Python: force c integer overflow behavior
  2013-09-04

Python language

Python-2.x offers 2 types of integers: plain integers and long integers.

While plain integers have a least a 32 bits precision, long integers have unlimited precision (Numeric types)

Plain integers are automatically promoted to long integers if an overflow happens:

Python 2.7.5 (default, May 12 2013, 12:00:47)
[GCC 4.8.0 20130502 (prerelease)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> print sys.maxint
9223372036854775807
>>> a = sys.maxint
>>> print type(a)
<type 'int'>
>>> a +=1
>>> print a
9223372036854775808
>>> print type(a)
<type 'long'>

C language

In C language, integers overflow behavior is different regarding the integer signedness. 2 situations arise: (Basics of Integer Overflow)

  • signed integer overflow : undefined behavior
  • unsigned integer overflow : safely wraps around (UINT_MAX + 1 gives 0)

Here is an example of undefined behavior: (if this is really too dumb, I would be glad to be enlightened )

#include <limits.h>
#include <stdlib.h>
#include <stdio.h>

int main (int argc, char *argv[])
{
  int i;

  unsigned int ui = UINT_MAX - 3;
  printf("Unsigned overflow:\n");

  for(i=0; i<10; i++) {
    ui += 1;
    printf("%u\n", ui);
  }

  signed int si = INT_MAX - 3;
  printf("Signed overflow:\n");

  for(i=0; i<10; i++) {
    si += 1;
    printf("%i\n", si);
  }

  return 0;
}

Let’s try:

$ gcc -Wall -o overflow overflow.c
$ ./overflow
Unsigned overflow:
4294967293
4294967294
4294967295
0
1
2
3
4
5
6
Signed overflow:
2147483645
2147483646
2147483647
-2147483648
-2147483647
-2147483646
-2147483645
-2147483644
-2147483643
-2147483642

The unsigned integer wraps, while the signed one does not.

Compiling with gcc -O3 optimization flag even emits a warning (gcc-4.8.2), and the signed loop also behaves differently :

$ gcc -O3 -Wall -o overflow overflow.c
overflow.c: In function ‘main’:
overflow.c:21:8: attention : iteration 3u invokes undefined behavior [-Waggressive-loop-optimizations]
     si += 1;
        ^
overflow.c:20:3: note: containing loop
   for(
$ ./overflow
Unsigned overflow:
4294967293
4294967294
4294967295
0
1
2
3
4
5
6
Signed overflow:
2147483645
2147483646
2147483647
-2147483648
-2147483647
-2147483646
-2147483645
-2147483644
-2147483643
-2147483642
(...)

But that’s not the point, there are a lot of interesting references about that.

Intentional unsigned overflow is useful for specific purposes, like hashing, cryptography, or random number generation.

The Python’s way of handling overflows is very convenient for the programmer, who does not have to care about a plain integer overflow, the integer capacity is upgraded transparently to a longer one.

Mimic c overflow behavior

However, problem arise when trying to mimic c unsigned overflow in Python. I had the problem when translating the following c hashing algorithm to Python: (from CFEngine source code)

unsigned int StringHash(const char *str, unsigned int seed, unsigned int max)
{
    unsigned const char *p = str;
    unsigned int h = seed;
    size_t len = strlen(str);

    for (size_t i = 0; i < len; i++)
    {
        h += p[i];
        h += (h << 10);
        h ^= (h >> 6);
    }

    h += (h << 3);
    h ^= (h >> 11);
    h += (h << 15);

    return (h & (max - 1));
}

A naive Python translation may look like that:

#!/usr/bin/python2.7

import sys

def StringHash(s, seed, max):
  h = seed
  for i in bytearray(s):
    h += i
    h += (h << 10)
    h ^= (h >> 6)

  h += (h << 3)
  h ^= (h >> 11)
  h += (h << 15)

  return (h & (max - 1));

if __name__ == "__main__":
  print StringHash('debian70.boring+192.168.2.110+0', 0, 8192)

Of course, it does not give the same result as its c counterpart.

Below a self-contained version of StringHash() c function…

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

unsigned int StringHash(const char *, unsigned int, unsigned int);

int main(int argc, char **argv) {

  char *str = "debian70.boring+192.168.2.110+0";
  printf("%d\n", StringHash(str, 0, 8192));

}

unsigned int StringHash(const char *str, unsigned int seed, unsigned int max)
{
    unsigned const char *p = str;
    unsigned int h = seed;
    unsigned int hh = 0;
    size_t len = strlen(str);

    for (size_t i = 0; i < len; i++)
    {
        h += p[i];
        h += (h << 10);
        h ^= (h >> 6);
        //printf("h: %u\n", h);
    }

    h += (h << 3);
    h ^= (h >> 11);
    h += (h << 15);

    return (h & (max - 1));
}

And a verification that the result is different given the same string to hash “debian70…”

$ gcc -std=c99 -o stringhash stringhash.c
$ ./stringhash
6219

$ ./stringhash.py
2498

Solution: Numpy package

A solution I’ve found is to use the Numpy package. Its generic unsigned integer overflows as wanted:

Python 2.7.5 (default, May 12 2013, 12:00:47)
[GCC 4.8.0 20130502 (prerelease)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> ui = np.uint32(2**32-1)
>>> ui
4294967295
>>> ui += np.uint32(10)
__main__:1: RuntimeWarning: overflow encountered in uint_scalars
>>> ui
9
>>>
Since the overflow is expected, the warning can be shut with:
np.warnings.simplefilter("ignore", RuntimeWarning)

Finally, the Python version of StringHash() is:

def StringHash(s, seed, max):
  """
  Python version of StringHash from string_lib.c
  """
  h = np.uint32(seed)
  for i in bytearray(s):
    h += np.uint32(i)
    h += np.uint32(h << 10)
    h ^= np.uint32(h >> 6)

  h += np.uint32(h << 3)
  h ^= np.uint32(h >> 11)
  h += np.uint32(h << 15)

  return np.uint32(h & (max - 1))

I was struggling with integers overflow for my cfe-rsplaytime quick hack.