Bad, vile and meaningless: Why Python rocks from Alan's clob

Glotski in Python

I implemented yesterday's glotski solver also in Python. I implemented them the same way, meaning that I painstakingly converted each instruction in Perl to corresponding Python expression, including regular expressions.

Here's a typical piece of conversion:

$b =~ s/(\w)/"\x1b[3;4" . ((ord($1) % 7) + 1) . "m" . $1/ge;

b_replacer = re.compile(r'(\w)')
b = b_replacer.sub(lambda x: "\x1b[3;4%dm%s" %
    ((ord( % 7) + 1,, b)

Pretty, huh? Well, it's pretty obvious that perlisms do not translate well into Python. Judging from this, you'd consider that the Python version is majorly uglier. But Python's simpler syntax definitely benefits us elsewhere:

for d in 'drul':
    for c in pieces:
        newmaze = move_piece(maze, c, d)
        if newmaze:
            moves.append([newmaze, move_so_far + d + c])

You'll find the equivalent implementation in Perl elsewhere on yesterday's entry.

Minor trouble on the way...

However, I did ran into a problem, before I could meaningfully benchmark this code: Python doesn't have a string xor operation. That means that the core of my piece-moving algorithm can't be written the same way in Python. That code in Perl looks like $tmp ^ $mask ^ "\x00" . substr($mask, 1) and it can't be expressed in Python quite so simply.

So, let us pretend we have string xor. We can write strxor(strxor(tmp, mask), '\x00'+mask[1:]) as the equivalent. A naive implentation goes as follows:

def strxor(s1, s2):
    return "".join(map(lambda x, y: chr(ord(x) ^ ord(y)), s1, s2)

but with that implementation, my Python program ran about 20% slower than the Perl program.

I improved it first by writing a 3-arg strxor that simply handled all three strings at once. That reduced Python program penalty to 10% to that of Perl program. Clearly, the overhead from strxor() was important. With that thought in my mind, I went to bed.

On to a fast implementation

So today, in the interests of giving Python the same benefit as Perl was enjoying with its string xor implementation, I wrote the whole thing in C.

I'm not a C programmer, and I mostly detest C programmers in general, but nevertheless, I think it turned out good enough. That's because Python strives to do things as simply as possible. For instance, the Makefile for building a Python extension involves, essentially, a gcc -shared -o foo.c style command. After that, you can do import foo, and if the .so defines an initfoo symbol, you have got yourself a Python module!

So, here's the corresponding bits in my strxor module:

PyMethodDef methods[] = {
    { "strxor", strxor },

void initstrxor() {
    (void) Py_InitModule("strxor", methods);

What this means is that on entering initstrxor(), the enter hook, Python registers the method strxor() in C source with Python name strxor. The body of the strxor() itself is naturally a bit longer, but nothing drastic:

PyObject * strxor(PyObject * self, PyObject * args) {
    PyObject *result = NULL;

    char * a, * b, * c;
    int al, bl;
    if (! PyArg_ParseTuple(args, "s#s#", &a, &al, &b, &bl)) {
        /* error raised by PyArg_ParseTuple */
        return result;

    if (al != bl)
        return result;

    c = (char *) malloc(al);
    if (c == NULL)
        return result;

    for (int i = 0; i < al; i += 1)
        c[i] = a[i] ^ b[i];

    /* build the python value. c is copied, and we discard it. */
    result = Py_BuildValue("s#", c, al);
    (void) free(c);

    return result;

The whole core is obviously the small for loop that does the string xor. The rest is simply about getting the args from Python and passing them back to Python. I suspect that Python does quite a bit of extra copying here that could be dispensed with, but I don't know how to tell that "I will agree, hand on the holy bible, to not tamper with memory regions pointed by *a and *b, so just give me your internal buffers". Copy avoidance would also benefit handling the *c on the way back.

On the other hand, I could avoid malloc()s by keeping a static buffer that would be grown dynamically to accommodate the largest xor'd buffer thus far. This is however against the spirit of Python that values correctness above performance, and I see nothing wrong with doing a dispensable malloc() in a central function.

Towards triumph!

With my new implementation, Python went ahead of Perl! The difference is minimal, almost within inter-run variation. But something like 0.5% faster it is.

Memory usage profiles are also damn-nigh identical, with Perl consuming slightly more RAM. It's as if these programming languages had copied each other's implementations or something...

Update: Python 2.4 ran my code about 5% faster than 2.3. Looks like Python is now significantly ahead in the cpu area! Stay tuned for the bitter battle for dominance of these programming language giants!