[SOLVED] Low Performance unpacking byte array to data structure

Issue

I’ve some performance trouble to put data from a byte array to the internal data structure. The data contains several nested arrays and can be extracted as the attached code. In C it takes something like one Second by reading from a stream, but in Python it takes almost one Minute. I guess indexing and calling int.from_bytes was not the best idea.
Has anybody a proposal to improve the performance?

...
ycnt = int.from_bytes(bytedat[idx:idx + 4], 'little')
idx += 4
while ycnt > 0:
    ky = int.from_bytes(bytedat[idx:idx + 4], 'little')
    idx += 4
    dv = DataObject()
    xvec.update({ky: dv})
    dv.x = int.from_bytes(bytedat[idx:idx + 4], 'little')
    idx += 4
    dv.y = int.from_bytes(bytedat[idx:idx + 4], 'little')
    idx += 4
    cntv = int.from_bytes(bytedat[idx:idx + 4], 'little')
    idx += 4
    while cntv > 0:
        dv.data_values.append(int.from_bytes(bytedat[idx:idx + 4], 'little', signed=True))
        idx += 4
        cntv -= 1
    dv.score = struct.unpack('d', bytedat[idx:idx + 8])[0]
    idx += 8
    ycnt -= 1
...

Solution

First, a factor 60 between Python versus C is normal for low-level code like this. This is not where Python shines, because it doesn’t get compiled down to machine-code.

Micro-Optimizations

The most obvious one is to reduce your integer math by using struct.unpack() properly. See the format string docu. Something like this:

ky, dy, dv.x, dv.y, cntv = struct.unpack('<iiiii', bytedat[idx:idx+5*4])

The second one is to load your int arrays (if they are large) "in batch" instead of the (interpreted!) while cntv > 0 loop. I would use a numpy array:

numpy.frombuffer(bytedat[idx:idx + 4*cntv], dtype='int32')

Why is not a list? A Python list contains (generic) Python objects. It requires extra memory and pointer indirection for each item. Libraries cannot use optimized C code (for example to calculate the sum) because each item has first to be dereferenced and then checked for its type.

A numpy object, on the other hand, is basically a wrapper to manage the memory of a C array. Loading it it will probably boil down to a memcpy(), or it may even just reference the bytes memory you passed.

And thirdly, instead of xvec.update({ky: dv}) you can probably write xvec[ky] = dy. This may prevent the creation of a temporary dict object.

Compiling your Python-Code

There are ways to compile Python (partially) down to machine code (PyPy, Numba, Cython). It’s a bit involved, but your original byte-indexing code would then run at C speed.

However, you are filling a Python list and a dict in the inner loop. This is never going to get "C"-like fast because it will have to deal with Python objects and reference counting, even when it gets compiled down to C.

Different file format

The easiest way is to use a data format handled by a fast specialized library (like numpy, hd5, pillow, maybe even pandas).

The pickle module may also help, but only if you can control the writing and everything is trusted, and you mainly care about loading speed.

Answered By – maxy

Answer Checked By – Robin (BugsFixing Admin)

Leave a Reply

Your email address will not be published. Required fields are marked *