[SOLVED] Performance of sorting structured arrays (numpy)

Issue

I have an array with several fields, which I want to be sorted with respect to 2 of them. One of these fields is binary, e.g.:

size = 100000
data = np.empty(
            shape=2 * size,
            dtype=[('class', int),
                   ('value', int),]
)

data['class'][:size] = 0
data['value'][:size] = (np.random.normal(size=size) * 10).astype(int)
data['class'][size:] = 1
data['value'][size:] = (np.random.normal(size=size, loc=0.5) * 10).astype(int)

np.random.shuffle(data)

I need the result to be sorted with respect to value, and for same values class=0 should go first. Doing it like so (a):

idx = np.argsort(data, order=['value', 'class'])
data_sorted = data[idx]

seems to be an order of magnitude slower compared to sorting just data['value']. Is there a way to improve the speed, given that there are only two classes?

By experimenting randomly I noticed that an approach like this (b):

idx = np.argsort(data['value'])
data_sorted = data[idx]
idx = np.argsort(data_sorted, order=['value', 'class'], kind='mergesort')
data_sorted = data_sorted[idx]

takes ~20% less time than (a). Changing field datatypes seem to also have some effect – floats instead of ints seem to be slightly faster.

Solution

The simplest way to do this is using the order parameter of sort

sort(data, order=['value', 'class'])

However, this takes 121 ms to run on my computer, while data['class'] and data['value'] take only 2.44 and 5.06 ms respectively. Interestingly, sort(data, order='class') takes 135 ms again, suggesting the problem is with sorting structured arrays.

So, the approach you’ve taken of sorting each field using argsort then indexing the final array seems to be on the right track. However, you need to sort each field individually,

idx=argsort(data['class'])
data_sorted = data[idx][argsort(data['value'][idx], kind='stable')]

This runs in 43.9 ms.
You can get a very slight speedup by removing one temporary array from indexing

idx = argsort(data['class'])
tmp = data[idx]
data_sorted = tmp[argsort(tmp['value'], kind='stable')]

Which runs in 40.8 ms. Not great, but it is a workaround if performance is critical.

This seems to be a known problem:
sorting numpy structured and record arrays is very slow

Edit
The sourcecode for the comparisons used in sort can be seen at https://github.com/numpy/numpy/blob/dea85807c258ded3f75528cce2a444468de93bc1/numpy/core/src/multiarray/arraytypes.c.src .
The numeric types are much, much simpler. Still, that large of a difference in performance is surprising.

Answered By – user2699

Answer Checked By – Mary Flores (BugsFixing Volunteer)

Leave a Reply

Your email address will not be published. Required fields are marked *