I have an array with several fields, which I want to be sorted with respect to 2 of them. One of these fields is binary, e.g.:
size = 100000 data = np.empty( shape=2 * size, dtype=[('class', int), ('value', int),] ) data['class'][:size] = 0 data['value'][:size] = (np.random.normal(size=size) * 10).astype(int) data['class'][size:] = 1 data['value'][size:] = (np.random.normal(size=size, loc=0.5) * 10).astype(int) np.random.shuffle(data)
I need the result to be sorted with respect to
value, and for same values
class=0 should go first. Doing it like so (a):
idx = np.argsort(data, order=['value', 'class']) data_sorted = data[idx]
seems to be an order of magnitude slower compared to sorting just
data['value']. Is there a way to improve the speed, given that there are only two classes?
By experimenting randomly I noticed that an approach like this (b):
idx = np.argsort(data['value']) data_sorted = data[idx] idx = np.argsort(data_sorted, order=['value', 'class'], kind='mergesort') data_sorted = data_sorted[idx]
takes ~20% less time than (a). Changing field datatypes seem to also have some effect – floats instead of ints seem to be slightly faster.
The simplest way to do this is using the
order parameter of
sort(data, order=['value', 'class'])
However, this takes 121 ms to run on my computer, while
data['value'] take only 2.44 and 5.06 ms respectively. Interestingly,
sort(data, order='class') takes 135 ms again, suggesting the problem is with sorting structured arrays.
So, the approach you’ve taken of sorting each field using
argsort then indexing the final array seems to be on the right track. However, you need to sort each field individually,
idx=argsort(data['class']) data_sorted = data[idx][argsort(data['value'][idx], kind='stable')]
This runs in 43.9 ms.
You can get a very slight speedup by removing one temporary array from indexing
idx = argsort(data['class']) tmp = data[idx] data_sorted = tmp[argsort(tmp['value'], kind='stable')]
Which runs in 40.8 ms. Not great, but it is a workaround if performance is critical.
This seems to be a known problem:
sorting numpy structured and record arrays is very slow
The sourcecode for the comparisons used in sort can be seen at https://github.com/numpy/numpy/blob/dea85807c258ded3f75528cce2a444468de93bc1/numpy/core/src/multiarray/arraytypes.c.src .
The numeric types are much, much simpler. Still, that large of a difference in performance is surprising.
Answered By – user2699
Answer Checked By – Mary Flores (BugsFixing Volunteer)