# [SOLVED] Performance of sorting structured arrays (numpy)

## Issue

I have an array with several fields, which I want to be sorted with respect to 2 of them. One of these fields is binary, e.g.:

``````size = 100000
data = np.empty(
shape=2 * size,
dtype=[('class', int),
('value', int),]
)

data['class'][:size] = 0
data['value'][:size] = (np.random.normal(size=size) * 10).astype(int)
data['class'][size:] = 1
data['value'][size:] = (np.random.normal(size=size, loc=0.5) * 10).astype(int)

np.random.shuffle(data)
``````

I need the result to be sorted with respect to `value`, and for same values `class=0` should go first. Doing it like so (a):

``````idx = np.argsort(data, order=['value', 'class'])
data_sorted = data[idx]
``````

seems to be an order of magnitude slower compared to sorting just `data['value']`. Is there a way to improve the speed, given that there are only two classes?

By experimenting randomly I noticed that an approach like this (b):

``````idx = np.argsort(data['value'])
data_sorted = data[idx]
idx = np.argsort(data_sorted, order=['value', 'class'], kind='mergesort')
data_sorted = data_sorted[idx]
``````

takes ~20% less time than (a). Changing field datatypes seem to also have some effect – floats instead of ints seem to be slightly faster.

## Solution

The simplest way to do this is using the `order` parameter of `sort`

``````sort(data, order=['value', 'class'])
``````

However, this takes 121 ms to run on my computer, while `data['class']` and `data['value']` take only 2.44 and 5.06 ms respectively. Interestingly, `sort(data, order='class')` takes 135 ms again, suggesting the problem is with sorting structured arrays.

So, the approach you’ve taken of sorting each field using `argsort` then indexing the final array seems to be on the right track. However, you need to sort each field individually,

``````idx=argsort(data['class'])
data_sorted = data[idx][argsort(data['value'][idx], kind='stable')]
``````

This runs in 43.9 ms.
You can get a very slight speedup by removing one temporary array from indexing

``````idx = argsort(data['class'])
tmp = data[idx]
data_sorted = tmp[argsort(tmp['value'], kind='stable')]
``````

Which runs in 40.8 ms. Not great, but it is a workaround if performance is critical.

This seems to be a known problem:
sorting numpy structured and record arrays is very slow

Edit
The sourcecode for the comparisons used in sort can be seen at https://github.com/numpy/numpy/blob/dea85807c258ded3f75528cce2a444468de93bc1/numpy/core/src/multiarray/arraytypes.c.src .
The numeric types are much, much simpler. Still, that large of a difference in performance is surprising.