## Issue

Suppose we have an array of doubles `x`

and an array of indices `y`

, and we want to sort these indices by the respective values in `x`

(so, sort `[i in y]`

by `x[i]`

as key).

We can then create an array of pairs, with one component being the key value and one being the index, and for example do something like this:

```
boost::sort::spreadsort::float_sort(sortdata, sortdata + n,
[](const std::pair<double, int> &a, const unsigned offset) -> boost::int64_t {
return boost::sort::spreadsort::float_mem_cast<double, boost::int64_t>(a.first) >> offset;
},
[](const std::pair<double, int> &a, const std::pair<double, int> &b) -> bool {
return a.first < b.first;
});
```

This costs quite a bit of memory, so instead we can omit the creation of this array of pairs and directly use the data like this:

```
boost::sort::spreadsort::float_sort(y, y + n,
[x](const int a, const unsigned offset) -> boost::int64_t {
return boost::sort::spreadsort::float_mem_cast<double, boost::int64_t>(x[a]) >> offset;
},
[x](const int a, const int b) -> bool {
return x[a] < x[b];
});
```

Now, when using this on very large data (letâ€™s say 50000000 entries), the second method takes more than twice as long as the first. As far as I know, a `std::pair`

is simply a `struct`

. So, can it really be that an array access is so much less efficient than a struct access? Or, am I doing something wrong here?

Here is a full example for comparisation:

```
#include <cstdlib>
#include <ctime>
#include <boost/sort/spreadsort/spreadsort.hpp>
#include <iostream>
int main() {
int n = 50000000;
double *x = new double[n];
int *y = new int[n];
std::pair<double, int> *sortdata = new std::pair<double, int> [n];
for (int i=0; i < n; i++) {
x[i] = ((double) std::rand()) / ((double) std::rand());
sortdata[i].first = x[i];
y[i] = i;
sortdata[i].second = i;
}
std::time_t t = std::time(0);
boost::sort::spreadsort::float_sort(sortdata, sortdata + n,
[](const std::pair<double, int> &a, const unsigned offset) -> boost::int64_t {
return boost::sort::spreadsort::float_mem_cast<double, boost::int64_t>(a.first) >> offset;
},
[](const std::pair<double, int> &a, const std::pair<double, int> &b) -> bool {
return a.first < b.first;
});
std::cout << std::time(0)-t << "\n";
t = std::time(0);
boost::sort::spreadsort::float_sort(y, y + n,
[x](const int a, const unsigned offset) -> boost::int64_t {
return boost::sort::spreadsort::float_mem_cast<double, boost::int64_t>(x[a]) >> offset;
},
[x](const int a, const int b) -> bool {
return x[a] < x[b];
});
std::cout << std::time(0)-t << "\n";
}
```

Running this with `g++ -O2`

gives 3 seconds for the first and 9 seconds for the last on my system.

## Solution

Not an expert, but here is what I think is going on:

Jumping around in memory is a costly operation, no matter the programming language. This has to do with the caching architecture of your CPU.

In pairs the data is stored interleaved in memory. There is no class boundary or anything like that. It looks like this:

```
value0 index0 value1 index1 value2 index2 ...
```

Now when you store data in two arrays your data will look like this:

```
value0 value1 value2 ......... index0 index1 index2 .....
```

Ok, now let’s see what happens when the sorting algorithm tries to figure out whether the data at index 46 should go before or after the data at index 47:

In the case of pairs:

```
1. ask ram for value46 (expensive!)
because of how cachelines work ~ 64bytes of data will be pulled.
that is: value46 index46 value47 index47 are pulled, maybe a bit more.
2. ask ram for value47 (cheap, it's already cached)
```

In the case of two arrays:

```
1. ask ram for index46 (expensive)
this will pull index46,index47,...
2. ask ram for index47 (cheap)
3. ask ram for x[index46] (expensive)
this will pull x[index46],x[index46+1...]
4. ask ram for x[index47] (should be next to x[index46]? not sure... could be cheap, could be expensive)
```

Well, I’m not sure about that last memory fetch, but the point is that you’re jumping around in memory around two to three times as much, and I’m pretty sure that’s why you’re measuring a roughly twice as long runtime.

Here is a final example to get this point across:

```
int N = 100000;
float data[N] = {... random data...};
int idx_1[N] = {0,1,....};
int idx_2[N] = {... random permutation of 0,...N-1};
// this will be very fast.
// the cache predictor always gets it right.
float sum_1 = 0;
for(int i = 0; i < N; i++) sum_1 += data[idx_1[i]];
// this will be very slow.
// the cache predictor always gets it wrong.
float sum_2 = 0;
for(int i = 0; i < N; i++) sum_2 += data[idx_2[i]];
```

Answered By – kritzikratzi

Answer Checked By – Mary Flores (BugsFixing Volunteer)