## Issue

I’m writing a program in which I’m trying to see how well a given redshift gets a set of lines detected in an spectrum to match up to an atomic line database. The closer the redshift gets the lines to overlap, the lower the "score" and the higher the chance that the redshift is correct.

I do this by looping over a range of possible redshifts, calculating the score for each. Within that outer loop, I was looping within each line in the set of detected lines to calculate its sub_score, and summing that inner loop to get the overall score.

I tried to vectorize the inner loop with numpy, but surprisingly it actually slowed down the execution. In the example given, the nested for loop takes ~2.6 seconds on my laptop to execute, while the single for loop with numpy on the inside takes ~5.3 seconds.

Why would vectorizing the inner loop slow things down? Is there a better way to do this that I’m missing?

```
import numpy as np
import time
def find_nearest_line(lines, energies):
# Return the indices (from the lines vector) that are closest to the energies vector supplied
# Vectorized with help from https://stackoverflow.com/a/53111759
energies = np.expand_dims(energies, axis=-1)
idx = np.abs(lines / energies - 1).argmin(axis=-1)
return idx
def calculate_distance_to_line(lines, energies):
# Returns distance between an array of lines and an array of energies)
z = (lines / energies) - 1
return z
rng = np.random.default_rng(2021)
det_lines = rng.random(1000)
atom_lines = rng.random(10000)
redshifts = np.linspace(-0.1, 0.1, 100)
# loop version
start=time.time()
scores = []
for redshift in redshifts:
atom_lines_shifted = atom_lines * (1 + redshift)
score = 0
for det_line in det_lines:
closest_atom_line = find_nearest_line(atom_lines_shifted, det_line)
score += abs(calculate_distance_to_line(atom_lines_shifted[closest_atom_line], det_line))
scores.append(score)
print(time.time()-start)
print(scores)
# (semi)-vectorized version
start=time.time()
scores = []
for redshift in redshifts:
atom_lines_shifted = atom_lines * (1 + redshift)
closest_atom_lines = find_nearest_line(atom_lines_shifted, det_lines)
score = np.sum(np.abs(calculate_distance_to_line(atom_lines_shifted[closest_atom_lines], det_lines)))
scores.append(score)
print(time.time()-start)
print(scores)
```

## Solution

Numpy codes generally creates many *temporary arrays*. This is the case for your function `find_nearest_line`

for example. Working on all the items of `det_lines`

simultaneously would results in the creation of many relatively big arrays (`1000 * 10_000 * 8 = 76 MiB`

per array). The thing is **big array often do not fit in CPU caches**. If so, the array needs to be stored in RAM with a much *lower throughput* and much *higher latency*. Moreover, allocating/freeing bigger array takes more time and results often in more page faults (due to the actual implementation of most default standard allocators). It is sometimes faster to use big array because the overhead of the CPython interpreter is huge but both strategies are inefficient in practice.

The thing is that the **algorithm is not efficient**. Indeed, you can **sort the array** and use a **binary search** to find the closest value much more efficiently. `np.searchsorted`

does most of the work but it only returns the index of the closest value greater (or equal) than the target value. Thus, there is some additional operation to do to get the closest value (possibly greater or lesser than the target value). Note that this algorithm do not generate huge array thanks to the binary search.

```
scores = []
n = atom_lines.size
m = det_lines.size
line_idx = np.arange(m)
for redshift in redshifts:
atom_lines_shifted = atom_lines * (1 + redshift)
sorted_atom_lines_shifted = np.sort(atom_lines_shifted)
close_idx = np.searchsorted(sorted_atom_lines_shifted, det_lines)
lower_bound = sorted_atom_lines_shifted[np.maximum(close_idx - 1, 0)]
upper_bound = sorted_atom_lines_shifted[np.minimum(close_idx, n - 1)]
bounds = np.hstack((lower_bound[:, None], upper_bound[:, None]))
closest_bound_idx = find_nearest_line(bounds, det_lines)
close_values = bounds[line_idx, closest_bound_idx]
score = np.sum(np.abs(calculate_distance_to_line(close_values, det_lines)))
scores.append(score)
```

Since * atom_lines is not modified and the multiplication preserve the order*, the algorithm can be further optimized by sorting

`atom_lines`

directly:```
scores = []
n = atom_lines.size
m = det_lines.size
line_idx = np.arange(m)
sorted_atom_lines = np.sort(atom_lines)
for redshift in redshifts:
sorted_atom_lines_shifted = sorted_atom_lines * (1 + redshift)
close_idx = np.searchsorted(sorted_atom_lines_shifted, det_lines)
lower_bound = sorted_atom_lines_shifted[np.maximum(close_idx - 1, 0)]
upper_bound = sorted_atom_lines_shifted[np.minimum(close_idx, n - 1)]
bounds = np.hstack((lower_bound[:, None], upper_bound[:, None]))
closest_bound_idx = find_nearest_line(bounds, det_lines)
close_values = bounds[line_idx, closest_bound_idx]
score = np.sum(np.abs(calculate_distance_to_line(close_values, det_lines)))
scores.append(score)
```

This last implementation is about **300 times faster** on my machine.

Answered By – Jérôme Richard

Answer Checked By – Katrina (BugsFixing Volunteer)