Table of Contents

## Issue

I get drastically different timings (8x difference) for the following `np.sum`

values.

```
from numpy.random import default_rng
r = 10**9
a = default_rng().integers(0, r, size=(r, 2))
np.sum(a, axis=0) # 12s
np.sum(np.asfortranarray(a), axis=0) # 1.5s
```

Does this mean that np traverses `a`

multiple times? shouldn’t the 2 sum values be calculated simultaneously while the data is streamed through the memory (using AVX may be)?

## Solution

**TL;DR:** While the simplest explanation is that the performance gap is mainly due to a less efficient memory access pattern in the first case, it turns out that **this assertion is completely wrong** here! The **Numpy (1.20.3) code simply generates a very inefficient code** so far (especially in the first case).

## Analysis

First of all, using `np.asfortranarray`

actually cause a *transposition* of the input array which is copied in memory. This means that **the two cases read memory using an inefficient pattern**. Not to mention the second one actually store a huge temporary array in memory (using sub-optimal code) which is slow.

Moreover, usual `np.sum`

does not use SIMD instructions for integers yet (in fact all integer types) and it adds some overheads due to the conversion to a bigger type on platform using the `np.int32`

type by default. Thus, in the second case, the executed assembly code of the hottest part is a basic scalar loop. It should be memory bound on mainstream x86-64 PC using the `np.int64`

type by default. The loop can be compute bound on the one using the `np.int32`

type by default.

The main reason `np.sum`

is so slow in the first case is because **Numpy generates a very inefficient code in this case**. A profiling of the Numpy code gives the following results:

```
%time | Function
---------------------------------------------------
54% | reduce_loop
30% | LONG_add_avx2
14% | npyiter_buffered_reduce_iternext_iters2
2% | Other functions
```

`reduce_loop`

results in a very inefficient assembly code that spend 75% of its time copying data and calling other functions in a huge loop. `LONG_add_avx2`

does the main computational part. It spends a major part of its time performing conditional jump in a huge loop. Based on the assembly code, it appears that the loop is mainly optimized for a large number of items in the last dimension so Numpy can use the SIMD instructions (AVX2 on my machine). The thing is the number of item in the last dimension is too small to use AVX2. Thus, the code fallback on a slow scalar implementation… Here is a part of the executed assembly code (timings in % are relative to the function execution time):

```
; Beginning of the huge loop
6b: leave ; Slow instruction (~8% of the time)
ret
nop
70: cmp r10,0x8
↓ je 290 ; Slow jump (~8% of the time)
; Huge assembly code with 136 instructions
; that are not actually executed.
[...]
; This block perform many slow conditional
; jumps taking ~40% of function time.
; It performs many conditonal jumps on other
; blocks finally jumping on the label 34e.
290: cmp r11,0x8
↑ jne 3f
mov rax,rcx
mov rdx,r8
sub rax,r8
sub rdx,rcx
cmp r8,rcx
cmovb rdx,rax
test rdx,rdx
↓ jne 63a
cmp rcx,rdi
↓ jbe 52e
mov rax,rcx
sub rax,rdi
cmp rax,0x3ff
↓ jbe 540
2d1: test rsi,rsi
↑ jle 6b
lea rax,[rsi-0x1]
cmp rax,0x2
↓ jbe 73c
; Actual computing block:
; Compute only 2 values per loop iteration
; This only take ~15% of the time
34e: mov rdx,QWORD PTR [r8]
add rdx,QWORD PTR [rdi]
mov QWORD PTR [rcx],rdx
lea rdx,[rax+0x1]
cmp rdx,rsi
↑ jge 6b
mov rdx,QWORD PTR [r8+0x8]
add rax,0x2
add rdx,QWORD PTR [rdi+0x8]
mov QWORD PTR [rcx+0x8],rdx
cmp rax,rsi
jge 6b
```

Based on this, one can conclude that less than 5% of the execution time is spend in doing actual useful work with an inefficient scalar accumulation!

## Fast implementation

You can generate a much more efficient code using **Numba** in this specific case:

```
import numba as nb
@nb.njit('int64[::1](int_[:,::1])')
def fast_sum(arr):
assert arr.shape[1] == 2
a = b = np.int64(0)
for i in nb.prange(arr.shape[0]):
a += arr[i, 0]
b += arr[i, 1]
return np.array([a, b], dtype=np.int64)
fast_sum(a)
```

Note that the `assert`

tells to Numba that the contiguous dimension is small so to better optimize the code and also not to generate a big code similar to the one of Numpy. In practice, Numba generate a much bigger code without it but it is still quite fast. Numba succeed to generate a fast AVX2 assembly code on my machine.

Note also that the code can be parallelized so to be faster on machine with a high memory throughput (ie. mainly computing server and not PCs).

## Results

Here are performance results on my i5-9600KF processor with `r = 10**7`

:

```
np.sum(a, axis=0): 87.5 ms
np.sum(np.asfortranarray(a), axis=0): 34.9 ms
np.sum(b, axis=0) + precomputed b*: 8.9 ms
Sequential fast_sum(a): 5.5 ms
Parallel fast_sum(a): 3.7 ms
* b = np.asfortranarray(a)
```

Note that the parallel Numba implementation is **optimal** on my machine as it succeeds to fully saturate my RAM. Note that my Linux machine use `np.int64`

by default and the results should be even better with `np.int32`

values (as more items can be packed in the same memory space).

Answered By – Jérôme Richard

Answer Checked By – Jay B. (BugsFixing Admin)