# [SOLVED] Numpy sum take longer for C-order arrays

## Issue

I get drastically different timings (8x difference) for the following `np.sum` values.

``````from numpy.random import default_rng
r = 10**9
a = default_rng().integers(0, r, size=(r, 2))
np.sum(a, axis=0)  # 12s
np.sum(np.asfortranarray(a), axis=0) # 1.5s
``````

Does this mean that np traverses `a` multiple times? shouldn’t the 2 sum values be calculated simultaneously while the data is streamed through the memory (using AVX may be)?

## Solution

TL;DR: While the simplest explanation is that the performance gap is mainly due to a less efficient memory access pattern in the first case, it turns out that this assertion is completely wrong here! The Numpy (1.20.3) code simply generates a very inefficient code so far (especially in the first case).

## Analysis

First of all, using `np.asfortranarray` actually cause a transposition of the input array which is copied in memory. This means that the two cases read memory using an inefficient pattern. Not to mention the second one actually store a huge temporary array in memory (using sub-optimal code) which is slow.

Moreover, usual `np.sum` does not use SIMD instructions for integers yet (in fact all integer types) and it adds some overheads due to the conversion to a bigger type on platform using the `np.int32` type by default. Thus, in the second case, the executed assembly code of the hottest part is a basic scalar loop. It should be memory bound on mainstream x86-64 PC using the `np.int64` type by default. The loop can be compute bound on the one using the `np.int32` type by default.

The main reason `np.sum` is so slow in the first case is because Numpy generates a very inefficient code in this case. A profiling of the Numpy code gives the following results:

`````` %time |  Function
---------------------------------------------------
54%  |  reduce_loop
14%  |  npyiter_buffered_reduce_iternext_iters2
2%  |  Other functions
``````

`reduce_loop` results in a very inefficient assembly code that spend 75% of its time copying data and calling other functions in a huge loop. `LONG_add_avx2` does the main computational part. It spends a major part of its time performing conditional jump in a huge loop. Based on the assembly code, it appears that the loop is mainly optimized for a large number of items in the last dimension so Numpy can use the SIMD instructions (AVX2 on my machine). The thing is the number of item in the last dimension is too small to use AVX2. Thus, the code fallback on a slow scalar implementation… Here is a part of the executed assembly code (timings in % are relative to the function execution time):

``````       ; Beginning of the huge loop
6b:   leave                         ; Slow instruction (~8% of the time)
ret
nop
70:   cmp          r10,0x8
↓ je           290              ; Slow jump (~8% of the time)

; Huge assembly code with 136 instructions
; that are not actually executed.
[...]

; This block perform many slow conditional
; jumps taking ~40% of function time.
; It performs many conditonal jumps on other
; blocks finally jumping on the label 34e.
290:   cmp          r11,0x8
↑ jne          3f
mov          rax,rcx
mov          rdx,r8
sub          rax,r8
sub          rdx,rcx
cmp          r8,rcx
cmovb        rdx,rax
test         rdx,rdx
↓ jne          63a
cmp          rcx,rdi
↓ jbe          52e
mov          rax,rcx
sub          rax,rdi
cmp          rax,0x3ff
↓ jbe          540
2d1:   test         rsi,rsi
↑ jle          6b
lea          rax,[rsi-0x1]
cmp          rax,0x2
↓ jbe          73c

; Actual computing block:
; Compute only 2 values per loop iteration
; This only take ~15% of the time
34e:   mov          rdx,QWORD PTR [r8]
add          rdx,QWORD PTR [rdi]
mov          QWORD PTR [rcx],rdx
lea          rdx,[rax+0x1]
cmp          rdx,rsi
↑ jge          6b
mov          rdx,QWORD PTR [r8+0x8]
add          rdx,QWORD PTR [rdi+0x8]
mov          QWORD PTR [rcx+0x8],rdx
cmp          rax,rsi
jge          6b
``````

Based on this, one can conclude that less than 5% of the execution time is spend in doing actual useful work with an inefficient scalar accumulation!

## Fast implementation

You can generate a much more efficient code using Numba in this specific case:

``````import numba as nb

@nb.njit('int64[::1](int_[:,::1])')
def fast_sum(arr):
assert arr.shape[1] == 2
a = b = np.int64(0)
for i in nb.prange(arr.shape[0]):
a += arr[i, 0]
b += arr[i, 1]
return np.array([a, b], dtype=np.int64)

fast_sum(a)
``````

Note that the `assert` tells to Numba that the contiguous dimension is small so to better optimize the code and also not to generate a big code similar to the one of Numpy. In practice, Numba generate a much bigger code without it but it is still quite fast. Numba succeed to generate a fast AVX2 assembly code on my machine.

Note also that the code can be parallelized so to be faster on machine with a high memory throughput (ie. mainly computing server and not PCs).

## Results

Here are performance results on my i5-9600KF processor with `r = 10**7`:

``````np.sum(a, axis=0):                      87.5 ms
np.sum(np.asfortranarray(a), axis=0):   34.9 ms
np.sum(b, axis=0) + precomputed b*:      8.9 ms
Sequential fast_sum(a):                  5.5 ms
Parallel fast_sum(a):                    3.7 ms

* b = np.asfortranarray(a)
``````

Note that the parallel Numba implementation is optimal on my machine as it succeeds to fully saturate my RAM. Note that my Linux machine use `np.int64` by default and the results should be even better with `np.int32` values (as more items can be packed in the same memory space).

Answered By – Jérôme Richard

Answer Checked By – Jay B. (BugsFixing Admin)