[SOLVED] Numpy sum take longer for C-order arrays

Issue

I get drastically different timings (8x difference) for the following np.sum values.

from numpy.random import default_rng
r = 10**9
a = default_rng().integers(0, r, size=(r, 2))
np.sum(a, axis=0)  # 12s
np.sum(np.asfortranarray(a), axis=0) # 1.5s

Does this mean that np traverses a multiple times? shouldn’t the 2 sum values be calculated simultaneously while the data is streamed through the memory (using AVX may be)?

Solution

TL;DR: While the simplest explanation is that the performance gap is mainly due to a less efficient memory access pattern in the first case, it turns out that this assertion is completely wrong here! The Numpy (1.20.3) code simply generates a very inefficient code so far (especially in the first case).


Analysis

First of all, using np.asfortranarray actually cause a transposition of the input array which is copied in memory. This means that the two cases read memory using an inefficient pattern. Not to mention the second one actually store a huge temporary array in memory (using sub-optimal code) which is slow.

Moreover, usual np.sum does not use SIMD instructions for integers yet (in fact all integer types) and it adds some overheads due to the conversion to a bigger type on platform using the np.int32 type by default. Thus, in the second case, the executed assembly code of the hottest part is a basic scalar loop. It should be memory bound on mainstream x86-64 PC using the np.int64 type by default. The loop can be compute bound on the one using the np.int32 type by default.

The main reason np.sum is so slow in the first case is because Numpy generates a very inefficient code in this case. A profiling of the Numpy code gives the following results:

 %time |  Function
---------------------------------------------------
  54%  |  reduce_loop
  30%  |  LONG_add_avx2
  14%  |  npyiter_buffered_reduce_iternext_iters2
   2%  |  Other functions

reduce_loop results in a very inefficient assembly code that spend 75% of its time copying data and calling other functions in a huge loop. LONG_add_avx2 does the main computational part. It spends a major part of its time performing conditional jump in a huge loop. Based on the assembly code, it appears that the loop is mainly optimized for a large number of items in the last dimension so Numpy can use the SIMD instructions (AVX2 on my machine). The thing is the number of item in the last dimension is too small to use AVX2. Thus, the code fallback on a slow scalar implementation… Here is a part of the executed assembly code (timings in % are relative to the function execution time):

       ; Beginning of the huge loop
 6b:   leave                         ; Slow instruction (~8% of the time)
       ret                           
       nop                           
 70:   cmp          r10,0x8          
     ↓ je           290              ; Slow jump (~8% of the time)

       ; Huge assembly code with 136 instructions 
       ; that are not actually executed.
       [...]

       ; This block perform many slow conditional 
       ; jumps taking ~40% of function time.
       ; It performs many conditonal jumps on other 
       ; blocks finally jumping on the label 34e.
290:   cmp          r11,0x8                  
     ↑ jne          3f                       
       mov          rax,rcx                  
       mov          rdx,r8                   
       sub          rax,r8                   
       sub          rdx,rcx                  
       cmp          r8,rcx                   
       cmovb        rdx,rax                  
       test         rdx,rdx                  
     ↓ jne          63a                      
       cmp          rcx,rdi                  
     ↓ jbe          52e                      
       mov          rax,rcx                  
       sub          rax,rdi                  
       cmp          rax,0x3ff                
     ↓ jbe          540                      
2d1:   test         rsi,rsi                  
     ↑ jle          6b                       
       lea          rax,[rsi-0x1]            
       cmp          rax,0x2                  
     ↓ jbe          73c                      

       ; Actual computing block:
       ; Compute only 2 values per loop iteration
       ; This only take ~15% of the time
34e:   mov          rdx,QWORD PTR [r8]       
       add          rdx,QWORD PTR [rdi]      
       mov          QWORD PTR [rcx],rdx      
       lea          rdx,[rax+0x1]            
       cmp          rdx,rsi                  
     ↑ jge          6b                       
       mov          rdx,QWORD PTR [r8+0x8]   
       add          rax,0x2                  
       add          rdx,QWORD PTR [rdi+0x8]  
       mov          QWORD PTR [rcx+0x8],rdx  
       cmp          rax,rsi                  
       jge          6b                       

Based on this, one can conclude that less than 5% of the execution time is spend in doing actual useful work with an inefficient scalar accumulation!


Fast implementation

You can generate a much more efficient code using Numba in this specific case:

import numba as nb

@nb.njit('int64[::1](int_[:,::1])')
def fast_sum(arr):
    assert arr.shape[1] == 2
    a = b = np.int64(0)
    for i in nb.prange(arr.shape[0]):
        a += arr[i, 0]
        b += arr[i, 1]
    return np.array([a, b], dtype=np.int64)

fast_sum(a)

Note that the assert tells to Numba that the contiguous dimension is small so to better optimize the code and also not to generate a big code similar to the one of Numpy. In practice, Numba generate a much bigger code without it but it is still quite fast. Numba succeed to generate a fast AVX2 assembly code on my machine.

Note also that the code can be parallelized so to be faster on machine with a high memory throughput (ie. mainly computing server and not PCs).


Results

Here are performance results on my i5-9600KF processor with r = 10**7:

np.sum(a, axis=0):                      87.5 ms
np.sum(np.asfortranarray(a), axis=0):   34.9 ms
np.sum(b, axis=0) + precomputed b*:      8.9 ms
Sequential fast_sum(a):                  5.5 ms
Parallel fast_sum(a):                    3.7 ms

* b = np.asfortranarray(a)

Note that the parallel Numba implementation is optimal on my machine as it succeeds to fully saturate my RAM. Note that my Linux machine use np.int64 by default and the results should be even better with np.int32 values (as more items can be packed in the same memory space).

Answered By – Jérôme Richard

Answer Checked By – Jay B. (BugsFixing Admin)

Leave a Reply

Your email address will not be published. Required fields are marked *