[SOLVED] Why does numpy.view(bool) makes numpy.logical_and significantly faster?

Table of Contents

Issue

When passing a numpy.ndarray of uint8 to numpy.logical_and, it runs significantly faster if I apply numpy.view(bool) to its inputs.

a = np.random.randint(0, 255, 1000 * 1000 * 100, dtype=np.uint8)
b = np.random.randint(0, 255, 1000 * 1000 * 100, dtype=np.uint8)

%timeit np.logical_and(a, b)
126 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit np.logical_and(a.view(bool), b.view(bool))
20.9 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Can someone explain why this is happening?

Furthermore, why numpy.logical_and doesn’t automatically apply view(bool) to an array of uint8? (Is there any situation where we shouldn’t use view(bool)?)

EDIT:

It seems that this is an issue with Windows environment.
I just tried the same thing in the official python docker container (which is debian) and found no difference between them.

My environment:

  • OS: Windows 10 Pro 21H2
  • CPU: AMD Ryzen 9 5900X
  • Python: Python 3.10.2 (tags/v3.10.2:a58ebcc, Jan 17 2022, 14:12:15) [MSC v.1929 64 bit (AMD64)] on win32
  • numpy: 1.22.2

Solution

This is a performance issue of the current Numpy implementation. I can also reproduce this problem on Windows (using an Intel Skylake Xeon processor with Numpy 1.20.3). np.logical_and(a, b) executes a very-inefficient scalar assembly code based on slow conditional jumps while np.logical_and(a.view(bool), b.view(bool)) executes relatively-fast SIMD instructions.

Currently, Numpy uses a specific implementation for bool-types. Regarding the compiler used, the general-purpose implementation can be significantly slower if the compiler used to build Numpy failed to automatically vectorize the code which is apparently the case on Windows (and explain why this is not the case on other platforms since the compiler is likely not exactly the same). The Numpy code can be improved for non-bool types. Note that the vectorization of Numpy is an ongoing work and we plan optimize this soon.


Deeper analysis

Here is the assembly code executed by np.logical_and(a, b):

Block 24:                         
    cmp byte ptr [r8], 0x0        ; Read a[i]
    jz <Block 27>                 ; Jump to block 27 if a[i]!=0
Block 25:                         
    cmp byte ptr [r9], 0x0        ; Read b[i]
    jz <Block 27>                 ; Jump to block 27 if b[i]!=0
Block 26:                         
    mov al, 0x1                   ; al = 1
    jmp <Block 28>                ; Skip the next instruction
Block 27:                         
    xor al, al                    ; al = 0
Block 28:                         
    mov byte ptr [rdx], al        ; result[i] = al
    inc r8                        ; i += 1
    inc rdx                       
    inc r9                        
    sub rcx, 0x1                  
    jnz <Block 24>                ; Loop again while i<a.shape[0]

As you can see, the loop use several data-dependent conditional jumps to write per item of a and b read. This is very inefficient here since the branch taken cannot be predicted by the processor with random values. As a result the processor stall for few cycles (typically about 10 cycles on modern x86 processors).

Here is the assembly code executed by np.logical_and(a.view(bool), b.view(bool)):

Block 15:
    movdqu xmm1, xmmword ptr [r10]               ; xmm1 = a[i:i+16]
    movdqu xmm0, xmmword ptr [rbx+r10*1]         ; xmm0 = b[i:i+16]
    lea r10, ptr [r10+0x10]                      ; i += 16
    pcmpeqb xmm1, xmm2                           ; \
    pandn xmm1, xmm0                             ;  | Complex sequence to just do:
    pcmpeqb xmm1, xmm2                           ;  | xmm1 &= xmm0
    pandn xmm1, xmm3                             ; /
    movdqu xmmword ptr [r14+r10*1-0x10], xmm1    ; result[i:i+16] = xmm1
    sub rcx, 0x1                                 
    jnz <Block 15>                               ; Loop again while i!=a.shape[0]//16

This code use the SIMD instruction set called SSE which is able to work on 128-bit wide registers. There is no conditional jumps. This code is far more efficient as it operates on 16 items at once per iteration and each iteration should be much faster.

Note that this last code is not optimal either as most modern x86 processors (like your AMD one) supports the 256-bit AVX-2 instruction set (twice as fast). Moreover, the compiler generate an inefficient sequence of SIMD instruction to perform the logical-and that can be optimized. The compiler seems to assume the boolean can be values different of 0 or 1. That being said, the input arrays are too big to fit in your CPU cache and so the code is bounded by the throughput of your RAM as opposed to the first one. This is why the SIMD-friendly code is not drastically faster. The difference between the two version is certainly much bigger with arrays of less than 1 MiB on your processor (like on almost all other modern processor).

Answered By – Jérôme Richard

Answer Checked By – Cary Denson (BugsFixing Admin)

Leave a Reply

Your email address will not be published. Required fields are marked *