[SOLVED] Numba np.convolve really slow

Issue

I’m trying to speed up a piece of code convolving a 1D array (filter) over each column of a 2D array. Somehow, when I run it with numba’s njit, I get a 7x slow down. My thoughts:

  • Maybe column indexing is slowing it down, but switching to row indexing didn’t affect performance
  • Maybe slice indexing the results of the convolution is slow, but removing it didn’t change anything
  • I’ve checked that numba understands all the types properly

(tested on Windows 10, python 3.9.4 from conda, numpy 1.12.2, numba 0.53.1)

Can anyone tell me why this code is slow?

import numpy as np
from numba import njit

def f1(a1, filt):
    l2 = filt.size // 2
    res = np.empty(a1.shape)
    for i in range(a1.shape[1]):
        res[:, i] = np.convolve(a1[:, i], filt)[l2:-l2]
    return res

@njit
def f1_jit(a1, filt):
    l2 = filt.size // 2
    res = np.empty(a1.shape)
    for i in range(a1.shape[1]):
        res[:, i] = np.convolve(a1[:, i], filt)[l2:-l2]
    return res

a1 = np.random.random((6400, 1000))
filt = np.random.random((65))
f1(a1, filt)
f1_jit(a1, filt)

%timeit f1(a1, filt)     # 404 ms ± 19.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f1_jit(a1, filt) # 2.8 s ± 66.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Solution

The problem comes from the Numba implementation of np.convolve. This is a known issue. It turns out that the current Numba implementation is much slower than the one of Numpy (version <=0.54.1 tested on Windows).


Under the hood

On one hand, the Numpy implementation call correlate which itself performs a dot product that should be implemented by the fast BLAS library available on your system. On the other hand, the Numba implementation calls _get_inner_prod which use np.dot that should also use the same BLAS library (assuming a BLAS is detected which should be the case)…

That being said, there are multiple issues related to the dot product:

First of all, if the internal variable _HAVE_BLAS of numba/np/arraymath.py is manually disabled, Numba use a fallback implementation of the dot product supposed to be significantly slower. However, it turns out that using the fallback dot product implementation used by np.convolve result in a 5 times faster execution than with the BLAS wrapper on my machine! Using additionally the parameter fastmath=True in the njit Numba decorator results in an overall 8.7 times faster execution! Here is the testing code:

import numpy as np
import numba as nb

def npConvolve(a, b):
    return np.convolve(a, b)

@nb.njit('float64[:](float64[:], float64[:])')
def nbConvolveUncont(a, b):
    return np.convolve(a, b)

@nb.njit('float64[::1](float64[::1], float64[::1])')
def nbConvolveCont(a, b):
    return np.convolve(a, b)

a = np.random.random(6400)
b = np.random.random(65)
%timeit -n 100 npConvolve(a, b)
%timeit -n 100 nbConvolveUncont(a, b)
%timeit -n 100 nbConvolveCont(a, b)

Here are raw interesting results:

With _HAVE_BLAS=True (default):
126 µs ± 292 ns per loop
1.6 ms ± 21.3 µs per loop
1.6 ms ± 18.5 µs per loop

With _HAVE_BLAS=False:
125 µs ± 359 ns per loop
311 µs ± 1.18 µs per loop
268 µs ± 4.26 µs per loop

With _HAVE_BLAS=False and fastmath=True:
125 µs ± 757 ns per loop
327 µs ± 3.69 µs per loop
183 µs ± 654 ns per loop

Moreover, np_convolve of Numba internally flip some array parameter and then perform the dot product using a flipped array that have a non trivial stride (ie. not 1). Such a non-trivial stride may have an impact on the dot product performance. More generally, any transformation preventing the compiler to know that arrays are contiguous will certainly strongly impact the performances. Indeed, the following test shows the impact of working on a contiguous array with the dot product implementation of Numba:

import numpy as np
import numba as nb

def np_dot(a, b):
    return np.dot(a, b)

@nb.njit('float64(float64[::1], float64[::1])')
def nb_dot_cont(a, b):
    return np.dot(a, b)

@nb.njit('float64(float64[::1], float64[:])')
def nb_dot_stride(a, b):
    return np.dot(a, b)

v = np.random.random(128*1024)
%timeit -n 200 np_dot(v, v)         #  36.5 µs ±  4.9 µs per loop
%timeit -n 200 nb_dot_stride(v, v)  # 361.0 µs ± 17.1 µs per loop  (x10 !!!)
%timeit -n 200 nb_dot_cont(v, v)    #  34.1 µs ±  2.9 µs per loop

Some general notes about Numpy and Numba

Note that Numba can hardly accelerate the Numpy calls when they work on pretty-big arrays since Numba re-implement Numpy functions mostly in Python and use a JIT compiler (LLVM-Lite) to speed them up, while Numpy is mostly implemented in plain-C (with a rather-slow Python wrapping code). The Numpy code uses low-level processor features like SIMD instructions to speed up a lot the execution of many functions. Both appear to use BLAS libraries known to be highly optimized. Numpy tends to be generally more optimized since Numpy is currently more mature than Numba: Numpy has much more contributors working since a longer time.

Answered By – Jérôme Richard

Answer Checked By – Terry (BugsFixing Volunteer)

Leave a Reply

Your email address will not be published. Required fields are marked *