# [SOLVED] Numba parallel code slower than its sequential counterpart

## Issue

I’m new to Numba and I’m trying to implement an old Fortran code in Python using Numba (version 0.54.1), but when I add `parallel = True` the program actually slows down. My program is very simple: I change the positions x and y in a L x L grid and for each position in the grid I perform a summation

``````import numpy as np
import numba as nb

@nb.njit(parallel=True)
def lyapunov_grid(x_grid, y_grid, k, N):
L = len(x_grid)
lypnv = np.zeros((L, L))
for ii in nb.prange(L):
for jj in range(L):
x = x_grid[ii]
y = y_grid[jj]
beta0 = 0
sumT11 = 0

for j in range(N):
y = (y - k*np.sin(x)) % (2*np.pi)
x = (x + y) % (2*np.pi)
J = np.array([[1.0, -k*np.cos(x)], [1.0, 1.0 - k*np.cos(x)]])
beta = np.arctan((-J[1,0]*np.cos(beta0) + J[1,1]*np.sin(beta0))/(J[0,0]*np.cos(beta0) - J[0,1]*np.sin(beta0)))
T11 = np.cos(beta0)*(J[0,0]*np.cos(beta) - J[1,0]*np.sin(beta)) - np.sin(beta0)*(J[0,1]*np.cos(beta) - J[1,1]*np.sin(beta))
sumT11 += np.log(abs(T11))/np.log(2)

beta0 = beta

lypnv[ii, jj] = sumT11/N
return lypnv

# Compile
_ = lyapunov_grid(np.linspace(0, 1, 10), np.linspace(0, 1, 10), 1, 10)
# Parameters
N = int(1e3)
L = 128
pi = np.pi
k = 1.5
# Limits of the phase space
x0 = -pi
xf = pi
y0 = -pi
yf = pi
# Grid positions
x = np.linspace(x0, xf, L, endpoint=True)
y = np.linspace(y0, yf, L, endpoint=True)

lypnv = lyapunov_grid(x, y, k, N)
``````

With `parallel=False` it takes about 8s to run, however with `parallel=True` it takes about 14s. I also tested with another code from https://github.com/animator/mandelbrot-numba and in this case the parallelization works.

``````import math
import numpy as np
import numba as nb

WIDTH = 1000
MAX_ITER = 1000

@nb.njit(parallel=True)
def mandelbrot(width, max_iter):
pixels = np.zeros((width, width, 3), dtype=np.uint8)
for y in nb.prange(width):
for x in range(width):
c0 = complex(3.0*x/width - 2, 3.0*y/width - 1.5)
c = 0
for i in range(1, max_iter):
if abs(c) > 2:
log_iter = math.log(i)
pixels[y, x, :] = np.array([int(255*(1+math.cos(3.32*log_iter))/2),
int(255*(1+math.cos(0.774*log_iter))/2),
int(255*(1+math.cos(0.412*log_iter))/2)],
dtype=np.uint8)
break
c = c * c + c0
return pixels

# compile
_ = mandelbrot(WIDTH, 10)

calcpixels = mandelbrot(WIDTH, MAX_ITER)
``````

## Solution

One main issue is that the second function call compile the function again. Indeed, the types of the provided arguments change: in the first call the third argument is an integer (`int` transformed to a `np.int_`) while in the second call the third argument (`k`) is a floating point number (`float` transformed to a `np.float64`). Numba recompiles the function for different parameter types because they are deduced from the type of the arguments and it does not know you want to use a `np.float64` type for the third argument (since the first time the function is compiled with for a `np.int_` type). One simple solution to fix the problem is to change the first call to:

``````_ = lyapunov_grid(np.linspace(0, 1, 10), np.linspace(0, 1, 10), 1.0, 10)
``````

However, this is not a robust way to fix the problem. You can specify the parameter types to Numba so it will compile the function at declaration time. This also remove the need to artificially call the function (with useless parameters).

``````@nb.njit('float64[:,:](float64[::1], float64[::1], float64, float64)', parallel=True)
``````

Note that `(J[0,0]*np.cos(beta0) - J[0,1]*np.sin(beta0))` is zero the first time resulting in a division by 0.

Another main issue comes from the allocations of many small arrays in the loop causing a contention of the standard allocator (see this post for more information). While Numba could theoretically optimize it (ie. replace the array with local variables), it actually does not, resulting in a huge slowdown and a contention. Hopefully, in your case, you do not need to actually create the array. At last, you can create it only in the encompassing loop and modify it in the innermost loop. Here is the optimized code:

``````@nb.njit('float64[:,:](float64[::1], float64[::1], float64, float64)', parallel=True)
def lyapunov_grid(x_grid, y_grid, k, N):
L = len(x_grid)
lypnv = np.zeros((L, L))
for ii in nb.prange(L):
J = np.ones((2, 2), dtype=np.float64)

for jj in range(L):
x = x_grid[ii]
y = y_grid[jj]
beta0 = 0
sumT11 = 0

for j in range(N):
y = (y - k*np.sin(x)) % (2*np.pi)
x = (x + y) % (2*np.pi)
J[0, 1] = -k*np.cos(x)
J[1, 1] = 1.0 - k*np.cos(x)
beta = np.arctan((-J[1,0]*np.cos(beta0) + J[1,1]*np.sin(beta0))/(J[0,0]*np.cos(beta0) - J[0,1]*np.sin(beta0)))
T11 = np.cos(beta0)*(J[0,0]*np.cos(beta) - J[1,0]*np.sin(beta)) - np.sin(beta0)*(J[0,1]*np.cos(beta) - J[1,1]*np.sin(beta))
sumT11 += np.log(abs(T11))/np.log(2)

beta0 = beta

lypnv[ii, jj] = sumT11/N
return lypnv
``````

Here is the results on a old 2-core machine (with 4 hardware threads):

``````Original sequential:   15.9 s
Original parallel:     11.9 s
Fix-build sequential:  15.7 s
Fix-build parallel:    10.1 s
Optimized sequential:  2.73 s
Optimized parallel:    0.94 s
``````

The optimized implementation is much faster than the others. The parallel optimized version scale very well compared than the original one (2.9 times faster than the sequential one). Finally, the best version is about 12 times faster than the original parallel version. I expect a much faster computation on a recent machine with many more cores.