Table of Contents

## Issue

I have two large arrays, one containing values, and one being a mask basically. The code below shows the function I want to implement.

```
from scipy.signal import convolve2d
import numpy as np
sample = np.array([[6, 4, 5, 5, 5],
[7, 1, 0, 8, 3],
[2, 5, 4, 8, 4],
[2, 0, 2, 6, 0],
[5, 7, 2, 3, 2]])
mask = np.array([[1, 0, 1, 1, 0],
[0, 0, 1, 0, 1],
[0, 1, 0, 0, 0],
[0, 0, 0, 1, 0],
[1, 1, 0, 0, 1]])
neighbors_sum = convolve2d(sample, np.ones((3,3), dtype=int), mode='same', boundary='wrap')
# neighbors_sum = np.array([[40, 37, 35, 33, 44],
# [37, 34, 40, 42, 48],
# [24, 23, 34, 35, 40],
# [27, 29, 37, 31, 32],
# [31, 33, 34, 30, 34]])
result = np.where(mask, neighbors_sum, 0)
print(result)
```

This code works, and gets me what I expects:

```
np.array([[40, 0, 35, 33, 0],
[ 0, 0, 40, 0, 48],
[ 0, 23, 0, 0, 0],
[ 0, 0, 0, 31, 0],
[31, 33, 0, 0, 34]])
```

So far, so good. However, where I’m encountering some large issue is when I increase the size of the arrays. In my case, instead of a 5×5 input and a 3×3 summing mask, I need a 50,000×20,000 input and a 100×100 summing mask. And when I move to that, the convolve2d function is in all kinds of trouble and the calculation is extremely long.

Given that I only care about the masked result, and thus only care about the summation from convolve2d at those points, can anyone think of a smart approach to take here? Going to a for loop and selecting only the points of interest would lose the speed advantage of the vectorization so I’m not convinced this would be worth it.

Any suggestion welcome!

## Solution

`convolve2d`

is very inefficient in this case. Since the mask is `np.ones`

, you can split the filter in two trivial ones thanks to **separable filtering**: one `np.ones(100, 1)`

filter and one `np.ones(1, 100)`

filter. Moreover, a **rolling sum** can be used to speed up even more the computation.

Here is a simple solution without a rolling sum:

```
# Simple faster implementation
tmp = convolve2d(sample, np.ones((1,100), dtype=int), mode='same', boundary='wrap')
neighbors_sum = convolve2d(tmp, np.ones((100,1), dtype=int), mode='same', boundary='wrap')
result = np.where(mask, neighbors_sum, 0)
```

You can compute the rolling sum efficiently using **Numba**. The strategy is to split the computation in 3 parts: the horizontal rolling sum, the vertical rolling sum and the final masking. Each step can be fully parallelized using **multiple threads** (although parallelizing the vertical rolling sum is harder with Numba). Each part needs to work line by line so to be *cache friendly*.

```
# Complex very-fast implementation
import numba as nb
# Numerical results may diverge if the input contains big
# values with many small ones.
# Does not support inputs containing NaN values or +/- Inf ones.
@nb.njit('float64[:,::1](float64[:,::1], int_)', parallel=True, fastmath=True)
def horizontalRollingSum(sample, filterSize):
n, m = sample.shape
fs = filterSize
# Make the wrapping part of the rolling sum much simpler
assert fs >= 1
assert n >= fs and m >= fs
# Horizontal rolling sum.
tmp = np.empty((n, m), dtype=np.float64)
for i in nb.prange(n):
s = 0.0
lShift = fs//2
rShift = (fs-1)//2
for j in range(m-lShift, m):
s += sample[i, j]
for j in range(0, rShift+1):
s += sample[i, j]
tmp[i, 0] = s
for j in range(1, m):
jLeft, jRight = (j-1-lShift)%m, (j+rShift)%m
s += sample[i, jRight] - sample[i, jLeft]
tmp[i, j] = s
return tmp
@nb.njit('float64[:,::1](float64[:,::1], int_)', fastmath=True)
def verticaltalRollingSum(sample, filterSize):
n, m = sample.shape
fs = filterSize
# Make the wrapping part of the rolling sum much simpler
assert fs >= 1
assert n >= fs and m >= fs
# Horizontal rolling sum.
tmp = np.empty((n, m), dtype=np.float64)
tShift = fs//2
bShift = (fs-1)//2
for j in range(m):
tmp[0, j] = 0.0
for i in range(n-tShift, n):
for j in range(m):
tmp[0, j] += sample[i, j]
for i in range(0, bShift+1):
for j in range(m):
tmp[0, j] += sample[i, j]
for i in range(1, n):
iTop = (i-1-tShift)%n
iBot = (i+bShift)%n
for j in range(m):
tmp[i, j] = tmp[i-1, j] + (sample[iBot, j] - sample[iTop, j])
return tmp
@nb.njit('float64[:,::1](float64[:,::1], int_[:,::1], int_)', parallel=True, fastmath=True)
def compute(sample, mask, filterSize):
n, m = sample.shape
tmp = horizontalRollingSum(sample, filterSize)
neighbors_sum = verticaltalRollingSum(tmp, filterSize)
res = np.empty((n, m), dtype=np.float64)
for i in nb.prange(n):
for j in range(n):
res[i, j] = neighbors_sum[i, j] * mask[i, j]
return res
```

## Benchmark & Notes

Here is the testing code:

```
n, m = 5000, 2000
sample = np.random.rand(n, m)
mask = (np.random.rand(n, m) < 0.05).astype(int)
```

Here are the results on my 6-core machine:

```
Initial solution: 174366 ms (x1)
With separate filters: 5710 ms (x31)
Final Numba solution: 40 ms (x4359)
Optimal theoretical time: 10 ms (optimistic)
```

Thus, the Numba implementation is **4359 times faster** than the initial one.

That being said, be careful of possible numerical issues that this last implementation can have regarding the input array (see the comments in the code). It should be fine as long as `np.std(sample)`

is relatively small and `np.all(np.isfinite(sample))`

is true.

Note that the code can be further optimized: the vertical rolling sum can be parallelized; modulus operations can be avoided in the horizontal rolling sum; the vertical rolling sum and the masking steps can be merged together (ie. by computing `res`

on-the-fly and not storing `tmp`

); tiling can be used to compute all the steps simultaneously in a more cache-friendly way. However, these optimizations make the code more complex and some of them are very hard to perform (especially the last one with Numba).

Note that using a *boolean mask* (instead of an integer-based one) should make the algorithm faster since it takes less memory and processors can fetch values faster.

Answered By – Jérôme Richard

Answer Checked By – Jay B. (BugsFixing Admin)