# [SOLVED] Broadcasting + reshaping VS list comprehension speed in python

## Issue

Given the following `numpy` arrays:

``````import numpy as np
import time

v1 = np.linspace(20, 250, 100000000)
a = np.array([12.592,16.320])
m = np.array([3, 5])
``````

How is it possible that list comprehension:

``````start = time.time()
v2 = np.max(
[10 ** _a * np.power(v1.astype(float), -_m) for _m, _a in zip(m, a)],
axis=0
)
end = time.time()
print(end - start)  # prints 5.822041034698486
``````

is more than twice as fast than `numpy`‘s `broadcasting`?

``````start = time.time()
v2 = np.max(
np.power(10, a) * np.power(v1.astype(float)[:, None], -m).reshape(
v1.shape,-1),
axis=1
)
end = time.time()
print(end - start)  # prints 12.292157173156738
``````

And what is the "fastest approach" of calculating `v2`?

## Solution

Both are inefficient. Indeed, you can rewrite the operation much more efficiently using mathematical property of the natural logarithm and the exponential. Here is an equivalent (unoptimized) expression that can be further optimized:

``````v2 = np.max(
[np.exp(_a * np.log(10) - _m * np.log(v1.astype(float))) for _m, _a in zip(m, a)],
axis=0
)
``````

Since `np.log(v1.astype(float)))` is a constant, you can pre-compute it. Actually, `v1` items are already of type `float` (note that `float` means `np.float64` for Numpy and not the CPython `float` object type). `np.log(v1)` will do the job correctly (unless `v1` is set to an array of `np.float32` for example). Additionally, `np.exp` can be computed only on the final result (ie. after computing `np.max`) since `a < b` is equivalent to `e**a < e**b`. Finally, you can use in-place operations to avoid creating several expensive temporary arrays. This can be done using the `out` parameter of many functions like `np.subtract` or `np.multiply` for example.

Here is the resulting code:

``````log_v1 = np.log(v1)
tmp = np.empty((len(a), v1.size), dtype=np.float64)
v2 = np.exp(np.max(
[np.subtract(_a * np.log(10), np.multiply(log_v1, _m, out=_tmp), out=_tmp) for _m, _a, _tmp in zip(m, a, tmp)],
axis=0,
out=tmp
), out=tmp)
``````

For even faster performance, you can simply use Numba so to avoid writing huge array in memory (which is slow). Numba can also use multiple threads to speed this up a lot. Here is an example:

``````import numba as nb

# Note that fastmath=True can cause issues with values likes NaN Inf.
# Please disable it if your input array contains such spacial values.
@nb.njit('float64[::1](float64[::1], float64[::1], float64[::1])', fastmath=True, parallel=True)
def compute(v1, m, a):
assert a.size > 0 and a.size == m.size
out = np.empty(v1.size, dtype=np.float64)
log10 = np.log(10)
for i in nb.prange(v1.size):
log_v1 = np.log(v1[i])
maxi = a * log10 - m * log_v1
for j in range(1, len(a)):
value = a[j] * log10 - m[j] * log_v1
if value > maxi:
maxi = value
out[i] = np.exp(maxi)
return out

v2 = compute(v1, m.astype(float), a)
``````

## Benchmark

Here are results on my 6-core machine:

``````Initial code with big arrays:    10.013 s    (inefficient)
Initial code with lists:          8.147 s    (inefficient)
Optimized Numpy code:             2.009 s    (memory-bound)
Optimized Numba code:             0.300 s    (compute-bound)
``````

As you can see, Numba is much faster than the other methods: about 30 times faster.