I am having a big vector and I want to compute the elements wise inverse each time with a small perturbation. For example, for large
N, I have
y and its element wise inverse
y = np.random.rand(1, N) y_inv = 1. / y y_inv[y_inv == np.inf] = 0 # y may contain zeros
and I want to compute
p = 0.01 z = y + p z_inv = 1./ z = 1./ (y + p)
multiple times for different
p and fixed
N is very large, is there an efficient way or an approximation, to use the already computed
y_inv in the computation of
z_inv? Any suggestions to increase the speed of inverting
z are highly appreciated.
Floating point divisions are slow, especially for double-precision floating point numbers. Simple-precision is faster (with a relative error likely less than 2.5e-07) and it can be made even faster if you do not need a high-precision by computing the approximate reciprocal (with a maximum relative error for this approximation is less than 3.66e-4 on x86-64 processors).
Assuming this is OK to you to decrease the precision of results, here are the performance of the different approach on a Skylake processor (assuming Numpy is compiled with the support of the AVX SIMD instruction set):
Double-precision division: 8 operation/cycle/core Simple-precision division: 5 operation/cycle/core Simple-precision approximate reciprocal: 1 operation/cycle/core
You can easily switch to simple-precision with Numpy by specifying the
dtype of arrays. For the reciprocal, you need to use Numba (or Cython) configured to use the
fastmath model (a flag of
Another solution is simply to execute this operation in using multiple threads. This can be easily done with Numba using the
parallel=True and a loop iterating over
nb.prange. This solution can be combined with the previous one (about precision) resulting in a much faster computation compared to the initial Numpy double-precision-based code.
Moreover, computing arrays in-place should be faster (especially when using multiple threads). Alternatively, you can preallocate the output array and reuse it (slower than the in-place method but faster than the naive approach). The parameter
out of Numpy function (like
np.divide) can be used to do that.
Here is an (untested) example of parallel Numba code:
import numba as nb # See fastmath=True and float32 types for faster performance of approximate results # Assume `out` is already preallocated @njit('void(float64[::1], float64[::1], float64)', parallel=True) def compute(out, y, p): assert(out.size == y.size) for i in nb.prange(y.size): out[i] = 1.0 / (y[i] + p)
Answered By – Jérôme Richard
Answer Checked By – David Marino (BugsFixing Volunteer)