[SOLVED] Optimized way of element wise vector inverse with perturbations

Issue

I am having a big vector and I want to compute the elements wise inverse each time with a small perturbation. For example, for large N, I have y and its element wise inverse y_inv

y = np.random.rand(1, N)  
y_inv = 1. / y
y_inv[y_inv == np.inf] = 0 # y may contain zeros

and I want to compute

p = 0.01
z = y + p
z_inv = 1./ z = 1./ (y + p)

multiple times for different p and fixed y. Because N is very large, is there an efficient way or an approximation, to use the already computed y_inv in the computation of z_inv? Any suggestions to increase the speed of inverting z are highly appreciated.

Solution

Floating point divisions are slow, especially for double-precision floating point numbers. Simple-precision is faster (with a relative error likely less than 2.5e-07) and it can be made even faster if you do not need a high-precision by computing the approximate reciprocal (with a maximum relative error for this approximation is less than 3.66e-4 on x86-64 processors).

Assuming this is OK to you to decrease the precision of results, here are the performance of the different approach on a Skylake processor (assuming Numpy is compiled with the support of the AVX SIMD instruction set):

Double-precision division:                8 operation/cycle/core
Simple-precision division:                5 operation/cycle/core
Simple-precision approximate reciprocal:  1 operation/cycle/core

You can easily switch to simple-precision with Numpy by specifying the dtype of arrays. For the reciprocal, you need to use Numba (or Cython) configured to use the fastmath model (a flag of njit).

Another solution is simply to execute this operation in using multiple threads. This can be easily done with Numba using the njit flag parallel=True and a loop iterating over nb.prange. This solution can be combined with the previous one (about precision) resulting in a much faster computation compared to the initial Numpy double-precision-based code.

Moreover, computing arrays in-place should be faster (especially when using multiple threads). Alternatively, you can preallocate the output array and reuse it (slower than the in-place method but faster than the naive approach). The parameter out of Numpy function (like np.divide) can be used to do that.

Here is an (untested) example of parallel Numba code:

import numba as nb

# See fastmath=True and float32 types for faster performance of approximate results
# Assume `out` is already preallocated
@njit('void(float64[::1], float64[::1], float64)', parallel=True)
def compute(out, y, p):
    assert(out.size == y.size)
    for i in nb.prange(y.size):
        out[i] = 1.0 / (y[i] + p)

Answered By – Jérôme Richard

Answer Checked By – David Marino (BugsFixing Volunteer)

Leave a Reply

Your email address will not be published. Required fields are marked *