## Issue

I am having a big vector and I want to compute the elements wise inverse each time with a small perturbation. For example, for large `N`

, I have `y`

and its element wise inverse `y_inv`

```
y = np.random.rand(1, N)
y_inv = 1. / y
y_inv[y_inv == np.inf] = 0 # y may contain zeros
```

and I want to compute

```
p = 0.01
z = y + p
z_inv = 1./ z = 1./ (y + p)
```

multiple times for different `p`

and fixed `y`

. Because `N`

is very large, is there an efficient way or an approximation, to use the already computed `y_inv`

in the computation of `z_inv`

? Any suggestions to increase the speed of inverting `z`

are highly appreciated.

## Solution

Floating point divisions are slow, especially for double-precision floating point numbers. **Simple-precision** is faster (with a *relative* error likely less than 2.5e-07) and it can be made even faster if you do not need a high-precision by computing the **approximate reciprocal** (with a maximum *relative* error for this approximation is less than 3.66e-4 on x86-64 processors).

Assuming this is OK to you to decrease the precision of results, here are the performance of the different approach on a Skylake processor (assuming Numpy is compiled with the support of the AVX SIMD instruction set):

```
Double-precision division: 8 operation/cycle/core
Simple-precision division: 5 operation/cycle/core
Simple-precision approximate reciprocal: 1 operation/cycle/core
```

You can easily switch to simple-precision with Numpy by specifying the `dtype`

of arrays. For the reciprocal, you need to use Numba (or Cython) configured to use the `fastmath`

model (a flag of `njit`

).

Another solution is simply to execute this operation in using **multiple threads**. This can be easily done with Numba using the `njit`

flag `parallel=True`

and a loop iterating over `nb.prange`

. This solution can be combined with the previous one (about precision) resulting in a much faster computation compared to the initial Numpy double-precision-based code.

Moreover, computing arrays in-place should be faster (especially when using multiple threads). Alternatively, you can preallocate the output array and reuse it (slower than the in-place method but faster than the naive approach). The parameter `out`

of Numpy function (like `np.divide`

) can be used to do that.

Here is an (untested) example of parallel Numba code:

```
import numba as nb
# See fastmath=True and float32 types for faster performance of approximate results
# Assume `out` is already preallocated
@njit('void(float64[::1], float64[::1], float64)', parallel=True)
def compute(out, y, p):
assert(out.size == y.size)
for i in nb.prange(y.size):
out[i] = 1.0 / (y[i] + p)
```

Answered By – Jérôme Richard

Answer Checked By – David Marino (BugsFixing Volunteer)