## Issue

**Update:** this feature is now in `sciPy.stats.qmc.discrepancy`

and was ported to Cython and also parallelised.

I have a function using some for loops and I wanted to improve the speed using numpy. But this seems not to do the trick as the bumpy version appears to be 2 times slower. Here is the code:

```
import numpy as np
import itertools
import timeit
def func():
sample = np.random.random_sample((100, 2))
disc1 = 0
disc2 = 0
n_sample = len(sample)
dim = sample.shape[1]
for i in range(n_sample):
prod = 1
for k in range(dim):
sub = np.abs(sample[i, k] - 0.5)
prod *= 1 + 0.5 * sub - 0.5 * sub ** 2
disc1 += prod
for i, j in itertools.product(range(n_sample), range(n_sample)):
prod = 1
for k in range(dim):
a = 0.5 * np.abs(sample[i, k] - 0.5)
b = 0.5 * np.abs(sample[j, k] - 0.5)
c = 0.5 * np.abs(sample[i, k] - sample[j, k])
prod *= 1 + a + b - c
disc2 += prod
c2 = (13 / 12) ** dim - 2 / n_sample * disc1 + 1 / (n_sample ** 2) * disc2
def func_numpy():
sample = np.random.random_sample((100, 2))
disc1 = 0
disc2 = 0
n_sample = len(sample)
dim = sample.shape[1]
disc1 = np.sum(np.prod(1 + 0.5 * np.abs(sample - 0.5) - 0.5 * np.abs(sample - 0.5) ** 2, axis=1))
for i, j in itertools.product(range(n_sample), range(n_sample)):
disc2 += np.prod(1 + 0.5 * np.abs(sample[i] - 0.5) + 0.5 * np.abs(sample[j] - 0.5) - 0.5 * np.abs(sample[i] - sample[j]))
c2 = (13 / 12) ** dim - 2 / n_sample * disc1 + 1 / (n_sample ** 2) * disc2
print('Normal function time: ' , timeit.repeat('func()', number=20, repeat=5, setup="from __main__ import func"))
print('numpy function time: ', timeit.repeat('func_numpy()', number=20, repeat=5, setup="from __main__ import func_numpy"))
```

The timing output is:

```
Normal function time: [2.831496894999873, 2.832342429959681, 2.8009242500411347, 2.8075121529982425, 2.824807019031141]
numpy function time: [5.154757721000351, 5.2011515340418555, 5.148996959964279, 5.095560318033677, 5.125199959962629]
```

What am I missing here? I know that the bottleneck is the itertools part because I have a 100x100x2 loop instead of a 100×2 loop before.

Do you see another way to do that?

## Solution

With NumPy, one must look to vectorize things and we could certainly do so here.

Taking a closer look at the loop portion, we are iterating along the first axis of the input data `samples`

twice with that loop startup :

```
for i, j in itertools.product(range(n_sample), range(n_sample)):
```

We could convert these iterations into vectorized operations, once we let `broadcasting`

handle those.

Now, to have a fully vectorized solution, we would need a lot more memory space, specifically `(N,N,M)`

, where `(N,M)`

is the shape of the input data.

Another noticeable aspect here is that at each iteration, we aren’t doing a lot of work, as we are performing operation on each row and each row contains only `2`

elements for the given sample. So, the idea that comes out is that we could run a loop along `M`

, such that at each iteration, we would compute the `prod`

and accumulate. Thus, for the given sample, its just two loop iterations.

Getting out of the loop, we would have the accumulated `prod`

, that just needs summation for `disc2`

as the final output.

Here’s an implementation to fulfil the above mentioned ideas –

```
prod_arr = 1
for i in range(sample.shape[1]):
si = sample[:,i]
prod_arr *= 1 + 0.5 * np.abs(si[:,None] - 0.5) + 0.5 * np.abs(si - 0.5) - \
0.5 * np.abs(si[:,None] - si)
disc2 = prod_arr.sum()
```

**Runtime test**

The stripped down version of the loopy portion from the original approach and the modified versions as approaches are listed below :

```
def org_app(sample):
disc2 = 0
n_sample = len(sample)
for i, j in itertools.product(range(n_sample), range(n_sample)):
disc2 += np.prod(1 + 0.5 * np.abs(sample[i] - 0.5) + 0.5 * \
np.abs(sample[j] - 0.5) - 0.5 * np.abs(sample[i] - sample[j]))
return disc2
def mod_app(sample):
prod_arr = 1
for i in range(sample.shape[1]):
si = sample[:,i]
prod_arr *= 1 + 0.5 * np.abs(si[:,None] - 0.5) + 0.5 * np.abs(si - 0.5) - \
0.5 * np.abs(si[:,None] - si)
disc2 = prod_arr.sum()
return disc2
```

Timings and verification –

```
In [10]: sample = np.random.random_sample((100, 2))
In [11]: org_app(sample)
Out[11]: 11934.878683659041
In [12]: mod_app(sample)
Out[12]: 11934.878683659068
In [14]: %timeit org_app(sample)
10 loops, best of 3: 84.4 ms per loop
In [15]: %timeit mod_app(sample)
10000 loops, best of 3: 94.6 µs per loop
```

About ** 900x** speedup! Well this should be motivating enough, hopefully to look to vectorize things whenever possible.

Answered By – Divakar

Answer Checked By – Katrina (BugsFixing Volunteer)