# [SOLVED] numpy.where on 2D array is slower than list comprehension of numpy.where on 1D array

## Issue

1. Why is Numpy slower than list comprehensions in this case?

2. What is the best way to vectorize this grid-construction?

``````In : import numpy as np

In : mesh = np.linspace(-1, 1, 3000)

In : rowwise, colwise = np.meshgrid(mesh, mesh)

In : f = lambda x, y: np.where(x > y, x**2, x**3)

# Using 2D arrays:
In : %timeit f(colwise, rowwise)
285 ms ± 2.25 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# Using 1D array and list-comprehension:
In : %timeit np.array([f(x, mesh) for x in mesh])
58 ms ± 2.69 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# Equivalent result
In : np.allclose(f(colwise, rowwise), np.array([f(x, mesh) for x in mesh]))
True
``````

## Solution

Why is Numpy slower than list comprehensions in this case?

you are essentially suffering from 2 problems.

the first one is the cache utilization,
the second version uses a smaller subset of the space (3000,1) (1,3000) for the calculation which can fit nicely in you cache, so `x>y, x**2 , x**3` can all fit inside your cache which somewhat speeds things up,
the first version is calculating each of those 3 for a 3000×3000 array (9 million entries) which can never sit inside your cache (usually ~ 2-5 MB), then np.where is called that has to get parts of the data from your RAM (and not cache) in order to do its memory copying, which is then returned piece by piece to your RAM, which is very expensive.

also numpy implementation of np.where is somewhat alignment unaware and is accessing your arrays column-wise, not row-wise, so it’s essentially grabbing each and every entry from your RAM, and not utilizing the cache at all.

your list comprehension actually solves this issue as it is only dealing with a small subset of data at a given time, and therefore all the data can sit in you cache, but it is still using np.where, it’s only forcing it to use a row-wise access and therefore utilize your cache.

the second problem is the calculation of `x**2` and `x**3` which is a floating point exponentiation, which is very expensive, consider replacing it with `x*x` and `x*x*x`

What is the best way to vectorize this grid-construction?

apparently you have written it in your second method.

an even faster but unnecessary optimization by utilization of cache is to write your own code in C and call it from within python so you don’t have to evaluate either `x*x or x*x*x` unless you need to, and won’t have to store `x>y,x*x,x*x*x` but the speedup won’t be worth the trouble.