[SOLVED] Increase performance of np.where() loop

Issue

I am trying to speed up the code for the following script (ideally >4x) without multiprocessing. In a future step, I will implement multiprocessing, but the current speed is too slow even if I split it up to 40 cores. Therefore I’m trying to optimize to code first.

import numpy as np

def loop(x,y,q,z):
    matchlist = []
    for ind in range(len(x)):
        matchlist.append(find_match(x[ind],y[ind],q,z))
    return matchlist

def find_match(x,y,q,z):
    A = np.where(q == x)
    B = np.where(z == y)
    return np.intersect1d(A,B)


# N will finally scale up to 10^9
N = 1000
M = 300

X = np.random.randint(M, size=N)
Y = np.random.randint(M, size=N)

# Q and Z size is fixed at 120000
Q = np.random.randint(M, size=120000)
Z = np.random.randint(M, size=120000)

# convert int32 arrays to str64 arrays, to represent original data (which are strings and not numbers)
X = np.char.mod('%d', X)
Y = np.char.mod('%d', Y)
Q = np.char.mod('%d', Q)
Z = np.char.mod('%d', Z)

matchlist = loop(X,Y,Q,Z)

I have two arrays (X and Y) which are identical in length. Each row of these arrays corresponds to one DNA sequencing read (basically strings of the letters ‘A’,’C’,’G’,’T’; details not relevant for the example code here).

I also have two ‘reference arrays’ (Q and Z) which are identical in length and I want to find the occurrence (with np.where()) of every element of X within Q, as well as of every element of Y within Z (basically the find_match() function). Afterwards I want to know whether there is an overlap/intersect between the indexes found for X and Y.

Example output (matchlist; some rows of X/Y have matching indexes in Q/Y, some don’t e.g. index 11):

Example output of matchlist

The code works fine so far, but it would take quite long to execute with my final dataset where N=10^9 (in this code example N=1000 to make the tests quicker). 1000 rows of X/Y need about 2.29s to execute on my laptop:

timeit test of loop()

Every find_match() takes about 2.48 ms to execute which is roughly 1/1000 of the final loop.

timeit test of find_match()

One first approach would be to combine (x with y) as well as (q with z) and then I only need to run np.where() once, but I couldn’t get that to work yet.

I’ve tried to loop and lookup within Pandas (.loc()) but this was about 4x slower than np.where().

The question is closely related to a recent question from philshem (Combine several NumPy "where" statements to one to improve performance), however, the solutions provided on this question do not work for my approach here.

Solution

Numpy isn’t too helpful here, since what you need is a lookup into a jagged array, with strings as the indexes.

lookup = {}
for i, (q, z) in enumerate(zip(Q, Z)):
    lookup.setdefault((q, z), []).append(i)

matchlist = [lookup.get((x, y), []) for x, y in zip(X, Y)]

If you don’t need the output as a jagged array, but are OK with just a boolean denoting presence, and can preprocess each string to a number, there is a much faster method.

lookup = np.zeros((300, 300), dtype=bool)
lookup[Q, Z] = True
matchlist = lookup[X, Y]

You typically won’t want to use this method to replace the former jagged case, as dense variants (eg. Daniel F’s solution) will be memory inefficient and numpy does not support sparse arrays well. However, if more speed is needed then a sparse solution is certainly possible.

Answered By – Veedrac

Answer Checked By – Marilyn (BugsFixing Volunteer)

Leave a Reply

Your email address will not be published. Required fields are marked *