[SOLVED] Faster numpy isin alternative for strings using numba

Issue

I’m trying to implement a faster version of the np.isin in numba, this is what I have so far:

import numpy as np
import numba as nb

@nb.njit(parallel=True)
def isin(a, b):
    out=np.empty(a.shape[0], dtype=nb.boolean)
    b = set(b)
    for i in nb.prange(a.shape[0]):
        if a[i] in b:
            out[i]=True
        else:
            out[i]=False
    return out

For numbers it works, as seen in this example:

a = np.array([1,2,3,4])
b = np.array([2,4])

isin(a,b)
>>> array([False,  True, False,  True])

And it’s faster than np.isin:

a = np.random.rand(20000)
b = np.random.rand(5000)

%time isin(a,b)
CPU times: user 3.96 ms, sys: 0 ns, total: 3.96 ms
Wall time: 1.05 ms

%time np.isin(a,b)
CPU times: user 11 ms, sys: 0 ns, total: 11 ms
Wall time: 8.48 ms

However, I would like to use this function with arrays containing strings. The problem is that whenever I try to pass an array of strings, numba complains that it cannot interpret the set() operation with these data.

a = np.array(['A','B','C','D'])
b = np.array(['B','D'])

isin(a,b)
TypingError: Failed in nopython mode pipeline (step: nopython frontend)
No implementation of function Function(<class 'set'>) found for signature:
 
 >>> set(array([unichr x 1], 1d, C))
 
There are 2 candidate implementations:
  - Of which 2 did not match due to:
  Overload of function 'set': File: numba/core/typing/setdecl.py: Line 20.
    With argument(s): '(array([unichr x 1], 1d, C))':
   No match.

During: resolving callee type: Function(<class 'set'>)
During: typing of call at /tmp/ipykernel_20582/4221597969.py (7)


File "../../../../tmp/ipykernel_20582/4221597969.py", line 7:
<source missing, REPL/exec in use?>

Is there a way, like specifying a signature, that will allow me to use this directly on arrays of strings?

I know I could assign a numerical value to each string, but for large arrays I think this will take a while and will make the whole thing slower than just using np.isin.

Any ideas?

Solution

Strings are barely supported by Numba (like bytes although the support is slightly better). Set and dictionary are supported with some strict restriction and are quite experimental/new. Sets of strings are not supported yet regarding the documentation:

Sets must be strictly homogeneous: Numba will reject any set containing objects of different types, even if the types are compatible (for example, {1, 2.5} is rejected as it contains a int and a float). The use of reference counted types, e.g. strings, in sets is unsupported.

You can try to cheat using a binary search. Unfortunately, np.searchsorted is not yet implemented for string-typed Numpy array (although np.unique is). I think you can implement a binary search yourself but this is cumbersome to do and in the end. I am not sure this will be faster in the end, but I think it should because of the O(Ns Na log Nb)) running time complexity (with Ns the average size of string length of b unique items, Na the number of items in a and Nb the number of unique items in b). Indeed, the running time complexity of np.isin is O(Ns (Na+Nb) log (Na+Nb)) if arrays are about similar sizes and O(Ns Na Nb) if Nb is much smaller than Na. Note that the best theoretical running time complexity is AFAIK O(Ns (Na + Nb)) thanks to a hash table with a good hash function (Tries can also achieve this but they should be practically slower unless the hash function is not great).

Note that typed dictionaries support statically declared string but not dynamic ones (and this is an experimental feature for static ones).

Another cheat (that should work) is to store string hashes as the key of a typed dictionary and associate each hash to an array of indices referencing the string locations in b having the hash of the associated key. The parallel loop needs to hash a items and fetch the location of strings item in b having this hash so you can then compare the strings. A faster implementation would be to assume that the hash function for b strings is perfect and that there is no collision so you can directly use a TypedDict[np.int64, np.int64] hash table. You can test this hypothesis at runtime when you build b. Writing such a code is a bit tedious. Note that this implementation may not be faster than Numpy in the end in sequential because Numba TypedDicts are currently pretty slow… However, this should be faster in parallel on processors with enough cores.

Answered By – Jérôme Richard

Answer Checked By – Terry (BugsFixing Volunteer)

Leave a Reply

Your email address will not be published. Required fields are marked *