# [SOLVED] Length of the intersections between a list an list of list

## Issue

Note : almost duplicate of Numpy vectorization: Find intersection between list and list of lists

Differences :

• I am focused on efficiently when the lists are large
• I’m searching for the largest intersections.
``````x = [500 numbers between 1 and N]
y = [[1, 2, 3], [4, 5, 6, 7], [8, 9], [10, 11, 12], etc. up to N]
``````

Here are some assumptions:

• `y` is a list of ~500,000 sublist of ~500 elements
• each sublist in `y` is a range, so `y` is characterized by the last elements of each sublists. In the example : 3, 7, 9, 12 …
• `x` is not sorted
• `y` contains once and only once each numbers between 1 and ~500000*500
• `y` is sorted in the sense that, as in the example, the sub-lists are sorted and the first element of one sublist is the next of the last element of the previous list.
• `y` is known long before even compile-time

My purpose is to know, among the sublists of `y`, which have at least 10 intersections with `x`.

I can obviously make a loop :

``````def find_best(x, y):
result = []

for index, sublist in enumerate(y):
intersection = set(x).intersection(set(sublist))
if len(intersection) > 2:  # in real live: > 10
result.append(index)

return(result)

x = [1, 2, 3, 4, 5, 6]
y = [[1, 2, 3], [4],  [5, 6], [7], [8, 9, 10, 11]]

res = find_best(x, y)
print(res)   # [0, 2]
``````

Here the result is `[0,2]` because the first and third sublist of `y` have 2 elements in intersection with `x`.

An other method should to parse only once `y` and count the intesections :

``````def find_intersec2(x, y):
n_sublists = len(y)
res = {num: 0 for num in range(0, n_sublists + 1)}
for list_no, sublist in enumerate(y):
for num in sublist:
if num in x:
x.remove(num)
res[list_no] += 1
return [n for n in range(n_sublists + 1) if res[n] >= 2]
``````

This second method uses more the hypothesis.

Questions :

• what optimizations are possibles ?
• Is there a completely different approach ? Indexing, kdtree ? In my use case, the large list `y` is known days before the actual run. So i’m not afraid to buildind an index or whatever from `y`. The small list `x` is only known at runtime.

## Solution

Since `y` contains disjoint ranges and the union of them is also a range, a very fast solution is to first perform a binary search on `y` and then count the resulting indices and only return the ones that appear at least 10 times. The complexity of this algorithm is `O(Nx log Ny)` with `Nx` and `Ny` the number of items in respectively `x` and `y`. This algorithm is nearly optimal (since `x` needs to be read entirely).

## Actual implementation

First of all, you need to transform your current `y` to a Numpy array containing the beginning value of all ranges (in an increasing order) with `N` as the last value (assuming `N` is excluded for the ranges of `y`, or `N+1` otherwise). This part can be assumed as free since `y` can be computed at compile time in your case. Here is an example:

``````import numpy as np
y = np.array([1, 4, 8, 10, 13, ..., N])
``````

Then, you need to perform the binary search and check that the values fits in the range of y:

``````indices = np.searchsorted(y, x, 'right')

# The `0 < indices < len(y)` check should not be needed regarding the input.
# If so, you can use only `indices -= 1`.
indices = indices[(0 < indices) & (indices < len(y))] - 1
``````

Then you need to count the indices and filter the ones with at least :

``````uniqueIndices, counts = np.unique(indices, return_counts=True)
result = uniqueIndices[counts >= 10]
``````

Here is an example based on your:

``````x = np.array([1, 2, 3, 4, 5, 6])

# [[1, 2, 3], [4],  [5, 6], [7], [8, 9, 10, 11]]
y = np.array([1, 4, 5, 7, 8, 12])

# Actual simplified version of the above algorithm
indices = np.searchsorted(y, x, 'right') - 1
uniqueIndices, counts = np.unique(indices, return_counts=True)
result = uniqueIndices[counts >= 2]

# [0, 2]
print(result.tolist())
``````

It runs in less than 0.1 ms on my machine on a random input based on your input constraints.