# [SOLVED] Numpy: Filter segments in array based on how much overlap there is with a second array with different segmentation type

## Issue

I am looking for the most computationally efficient way to filter out arrays from one array, based on segment overlap in a second array, and that array has a different segmentation type.

This is my first array

``````iob = np.array(
[0, 1, 2, 2, 2, 2, 2, 2, 2, 0, 0, 1, 2, 2, 2, 2, 0, 0, 1, 2, 1, 2, 2, 0]
)
``````

The number 1 is the start of each segment, the number 2 is the rest of the segment, and the number 0 indicates no segment. So for this array, the segments are `1, 2, 2, 2, 2, 2, 2, 2`, `1, 2, 2, 2, 2`, `1, 2`, and `1, 2, 2`.

This is my second array

``````output = np.array(
[0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0]
)
``````

The segments are only defined by the number 1, and 0 indicates no segment. So the segments for this array are `1, 1, 1, 1`, `1, 1`, and `1, 1, 1`.

In the first array, I want filter out segments whose contents to not overlap by at least 50% by a segment in the 2nd array. In other words, if at least half of the contents in a segment in the first array overlaps with a segment in the 2nd array, I want to keep that segment.

So this is the desired result

``````array([0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
0, 0])
``````

I am looking for the most computationally efficient method to calculate this solution.

Current solution

I can get the segment indices using the technique described here https://stackoverflow.com/a/71297663/3259896

``````output_zero = np.concatenate((output, ))
zero_output = np.concatenate((, output))
iob_zero = np.concatenate((iob, ))

iob_starts = np.where(iob_zero == 1)
iob_ends = np.where((iob_zero[:-1] != 0) & (iob_zero[1:] != 2))
iob_pairs = np.column_stack((iob_starts, iob_ends))

output_starts = np.where((zero_output[:-1] == 0) & (zero_output[1:] == 1))
output_ends = np.where((output_zero[:-1] == 1) & (output_zero[1:] == 0))
output_pairs = np.column_stack((output_starts, output_ends))
``````

Next, I directly compare all possible combination of segments to see which ones have at least a 50% overlap, and only keep those segments.

``````valid_pairs = []
for o_p in output_pairs:
for i_p in iob_pairs:
overlap = 1 + min(o_p, i_p) - max(o_p, i_p)
if overlap > np.diff(i_p)/2:
valid_pairs.append(i_p)
valid_pairs = np.array(valid_pairs)
``````

Finally, I used the filtered indices to create the desired array

``````final = np.zeros_like(output)
for block in valid_pairs:
final[block:block+1] = 1
final
``````
``````array([0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
0, 0])
``````

I suspect that this may not be the most computationally efficient solution. It uses quite a few lines of code, and also used a nested for loop to do all possible comparisons, and uses another loop to create the desired array. I don’t have a mastery on use of all of numpy’s functions, and I am wondering if there is a more computationally efficient way to calculate this.

## Solution

Here’s a 4-liner:

``````output = np.concatenate((output, ))
iob = np.concatenate((iob, ))
idx = np.dstack(np.where(iob == 1) + tuple(np.array(np.where(np.diff(iob) < 0)) + 1)).flatten()
desired_array = np.concatenate([np.full(len(x), 1 if ((np.sum(x==2) >= np.sum(x==1)) and x.sum()>0) else 0) for x in np.split(np.where(iob>0, 1, 0) + output, idx)])[0:-1]
``````

Output:

``````>>> desired_array
array([0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0])
``````

And here is a more verbose version

``````
output_0 = np.concatenate((output, ))
iob_0 = np.concatenate((iob, ))
overlap = np.where(iob_0>0, 1, 0) + output_0

iob_segment_starts = np.where(iob_0 == 1)
iob_segment_ends = tuple(np.array(np.where(np.diff(iob_0) < 0)) + 1)
iob_segment_indxs = np.dstack(iob_segment_starts + iob_segment_ends).flatten()
overlap_segments = np.split(overlap, iob_segment_indxs)
filtered_segment_params = (
(
len(x),
np.sum(x == 2) >= np.sum(x == 1) and x.sum() > 0,
)
for x in overlap_segments
)
filtered_segment_pieces = [
np.full(length, value, dtype=int)
for length, value in filtered_segment_params
]
filtered_array = np.concatenate(filtered_segment_pieces)[:-1]
filtered_array
``````

Pytorch Version

``````import torch
output = torch.tensor(
[1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1]
)
iob = torch.tensor(
[0, 1, 2, 2, 2, 2, 2, 2, 2, 0, 0, 1, 2, 2, 2, 2, 0, 0, 1, 2, 1, 2, 2, 1]
)
output_0 = torch.cat((output, torch.tensor()))
iob_0 = torch.cat((iob, torch.tensor()))
overlap = torch.where(iob_0>0, 1, 0) + output_0

iob_segment_starts = torch.where(iob_0 == 1)
iob_segment_ends = torch.where(torch.diff(iob_0) < 0) + 1
iob_segment_cuts = torch.cat([
iob_segment_starts,
iob_segment_ends,
torch.tensor([0, overlap.shape])
]).sort()
iob_segment_sizes = torch.diff(iob_segment_cuts)
overlap_segments = torch.split(overlap, iob_segment_sizes.tolist(), dim=0)

filtered_segment_params = (
(
len(x),
torch.sum(x == 2) >= torch.sum(x == 1) and x.sum() > 0,
)
for x in overlap_segments
)
filtered_segment_pieces = [
torch.full((1, length), value, dtype=int).flatten(0)
for length, value in filtered_segment_params
]
filtered_array = torch.cat(filtered_segment_pieces)[:-1]
filtered_array
``````