[SOLVED] Improve performance of Cartesian product without duplicates and repeated elements

Issue

Input:

import numpy as np
import itertools
a = np.array([2,  3,  5, 13, 14, 16, 19, 20, 22, 29, 30, 34, 41, 43, 45, 49, 50,
       54, 60, 63, 66, 73, 75, 79, 82, 93]).astype(np.uint8)
b = np.array([ 9, 10, 11, 13, 14, 16, 17, 20, 22, 28, 30, 35, 36, 41, 45, 47, 49,
       54, 58, 63, 64, 66, 68, 73, 74, 76, 78, 83, 84, 86, 87, 92, 94, 95]).astype(np.uint8)
c = np.array([ 8, 11, 15, 17, 28, 32, 36, 38, 39, 42, 47, 52, 53, 58, 61, 62, 64,
       68, 69, 70, 71, 74, 78, 79, 80, 81, 83, 84, 85, 86, 89, 92]).astype(np.uint8)
d = np.array([ 6,  8, 12, 15, 21, 23, 25, 26, 31, 32, 37, 38, 44, 51, 52, 53, 56,
       57, 61, 62, 65, 67, 70, 71, 72, 81, 85, 89, 90]).astype(np.uint8)
e = np.array([ 0,  1,  4,  7, 18, 21, 23, 24, 25, 26, 27, 33, 40, 44, 46, 48, 51,
       55, 56, 57, 59, 65, 67, 77, 88, 90, 91]).astype(np.uint8)

Current Solution:

# Get product and ensure no element is repeated
main_combos = np.array([i for i in itertools.product(a, b, c, d, e) if len(set(i)) == 5])

# Remove duplicates (should not return both [13, 14, 8, 6, 0], [14, 13, 8, 6, 0])
u = np.sort(main_combos, axis=1)
_, idx = np.unique(u, axis=0, return_index=True)
main_combos = main_combos[idx]

This currently takes ~41 seconds on my machine to get all 14,133,909 rows, I was wondering if the performance of this can be improved.

The order of each individual row needs to stay the same for example the first element of each array [2, 9, 8, 6, 0] should be in this same order in the final output.

Solution

Your solution takes 58 seconds on my machine.

If you dont require the use of Numpy, the following runs in 23 seconds on my machine, using pure Python

main_combos = {tuple(set(i)): i for i in itertools.product(a, b, c, d, e)}
main_combos = [val for key, val in main_combos.items() if len(key)==5]

Answered By – Atnas

Answer Checked By – David Goodson (BugsFixing Volunteer)

Leave a Reply

Your email address will not be published. Required fields are marked *