Suppose I have an array
a = np.array([1, 2, 1, 3, 3, 3, 0])
How can I (efficiently, Pythonically) find which elements of
a are duplicates (i.e., non-unique values)? In this case the result would be
array([1, 3, 3]) or possibly
array([1, 3]) if efficient.
I’ve come up with a few methods that appear to work:
m = np.zeros_like(a, dtype=bool) m[np.unique(a, return_index=True)] = True a[~m]
a[~np.in1d(np.arange(len(a)), np.unique(a, return_index=True), assume_unique=True)]
This one is cute but probably illegal (as
a isn’t actually unique):
np.setxor1d(a, np.unique(a), assume_unique=True)
u, i = np.unique(a, return_inverse=True) u[np.bincount(i) > 1]
s = np.sort(a, axis=None) s[:-1][s[1:] == s[:-1]]
s = pd.Series(a) s[s.duplicated()]
Is there anything I’ve missed? I’m not necessarily looking for a numpy-only solution, but it has to work with numpy data types and be efficient on medium-sized data sets (up to 10 million in size).
Testing with a 10 million size data set (on a 2.8GHz Xeon):
a = np.random.randint(10**7, size=10**7)
The fastest is sorting, at 1.1s. The dubious
xor1d is second at 2.6s, followed by masking and Pandas
Series.duplicated at 3.1s,
bincount at 5.6s, and
in1d and senderle’s
setdiff1d both at 7.3s. Steven’s
Counter is only a little slower, at 10.5s; trailing behind are Burhan’s
Counter.most_common at 110s and DSM’s
Counter subtraction at 360s.
I’m going to use sorting for performance, but I’m accepting Steven’s answer because the performance is acceptable and it feels clearer and more Pythonic.
Edit: discovered the Pandas solution. If Pandas is available it’s clear and performs well.
I think this is most clear done outside of
numpy. You’ll have to time it against your
numpy solutions if you are concerned with speed.
>>> import numpy as np >>> from collections import Counter >>> a = np.array([1, 2, 1, 3, 3, 3, 0]) >>> [item for item, count in Counter(a).items() if count > 1] [1, 3]
note: This is similar to Burhan Khalid’s answer, but the use of
items without subscripting in the condition should be faster.
Answered By – Steven Rumbalski
Answer Checked By – Dawn Plyler (BugsFixing Volunteer)