# [SOLVED] Determining duplicate values in an array

## Issue

Suppose I have an array

``````a = np.array([1, 2, 1, 3, 3, 3, 0])
``````

How can I (efficiently, Pythonically) find which elements of `a` are duplicates (i.e., non-unique values)? In this case the result would be `array([1, 3, 3])` or possibly `array([1, 3])` if efficient.

I’ve come up with a few methods that appear to work:

``````m = np.zeros_like(a, dtype=bool)
m[np.unique(a, return_index=True)] = True
a[~m]
``````

### Set operations

``````a[~np.in1d(np.arange(len(a)), np.unique(a, return_index=True), assume_unique=True)]
``````

This one is cute but probably illegal (as `a` isn’t actually unique):

``````np.setxor1d(a, np.unique(a), assume_unique=True)
``````

### Histograms

``````u, i = np.unique(a, return_inverse=True)
u[np.bincount(i) > 1]
``````

### Sorting

``````s = np.sort(a, axis=None)
s[:-1][s[1:] == s[:-1]]
``````

### Pandas

``````s = pd.Series(a)
s[s.duplicated()]
``````

Is there anything I’ve missed? I’m not necessarily looking for a numpy-only solution, but it has to work with numpy data types and be efficient on medium-sized data sets (up to 10 million in size).

## Conclusions

Testing with a 10 million size data set (on a 2.8GHz Xeon):

``````a = np.random.randint(10**7, size=10**7)
``````

The fastest is sorting, at 1.1s. The dubious `xor1d` is second at 2.6s, followed by masking and Pandas `Series.duplicated` at 3.1s, `bincount` at 5.6s, and `in1d` and senderle’s `setdiff1d` both at 7.3s. Steven’s `Counter` is only a little slower, at 10.5s; trailing behind are Burhan’s `Counter.most_common` at 110s and DSM’s `Counter` subtraction at 360s.

I’m going to use sorting for performance, but I’m accepting Steven’s answer because the performance is acceptable and it feels clearer and more Pythonic.

Edit: discovered the Pandas solution. If Pandas is available it’s clear and performs well.

## Solution

I think this is most clear done outside of `numpy`. You’ll have to time it against your `numpy` solutions if you are concerned with speed.

``````>>> import numpy as np
>>> from collections import Counter
>>> a = np.array([1, 2, 1, 3, 3, 3, 0])
>>> [item for item, count in Counter(a).items() if count > 1]
[1, 3]
``````

note: This is similar to Burhan Khalid’s answer, but the use of `items` without subscripting in the condition should be faster.