Table of Contents

## Issue

Suppose I have an array

```
a = np.array([1, 2, 1, 3, 3, 3, 0])
```

How can I (efficiently, Pythonically) find which elements of `a`

are duplicates (i.e., non-unique values)? In this case the result would be `array([1, 3, 3])`

or possibly `array([1, 3])`

if efficient.

I’ve come up with a few methods that appear to work:

### Masking

```
m = np.zeros_like(a, dtype=bool)
m[np.unique(a, return_index=True)[1]] = True
a[~m]
```

### Set operations

```
a[~np.in1d(np.arange(len(a)), np.unique(a, return_index=True)[1], assume_unique=True)]
```

This one is cute but probably illegal (as `a`

isn’t actually unique):

```
np.setxor1d(a, np.unique(a), assume_unique=True)
```

### Histograms

```
u, i = np.unique(a, return_inverse=True)
u[np.bincount(i) > 1]
```

### Sorting

```
s = np.sort(a, axis=None)
s[:-1][s[1:] == s[:-1]]
```

### Pandas

```
s = pd.Series(a)
s[s.duplicated()]
```

Is there anything I’ve missed? I’m not necessarily looking for a numpy-only solution, but it has to work with numpy data types and be efficient on medium-sized data sets (up to 10 million in size).

## Conclusions

Testing with a 10 million size data set (on a 2.8GHz Xeon):

```
a = np.random.randint(10**7, size=10**7)
```

The fastest is sorting, at 1.1s. The dubious `xor1d`

is second at 2.6s, followed by masking and Pandas `Series.duplicated`

at 3.1s, `bincount`

at 5.6s, and `in1d`

and senderle’s `setdiff1d`

both at 7.3s. Steven’s `Counter`

is only a little slower, at 10.5s; trailing behind are Burhan’s `Counter.most_common`

at 110s and DSM’s `Counter`

subtraction at 360s.

I’m going to use sorting for performance, but I’m accepting Steven’s answer because the performance is acceptable and it *feels* clearer and more Pythonic.

Edit: discovered the Pandas solution. If Pandas is available it’s clear and performs well.

## Solution

I think this is most clear done outside of `numpy`

. You’ll have to time it against your `numpy`

solutions if you are concerned with speed.

```
>>> import numpy as np
>>> from collections import Counter
>>> a = np.array([1, 2, 1, 3, 3, 3, 0])
>>> [item for item, count in Counter(a).items() if count > 1]
[1, 3]
```

*note:* This is similar to Burhan Khalid’s answer, but the use of `items`

without subscripting in the condition should be faster.

Answered By – Steven Rumbalski

Answer Checked By – Dawn Plyler (BugsFixing Volunteer)