[SOLVED] Pandas replace/dictionary slowness


Please help me understand why this “replace from dictionary” operation is slow in Python/Pandas:

# Series has 200 rows and 1 column
# Dictionary has 11269 key-value pairs
series.replace(dictionary, inplace=True)

Dictionary lookups should be O(1). Replacing a value in a column should be O(1). Isn’t this a vectorized operation? Even if it’s not vectorized, iterating 200 rows is only 200 iterations, so how can it be slow?

Here is a SSCCE demonstrating the issue:

import pandas as pd
import random

# Initialize dummy data
dictionary = {}
orig = []
for x in range(11270):
    dictionary[x] = 'Some string ' + str(x)
for x in range(200):
    orig.append(random.randint(1, 11269))
series = pd.Series(orig)

# The actual operation we care about
series.replace(dictionary, inplace=True)

Running that command takes more than 1 second on my machine, which is 1000’s of times longer than expected to perform <1000 operations.


It looks like replace has a bit of overhead, and explicitly telling the Series what to do via map yields the best performance:

series = series.map(lambda x: dictionary.get(x,x))

If you’re sure that all keys are in your dictionary you can get a very slight performance boost by not creating a lambda, and directly supplying the dictionary.get function. Any keys that are not present will return NaN via this method, so beware:

series = series.map(dictionary.get)

You can also supply just the dictionary itself, but this appears to introduce a bit of overhead:

series = series.map(dictionary)


Some timing comparisons using your example data:

%timeit series.map(dictionary.get)
10000 loops, best of 3: 124 µs per loop

%timeit series.map(lambda x: dictionary.get(x,x))
10000 loops, best of 3: 150 µs per loop

%timeit series.map(dictionary)
100 loops, best of 3: 5.45 ms per loop

%timeit series.replace(dictionary)
1 loop, best of 3: 1.23 s per loop

Answered By – root

Answer Checked By – David Marino (BugsFixing Volunteer)

Leave a Reply

Your email address will not be published. Required fields are marked *