[SOLVED] Why is pandas read_csv converters performance much slower and non-vectorized?

Issue

Why is it that passing vectorized functions to the read_csv converters parameter, they get executed on each value individually, instead of on all values at once? This is much slower (~60x) than manually converting the corresponding columns afterwards.

Description

While reading large csv files (several million rows of mixed data), I used the converters parameter of pandas’ read_csv method to conveniently pass functions that convert strings to datetime objects etc.

However, using the converters parameter is very slow in comparison to manually converting the corresponding columns.

Example Code

For illustration, let’s convert strings to datetime objects using 3 different methods:

  1. converters parameter
  2. parse_dates/date_parser parameters
  3. manually after loading the csv

Note, the conversion from string to datetime is arbitrary here. This could be replaced with other functions (except for not having the specific parse_dates/date_parser parameters).

import pandas as pd # 0.19.2 with python 3.5

# create dummy data
rows = 100000
data = {"dates": pd.date_range("2017-02-27 20:44:23", periods=rows, freq="S")}

# save as temporary file for timeit
pd.DataFrame(data).to_csv("dummy")

# define converters
def convert_datetime(series):
    return pd.to_datetime(series, format="%Y-%m-%d %H:%M:%S")

Now, lets see the timeit (iPython) comparisons:

# METHOD 1
%%timeit
df = pd.read_csv("dummy", converters={"dates": convert_datetime})
# 1 loop, best of 3: 7.76 s per loop

# METHOD 2
%%timeit
df = pd.read_csv("dummy", parse_dates=["dates"], date_parser=convert_datetime)
# 10 loops, best of 3: 125 ms per loop   

# METHOD 3
%%timeit
df = pd.read_csv("dummy")
df["dates"] = convert_datetime(df["dates"])
# 10 loops, best of 3: 129 ms per loop

Result

The converters’ version is roughly 60 times slower than the other ones. To understand this a bit better, I wrapped the convert_datetime function into a little decorator class to count the number of calls:

class Counter:
   def __init__(self, func):
       self.func = func
       self.count = 0
   
   def __call__(self, *args, **kwargs):
       self.count += 1
       return self.func(*args, **kwargs)

@Counter
def convert_datetime(series):
    return pd.to_datetime(series, format="%Y-%m-%d %H:%M:%S")

It revealed that using the converters’ parameter calls the convert_datetime function for every single value whereas the other versions call the converter function only once. This explains the performance deficit.

Solution

From the docs for read_csv (emphasis mine),

converters : dict, default None

  • Dict of functions for converting values in certain columns. Keys can either be integers or column labels

The idea of the converters keyword parameter is to provide functions which act upon individual values, not entire columns. This can be seen by redefining the converter function

def convert_datetime(val):
    return datetime.datetime.strptime(val, "%Y-%m-%d %H:%M:%S")
%timeit pd.read_csv("dummy", converters={"dates": convert_datetime})
1 loop, best of 3: 2.81 s per loop

And as you can imagine, this amounts to a lot of Python function calls.


As for why the converters parameter doesn’t accept vectorized functions, my best guess would be that they offer somewhat less flexibility than the current implementation. With the idea being that you can parse necessary columns of dates, etc. which may require some parsing logic with vectorized parse_dates, date_parser already, and most further column operations could be done post-read with vectorized methods.

In other words, being able to parse elements which have to end up in a different type (like datetimes) is convenient to have vectorized methods for in read_csv. Beyond that, converters is just a convenience parameter which can flexibly act on individual values – as further vectorized methods can just be done post-read anyways.

Answered By – miradulo

Answer Checked By – Marie Seifert (BugsFixing Admin)

Leave a Reply

Your email address will not be published. Required fields are marked *