[SOLVED] Is there a faster way to concat huge data frames (40GB) using pandas

Issue

I have 3 huge data frames of 40 GB size, I opened them using chunks. Then, I wanna concatenate them together. Here is what I tried:

    path = 'path/to/myfiles'
    files = [os.path.join(path,i) for i in os.listdir(path) if i.endswith('tsv')]
    for file in files:
        cols = ['col1','col2','col3']
        chunks = pd.read_table(file, sep='\t', names=cols, chunksize=10000000)

However, when I try to concatenate all the files, it is taking forever.
I would like to have some suggestions to concatenate all the data frames quicker/faster.

Solution

.tsv and .csv are fairly slow formats to read/write. I’ve found parquet works best for most of the stuff I end up doing. It’s quite fast on reads and writes, and also allows you to read back a chunked folder of files as a single table easily. It does require string column names, however:

In [102]: df = pd.DataFrame(np.random.random((100000, 100)), columns=[str(i) for i in range(100)])

In [103]: %time df.to_parquet("out.parquet")
Wall time: 980 ms

In [104]: %time df.to_csv("out.csv")
Wall time: 14 s

In [105]: %time df = pd.read_parquet("out.parquet")
Wall time: 195 ms

In [106]: %time df = pd.read_csv("out.csv")
Wall time: 1.53 s

If you don’t have control over the format those chunked files are in, you’ll obviously need to pay the read cost of them at least once, but converting them could still save you some time in the long run if you do a lot of other read/writes.

Answered By – Randy

Answer Checked By – Candace Johnson (BugsFixing Volunteer)

Leave a Reply

Your email address will not be published. Required fields are marked *