[SOLVED] How to parse a pandas column of JSON content efficiently?

Issue

Let’s say I have the following DataFrame, where the data column contains a nested JSON string that I want to parse into separate columns:

import pandas as pd

df = pd.DataFrame({
    'bank_account': [101, 102, 201, 301],
    'data': [
        '{"uid": 100, "account_type": 1, "account_data": {"currency": {"current": 1000, "minimum": -500}, "fees": {"monthly": 13.5}}, "user_name": "Alice"}',
        '{"uid": 100, "account_type": 2, "account_data": {"currency": {"current": 2000, "minimum": 0},  "fees": {"monthly": 0}}, "user_name": "Alice"}',
        '{"uid": 200, "account_type": 1, "account_data": {"currency": {"current": 3000, "minimum": 0},  "fees": {"monthly": 13.5}}, "user_name": "Bob"}',        
        '{"uid": 300, "account_type": 1, "account_data": {"currency": {"current": 4000, "minimum": 0},  "fees": {"monthly": 13.5}}, "user_name": "Carol"}'        
    ]},
    index = ['Alice', 'Alice', 'Bob', 'Carol']
)


df

I’ve found the json_normalize function, and am currently parsing the JSON in a list comprehension; the result is correct, but this takes long. 1000 rows take 1-2 seconds, and I have about a million rows in my real run:

import json
from pandas.io.json import json_normalize

parsed_df = pd.concat([json_normalize(json.loads(js)) for js in df['data']])

parsed_df['bank_account'] = df['bank_account'].values
parsed_df.index = parsed_df['user_id']

parsed_df

Is there a faster way to parse this data into a nice-looking DataFrame?

Solution

I see a small (~25%) performance improvement from bypassing pandas.concat.

Otherwise, rewriting / optimizing json_normalize doesn’t seem straightforward.

def original(df):
    parsed_df = pd.concat([json_normalize(json.loads(js)) for js in df['data']])

    parsed_df['bank_account'] = df['bank_account'].values
    parsed_df.index = parsed_df['uid']

    return parsed_df

def jp(df):

    cols = ['account_data.currency.current', 'account_data.currency.minimum',
            'account_data.fees.monthly', 'account_type', 'uid', 'user_name']

    parsed_df = pd.DataFrame([json_normalize(json.loads(js)).values[0] for js in df['data']],
                             columns=cols)

    parsed_df['bank_account'] = df['bank_account'].values
    parsed_df.index = parsed_df['uid']

    return parsed_df

df = pd.concat([df]*100, ignore_index=True)

%timeit original(df)  # 675 ms per loop
%timeit jp(df)        # 526 ms per loop

Answered By – jpp

Answer Checked By – Willingham (BugsFixing Volunteer)

Leave a Reply

Your email address will not be published. Required fields are marked *