[SOLVED] drastically improve speed to subset and summarize pandas dataframe

Issue

I have a dataframe which has history of purchases
(dataframe in the last)

I have to generate a summary of purchases made by them on first date of thier coming, 2nd date, 1 week, 1 month and so on like below:

purchaser firstDay secondDay firstWeek firstMonth 6months oneyear
0 anil 1 0 0 0 1
1 mukesh 1 0 1 7 0
2 ravi 8 0 0 4 1

what I have done is as below:
created a summary

summary=df.groupby('purchaser').agg('min').rename(columns={'date':'min'}).reset_index()
summary['oneday_date']=summary['min']+dt.timedelta(days=1)
summary['oneweek_date']=summary['min']+dt.timedelta(days=7)
summary['onemonth_date']=summary['min']+dt.timedelta(days=30)
summary['sixmonth_date']=summary['min']+dt.timedelta(days=183)
summary['year_date']=summary['min']+dt.timedelta(days=365)

and then iterated and counted for each purchaser.

%%time
result=[]
for num, row in summary.iterrows():
    purchaser=row['purchaser']
    mindate=row['min']
    oneday=row['oneday_date']
    oneweek=row['oneweek_date']
    onemonth=row['onemonth_date']
    sixmonth=row['sixmonth_date']
    oneyear=row['year_date']
    
    subdf=df[df['purchaser']==purchaser]
    
    count0=len(subdf[(subdf['date']>=mindate) & (subdf['date']<oneday)])
    count1=len(subdf[(subdf['date']>=oneday) & (subdf['date']<oneweek)])
    count2=len(subdf[(subdf['date']>=oneweek) & (subdf['date']<onemonth)])
    count3=len(subdf[(subdf['date']>=onemonth) & (subdf['date']<sixmonth)])
    count4=len(subdf[(subdf['date']>=sixmonth) & (subdf['date']<oneyear)])
    count5=len(subdf[subdf['date']>=oneyear])
    
    result.append([purchaser,count0,count1,count2,count3,count4,count5])

CPU times: user 13.2 ms, sys: 587 ┬Ás, total: 13.8 ms
Wall time: 11.9 ms

My actual data is 1000,000 times bigger than this.

What I have already tried is

  • Indexing the dataframe on date df=df.set_index('date')
  • Sorting the subdf on dates

both did not bring any speed improvement

full data

df=pd.DataFrame({'purchaser':['anil', 'anil', 'anil', 'anil', 'anil', 'anil', 'anil', 'anil', 'anil', 'anil', 'anil', 'anil', 'mukesh', 'mukesh', 'mukesh', 'mukesh', 'mukesh', 'mukesh', 'mukesh', 'mukesh', 'mukesh', 'ravi', 'ravi', 'ravi', 'ravi', 'ravi', 'ravi', 'ravi', 'ravi', 'ravi', 'ravi', 'ravi', 'ravi', 'ravi'],
'article':['pencil', 'pencil', 'pencil', 'pencil', 'rubber', 'rubber', 'rubber', 'rubber', 'sharpner', 'sharpner', 'sharpner', 'sharpner', 'pencil', 'pencil', 'rubber', 'sharpner', 'sharpner', 'sharpner', 'sharpner', 'sharpner', 'sharpner', 'pencil', 'pencil', 'pencil', 'pencil', 'pencil', 'pencil', 'sharpner', 'sharpner', 'sharpner', 'sharpner', 'sharpner', 'sharpner', 'rubber'],
'date':[1611316328000000000, 1612432758000000000, 1616319170000000000, 1622455063000000000, 1604242496000000000, 1604245635000000000, 1605421133000000000, 1570823168000000000, 1594919491000000000, 1604248351000000000, 1604237937000000000, 1604233396000000000, 1604251740000000000, 1601216201000000000, 1604232509000000000, 1604249925000000000, 1604246581000000000, 1603559931000000000, 1603946050000000000, 1603956529000000000, 1604228447000000000, 1604233557000000000, 1604212924000000000, 1604212924000000000, 1604212924000000000, 1612539904000000000, 1614939815000000000, 1614964750000000000, 1621581174000000000, 1604218928000000000, 1604222345000000000, 1604239015000000000, 1613635361000000000, 1604208994000000000]})
df['date']=pd.to_datetime(df['date'])

Solution

Algorithm:

  1. Calculate the first date of purchase for each purchaser using .groupby(), then calculate the second date, first week, … one year dates. Keep this in a second data frame.
  2. Left join these dates to the original dataset containing all purchases on the "purchaser" column.
  3. Calculate the columns desired based on these date columns in the all-purchase set df. This can now be done using vectorized operations, instead of iterating over the whole array, leading to a much faster run time.
  4. .groupby() on purchaser and sum each column’s counts to produce the final desired output.

Answered By – Ben Goodman

Answer Checked By – Terry (BugsFixing Volunteer)

Leave a Reply

Your email address will not be published. Required fields are marked *