Issue
I have a dataframe which has history of purchases
(dataframe in the last)
I have to generate a summary of purchases made by them on first date of thier coming, 2nd date, 1 week, 1 month and so on like below:
purchaser | firstDay | secondDay | firstWeek | firstMonth | 6months | oneyear |
---|---|---|---|---|---|---|
0 | anil | 1 | 0 | 0 | 0 | 1 |
1 | mukesh | 1 | 0 | 1 | 7 | 0 |
2 | ravi | 8 | 0 | 0 | 4 | 1 |
what I have done is as below:
created a summary
summary=df.groupby('purchaser').agg('min').rename(columns={'date':'min'}).reset_index()
summary['oneday_date']=summary['min']+dt.timedelta(days=1)
summary['oneweek_date']=summary['min']+dt.timedelta(days=7)
summary['onemonth_date']=summary['min']+dt.timedelta(days=30)
summary['sixmonth_date']=summary['min']+dt.timedelta(days=183)
summary['year_date']=summary['min']+dt.timedelta(days=365)
and then iterated and counted for each purchaser.
%%time
result=[]
for num, row in summary.iterrows():
purchaser=row['purchaser']
mindate=row['min']
oneday=row['oneday_date']
oneweek=row['oneweek_date']
onemonth=row['onemonth_date']
sixmonth=row['sixmonth_date']
oneyear=row['year_date']
subdf=df[df['purchaser']==purchaser]
count0=len(subdf[(subdf['date']>=mindate) & (subdf['date']<oneday)])
count1=len(subdf[(subdf['date']>=oneday) & (subdf['date']<oneweek)])
count2=len(subdf[(subdf['date']>=oneweek) & (subdf['date']<onemonth)])
count3=len(subdf[(subdf['date']>=onemonth) & (subdf['date']<sixmonth)])
count4=len(subdf[(subdf['date']>=sixmonth) & (subdf['date']<oneyear)])
count5=len(subdf[subdf['date']>=oneyear])
result.append([purchaser,count0,count1,count2,count3,count4,count5])
CPU times: user 13.2 ms, sys: 587 µs, total: 13.8 ms
Wall time: 11.9 ms
My actual data is 1000,000 times bigger than this.
What I have already tried is
- Indexing the dataframe on date
df=df.set_index('date')
- Sorting the
subdf
ondates
both did not bring any speed improvement
full data
df=pd.DataFrame({'purchaser':['anil', 'anil', 'anil', 'anil', 'anil', 'anil', 'anil', 'anil', 'anil', 'anil', 'anil', 'anil', 'mukesh', 'mukesh', 'mukesh', 'mukesh', 'mukesh', 'mukesh', 'mukesh', 'mukesh', 'mukesh', 'ravi', 'ravi', 'ravi', 'ravi', 'ravi', 'ravi', 'ravi', 'ravi', 'ravi', 'ravi', 'ravi', 'ravi', 'ravi'],
'article':['pencil', 'pencil', 'pencil', 'pencil', 'rubber', 'rubber', 'rubber', 'rubber', 'sharpner', 'sharpner', 'sharpner', 'sharpner', 'pencil', 'pencil', 'rubber', 'sharpner', 'sharpner', 'sharpner', 'sharpner', 'sharpner', 'sharpner', 'pencil', 'pencil', 'pencil', 'pencil', 'pencil', 'pencil', 'sharpner', 'sharpner', 'sharpner', 'sharpner', 'sharpner', 'sharpner', 'rubber'],
'date':[1611316328000000000, 1612432758000000000, 1616319170000000000, 1622455063000000000, 1604242496000000000, 1604245635000000000, 1605421133000000000, 1570823168000000000, 1594919491000000000, 1604248351000000000, 1604237937000000000, 1604233396000000000, 1604251740000000000, 1601216201000000000, 1604232509000000000, 1604249925000000000, 1604246581000000000, 1603559931000000000, 1603946050000000000, 1603956529000000000, 1604228447000000000, 1604233557000000000, 1604212924000000000, 1604212924000000000, 1604212924000000000, 1612539904000000000, 1614939815000000000, 1614964750000000000, 1621581174000000000, 1604218928000000000, 1604222345000000000, 1604239015000000000, 1613635361000000000, 1604208994000000000]})
df['date']=pd.to_datetime(df['date'])
Solution
Algorithm:
- Calculate the first date of purchase for each purchaser using
.groupby()
, then calculate the second date, first week, … one year dates. Keep this in a second data frame. - Left join these dates to the original dataset containing all purchases on the "purchaser" column.
- Calculate the columns desired based on these date columns in the all-purchase set df. This can now be done using vectorized operations, instead of iterating over the whole array, leading to a much faster run time.
.groupby()
on purchaser and sum each column’s counts to produce the final desired output.
Answered By – Ben Goodman
Answer Checked By – Terry (BugsFixing Volunteer)