[SOLVED] How can I replace a loop to filter, splice, and process a dataframe to improve performance?

Issue

I’d like to improve my code so that it’s more pythonic and enhances the speed at which data is processed. The current code works, but I’m sure this can be improved somewhat. The .csv file is 702 MB, so it’s taking about 7-10 minutes for me to get to the end result:

def delayed_vs_punctual(self, df):
    filtered_for_carriers = df['UniqueCarrier']
    number_of_entries_each_carrier = filtered_for_carriers.value_counts()
    carriers = number_of_entries_each_carrier.index

    percent_delayed_all = []
    for carrier in carriers:  
        total_number_of_carrier = number_of_entries_each_carrier[carrier]
        mask = df.loc[df['UniqueCarrier'] == carrier]

        d = 0
        for index, row in mask.iterrows():
            ArrDelay = row['ArrDelay']
            if ArrDelay > 0:
                d += 1
            else:
                pass
        percent_delayed = d/total_number_of_carrier
        percent_delayed_all.append(percent_delayed)

    percentage_delay_dict = dict(zip(carriers, percent_delayed_all))  

    return percent_delayed_all

I’m quite sure looping is not the best method. Anyway, sample data:

Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
2007,1,1,1,1232,1225,1341,1340,WN,2891,N351,69,75,54,1,7,SMF,ONT,389,4,11,0,,0,0,0,0,0,0
2007,1,1,1,1918,1905,2043,2035,WN,462,N370,85,90,74,8,13,SMF,PDX,479,5,6,0,,0,0,0,0,0,0
2007,1,1,1,2206,2130,2334,2300,WN,1229,N685,88,90,73,34,36,SMF,PDX,479,6,9,0,,0,3,0,0,0,31
2007,1,1,1,1230,1200,1356,1330,WN,1355,N364,86,90,75,26,30,SMF,PDX,479,3,8,0,,0,23,0,0,0,3

I’m filtering through and splicing the DF so that it is grouped into airlines (UniqueCarrier). Then in each of these new mini-DFs (still pretty large though), I’m checking each row for a certain condition, like if there was a delay. Then a percentage is calculated (delayed flights against total flights for that particular airline). The end result is a dict:

percent_delayed_all = {'YV': 0.42212989448366295, 'US': 0.53435287477314719, 'MQ': 0.46239551225360503, 'AA': 0.49731090766529357, \
    'FL': 0.43394297743949478, 'NW': 0.56168732479989192, 'HA': 0.25596795727636851, 'F9': 0.50444967266775775, \
    'WN': 0.41947657183726861, 'OH': 0.50945518784192445, 'OO': 0.46118130333410273, '9E': 0.41249599190267761, \
    'B6': 0.45879864194306608, 'UA': 0.47631438239027596, 'AS': 0.47851546649186877, 'CO': 0.45207967792146703, \
    'AQ': 0.27577653149266607, 'XE': 0.40724700015870352, 'EV': 0.52604861756464993, 'DL': 0.45727049795225355}

From the dict you can see the airline as the key, and the percentage of delayed flights, so FL would be 43%, for example. Delayed in this case means > 0 minutes more than the expected arrival time.

Solution

df.ArrDelay.gt(0).groupby(df.UniqueCarrier).mean().to_dict()

Answered By – piRSquared

Answer Checked By – David Goodson (BugsFixing Volunteer)

Leave a Reply

Your email address will not be published. Required fields are marked *