[SOLVED] Optimising applying a if function to dataframe, am I doing it the slow way? (Python, Pandas)

Issue

First question here in a very long time as been having picking Python back up at work recently. I’ve been working on cleaning / prepping some data with pandas and I’ve found that when applying a function to a smaller sample (500000 rows) of the total data (~30000000 rows) it is taking a very long time to run a specific chunk of my code (~8 mins). My thinking is that I’ve written something that works but isn’t very optimal for what I’m trying to do and that it’s going to become a very long process when applied to the whole data set. I’m not completely sure but I think running this kind of thing is a programme like alteryx would be much faster and so I’m thinking I must have done something wrong. Any help or ideas to make it faster massively appreciated!

Dataframe example:

po_data = pd.DataFrame({'Order Quantity Received Type':['Order Cancelled - None Received','Order Partially Fulfilled'],Order Quantity Change Type':['Order Cancelled','Increased','c'],'Received Quantity':[0,3],Current Order Quantity:[0,5]})

Func:

def order_quantity_received(df,output_col,cancelled,received_quant,ordered_quant):
    if (df[cancelled] == "Order Cancelled") & (df[received_quant] == 0):
        df[output_col] = "Order Cancelled - None Received"
    elif (df[cancelled] == "Order Cancelled") & (df[received_quant] == 0):
        df[output_col] = "Order Cancelled - Items Received"
    elif df[received_quant] > df[ordered_quant]:
        df[output_col] = "Order Over Fufilled"
    elif (df[received_quant] < df[ordered_quant]) & (df[received_quant] > 0):
        df[output_col] = "Order Partially Fufilled"
    elif df[received_quant] == df[ordered_quant]:
        df[output_col] = "Order Fully Fufilled"
    elif (df[received_quant] == 0) & (df[ordered_quant] > 0):
        df[output_col] = "Order Not Fufilled"
    else:
        df[output_col] = "Error"
    return df

func call:

po_data = po_data.apply(lambda po_data: order_quantity_received(po_data,'Order Quantity Received Type','Order Quantity Change Type','Received Quantity','Current Order Quantity'),axis=1)

Solution

The fastest way to work with Pandas and Numpy is to vectorize your functions. Running functions element by element along an array or a series using for loops, list comprehension, or apply() is a bad practice.

I would just give an example for "cancelled orders":

def order_cancelled(a, b):
    ## define your function logic however you want
    return a - b

And then vectorize your function:

df['output_col'] = np.vectorize(order_cancelled)(df['cancelled'], df['received_quant'])

Answered By – yakutsa

Answer Checked By – Katrina (BugsFixing Volunteer)

Leave a Reply

Your email address will not be published. Required fields are marked *