## Issue

First question here in a very long time as been having picking Python back up at work recently. I’ve been working on cleaning / prepping some data with pandas and I’ve found that when applying a function to a smaller sample (500000 rows) of the total data (~30000000 rows) it is taking a very long time to run a specific chunk of my code (~8 mins). My thinking is that I’ve written something that works but isn’t very optimal for what I’m trying to do and that it’s going to become a very long process when applied to the whole data set. I’m not completely sure but I think running this kind of thing is a programme like alteryx would be much faster and so I’m thinking I must have done something wrong. Any help or ideas to make it faster massively appreciated!

Dataframe example:

```
po_data = pd.DataFrame({'Order Quantity Received Type':['Order Cancelled - None Received','Order Partially Fulfilled'],Order Quantity Change Type':['Order Cancelled','Increased','c'],'Received Quantity':[0,3],Current Order Quantity:[0,5]})
```

Func:

```
def order_quantity_received(df,output_col,cancelled,received_quant,ordered_quant):
if (df[cancelled] == "Order Cancelled") & (df[received_quant] == 0):
df[output_col] = "Order Cancelled - None Received"
elif (df[cancelled] == "Order Cancelled") & (df[received_quant] == 0):
df[output_col] = "Order Cancelled - Items Received"
elif df[received_quant] > df[ordered_quant]:
df[output_col] = "Order Over Fufilled"
elif (df[received_quant] < df[ordered_quant]) & (df[received_quant] > 0):
df[output_col] = "Order Partially Fufilled"
elif df[received_quant] == df[ordered_quant]:
df[output_col] = "Order Fully Fufilled"
elif (df[received_quant] == 0) & (df[ordered_quant] > 0):
df[output_col] = "Order Not Fufilled"
else:
df[output_col] = "Error"
return df
```

func call:

```
po_data = po_data.apply(lambda po_data: order_quantity_received(po_data,'Order Quantity Received Type','Order Quantity Change Type','Received Quantity','Current Order Quantity'),axis=1)
```

## Solution

The fastest way to work with Pandas and Numpy is to vectorize your functions. Running functions element by element along an array or a series using for loops, list comprehension, or apply() is a bad practice.

I would just give an example for "cancelled orders":

```
def order_cancelled(a, b):
## define your function logic however you want
return a - b
```

And then vectorize your function:

```
df['output_col'] = np.vectorize(order_cancelled)(df['cancelled'], df['received_quant'])
```

Answered By – yakutsa

Answer Checked By – Katrina (BugsFixing Volunteer)