# [SOLVED] Most efficient way to forward-fill NaN values in numpy array

## Example Problem

As a simple example, consider the numpy array `arr` as defined below:

``````import numpy as np
arr = np.array([[5, np.nan, np.nan, 7, 2],
[3, np.nan, 1, 8, np.nan],
[4, 9, 6, np.nan, np.nan]])
``````

where `arr` looks like this in console output:

``````array([[  5.,  nan,  nan,   7.,   2.],
[  3.,  nan,   1.,   8.,  nan],
[  4.,   9.,   6.,  nan,  nan]])
``````

I would now like to row-wise ‘forward-fill’ the `nan` values in array `arr`. By that I mean replacing each `nan` value with the nearest valid value from the left. The desired result would look like this:

``````array([[  5.,   5.,   5.,  7.,  2.],
[  3.,   3.,   1.,  8.,  8.],
[  4.,   9.,   6.,  6.,  6.]])
``````

## Tried thus far

I’ve tried using for-loops:

``````for row_idx in range(arr.shape):
for col_idx in range(arr.shape):
if np.isnan(arr[row_idx][col_idx]):
arr[row_idx][col_idx] = arr[row_idx][col_idx - 1]
``````

I’ve also tried using a pandas dataframe as an intermediate step (since pandas dataframes have a very neat built-in method for forward-filling):

``````import pandas as pd
df = pd.DataFrame(arr)
df.fillna(method='ffill', axis=1, inplace=True)
arr = df.as_matrix()
``````

Both of the above strategies produce the desired result, but I keep on wondering: wouldn’t a strategy that uses only numpy vectorized operations be the most efficient one?

## Summary

Is there another more efficient way to ‘forward-fill’ `nan` values in numpy arrays? (e.g. by using numpy vectorized operations)

# Update: Solutions Comparison

I’ve tried to time all solutions thus far. This was my setup script:

``````import numba as nb
import numpy as np
import pandas as pd

def random_array():
choices = [1, 2, 3, 4, 5, 6, 7, 8, 9, np.nan]
out = np.random.choice(choices, size=(1000, 10))
return out

def loops_fill(arr):
out = arr.copy()
for row_idx in range(out.shape):
for col_idx in range(1, out.shape):
if np.isnan(out[row_idx, col_idx]):
out[row_idx, col_idx] = out[row_idx, col_idx - 1]
return out

@nb.jit
def numba_loops_fill(arr):
'''Numba decorator solution provided by shx2.'''
out = arr.copy()
for row_idx in range(out.shape):
for col_idx in range(1, out.shape):
if np.isnan(out[row_idx, col_idx]):
out[row_idx, col_idx] = out[row_idx, col_idx - 1]
return out

def pandas_fill(arr):
df = pd.DataFrame(arr)
df.fillna(method='ffill', axis=1, inplace=True)
out = df.as_matrix()
return out

def numpy_fill(arr):
'''Solution provided by Divakar.'''
np.maximum.accumulate(idx,axis=1, out=idx)
out = arr[np.arange(idx.shape)[:,None], idx]
return out
``````

followed by this console input:

``````%timeit -n 1000 loops_fill(random_array())
%timeit -n 1000 numba_loops_fill(random_array())
%timeit -n 1000 pandas_fill(random_array())
%timeit -n 1000 numpy_fill(random_array())
``````

resulting in this console output:

``````1000 loops, best of 3: 9.64 ms per loop
1000 loops, best of 3: 377 µs per loop
1000 loops, best of 3: 455 µs per loop
1000 loops, best of 3: 351 µs per loop
``````

## Solution

Here’s one approach –

``````mask = np.isnan(arr)
np.maximum.accumulate(idx,axis=1, out=idx)
out = arr[np.arange(idx.shape)[:,None], idx]
``````

If you don’t want to create another array and just fill the NaNs in `arr` itself, replace the last step with this –

``````arr[mask] = arr[np.nonzero(mask), idx[mask]]
``````

Sample input, output –

``````In : arr
Out:
array([[  5.,  nan,  nan,   7.,   2.,   6.,   5.],
[  3.,  nan,   1.,   8.,  nan,   5.,  nan],
[  4.,   9.,   6.,  nan,  nan,  nan,   7.]])

In : out
Out:
array([[ 5.,  5.,  5.,  7.,  2.,  6.,  5.],
[ 3.,  3.,  1.,  8.,  8.,  5.,  5.],
[ 4.,  9.,  6.,  6.,  6.,  6.,  7.]])
``````