less than 1 minute read

Imputing missing values

From the mean of a feature

Country Name Year GDP
Aruba 1965 null
Aruba 1966 5.872478e+08

Say you have a dataframe for GDP by Country Name for each year, but some years are missing values. One way to deal with the missing values is to fill them in with the mean GDP for that country as follows:

df['GDP_filled'] = df.groupby('Country Name')['GDP'].transform(lambda x: x.fillna(x.mean()))

With forward fill

We can also use the ffill option from Pandas.

First we need to take care to sort the data by year, then we group by the Country Name so that the forward fill stays within each country

df.sort_values('year').groupby('Country Name')['GDP'].fillna(method='ffill')

With backward fill

Of course there is backward fill too:

df.sort_values('year').groupby('Country Name')['GDP'].fillna(method='bfill')

Comments