Detect and hundle missing values in Python

Data Science and Machine Learning Using Python

About Lesson

When you work on a real project, you will probably get a raw dataset. Raw data is simply a dataset that might have missing values, non-numeric columns, outliers, unnecessary columns, etc. Now, as a data scientist, it is our task to clean the dataset to make it suitable for a machine learning model. In this lesson, we will discuss various methods that can help us to clean the dataset and handle missing values.

How do we handle missing values?

In Python, there are many methods available that help us to handle missing values optimally depending on the type of dataset. In this section, we will go through the following different methods to handle missing values.

Drop missing values
Forward filling
Backward filling
Replace missing values

First, let us create a sample dataset that will have some missing values in it.

# import module
import pandas as pd

# dataframe
data = {"Column_1": [3, 8, 9, None, 3, 8, 3, None, 8]}

# dataframe
df = pd.DataFrame(data)
df

If you have a large data frame with many columns and rows, then you can simply use the data.isnull().sum() methods to detect if there are any missing values in any of the columns.

df.isnull().sum()

In our case, the data frame has already two missing values.

How do we drop missing values from the data frame?

The dropna() method in Pandas is used to get rid of missing values by dropping them. When you apply the dropna() method on a data frame, it will drop the whole row where if founds a missing value.

It is recommended to use the dropna() method only when the length of the data frame is very large as compared to the total number of missing values. In such a case, dropping a few rows will not affect the overall trend of the data set.

# df copy
df1 = df.copy()

# dropna with inplace =True
df1.dropna(inplace = True)

First, we copied the original data frame in another variable df1. Then we applied the dropna() method. The inplace = True in the dropna() method is the indication that we want to drop the missing values permanently from the data frame. Otherwise, the missing values will be dropped for only that instant.

How does the ffill() work in Pandas?

The ffill() method is the Pandas method that is used to handle the missing values by forward filling. When we applied the ffill() method on the dataset having null values, then the value in the previous row will be filled to the null value. That is why this method is known as forward filling.

The important question is when to use this type of filling. It is highly recommended to use the ffill() when you have a dataset where each row has some relation with each others. Specially, if you have a time series data frame then you can use the ffill() to fill the missing values.

# df copy
df2 = df.copy()

# ffill()
df2.ffill(inplace=True)

In this case, we again copied the data frame and stored it in another variable. Then we applied the ffill() method with inplace=True which means the filling will be permanent.

How does the bfill() work in Pandas?

The bfill() is again another Pandas’s method that is used to fill the missing values. This will fill the null values with the values from the row which is after the missing one. That is why this method is known as backward filling.

In the same way, we can apply the bfill() method on time series datasets because, in time series datasets, there is a relation between each row.

# df copy
df3 = df.copy()

# bfill()
df3.bfill(inplace=True)

How the fillna() works in Pandas?

Apart from filling in the previous or after rows values, we can also fill the missing values with our own values as well. For example, in some cases, we may want to replace the missing values with zeroes or fill the missing values with the mean of the columns.

Let us see how we can apply the fillna() method to replace the missing values with the mean of the dataframe.

# df copy
df4 = df.copy()

# find the mean of dataframe
mean = df4.mean()

# fill with mean
df4.fillna(mean, inplace=True)

The mean() function in pandas finds the mean of each column separately and the fillna() method will replace the missing values in that column with the mean of that specific column.

Conclusion

In this short lesson, we discussed the most basic preprocessing technique: handling missing values. We can either use the methods mentioned above to handle missing values or use machine learning models to predict the missing values for us. Here, we looked at four different methods that can be used to handle missing values depending on the data frame.

Exercise Files

Lesson-3 Handle Missing Values.pdf

Size: 21.15 KB