Course Content
Data Reading using Python
As a data scientist we should be able to read different types of datasets. In this lesson, we will cover how to read dataset in different formats using Python and its various modules.
0/1
Data Preprocessing
In this section, we will cover various methods that can help us to clean the dataset and make it suitable for a machine learning model. In general, we will cover the basic methods that every data scientist should know. We will learn about encoding, outliers, null values, and hundling imbalance datasets. By the end of this section, you will be comfortable to preprocess the dataset and make it clean.
0/6
Project-1: Data Analysis Project
Welcome to the first project! This is going to be a very simple project which is about data analysis. Your task is to import the raw dataset and apply various methods to analyze and find the hidden trends from the dataset. We have already provided the solution as well, but you are recommended to try the project first by yourself and then look at the attached file for the solution.
0/1
Supervised Machine Learning
A machine learning is actually using some models to go through our dataset to find the trends and information from our data automatically. The machine learning can be superivsed or unsupervised. The supervised machine learning is when you have the target variable in our your dataset or you have a labeled dataset. The supervised machine learning will find the relation between the input data and the target variable and will use this relation to make prediction later. In this section, we will go through various kinds of supervised machine learning models and will analyze them.
0/4
Data Science and Machine Learning Using Python
About Lesson

When you work on a real project, you will probably get a raw dataset. Raw data is simply a dataset that might have missing values, non-numeric columns, outliers, unnecessary columns, etc. Now, as a data scientist, it is our task to clean the dataset to make it suitable for a machine learning model. In this lesson, we will discuss various methods that can help us to clean the dataset and handle missing values.

How do we handle missing values?

In Python, there are many methods available that help us to handle missing values optimally depending on the type of dataset. In this section, we will go through the following different methods to handle missing values. 

  • Drop missing values
  • Forward filling
  • Backward filling
  • Replace missing values

First, let us create a sample dataset that will have some missing values in it. 

# import module
import pandas as pd

# dataframe
data = {"Column_1": [3, 8, 9, None, 3, 8, 3, None, 8]}

# dataframe
df = pd.DataFrame(data)
df

If you have a large data frame with many columns and rows, then you can simply use the data.isnull().sum() methods to detect if there are any missing values in any of the columns. 

df.isnull().sum()

In our case, the data frame has already two missing values. 

How do we drop missing values from the data frame?

The dropna() method in Pandas is used to get rid of missing values by dropping them. When you apply the dropna() method on a data frame, it will drop the whole row where if founds a missing value. 

It is recommended to use the dropna() method only when the length of the data frame is very large as compared to the total number of missing values. In such a case, dropping a few rows will not affect the overall trend of the data set. 

# df copy
df1 = df.copy()

# dropna with inplace =True
df1.dropna(inplace = True)

First, we copied the original data frame in another variable df1. Then we applied the dropna() method. The inplace = True in the dropna() method is the indication that we want to drop the missing values permanently from the data frame. Otherwise, the missing values will be dropped for only that instant. 

How does the ffill() work in Pandas?

The ffill() method is the Pandas method that is used to handle the missing values by forward filling. When we applied the ffill() method on the dataset having null values, then the value in the previous row will be filled to the null value. That is why this method is known as forward filling. 

The important question is when to use this type of filling. It is highly recommended to use the ffill() when you have a dataset where each row has some relation with each others. Specially, if you have a time series data frame then you can use the ffill() to fill the missing values. 

# df copy
df2 = df.copy()

# ffill()
df2.ffill(inplace=True)

In this case, we again copied the data frame and stored it in another variable. Then we applied the ffill() method with inplace=True which means the filling will be permanent. 

How does the bfill() work in Pandas?

The bfill() is again another Pandas’s method that is used to fill the missing values. This will fill the null values with the values from the row which is after the missing one. That is why this method is known as backward filling. 

In the same way, we can apply the bfill() method on time series datasets because, in time series datasets, there is a relation between each row. 

# df copy
df3 = df.copy()

# bfill()
df3.bfill(inplace=True)

How the fillna() works in Pandas?

Apart from filling in the previous or after rows values, we can also fill the missing values with our own values as well. For example, in some cases, we may want to replace the missing values with zeroes or fill the missing values with the mean of the columns. 

Let us see how we can apply the fillna() method to replace the missing values with the mean of the dataframe. 

# df copy
df4 = df.copy()

# find the mean of dataframe
mean = df4.mean()

# fill with mean
df4.fillna(mean, inplace=True)

The mean() function in pandas finds the mean of each column separately and the fillna() method will replace the missing values in that column with the mean of that specific column. 

Conclusion

In this short lesson, we discussed the most basic preprocessing technique: handling missing values. We can either use the methods mentioned above to handle missing values or use machine learning models to predict the missing values for us. Here, we looked at four different methods that can be used to handle missing values depending on the data frame. 

 

 

 

 

Exercise Files
Lesson-3 Handle Missing Values.pdf
Size: 21.15 KB