Encoding Methods in Data Science Using Pandas

Data Science and Machine Learning Using Python

About Lesson

In general, we cannot analyze and apply machine learning models on a dataset that has some non-numeric values. For example, if we have a dataset having a column where we have the names of various colors. Such a dataset cannot be directly applied to the machine learning model as models use various mathematical formulae to give us some outputs. In such cases, we are required to convert the non-numeric columns in our dataset to numeric columns and this process is known as Encoding.

Table of Contents

What is the Encoding Method?

The conversion of non-numeric values or categorical values to numeric ones is known as the encoding method. There are various ways of encoding methods. In this lesson, we are going to cover the three most commonly used ones.

Label Encoding
Dummy Variable / One-hot Encoding
Replace method

Let us first create a data frame that contains some non-numeric values.

# creating a random dataset
import pandas as pd

data = {"Grade": ["A", 'A', 'B', 'C', 'A' ,'D', 'D']}

df = pd.DataFrame(data)

In order to know if the dataset had non-numeric values, we will simply find the general information about the data frame. In pandas, we can use the info() method which will give us information about the datatypes of each column.

# check if the dataset had non-numeric values
df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Grade   7 non-null      object
dtypes: object(1)
memory usage: 184.0+ bytes

In this output, each column will be shown its data type. In our case, as shown, the Dtype of the column Grade is an object which simply means, it is a non-numeric column. As a data scientist, we will always avoid having object data types in our data frame as the majority of models cannot be trained on such datasets.

Label Encoding in Python

Label encoding is one of the most commonly used encoding methods in Python. It assigns a unique integer value, starting from zero to every category in the dataset. For example, the first category/non-numeric value will get the value of zero and the next one will get one, and so on.

In this example, as you can see, the red category has been assigned a value of 0, the green has been assigned 1, and the blue 3. The numeric values will be assigned in alphabetical order.

One of the drawbacks of using the Label Encoding method on our dataset is that it gives a kind of ranking system. Let us imagine, we have 12 different colors and the first color will get the value of 0 and the last color will get the value of 11. Now, later when I apply a machine learning model, my model might be a little bit biased toward different colors because of having higher or lower numeric values.

In order to implement the Label Encoder, we have to import the Encoder from the Sklearn module.

# copying my original dataset
df1 = df.copy()

# importing the sklearn module
from sklearn.preprocessing import LabelEncoder
# initialize the model
label = LabelEncoder()

# fit the dataset
df1['Grade'] = label.fit_transform(df1['Grade'])

df

Once you get the output, you will notice that the Grades columns will be assigned by numeric values starting from 0 onwards.

In the example above, the fit_transform() method will take the column with the non-numeric values and then transform the values to numeric ones and store them back in the same column.

Dummy Variable in Pandas

Dummy variables or one-hot encoding method simply reduces the biases in the model. Unlike the Label Encoder, which assigns a unique integer value to every category, the one hot encoding method will assign a unique column to every category and put either 0 or 1 in the column.

One hot encoding method

In the example, we again have a column representing three different colors. When one hot encoder is applied to this dataset, it generates a unique column for every color. In our original data, the first color at index 1 was red, and in the encoded data, only the red column will be 1, and the rest of the columns will be zero. In pretty same way, in our original data, at index 2 the color was blue. This means in the encoded data, only the blue column will be 1, and the rest will be zeros.

One of the limitations of this method is that it increases the dimensions of the dataset. This might now increase the training time of a machine learning model.

In order to implement one hot encoding method, you can either use the Sklearn module or Pandas. In this case, we will be using the Pandas module.

# copy the dataset
df2 = df.copy()

# apply the dummy variables
df2 = pd.get_dummies(df2, columns=['Grade'])

df2

Once you print the data frame after applying the one-hot encoding method, you will see new different columns assigned to each category in the dataset.

In some cases, if you have more than one column which has non-numeric values, then simply you will list them while applying the function.

# apply the dummy variables on multiple columns
df2 = pd.get_dummies(df2, columns=['Grade', 'Grade2', 'Grade3'])

It is highly recommended to apply the one-hot encoding method on only the input dataset. We will discuss this point in detail, once we start the machine learning section.

Replace method for Encoding

Let’s just assume that instead of assigning unique values starting from 0 or assigning unique columns, we want the values to be represented by values defined by us. In this case, let us say that we want Grade A to be represented by 90, Grade B by 80, and so on.

# copy dataset
df3 = df.copy()

# create a nested dictionary
encoded = {'Grade' : {"A": 90, "B": 80, 'C': 70, "D": 60}}

# apply replace method
df3.replace(encoded)

df3

In this case, we first have to define a nested dictionary. A nested dictionary is a dictionary that has another dictionary inside it. The outer key should represent the column name and the inder dictionary should pair the non-numeric value with a numeric value.

Summary

In this lesson, we discussed the encoding methods in Python which can be applied on a dataset to convert the non-numeric categorical values to numeric one.

Exercise Files

Lesson-2 Encoding Methods.pdf

Size: 24.06 KB