Sklearn one hot encoder or one hot encoding is a process of converting categorical values in the dataset to numeric values so that the Machine learning model can understand and interpret the dataset. This step is part of data preprocessing. In this article, we will learn how we can use Sklearn one hot encoder to convert categorical values to numeric values by solving various examples. By the end of this article, you will learn:
- What is one hot encoder and why encoding is important in Machine Learning?
- How to use one hot encoder to encode categorical values?
- How to use sklearn one hot encoder to encode multiple columns?
What is Sklearn Module?
Sklearn, also known as Scikit-learn is probably the most useful library for machine learning in Python. The sklearn library contains a lot of efficient tools for machine learning and statistical modeling including classification(KNN, SVM, Decision trees), regression(linear regression, isolation forest, random forest), clustering(k-mean clustering), and dimensionality reduction(PCA). It also supports Python numerical and scientific libraries like NumPy and SciPy.
More importantly, it has various methods for data preprocessing including random state, data splitting, data encoding, and many more. In this article, we will focus on only one encoding method which is one-hot encoding.
How Does Sklearn One Hot Encoder Work?
As we discussed earlier sklearn on hot encoder converts the categorical values into numeric values. The One Hot Encoding technique creates a number of additional features based on the number of unique values in the categorical feature. Every unique value in the category is added as a feature. Hence the One Hot Encoding is known as the process of creating dummy variables.
For example, let us assume that we have two categorical values (Male and Female). When we apply sklearn one hot encoder, it will create two new columns, one for male and one for female and it will add value 1 if the person is male to the male column and add value 1 to the female column if the person is female.
The sklearn one hot encoder will create new columns depending on the number of categories and fill the columns with ones and zeros. Now, these values are easy for machine learning algorithms to interpret.
Mainly there are two important reasons why we should use sklearn one hot encoder to convert categorical values to numeric values. To understand the first reason, let us import a dataset that has categorical values in the output:
Why Use Sklearn One Hot Encoder?
The encoder in Machine Learning converts the non-numeric values to numeric ones. If we will feed the machine-learning model with a dataset that has non-numeric values, we will get an error. Let us see this error using an example.
# importing pandas
import pandas as pd
# importing dataset
data = pd.read_excel('Label_Encoding.xlsx')
# heading of data
data.head()
Output:
Age Marrige_Status
0 21 Yes
1 18 Yes
2 20 Yes
3 65 Yes
4 18 Yes
As you can see that the Marriage_status column has categorical values. Now, if we apply any machine learning model to this data, we will get an error because the data has categorical values. For example, let us apply xgboost algorithm to the given dataset.
# dividing the dataset
X = data.drop('Marrige_Status', axis=1)
y = data['Marrige_Status']
# importing the xgboost module
import xgboost as xgb
# Default parameters
xgboost_clf = xgb.XGBClassifier()
# training the model
xgboost_clf.fit(X,y)
Output:
ValueError: Invalid classes inferred from unique values of `y`. Expected: [0 1], got ['No' 'Yes']
We get an error because the model was unable to recognize the categorical values. That is why it is necessary to encode the categorical values before applying machine learning models.
Examples of Sklearn One Hot Encoder
Now we will solve various examples and learn how we can apply sklearn one hot encoder in Python to convert categorical values into numeric values. First, make sure that you have installed the sklearn module on your system. You can use the pip command to install the sklearn module on your system.
# importing sklearn module
import sklearn
# version checking
print(sklearn.__version__)
Output:
1.1.2
In my case, I have sklearn version 1.1.2 installed on my system.
Sklearn One Hot Encoder
Let us first import the dataset and then print the few headings to get familiar with the dataset.
# importing pandas
import pandas as pd
# importing dataset
data = pd.read_excel('Label_Encoding.xlsx')
# heading of data
data.head()
We have categorical values. Now let us import the sklearn one hot encoder and encode the categorical values into numeric ones.
# importing sklearn one hot encoding
from sklearn.preprocessing import OneHotEncoder
# initializing one hot encoding
encoding = OneHotEncoder()
# applying one hot encoding in python
transformed_data = encoding.fit_transform(data[['Marrige_Status']])
# head
print(transformed_data.toarray())
Output:
[[0. 1.]
[0. 1.]
[0. 1.]
[0. 1.]
[0. 1.]
[0. 1.]
[0. 1.]
[0. 1.]
[1. 0.]
[1. 0.]
[1. 0.]
[0. 1.]
[1. 0.]
[1. 0.]
[1. 0.]]
The Marrige_Status column has been converted into binary values by sklearn one-hot encoding method. The reason why we have only two columns is that there were only two categories in the main dataset as shown below:
# Getting one hot encoding categories
print(encoding.categories_)
Output:
[array(['No', 'Yes'], dtype=object)]
Now, let us add the encoded part back to our dataset and print it.
# adding the encoded values
data[encoding.categories_[0]] = transformed_data.toarray()
# deleting the uncoded one
data.drop('Marrige_Status', axis=1, inplace=True)
# data heading
data.head()
Output:
Now, the data has been converted into numeric values.
One Hot Encoding on Multiple Columns
So, far we have used the one hot encoding method to convert categorical encoding of only one column but now let us use the sklearn one hot-encoder to convert multiple columns from the dataset.
We will use a built-in dataset from the Seaborn module.
# loading dataset
from seaborn import load_dataset
# loading dataet
data = load_dataset('penguins')
# heading
data.head()
Output:
We have multiple columns with different categories. Now we will apply one hot encoding method on multiple columns.
# taking only columns
data = data[['island', 'sex', 'body_mass_g']]
# droping any null values
data = data.dropna()
# encoding multiple columns
transformer = make_column_transformer(
(OneHotEncoder(), ['island', 'sex']),
remainder='passthrough')
# transforming
transformed = transformer.fit_transform(data)
# transformating back
transformed_df = pd.DataFrame(transformed, columns=transformer.get_feature_names())
# head
transformed_df.head()
We have successfully encoded the multiple columns using Sklearn one hot encoder method.
Summary
One-hot encoding in machine learning is the conversion of categorical information into a format that may be fed into machine learning algorithms to improve prediction accuracy. One-hot encoding is a common method for processing categorical data in machine learning. In this short article, we learned how we can use the Sklearn one hot encoder to convert categorical values into numeric values.
You may also want to check: