Sklearn Labelencoder Examples in Machine Learning

Sklearn labelencoder is a process of converting categorical values to numeric values so that machine learning models can understand the data and find hidden patterns. Although, there are various ways for categorical encoding and Sklearn labelencoder is one of them. In this short article, we will learn how Sklearn labelencoder works by taking various examples. Moreover, we will also compare Sklearn labelencoder with Sklearn one hot encoder.

What is Sklearn Module?

Sklearn, also known as Scikit-learn is probably the most useful library for machine learning in Python. The Sklearn library contains a lot of efficient tools for machine learning and statistical modeling including classification(KNN, SVM, Decision trees), regression(linear regression, isolation forest, random forest), clustering(k-mean clustering), and dimensionality reduction(PCA). It also supports Python numerical and scientific libraries like NumPy and SciPy.

More importantly, it has various methods for data preprocessing including random state, data splitting, data encoding, and many more. In this article, we will focus on only one encoding method: one-hot encoding.

How Does Sklearn Labelencoder Works?

Sklearn labelEncoder converts the labels or categories into labels so that the machine learning model can understand the dataset. The label encoding assigns a new numeric value to each of the categories as shown below:

labelencoder

As you can see, the label encoder has assigned a specific value for each category.

What are the limitations of Label Encoding?

In machine learning, we cannot train a model on a dataset that has non-numeric values. If we have non-numeric values as categories then encoding methods are suggested to be applied. Among all the encoding methods, label encoding is very popular and commonly used. However, there are some limitations to using this method. The biggest drawback of using this method is that this method increases the chances of the model being biased. For example, if we have 10 different categories, then the label encoder will assign a unique integer value starting from 0 to each of the categories. This gives a kind of ranking to the categories. The machine learning model might give more priority to the categories with higher integer values than to the lower ones so there is always a risk of the model being biased.

Examples of Sklearn Labelencoder

Now, we will take various examples of Sklearn label encoder and will solve various examples. Here is what we are going to do in this section:

  • Sklearn label encoding one column
  • Sklearn label encoding multiple columns

Sklearn Label Encoding on One Column

Let us first import the dataset and then use the sklearn label encoding to convert categorical values to numeric ones.

# importing pandas
import pandas as pd
# importing dataset
data = pd.read_excel('Label_Encoding.xlsx')
# heading of data
data.head()

Output:

Age	Marrige_Status
0	21	Yes
1	18	Yes
2	20	Yes
3	65	Yes
4	18	Yes

The output of the data is categorical. Now, we will use the Sklearn labelencoder to convert these values into numeric values.

# Import sklearn labelencoder
from sklearn import preprocessing
  
# initializing sklearn labelencoder
label_encoder = preprocessing.LabelEncoder()
  
# encoding marrige column
data['Marrige_Status']= label_encoder.fit_transform(data['Marrige_Status'])
# printing
data['Marrige_Status'].unique()

Output:

array([1, 0])

As you can see, there are only numeric values in the output column.

You may also like: MinMax Scaling in Sklearn

Sklearn Label Encoding Multiple Columns

Encoding multiple columns in Sklearn is very much similar to a single column. Here we just need to specify the names of all columns.

data = pd.DataFrame({"Names": ['B', 'A', 'S'], 
                    "Grade": ['A', 'B', 'A']})

data.head()

Output:

Names	Grade
0	B	A
1	A	B
2	S	A

There are two columns with categorical values. Let us now apply the Sklearn label encoder to convert the data into numeric values.

You can either use the for loop or the selection of the columns for the encoding method. Here is the first way to applying the label encoding method on multiple columns

# select the columns
cols = ['Names', 'Grade']

# initialize the encoder
encoder = LabelEncoder()

# train multiple columns
data[cols] = data[cols].apply(encoder.fit_transform)

data.head()

Output:

Names	Grade
0	1	0
1	0	1
2	2	0

Another method to do the encoding of multiple columns is to simply use a for loop and iterate through each column one by one. But we have to make sure, we are only iterating through the columns with the object data type.

# selecting the dtype only as object
cols = data.select_dtypes('object')

# going through each columns with object
for i in cols:
    data[i] = encoder.fit_transform(data[i])
    
data.head()

Output:

Names	Grade
0	1	0
1	0	1
2	2	0

Summary

Label encoding assigns each categorical value an integer value based on alphabetical order. In this short article, we learned how we can use the Sklearn label encoder to convert categorical values to numeric ones. If you have any specific questions related to the label encoding method in Sklearn, please let us know through comments.

Leave a Comment