Sklearn StandardScaler With Examples

Sklearn standardscaler converts the numeric data to a standard scale which is then easy for the machine learning model to analyze. It has been observed that machine learning models perform better when the data is scaled in some specific range, especially the algorithms that are highly dependent on the weight of the input values like linear regression, KNN, logistic regression, and many more. In this short article, we will learn how we can use sklearn standardscaler to convert data into standard scale. Moreover, we will also learn why it is important to scale the data before training the model.

Introduction to Sklearn Standardscaler

Before going into sklearn standardscaler, let us first understand the concept of scaling. In machine learning, scaling is simply normalizing the dataset. The dataset can contain features of various dimensions and scales together which can affect the training process of a model. A model trained on unscaled data can have biased outcomes. So, it is always important to scale the data on a specific range before applying any machine learning model.

Sklearn standscaler is one of the scaling methods that scale the data in a standard way and make it suitable for machine learning models.

As you can see before applying the scaling the data were randomly distributed and now the data is more clustered in a specific range.

What Are Numeric Data Scaling Methods?

The two most powerful techniques of scaling are normalization and standardization. In normalization, each data point is scaled in the range of 0-1. while Standardization scales each input variable separately by subtracting the mean (called centering) and dividing by the standard deviation to shift the distribution to have a mean of zero and a standard deviation of one.

Normalization uses the following equation:

y = (x – min) / (x – max)

The min and max are the minimum and maximum values in the dataset.

Standardizing a dataset involves rescaling the distribution of values so that the mean of observed values is 0 and the standard deviation is 1. Standardization assumes that our data is a normal distribution with mean and standard deviation.

It uses the following equation:

y = (x – mean)/standard deviation

Where the mean and standard deviation are calculated as follows:

MeanStd Deviation
mean = sum(x) / count(x)std = sqrt( sum( (x – mean)^2 ) / count(x))

So far we have covered the theoretical part of sklearn standardscaler, now it is time to jump into the practical part and implement it. Also, check K-means Clustering in Python Visualization and Implementation.

Examples of Sklearn Standardscaler

In this section, we will take various examples of sklearn standardscaler and will scale our data in a specific range. Before going to the practical part, make sure that you have installed the following Python libraries as we will be using them in the practical part.

pip install sklearn
pip install pandas
pip install numpy
pip install matplotlib

You can install the modules using the pip command.

Sklearn Standardscaler on a Simple Dataset

First, let us create a simple dataset.

# importing numpy array
from numpy import asarray

# creating dataset
data = asarray([[100, 0.001],
                [8, 0.05],
		[50, 0.005],
		[88, 0.07],
		[4, 0.1],
               [35, 1],
               [45, 0.006],
               [34, 0.3]])

As you can see, we have a dataset with two columns where the first column has higher values and the second column has lower values. let us visualize the data through a box plot to see the distribution.

# importing seaborn module
import seaborn as sns

# plotting box plot
sns.boxplot(data=data)

There is a huge difference in the distribution of both columns. Now, let us apply sklearn standardscaler and scale the dataset.

# importing sklearn standardscaler
from sklearn.preprocessing import StandardScaler

# define standard scaler
scaler = StandardScaler()

# transform data
scaled = scaler.fit_transform(data)

# plotting the data
sns.boxplot(data=scaled)

After scaling the data is more convenient and both columns are now in the same scale.

Sklearn Standardscaler on One Column

In the first example, we have applied sklearn standardscaler to the whole dataset. In this section, we will learn how we can scale a specific column in sklearn.

We will take the same dataset and apply the sklearn standardscaler to the very first column.

# importing sklearn standardscaler
from sklearn.preprocessing import StandardScaler

# define standard scale
scaler = StandardScaler()

# transform data
scaled = scaler.fit_transform(data[:, :1])

# plotting the data
sns.boxplot(data=scaled)

The data is scaled from -0.5 to 0.5 while the original data was from 1-100.

Sklearn Standardscaler on Data Frame

So far we have scaled the dataset that we created on our own. This time we will read a dataset from an external file and then scale it. if your data contains non-numeric values, then it is highly recommended to apply encoding methods.

Let us first import the dataset.

# importing pandas
import pandas as pd

# importing dataset
data = pd.read_excel('data.xlsx')

# searborn
sns.boxplot(data)

Most of the data is in the range of 20-45. Now, let us apply standard scaling and see the result.

If you will directly scale the series object, you will get the following error.

# define standard scale
scaler = StandardScaler()

# transform data
scaled = scaler.fit_transform(data['Age'])

# plotting the data
sns.boxplot(data=scaled)

Output:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_239177/163014125.py in <module>
      6 
      7 # transform data
----> 8 scaled = scaler.fit_transform(data['Age'])
      9 
     10 # plotting the data

~/.local/lib/python3.10/site-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
    865         if y is None:
    866             # fit method of arity 1 (unsupervised transformation)
--> 867             return self.fit(X, **fit_params).transform(X)
    868         else:
    869             # fit method of arity 2 (supervised transformation)

~/.local/lib/python3.10/site-packages/sklearn/preprocessing/_data.py in fit(self, X, y, sample_weight)
    807         # Reset internal state before fitting
    808         self._reset()
--> 809         return self.partial_fit(X, y, sample_weight)
    810 
    811     def partial_fit(self, X, y=None, sample_weight=None):

~/.local/lib/python3.10/site-packages/sklearn/preprocessing/_data.py in partial_fit(self, X, y, sample_weight)
    842         """
    843         first_call = not hasattr(self, "n_samples_seen_")
--> 844         X = self._validate_data(
    845             X,
    846             accept_sparse=("csr", "csc"),

~/.local/lib/python3.10/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    575             raise ValueError("Validation should be done on X, y or both.")
    576         elif not no_val_X and no_val_y:
--> 577             X = check_array(X, input_name="X", **check_params)
    578             out = X
    579         elif no_val_X and not no_val_y:

~/.local/lib/python3.10/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
    877             # If input is 1D raise error
    878             if array.ndim == 1:
--> 879                 raise ValueError(
    880                     "Expected 2D array, got 1D array instead:\narray={}.\n"
    881                     "Reshape your data either using array.reshape(-1, 1) if "

ValueError: Expected 2D array, got 1D array instead:
array=[21. 18. 20. 65. 18. 24. 45. 35. 23. 32. 34. 31. 43. 32. 20.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

To avoid such errors, we need to do some transformations as shown in the code below:

# importing numpy array
import numpy as np

# define standard scale
scaler = StandardScaler()

# transform data
scaled = scaler.fit_transform(np.array(data['Age']).reshape(-1, 1))

# plotting the data
sns.boxplot(data=scaled)

This time the data is scaled in a specific range.

Summary

Feature scaling is a method used to normalize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step. In this short article, we learned how we can use sklearn standardscaler to scale the dataset in a specific range using various example

Leave a Comment