Are you looking for Data visualization using Pandas? Here we will plot various graphs using the pandas module.
Pandas is an open-source library in Python. It provides ready-to-use high-performance data structures and data analysis tools. Pandas module runs on top of NumPy and it is popularly used for data science and data analytics. However, data visualization using pandas can be very useful as well. It provides a built-in function that helps us to visualize complex data in simple and useful plots by just calling them. In this article, we will discuss how we can use pandas for data visualization by plotting various useful graphs.
Exploring Datasets Using Pandas
Exploring and preprocessing datasets is really important in data science and Machine Learning because it helps to know the dataset clearly and make it suitable for the learning models. You can get access to the dataset and the source code from my GitHub account.
Let us first import the dataset from a CSV file using the panda’s module.
# importing pandas module
import pandas as pd
# importing the dataset
data = pd.read_csv('house.csv')
# printing few rows of the dataset
data.head()
As the data set contains information about the prices of houses which depends on five different input variables as shown above.
Learn more about How to install Pandas
As you can see, there are many null values in the dataset. Pandas provide a built-in method to remove these null values. Let us remove all the null values from our dataset.
# removing null values
data.dropna(axis=0, inplace=True)
One of the useful methods of pandas is info() method, which provides many useful details about the dataset. Let us now use this method to know some useful details about the dataset.
Another useful method of available in pandas is the describe(), which returns the max, min, average, etc values of each of the attributes in the dataset.
# describe function of pandas
data.describe()
As you can see, the describe function has returned much useful information about the dataset.
Data Visualization Using Pandas Module
Data visualization is the most important step in the life cycle of data science and data analytics. It is more impressive and interesting when we represent our study or analysis with the help of colors and graphics. Using visualization elements like graphs, charts, maps, etc., It becomes easier for clients to understand the underlying structure, trends, patterns, and relationships among variables within the dataset.
Although there are various modules in Python which help us to visualize data in various plots, for example, heatmaps, 3d-plots using python, and plotting data on a google map. And these plotting need special modules. But if you know only pandas, you can still visualize your data through various plots and some of which we will discuss in this section.
Line plots in Pandas
A line plot is a linear graph that shows data frequencies along a number line. It can be used to analyze data that has a single defined value. Line plots are more useful when visualizing time series data.
Visualizing line plots using the pandas module is very easy. We just need to call the plot() function. For example, see the line plot below where we will plot the line graph of the prices.
# plotting line plots using pandas
data['price'].plot(figsize=(10, 6), c='m')
As you can see, the height of the line plots shows the price of the house. We can also plot more than one variable using the line plot. See the example below:
# plotting multi-line plots uing pandas
data.drop('price', axis=1).plot(figsize=(10, 6))
As you can see we have plotted different independent variables from the dataset on the same graph. But the problem is that all the variables have been plotted on the same scale which is not good because we cannot actually see, how the latitude and longitude are changing. We can plot them on separate plots using subplots.
# plotting multi-line plots uing pandas
data.drop('price', axis=1).plot(figsize=(10, 6), subplots=True)
This time as you can see, each plot has scaled differently.
Bar Plots in Pandas
A bar plot shows the relationship between a numeric and a categoric variable. Each entity of the categoric variable is represented as a bar. The size of the bar represents its numeric value. Let us now use the pandas module to plot bar charts. But before going to plot the bar charts, we will take a random 0.9% of the dataset and plot a bar chart from them as we have a huge dataset.
We will first plot the bar chart for the number of floors.
# taking 10% of the data randomly
dataset = data.sample(frac = 0.009)
# plotting bar charts
dataset['floor'].plot(kind='bar', figsize=(10, 6))
As you can see, the x-axis shows the bars and the y-axis shows the height of each bar which is actually the number of floors.
We can also plot the bar chart for more than one variable at the same time. For example, let us plot the bar chart for the number of floors and the number of rooms together.
# copying the dataset
bar_plot = dataset.copy()
# dropping unwanted variables
bar_plot.drop('area', axis=1, inplace=True)
bar_plot.drop('latitude', axis=1, inplace=True)
bar_plot.drop('longitude', axis=1, inplace=True)
bar_plot.drop('price', axis=1, inplace=True)
# plotting the bar chart
bar_plot.plot(kind='bar', figsize=(10, 6))
As you can see, the orange bars show the number of rooms and the blue bars show the number of floors. Another functionality of the bar plot is that we can also plot the stacked bar plots in pandas as shown below:
# plotting the bar chart
bar_plot.plot(kind='bar', figsize=(10, 6), stacked=True)
As you can see, we have plotted a stacked bar chart using pandas. We can also plot the same bar chart on the other axis by using barh the function.
# plotting the bar chart
bar_plot.plot.barh(stacked=True, figsize=(6, 8))
Histogram in Pandas
A histogram plot is a frequency distribution that shows how often each different value in a set of data occurs. A histogram is the most commonly used graph to show frequency distributions. Let us now plot the histogram chart of the above dataset.
# copying the dataset
hist_plot =bar_plot.copy()
# plotting histogram chart
hist_plot.plot.hist(figsize=(10, 6))
In a similar way to the bar plots, we can also plot stacked histograms as well.
# plotting histogram chart
hist_plot.plot.hist(figsize=(10, 6), stacked=True)
We can also plot the commutative histogram as well. Let us plot the commutative histogram on the y-axis.
# plotting histogram chart
hist_plot.plot.hist(figsize=(10, 6), stacked=True)
Apart from the cumulative histogram chart, we can also plot the histogram chart for each of the columns separately.
# histogram for each of the column
dataset.diff().hist(figsize=(10, 6))
As you can see, we have successfully plotted histograms for each of the columns.
Area Plots in Pandas
An area chart combines the line chart and bar chart to show how one or more groups’ numeric values change over the progression of a second variable, typically that of time. An area chart is distinguished from a line chart by the addition of shading between lines and a baseline, like in a bar chart. Let us first plot the area plot of the price variable.
# plotting the area plot
dataset['price'].plot.area(figsize=(10, 6))
As you can see, the area under the line chart has been shaded. Also, the above plot is very irregular because the data is random and there is no fixed trend. Let us now, create a random dataset and visualize it using an area plot.
# importing the module
import numpy as np
# creating a dataset
df = pd.DataFrame(np.random.rand(20, 5), columns=['A', 'B', 'C', 'D', 'E'])
# plotting Data visualization using Pandas
df.plot.area(figsize=(10,6))
As you can see, the area plots are stacked by default. We can also plot them without stack as shown below:
# plotting area plot
df.plot.area(figsize=(10,6), stacked=False)
As you can see, this time the plots are not stacked.
Scattered Plots in Pandas
Scatter plots are used to plot data points on a horizontal and a vertical axis in an attempt to show how much one variable is affected by another. Let us now plot the scatter plot of the dataset using pandas. Unlike other plots, in scatter plots, we have to specify the variables on axes.
# plotting the scatter plot
data.plot.scatter(x='price', y='area', figsize=(10, 6), c='m')
Now, let us apply a little styling and visualize two different datasets on one plot.
# creating sctter plot
ax=data.plot.scatter(x="price", y="area", color="m", marker="*", s=50, figsize=(10, 6))
# adding one more scattered plot on the same graph
data.plot.scatter(x="price", y="latitude", color="g", s=100, ax=ax)
As you can see, the purple-colored dots show the relation between price and area while the green dots show the relation between the price and the latitude. Another way of coloring the data points is using any of the column values. For example, see below:
# plotting the scatter plot based on coloring
data.plot.scatter(x="area", y="price", c='floor', s=100, figsize=(10, 6))
As you can see, the fully black shows the number of floors to be 20 and we the intensity of black decreases, the number of floors also decreases.
Box Plots in Pandas
A box plot is a graphical rendition of statistical data based on the minimum, first quartile, median, third quartile, and maximum. The term box plot comes from the fact that the graph looks like a rectangle with lines extending from the top and bottom. Pandas can also be used to visualize box plots. Let us first find the box plot of the price variable.
# data visualization using pandas box plot
data['price'].plot.box(figsize=(10, 6))
So all the points outside those horizontal lines are outliers. We can also plot the box plot for multiple variables as well.
# plotting the box plot
data.drop('price', axis=1).plot.box(figsize=(10, 6))
As you can see, we have plotted the box plot for multiple variables using pandas.
Pie Charts in Pandas
Pie charts are useful when we have a small number of categorical values that we need to compare. The readability of pie charts goes way down with the slightest increase in the number of categorical values. Let us first create a Pie chart for the number of floors.
# plotting a pie chart
dataset['floor'].plot.pie(figsize=(10, 6))
Each color represents a different number of rooms. We can also visualize the subplots for various numbers of categorical classes. Now, we will generate random variables to visualize subplots.
#creating a DataFrame
Data = pd.DataFrame(np.random.rand(6, 3),
columns=('A', 'B', 'C'))
# plottig Data visualization using Pandas subplots
Data.plot.pie(subplots=True, figsize=(10, 6))
As you can see, we have visualized subplots of pie.
Summary
Visualization is effective because it harnesses the power of our subconscious mind. When we visualize goals as complete, it creates a conflict in our subconscious mind between what we are visualizing and what we currently have. In this article, we learn how we can use the pandas module to visualize our data in various plots.