Course Content
Data Reading using Python
As a data scientist we should be able to read different types of datasets. In this lesson, we will cover how to read dataset in different formats using Python and its various modules.
0/1
Data Preprocessing
In this section, we will cover various methods that can help us to clean the dataset and make it suitable for a machine learning model. In general, we will cover the basic methods that every data scientist should know. We will learn about encoding, outliers, null values, and hundling imbalance datasets. By the end of this section, you will be comfortable to preprocess the dataset and make it clean.
0/6
Project-1: Data Analysis Project
Welcome to the first project! This is going to be a very simple project which is about data analysis. Your task is to import the raw dataset and apply various methods to analyze and find the hidden trends from the dataset. We have already provided the solution as well, but you are recommended to try the project first by yourself and then look at the attached file for the solution.
0/1
Supervised Machine Learning
A machine learning is actually using some models to go through our dataset to find the trends and information from our data automatically. The machine learning can be superivsed or unsupervised. The supervised machine learning is when you have the target variable in our your dataset or you have a labeled dataset. The supervised machine learning will find the relation between the input data and the target variable and will use this relation to make prediction later. In this section, we will go through various kinds of supervised machine learning models and will analyze them.
0/4
Data Science and Machine Learning Using Python
About Lesson

As a data scientist, our main task is usually to deal with various kinds of datasets. This data can be in various formats from Excel to CSV and from pdf to JSON files. As a data scientist, we should know how to read different kinds of datasets. In this lesson, we will learn how we can read Excel sheets, CSV files, and JSON file using Python. 

Table of Contents

Pandas is an open-source module in Python that is mostly used for data analysis, filtering, preprocessing, and reading. Let us use this module to read different datasets. 


# import pandas module
import pandas as pd

Once the module is imported, we can then use it to read various datasets. 

Read CSV files in Pandas

In order to read a CSV file, we need to use the read_csv() function in pandas. This function takes the path to the CSV file and imports the content from the CSV file and stores in a variable. 

# import csv file in pandas
data = pd.read_csv("dataset.csv")

You can provide the name of the CSV file to the read_csv() function if the dataset and the Jupyter Notebook are located in the same directory. 

If the dataset is located in another folder, then you have to give an absolute or relative path to the file. An example of how the absolute path looks like in the Linux operating system. 

# import csv file in pandas
data = pd.read_csv("/home/Documents/DataScience/1. Data Reading/dataset.csv")

In some cases, the CSV files might not be separated by commas. If you have a CSV file and the separation had to be done using any other symbol or operator, then while reading the file, you have to mention the separation as well. 

# import csv file in pandas
data = pd.read_csv("dataset.csv", sep = ";)

We assumed that the separation of the columns in the CSV file was using semi-colons as we have specified. 

Read an Excel Sheet in Pandas

Excel sheets are the most common files that contain datasets in different sheets. The read_excel() function in pandas is used to import the content from an Excel sheet. 

# import excel sheets in pandas
data = pd.read_excel("dataset.excel")

In some cases, you may want to read only the specific sheet of the excel file. So, we have to tell pandas which sheet we want to import from the whole file. 

# import specific sheet from the excel file
data = pd.read_excel('data.xlsx', sheet_name='sheet 1')

Here we assumed that the name of the first sheet is sheet 1. To read multiple sheets from the Excel file, provide the names of the sheets as a list to the read_excel() function. 

# import multiple sheet from the excel file
data = pd.read_excel('data.xlsx', sheet_name=['sheet 1', 'sheet 2'])

If you are using pandas for the first time, then you may get an error while importing the Excel files. No module named openpyxl found error may occur. In such a case, it is required to install the openpyxl module on your system. 

# Install openpyxl if required
pip install openpyxl xlrd

It will take a while and the module will be installed on your system. 

Read a JSON file in Pandas

In pandas, we can use the read_json() function to import the content from a JSON file. 

# read dataset from json file
df = pd.read_json('dataseet.json')

In some cases, JSON file may come in various formats. Some of the possible JSON files format are:

  • Records
  • Columns
  • Index

In such cases, we are required to specify the type of the file while importing the content. 

# import json file in specific format
df = pd.read_json('data.json', orient='records')

In this case, we assumed that the JSON file is in records format. 

Final Thoughts

In this short lesson, we learned how we can read various kinds of datasets using Pandas. In our upcoming lessons, we will be using these methods to read different kinds of datasets. 

Exercise Files
1. Read dataset using Pandas.pdf
Size: 18.54 KB