As a data scientist, our main task is usually to deal with various kinds of datasets. This data can be in various formats from Excel to CSV and from pdf to JSON files. As a data scientist, we should know how to read different kinds of datasets. In this lesson, we will learn how we can read Excel sheets, CSV files, and JSON file using Python.
Pandas is an open-source module in Python that is mostly used for data analysis, filtering, preprocessing, and reading. Let us use this module to read different datasets.
# import pandas module
import pandas as pd
Once the module is imported, we can then use it to read various datasets.
Read CSV files in Pandas
In order to read a CSV file, we need to use the read_csv() function in pandas. This function takes the path to the CSV file and imports the content from the CSV file and stores in a variable.
# import csv file in pandas
data = pd.read_csv("dataset.csv")
You can provide the name of the CSV file to the read_csv() function if the dataset and the Jupyter Notebook are located in the same directory.
If the dataset is located in another folder, then you have to give an absolute or relative path to the file. An example of how the absolute path looks like in the Linux operating system.
# import csv file in pandas
data = pd.read_csv("/home/Documents/DataScience/1. Data Reading/dataset.csv")
In some cases, the CSV files might not be separated by commas. If you have a CSV file and the separation had to be done using any other symbol or operator, then while reading the file, you have to mention the separation as well.
# import csv file in pandas
data = pd.read_csv("dataset.csv", sep = ";)
We assumed that the separation of the columns in the CSV file was using semi-colons as we have specified.
Read an Excel Sheet in Pandas
Excel sheets are the most common files that contain datasets in different sheets. The read_excel() function in pandas is used to import the content from an Excel sheet.
# import excel sheets in pandas
data = pd.read_excel("dataset.excel")
In some cases, you may want to read only the specific sheet of the excel file. So, we have to tell pandas which sheet we want to import from the whole file.
# import specific sheet from the excel file
data = pd.read_excel('data.xlsx', sheet_name='sheet 1')
Here we assumed that the name of the first sheet is sheet 1. To read multiple sheets from the Excel file, provide the names of the sheets as a list to the read_excel() function.
# import multiple sheet from the excel file
data = pd.read_excel('data.xlsx', sheet_name=['sheet 1', 'sheet 2'])
If you are using pandas for the first time, then you may get an error while importing the Excel files. No module named openpyxl found error may occur. In such a case, it is required to install the openpyxl module on your system.
# Install openpyxl if required
pip install openpyxl xlrd
It will take a while and the module will be installed on your system.
Read a JSON file in Pandas
In pandas, we can use the read_json() function to import the content from a JSON file.
# read dataset from json file
df = pd.read_json('dataseet.json')
In some cases, JSON file may come in various formats. Some of the possible JSON files format are:
- Records
- Columns
- Index
In such cases, we are required to specify the type of the file while importing the content.
# import json file in specific format
df = pd.read_json('data.json', orient='records')
In this case, we assumed that the JSON file is in records format.
Final Thoughts
In this short lesson, we learned how we can read various kinds of datasets using Pandas. In our upcoming lessons, we will be using these methods to read different kinds of datasets.