Pandas Masterclass — Your Foundation To Data Science

Part 2: DataFrame operations on special file formats

Published in

Analytics Vidhya

8 min readJan 12, 2020

Welcome back! In the previous post we have given a brief introduction to Pandas and how one can use Pandas libraries to read various file types and perform various ways of data extraction from DataFrames.

In this post, we shall discuss about writing DataFrame content to a file, reading special data types such as database, JSON data and binary data and then head over to sorting, filtering and grouping data.

1. Reading data from Database
We will first have to be familiar with the process of handling databases in Python before we can see how to use Pandas functions to manipulate and make best use of data. We shall explore the process with sqlite db which is a basic in-memory database that’s pre-installed on Mac OS and Linux OS and is a very light package to install on Windows OS. The various basic steps involved are as follows.

i) Importing the package that helps us execute the commands
Each and every database will have it’s respective package that has various pre-defined functions to help us process data in the database. Here is the syntax to import the package for sqlite

import sqlite3

ii) Creating a connection with the database
Then we establish a connection with the database. In a single program we can establish as many connections as needed with the database by using a new variable name for connection each time. The syntax to follow is

db = sqlite3.connect("myDb1.sqlite")

This essentially tries to connect to the database myDb, if it exists, else it creates a new database myDb1.

iii) Executing queries
Once we successfully connect to the database, we can execute any number of queries on the database using the execute or executemany functions. This can be best explained with examples.

#Creating a new table in the database
db.execute("CREATE TABLE test (name VARCHAR(50), age INTEGER)")
db.commit()

We are executing a table creation query and then committing the changes to the database. This shall create a new table with name test having columns name with a variable character type field of length 50 and age which is an integer.

We can use the same execute function to insert and retrieve data from the test table.

db.execute("INSERT INTO test values(?, ?)", ("Raghu", 34))

What we have to remember here is that tables store and retrieve values as tuples only in Python. When we pass the arguments it has to be passed only as a tuple as given in the above example which basically inserts the values "Raghu" and 34 in the first row of the table.

For large sets of data held in either a list or nested tuple, we can simply execute a for loop as below

list1 = [("Divya", 28), ("Dad", 69), ("Mom", 63)]for entry in list1:
    db.execute("INSERT INTO test values(?, ?)", entry)
db.commit()

The above lines of code will sequentially read each tuple from list list1 and then execute the insert statement for each set of values.

Or, we can use the much useful executemany function as follows

list1 = [("Divya", 28), ("Dad", 69), ("Mom", 63)]
db.executemany("INSERT INTO test values(?, ?), list1)
db.commit()

Selecting content from the database is a bit different. Once we execute the query, it stores the results into a cursor object which then has to be either converted into a list that can be iterated upon or use the fetchall() function to do the same job.

results = db.execute("SELECT * FROM test ORDER BY age DESC")
print(type(results)) #showing type of result object
rows = results.fetchall()
print(rows)

results = db.execute("SELECT * FROM test ORDER BY age DESC")
print(type(results)) #showing type of result object
rows = list(results)
print(rows)

Both will create a list with name rows that stores the entire contents of test table with each row stored as a tuple in it as below.

Type of result is cursor. Printing output of select query as a list of tuples.

Once we are able to extract the data out of a table and store it in a list, we can actually convert it into a DataFrame for further processing of data.

First step is to read the column labels using the description attribute. The description attribute’s first element stores the column name.

print(results.description)

The first element of each tuple has the column name

Then we use the DataFrame function of Pandas to convert the test table contents stores as a list in rows variable and then assigning the column labels from the results.description variable.

pdResults = pd.DataFrame(rows, columns=[x[0] for x in results.description])
print(pdResults)

DataFrame showing the contents of list table

2. Reading JSON data
Most of the communication using web APIs happen with transfer of data in JSON format lately and so it’s quite essential to have the necessary skills to parse JSON data and extract information from it in the form of DataFrame.

Let’s begin with a simple JSON data. Do note, the JSON package of Python can parse the JSON data using the default functions only when the data is passed as a string. Consider the following JSON data in string format below.

obj = """{"name":"Wes",
       "places_lived":["United States", "Spain", "Germany"],
       "pet":"null",
       "siblings": [{"name":"Scott", "age":30, "pets":["Zeus", "Zuko"]},
                    {"name":"Katie", "age":38, "pets":["Sixes", "Stache", "Cisco"]}]
       }"""#we use """ to take the JSON data as string

When we check the obj variables data type it shows as str as well.

We then load the JSON data into a variable after importing the json package. The variable to which JSON data is loaded will be of the form dict.

import jsonresults = json.loads(obj)print(type(results)) #this prints out <class 'dict'>

We can use online free JSON visual viewers to explore the JSON data visually. One website you can make use of is JsonViewer. The above obj JSON data will be shown as below.

A visual representation of the obj variable storing JSON data

You can revert back from dict format to str format for JSON data using the dumps() function.

dumpResults = json.dumps(results)
print(type(dumpResults)) #this prints out <class 'str'>

Reading specific JSON data happens by calling the key of the dict that outputs the value associated with the key.

print(results["siblings"]) #prints all values of key 'siblings'
print(results["siblings"][0]) #prints the first value of key 'siblings'
print(results["siblings"][0]["name"]) #prints value of 'name' key in first value of 'siblings' key value.

The output for the above three print statements

Once we are able to selectively call the JSON data, we can then convert it into a DataFrame provided all values in the key of the selected key are equal length.

dfResults = pd.DataFrame(results["siblings"], columns = ["name", "age", "pets"])
print(dfResults)

The above code calls out the 'siblings' key to retrieve its values and then specify that we want to extract all columns, we can as well omit the columns attribute as by default it retrieves data of all columns.

To get a list of all keys in the JSON data we can use the key() function of dictionary data type.

print(results.keys()) #JSON data keys
print(dfResults.keys()) #DataFrame keys

Printing keys for JSON data and DataFrame that loaded JSON data partly

3. Reading data from Web APIs
As we have just discussed about loading JSON data into a DataFrame, let’s now take yet another JSON example but one by reading the JSON data from a web API. Reading web API data is handled by the requests package in python.

import requestsurl = "http://api.github.com/repos/pandas-dev/pandas/issues"
webData = requests.get(url)
print(type(webData))

Result for get function returns a response object

the function get() responds back with a response object which is stored into another variable webData. We then use the json() function to read the JSON data received as a response from the web URL.

jsonData = webData.json()

Now the variable jsonData stores the JSON data which can be visually seen using any online JSON data visualiser as discussed earlier. We can extract parts of JSON data into a DataFrame just as discussed earlier for JSON data types.

dfData = pd.DataFrame(jsonData, columns=['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url'])
print(dfData.iloc[1])

4. Reading and operating on Binary Data.
The most used binary file formats in Machine Learning are Pickle file format and HDF5 file format. As most of the functions are quite similar we shall only take a simple read and write example for Pickle file format for binary data.

To write content stored in a DataFrame into Pickle file format we use to_pickle() function with the following syntax

results = pd.read_csv("stock_px.csv") #reading a csv file into a dataframe
results.to_pickle("pickleStocks") #stores in a pickle file format with name of the file as 'pickleStocks'

This will create a pickle file with filename pickleStocks and stores the data in DataFrame results into it.

To read the data from pickle file we use the read_pickle() function as follows which stores the data into a DataFrame

pickleDf = pd.read_pickle("pickleStocks")

Once we know how to read and write data in pickle format from and to a DataFrame, the further operations are quite simple.

With this post, we pretty much complete reading content from various types of files using Pandas and partly covered selectively displaying DataFrame data using iloc and loc functions. In the upcoming parts of the series we shall explore storing JSON data to files, sorting, filtering and grouping data.

All parts of the series:
Pandas Masterclass — Your Foundation To Data Science (Part 1: Basic Dataframe operations)
Pandas Masterclass — Your Foundation To Data Science (Part 2: DataFrame operations on special file formats)
Pandas Masterclass — Your Foundation To Data Science (Part 3: sorting, filtering and grouping DataFrame data and writing to files)
Pandas Masterclass — Your Foundation To Data Science (Part 4: pandas functions)
Pandas Masterclass — Your Foundation To Data Science (Part 5: Multi-index DataFrames, handling na values and Combining DataFrames)

Pandas Masterclass — Your Foundation To Data Science

Part 2: DataFrame operations on special file formats

Written by Raghu Prodduturi