Pandas Masterclass — Your Foundation To Data Science
Part 2: DataFrame operations on special file formats
Welcome back! In the previous post we have given a brief introduction to Pandas and how one can use Pandas libraries to read various file types and perform various ways of data extraction from DataFrames.
In this post, we shall discuss about writing DataFrame content to a file, reading special data types such as database, JSON data and binary data and then head over to sorting, filtering and grouping data.
1. Reading data from Database
We will first have to be familiar with the process of handling databases in Python before we can see how to use Pandas functions to manipulate and make best use of data. We shall explore the process with sqlite db which is a basic in-memory database that’s pre-installed on Mac OS and Linux OS and is a very light package to install on Windows OS. The various basic steps involved are as follows.
i) Importing the package that helps us execute the commands
Each and every database will have it’s respective package that has various pre-defined functions to help us process data in the database. Here is the syntax to import the package for sqlite
import sqlite3
ii) Creating a connection with the database
Then we establish a connection with the database. In a single program we can establish as many connections as needed with the database by using a new variable name for connection each time. The syntax to follow is
db = sqlite3.connect("myDb1.sqlite")
This essentially tries to connect to the database myDb
, if it exists, else it creates a new database myDb1
.
iii) Executing queries
Once we successfully connect to the database, we can execute any number of queries on the database using the execute
or executemany
functions. This can be best explained with examples.
#Creating a new table in the database
db.execute("CREATE TABLE test (name VARCHAR(50), age INTEGER)")
db.commit()
We are executing a table creation query and then committing the changes to the database. This shall create a new table with name test
having columns name
with a variable character type field of length 50 and age
which is an integer.
We can use the same execute function to insert and retrieve data from the test
table.
db.execute("INSERT INTO test values(?, ?)", ("Raghu", 34))
What we have to remember here is that tables store and retrieve values as tuples only in Python. When we pass the arguments it has to be passed only as a tuple as given in the above example which basically inserts the values "Raghu"
and 34
in the first row of the table.
For large sets of data held in either a list or nested tuple, we can simply execute a for loop as below
list1 = [("Divya", 28), ("Dad", 69), ("Mom", 63)]for entry in list1:
db.execute("INSERT INTO test values(?, ?)", entry)
db.commit()
The above lines of code will sequentially read each tuple from list list1
and then execute the insert
statement for each set of values.
Or, we can use the much useful executemany
function as follows
list1 = [("Divya", 28), ("Dad", 69), ("Mom", 63)]
db.executemany("INSERT INTO test values(?, ?), list1)
db.commit()
Selecting content from the database is a bit different. Once we execute the query, it stores the results into a cursor object which then has to be either converted into a list that can be iterated upon or use the fetchall()
function to do the same job.
results = db.execute("SELECT * FROM test ORDER BY age DESC")
print(type(results)) #showing type of result object
rows = results.fetchall()
print(rows)
or
results = db.execute("SELECT * FROM test ORDER BY age DESC")
print(type(results)) #showing type of result object
rows = list(results)
print(rows)
Both will create a list with name rows
that stores the entire contents of test
table with each row stored as a tuple in it as below.
Once we are able to extract the data out of a table and store it in a list, we can actually convert it into a DataFrame for further processing of data.
First step is to read the column labels using the description
attribute. The description
attribute’s first element stores the column name.
print(results.description)
Then we use the DataFrame
function of Pandas to convert the test
table contents stores as a list in rows
variable and then assigning the column labels from the results.description
variable.
pdResults = pd.DataFrame(rows, columns=[x[0] for x in results.description])
print(pdResults)
2. Reading JSON data
Most of the communication using web APIs happen with transfer of data in JSON format lately and so it’s quite essential to have the necessary skills to parse JSON data and extract information from it in the form of DataFrame.
Let’s begin with a simple JSON data. Do note, the JSON package of Python can parse the JSON data using the default functions only when the data is passed as a string. Consider the following JSON data in string format below.
obj = """{"name":"Wes",
"places_lived":["United States", "Spain", "Germany"],
"pet":"null",
"siblings": [{"name":"Scott", "age":30, "pets":["Zeus", "Zuko"]},
{"name":"Katie", "age":38, "pets":["Sixes", "Stache", "Cisco"]}]
}"""#we use """ to take the JSON data as string
When we check the obj
variables data type it shows as str
as well.
We then load the JSON data into a variable after importing the json
package. The variable to which JSON data is loaded will be of the form dict
.
import jsonresults = json.loads(obj)print(type(results)) #this prints out <class 'dict'>
We can use online free JSON visual viewers to explore the JSON data visually. One website you can make use of is JsonViewer. The above obj
JSON data will be shown as below.
You can revert back from dict
format to str
format for JSON data using the dumps()
function.
dumpResults = json.dumps(results)
print(type(dumpResults)) #this prints out <class 'str'>
Reading specific JSON data happens by calling the key
of the dict that outputs the value
associated with the key
.
print(results["siblings"]) #prints all values of key 'siblings'
print(results["siblings"][0]) #prints the first value of key 'siblings'
print(results["siblings"][0]["name"]) #prints value of 'name' key in first value of 'siblings' key value.
Once we are able to selectively call the JSON data, we can then convert it into a DataFrame provided all values
in the key
of the selected key are equal length.
dfResults = pd.DataFrame(results["siblings"], columns = ["name", "age", "pets"])
print(dfResults)
The above code calls out the 'siblings'
key to retrieve its values
and then specify that we want to extract all columns, we can as well omit the columns
attribute as by default it retrieves data of all columns.
To get a list of all keys in the JSON data we can use the key()
function of dictionary data type.
print(results.keys()) #JSON data keys
print(dfResults.keys()) #DataFrame keys
3. Reading data from Web APIs
As we have just discussed about loading JSON data into a DataFrame, let’s now take yet another JSON example but one by reading the JSON data from a web API. Reading web API data is handled by the requests
package in python.
import requestsurl = "http://api.github.com/repos/pandas-dev/pandas/issues"
webData = requests.get(url)
print(type(webData))
the function get()
responds back with a response object which is stored into another variable webData
. We then use the json()
function to read the JSON data received as a response from the web URL.
jsonData = webData.json()
Now the variable jsonData
stores the JSON data which can be visually seen using any online JSON data visualiser as discussed earlier. We can extract parts of JSON data into a DataFrame just as discussed earlier for JSON data types.
dfData = pd.DataFrame(jsonData, columns=['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url'])
print(dfData.iloc[1])
4. Reading and operating on Binary Data.
The most used binary file formats in Machine Learning are Pickle file format and HDF5 file format. As most of the functions are quite similar we shall only take a simple read and write example for Pickle file format for binary data.
To write content stored in a DataFrame into Pickle file format we use to_pickle()
function with the following syntax
results = pd.read_csv("stock_px.csv") #reading a csv file into a dataframe
results.to_pickle("pickleStocks") #stores in a pickle file format with name of the file as 'pickleStocks'
This will create a pickle file with filename pickleStocks
and stores the data in DataFrame results
into it.
To read the data from pickle file we use the read_pickle()
function as follows which stores the data into a DataFrame
pickleDf = pd.read_pickle("pickleStocks")
Once we know how to read and write data in pickle format from and to a DataFrame, the further operations are quite simple.
With this post, we pretty much complete reading content from various types of files using Pandas and partly covered selectively displaying DataFrame data using iloc
and loc
functions. In the upcoming parts of the series we shall explore storing JSON data to files, sorting, filtering and grouping data.
All parts of the series:
Pandas Masterclass — Your Foundation To Data Science (Part 1: Basic Dataframe operations)
Pandas Masterclass — Your Foundation To Data Science (Part 2: DataFrame operations on special file formats)
Pandas Masterclass — Your Foundation To Data Science (Part 3: sorting, filtering and grouping DataFrame data and writing to files)
Pandas Masterclass — Your Foundation To Data Science (Part 4: pandas functions)
Pandas Masterclass — Your Foundation To Data Science (Part 5: Multi-index DataFrames, handling na values and Combining DataFrames)