Pandas Masterclass — Your Foundation To Data Science

Raghu Prodduturi
Analytics Vidhya
Published in
7 min readJan 10, 2020

--

Part 1: Basic DataFrame operations

Image Source: Towards Data Science

Pandas, Panel Data, is a software library of Python programming language that’s immensely helpful for data manipulation and analysis. According to Wikipedia, what you can achieve with Pandas is

  • DataFrame object for data manipulation with integrated indexing.
  • Tools for reading and writing data between in-memory data structures and different file formats.
  • Data alignment and integrated handling of missing data.
  • Reshaping and pivoting of data sets.
  • Label-based slicing, fancy indexing, and subsetting of large data sets.
  • Data structure column insertion and deletion.
  • Group by engine allowing split-apply-combine operations on data sets.
  • Data set merging and joining.
  • Hierarchical axis indexing to work with high-dimensional data in a lower-dimensional data structure.
  • Time series-functionality: Date range generation[4] and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging.
  • Provides data filtration.

Let’s cut the chase and begin with Pandas implementation in Python and learn by doing. This is going to be a very long post and give you a comprehensive experience in learning Pandas to play with your data.

This will be series or 3 to 4 different posts that will cover major Pandas functionality in depth.

1. Creating an Alias for easy reference
To begin with, we have to import the pandas library into our python program using the statement import pandas. The easier option is to import it as an alias which helps us cut down the hassle of typing out the whole word pandas each time we execute a function related to the package.

import pandas as pd

2. Reading a file
Reading a file’s content is quite simple in pandas. There are multiple in-built functions available in pandas to read various types of files directly without worrying about delimiter/separator of the file. For the examples to follow, we shall use the stock_px.csv file that has time stamp and prices of 10 different stocks at that timestamp value.

Part of the contents of stock_px.csv file

To read a csv file we can use the read_csv() function as follows.

results = pd.read_csv("stock_px.csv")

This reads the csv file and stores it into a DataFrame named as results. Similarly we have functions such as read_tsv() to read tab separated values read_excel() to read excel files, except that for excel we have to mention the sheet names to be read using the sheetname attribute and mention the sheet name/names.

A DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table. When contents of a file are read into a DataFrame, the first row by default is considered as column labels. We can overwrite it and specify our own column labels using the following syntax.

results1 = pd.read_csv("stock_px.csv", names=["Date & Time", "American Associates", "Apple", "General Electricals", "IGM", "JNJ","Microsoft", "PEP", "Sparx", "Xerox"])

This will omit the first like of the stock_px.csv and give the column names as provided in the names= [] attribute of read_csv() function.

User given column names

While reading the file, if we want default system generate column labels and would like to omit the first row of the csv file we can use the following syntax.

results1 = pd.read_csv("stock_px.csv", header=None)

This will allocate the column labels starting with 0 and the increments the value by 1 for each following columns.

System generated column names

3. Extracting DataFrame content
Once we are able to retrieve a file’s content in a DataFrame, Pandas library provides lots of functions to retrieve data from DataFrame. Let’s start with the most basic options and then move on to more complex ones.

i) Print entire DataFrame
To show the entire contents of a DataFrame, all we have to do is to print out the DataFrame itself. The system prints the abridged version of the DataFrame so that it’s more readable in the limited space to display.

print(results)
Entire DataFrame, results, printed

ii) Print first few or last few rows of DataFrame
We have useful functions such as head() and tail() that by default prints the first 5 rows of the DataFrame and last 5 rows of the DataFrame respectively by default but we can provide any argument such as 4 or 10 in the head() or tail() function to print that many rows.

print(results.head(2)) #prints the first 2 rows of DataFrame
print(results.tail()) #prints the last 5 rows of DataFrame
The output of printing head(2) and tail() function of DataFrame

iii)Counting number of rows of each column
To give a count of number of rows for each column, we can use the count() following function.

print(results.count())
Output of count() function

iv) Length of the DataFrame
To simply get the length of the DataFrame, which is also just the number of rows, we can use the len() function as follows.

print("Length of results dataframe is:",len(results))
Length of the DataFrame

v) Column labels of the DataFrame
If one wants to retrieve a list of all column labels in the DataFrame that can be used for further processing, we can use the columns.values attribute as follows that will result in an array.

print(results.columns.values)
print(type(results.columns.values))
Column names and the data type of the result

Let’s step it now and explore how we can selectively print out certain rows or columns or a combination of both or results with certain restrictions applied.

vi) Printing all rows of specific columns
We can selectively print rows of specific columns of the DataFrame using the column labels. If you want to select a range of column labels, you can retrieve the column labels using the previous method columns.values and then select the range of values from the list.

First, a simple example of printing all rows of one column, 'AA'.

print(results['AA'])
Printing all rows of column ‘AA’

Printing the 10 rows after the 2nd row of columns 'AA', 'AAPL' and 'PEP'. Here, what we have to remember is that when more than one column value has to be retrieved we have to pass it as a list to the results DataFrame as given below.

print(results[['AA', 'AAPL', 'PEP]][2:12])

This skips the first two rows with index value 0, 1 and then begins printing rows from 2:11 included.

vii) Using iloc and loc
Both these functions take values in the format loc[rows, columns] where as columns is optional. If we do not specify any columns it prints entire rows.

The basic difference between iloc and loc is that iloc accepts only row and column values allotted by system by default, these begin at 0 and keeps incrementing by 1 for each row. So if we have 10 rows the row index values would be 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 by default.

But loc accepts user defined labels as well. Here in our DataFrame, the column at index 1 has a label 'AA' and we can use iloc[0, 'AA'] to retrieve value of column 'AA' for row 0 where as to use loc the only way to do it is by taking system defined default values, loc[0, 0].

Consider the examples below

print(results.iloc[0, 0]) #prints timestamp of first row
print(results.loc[0, 'AAPL']) #prints value of 'AAPL' from first row
print(results.iloc[0:4]) #prints entire first 4 rows
print(results.loc[0:4, 'GE']) #prints values of GE from row index 0 to 3
Results for running the above examples of iloc and loc

We can step it up a bit and use loc to filter out results based on specific conditions as well. If we want to display results only matching certain values in certain rows we can use the following syntax.

print(results.loc[results['GE'] == 2.87])
#prints all rows where column 'GE' has a value 2.87
print(results[['GE','AA']].loc[results['GE'] == 2.87])
#prints columns 'GE' and 'AA' only for columns where 'GE' has a value of 2.87

There will be numerous instances wherein we will have the need to fetch the index number as well as contents of each row sequentially and then apply some logic on it. One easy way to iterate over the DataFrame and fetch data that way is as below.

for index, row in results.iterrows():
print("Index no.:",index)
print(row)

A simple 3 lines of code to do the job. We can take this further and retrieve selective column values only by specifying column labels against row variable.

for index, row in results.iterrows():
print(index, row['AAPL'])

The above piece of code selectively extracts values of 'AAPL' column only in each iteration.

In the next parts I would be talking more about writing to a file, sorting DataFrame content, Grouping DataFrame content, DataFrame data manipulation, Reading specific content from JSON, Databases etc., and many other useful specific operations.

All parts of the series:
Pandas Masterclass — Your Foundation To Data Science (Part 1: Basic Dataframe operations)
Pandas Masterclass — Your Foundation To Data Science (Part 2: DataFrame operations on special file formats)
Pandas Masterclass — Your Foundation To Data Science (Part 3: sorting, filtering and grouping DataFrame data and writing to files)
Pandas Masterclass — Your Foundation To Data Science (Part 4: pandas functions)
Pandas Masterclass — Your Foundation To Data Science (Part 5: Multi-index DataFrames, handling na values and Combining DataFrames)

--

--

Raghu Prodduturi
Analytics Vidhya

Programmer by education, business man by profession and entrepreneur by heart. Conflicting with a pinch of clarity.