Pandas Masterclass — Your Foundation To Data Science
Part 3: sorting, filtering and grouping DataFrame data and writing to files
Once we know how to read and write data into a DataFrame, we can perform various useful operations on it ranging from sorting data, filtering data on various conditions, grouping data, identifying null values and then either fill it with standard values, get shape of data etc. We will start discussing first about sorting data, filtering data on various conditions and then grouping of data.
1. Sorting Data
Once we have the data loaded into a DataFrame we can sort the data by any single column or a combination of columns. When we use more than one column to sort the data, preference is given in order of columns specified. The data is first sorted by first column given in sort function, if one or more values are similar in the first column, then it sorts by the data values in the second column mentioned in sort function. For each column, we can specify if we want to sort the data in ascending order or descending order, the default value is ascending order if nothing is specified.
Let’s look into a few examples now with implementation on the very first data set we have considered, the stocks information data. We are sharing the first 5 lines of the file below to provide a recap of the files structure, the data is loaded into the DataFram results
.
print(results.sort_values('AA', ascending = False))
#sorts results data by 'AA' column in descending order
In the next example, we will first sort data in the results
DataFrame by 'AAPL'
column in ascending order and in case of similar values in 'AAPL'
column, we then sort by 'GE'
column in descending order.
print(results.sort_values(['AAPL','GE'], ascending = [True, False]))
As we can see the results above, row numbers 2001, 1999 and 1997
all have 'AAPL'
column value as 3.28
. So it applies sorting on column 'GE'
next and sorts by descending order as indicated above.
2. Filtering data
There will be multiple use cases where one would like to retrieve data selectively from the DataFrame depending on various criteria. We have discussed some basic examples already under vii) Using iloc and loc section of the first part of our series Pandas Masterclass — Your Foundation To Data Science (Part 1: Basic Dataframe operations). We shall now see how to achieve more advanced filtering of data.
Few basic points to keep in mind. Whenever we perform a conditional operator check on DataFrame data, it returns boolean values with True
in any row value of the column that satisfies the conditional statement and False
for any value that does not satisfy the conditional statement. A small example below working on the same stocks data set that we have in the results
DataFrame.
print(results['AA'] > 5)
We can see above that rows (index values) 0
has False
, followed by True
for index values 1 to 6
etc. This essentially means that index value 0
has a value less than 5 and index values 1 to 6
has a value more than 5.
We can then use this boolean result in a loc
function to selectively print out only the indexes that have True
value as follows.
print(results.loc[results['AA'] > 5])
We can use multiple conditions using the and operator &
or operator |
or not operator ~
to join multiple conditions and extract data accordingly.
print(results.loc[(results['AA'] == 4.98) & (results['AAPL'] == 7.86)])
This prints out the DataFrame content having column 'AA'
value as 4.98
and column 'AAPL'
values as 7.86
.
print(results.loc[(results['AA'] == 4.98) | (results['AAPL'] == 7.86) | (results['SPX'] == max(results['SPX']))])
Over here, we are stepping it up one notch ahead and using the max()
function. We have three conditions here joined with an or operator |
. Executing the above code will return all contents of the DataFrame that satisfy at least one of the three conditional statements. The last statement basically says to return all rows of DataFrame where column 'SPX'
has a value equal to maximum value of column 'SPX'
.
Other ways to filter out data is to use string functions on columns that have string data. Let’s consider a few examples to work on data that has passenger information of Titanic ship. We are not discussing the intricate details of the data in the DataFrame as we are only considering one column 'name'
and applying string functions on it to filter out data.
print(titanicData.loc[titanicData['Name'].str.contains("John")]["Name"])
The above condition makes use of the contains()
string function and returns all the names that contain the word 'John'
in them.
import re
print(titanicData.loc[titanicData['Name'].str.contains("pi[a-z]*", regex=True, flags=re.I)]["Name"])
We can also use regular expressions in contains()
function. It’s a complex topic that I wouldn’t be discussing here but in case you are further interested you can read more about it in W3Schools. The above expressions searches for names that have 'pi'
in the name followed by any number of characters between a-z
. The variable flags
has multiple parameters but the one we used, re.I
, indicates removing case sensitive search.
We can use conditional filtering to change column values of data that is a result of the conditional filtering. The code we shall discuss below does a conditional filtering on column 'GE'
values of the stocks data and then for rows that satisfy the condition shall change the value of column 'AAPL'
to 0.
print(results.head())
results.loc[results['GE'] == 2.87, 'AAPL'] = 0
print(results.head())
3. Grouping data
Grouping data is quite straight forward. The groupby()
function works on a column and groups data using any one of the three aggregate functions, sum()
, count()
and mean()
which basically sums the data by the column name used in groupby()
function, counts the number of values and finds average of the values respectively. It would be quite easy to understand with a few examples.
Consider a DataFrame having details of various Pokemon and it’s attributes such as name, type, HP, attack etc. The various column names are indicated below with data from one row being displayed to understand what sort of data exists in the DataFrame.
As we have useful fields such as 'Type 1
and 'Type 2
, it is easy to run various grouping statements to show results effectively.
If we would like to get mean of all non-object values and then sort it my the 'Attack'
column we can execute a simple line of code to do so.
print(pokeData.groupby('Type 1').mean().sort_values('Attack', ascending = False))
The above code basically grouped the entire pokeData
DataFrame by 'Type 1'
column and then for the group of values of other columns calculate the average value for each type of Pokemon. We then use the sort_values()
function to sort by 'Attack'
column values in descending order.
4. Writing data frame content to a file
Now that we have covered the most frequently used operations on data, right from loading data from various sources into a DataFrame/Series and performing various operations on DataFrame data such as grouping, sorting and filtering, the next logical step is to find out how to write data present in a DataFrame into a file.
There are many useful functions and tasks in Pandas that help perform various operations that I shall discuss in the next part of this series.
Various file write options that we have are as follows. (Source)
df.to_csv() #csv format
df.to_json() #JSON format
df.to_html() #HTML format
df.to_clipboard() #to local clipboard
df.to_excel() #Excel format
df.to_hdf() #HDF format (binary)
df.to_feather() #Feather format
df.to_parquet #Parquet format
df.to_msgpack #Msgpack format
df.stata("filename") #stata format
df.pickle("filename") #pickle file format (binary)
df.to_sql() #SQL
df.to_gbq() #Google big query
We have various useful attributes for each and every function ranging from an option to specify only specific columns to be written into the file using columns = ['Col1', 'Col2']
attribute or prompt to not write the index labels using index = False
and similarly neglect column labels using header = False
. The full list of parameters is exhaustive and can be referred to when in need.
Few examples of writing data using various file formats. We will not be discussing the exact contents of the DataFrame but one can assess the contents looking at the snapshots of the resulting file after executing the lines of code as given below.
titanicDF.to_csv("testNoIndexNoHeader.csv", index = False, header = False)
This write the content of DataFrame titanicDF
into a CSV file with name testNoIndexHeader.csv
without any column or index names as shown below.
#titanicDF.to_csv("testOneColumnOnly.csv", columns = ["PassengerId"])
The above code writes the 'PassengeId'
column of DataFrame along with index and column labels of titanicDF
into a CSV file testOneColumnOnly.csv
as shown below.
The next parts of the series would be lot more exciting with details discussed above some intricate functions and specific tasks to be achieved.
All parts of the series:
Pandas Masterclass — Your Foundation To Data Science (Part 1: Basic Dataframe operations)
Pandas Masterclass — Your Foundation To Data Science (Part 2: DataFrame operations on special file formats)
Pandas Masterclass — Your Foundation To Data Science (Part 3: sorting, filtering and grouping DataFrame data and writing to files)
Pandas Masterclass — Your Foundation To Data Science (Part 4: pandas functions)
Pandas Masterclass — Your Foundation To Data Science (Part 5: Multi-index DataFrames, handling na values and Combining DataFrames)