Pandas Masterclass — Your Foundation To Data Science

Part 3: sorting, filtering and grouping DataFrame data and writing to files

Raghu Prodduturi
Analytics Vidhya

--

Once we know how to read and write data into a DataFrame, we can perform various useful operations on it ranging from sorting data, filtering data on various conditions, grouping data, identifying null values and then either fill it with standard values, get shape of data etc. We will start discussing first about sorting data, filtering data on various conditions and then grouping of data.

1. Sorting Data
Once we have the data loaded into a DataFrame we can sort the data by any single column or a combination of columns. When we use more than one column to sort the data, preference is given in order of columns specified. The data is first sorted by first column given in sort function, if one or more values are similar in the first column, then it sorts by the data values in the second column mentioned in sort function. For each column, we can specify if we want to sort the data in ascending order or descending order, the default value is ascending order if nothing is specified.

Let’s look into a few examples now with implementation on the very first data set we have considered, the stocks information data. We are sharing the first 5 lines of the file below to provide a recap of the files structure, the data is loaded into the DataFram results.

Showing 5 rows of results DataFrame having stocks data loaded into it.
print(results.sort_values('AA', ascending = False))
#sorts results data by 'AA' column in descending order
Results data frame sorted by column ‘AA’ in descending order

In the next example, we will first sort data in the results DataFrame by 'AAPL' column in ascending order and in case of similar values in 'AAPL' column, we then sort by 'GE' column in descending order.

print(results.sort_values(['AAPL','GE'], ascending = [True, False]))
Sorting by ‘AAPL’ column ascending and then by ‘GE’ column descending

As we can see the results above, row numbers 2001, 1999 and 1997 all have 'AAPL' column value as 3.28. So it applies sorting on column 'GE' next and sorts by descending order as indicated above.

2. Filtering data
There will be multiple use cases where one would like to retrieve data selectively from the DataFrame depending on various criteria. We have discussed some basic examples already under vii) Using iloc and loc section of the first part of our series Pandas Masterclass — Your Foundation To Data Science (Part 1: Basic Dataframe operations). We shall now see how to achieve more advanced filtering of data.

Few basic points to keep in mind. Whenever we perform a conditional operator check on DataFrame data, it returns boolean values with True in any row value of the column that satisfies the conditional statement and False for any value that does not satisfy the conditional statement. A small example below working on the same stocks data set that we have in the results DataFrame.

print(results['AA'] > 5)
Results of running a conditional statement on column ‘AA’

We can see above that rows (index values) 0 has False, followed by True for index values 1 to 6 etc. This essentially means that index value 0 has a value less than 5 and index values 1 to 6 has a value more than 5.

We can then use this boolean result in a loc function to selectively print out only the indexes that have True value as follows.

print(results.loc[results['AA'] > 5])
Results showing ‘AA’ values greater than 5

We can use multiple conditions using the and operator & or operator | or not operator ~ to join multiple conditions and extract data accordingly.

print(results.loc[(results['AA'] == 4.98) & (results['AAPL'] == 7.86)])

This prints out the DataFrame content having column 'AA' value as 4.98 and column 'AAPL' values as 7.86.

print(results.loc[(results['AA'] == 4.98) | (results['AAPL'] == 7.86) | (results['SPX'] == max(results['SPX']))])

Over here, we are stepping it up one notch ahead and using the max() function. We have three conditions here joined with an or operator |. Executing the above code will return all contents of the DataFrame that satisfy at least one of the three conditional statements. The last statement basically says to return all rows of DataFrame where column 'SPX' has a value equal to maximum value of column 'SPX'.

Results of executing both the above multiple conditional statement filtering

Other ways to filter out data is to use string functions on columns that have string data. Let’s consider a few examples to work on data that has passenger information of Titanic ship. We are not discussing the intricate details of the data in the DataFrame as we are only considering one column 'name' and applying string functions on it to filter out data.

print(titanicData.loc[titanicData['Name'].str.contains("John")]["Name"])

The above condition makes use of the contains() string function and returns all the names that contain the word 'John' in them.

import re
print(titanicData.loc[titanicData['Name'].str.contains("pi[a-z]*", regex=True, flags=re.I)]["Name"])

We can also use regular expressions in contains() function. It’s a complex topic that I wouldn’t be discussing here but in case you are further interested you can read more about it in W3Schools. The above expressions searches for names that have 'pi' in the name followed by any number of characters between a-z. The variable flags has multiple parameters but the one we used, re.I, indicates removing case sensitive search.

Results of executing the above to code snippets

We can use conditional filtering to change column values of data that is a result of the conditional filtering. The code we shall discuss below does a conditional filtering on column 'GE' values of the stocks data and then for rows that satisfy the condition shall change the value of column 'AAPL' to 0.

print(results.head())
results.loc[results['GE'] == 2.87, 'AAPL'] = 0
print(results.head())
Showing first 5 rows of results DataFrame before and after making changes

3. Grouping data
Grouping data is quite straight forward. The groupby() function works on a column and groups data using any one of the three aggregate functions, sum(), count() and mean() which basically sums the data by the column name used in groupby() function, counts the number of values and finds average of the values respectively. It would be quite easy to understand with a few examples.

Consider a DataFrame having details of various Pokemon and it’s attributes such as name, type, HP, attack etc. The various column names are indicated below with data from one row being displayed to understand what sort of data exists in the DataFrame.

One row of the Pokemon Data Frame

As we have useful fields such as 'Type 1 and 'Type 2, it is easy to run various grouping statements to show results effectively.

If we would like to get mean of all non-object values and then sort it my the 'Attack' column we can execute a simple line of code to do so.

print(pokeData.groupby('Type 1').mean().sort_values('Attack', ascending = False))
Data frame content grouped by Type 1 column and averages calculated

The above code basically grouped the entire pokeData DataFrame by 'Type 1' column and then for the group of values of other columns calculate the average value for each type of Pokemon. We then use the sort_values() function to sort by 'Attack' column values in descending order.

4. Writing data frame content to a file
Now that we have covered the most frequently used operations on data, right from loading data from various sources into a DataFrame/Series and performing various operations on DataFrame data such as grouping, sorting and filtering, the next logical step is to find out how to write data present in a DataFrame into a file.

There are many useful functions and tasks in Pandas that help perform various operations that I shall discuss in the next part of this series.

Various file write options that we have are as follows. (Source)

df.to_csv() #csv format
df.to_json() #JSON format
df.to_html() #HTML format
df.to_clipboard() #to local clipboard
df.to_excel() #Excel format
df.to_hdf() #HDF format (binary)
df.to_feather() #Feather format
df.to_parquet #Parquet format
df.to_msgpack #Msgpack format
df.stata("filename") #stata format
df.pickle("filename") #pickle file format (binary)
df.to_sql() #SQL
df.to_gbq() #Google big query

We have various useful attributes for each and every function ranging from an option to specify only specific columns to be written into the file using columns = ['Col1', 'Col2'] attribute or prompt to not write the index labels using index = False and similarly neglect column labels using header = False. The full list of parameters is exhaustive and can be referred to when in need.

Few examples of writing data using various file formats. We will not be discussing the exact contents of the DataFrame but one can assess the contents looking at the snapshots of the resulting file after executing the lines of code as given below.

titanicDF.to_csv("testNoIndexNoHeader.csv", index = False, header = False)

This write the content of DataFrame titanicDF into a CSV file with name testNoIndexHeader.csv without any column or index names as shown below.

A snapshot of ‘testNoIndexNoHeader.csv’ file
#titanicDF.to_csv("testOneColumnOnly.csv", columns = ["PassengerId"])

The above code writes the 'PassengeId' column of DataFrame along with index and column labels of titanicDF into a CSV file testOneColumnOnly.csv as shown below.

A snapshot of ‘testOneColumnOnly.csv’ file

The next parts of the series would be lot more exciting with details discussed above some intricate functions and specific tasks to be achieved.

All parts of the series:
Pandas Masterclass — Your Foundation To Data Science (Part 1: Basic Dataframe operations)
Pandas Masterclass — Your Foundation To Data Science (Part 2: DataFrame operations on special file formats)
Pandas Masterclass — Your Foundation To Data Science (Part 3: sorting, filtering and grouping DataFrame data and writing to files)
Pandas Masterclass — Your Foundation To Data Science (Part 4: pandas functions)
Pandas Masterclass — Your Foundation To Data Science (Part 5: Multi-index DataFrames, handling na values and Combining DataFrames)

--

--

Raghu Prodduturi
Analytics Vidhya

Programmer by education, business man by profession and entrepreneur by heart. Conflicting with a pinch of clarity.