Pandas Masterclass — Your Foundation To Data Science

Part 4: pandas functions

Raghu Prodduturi
Analytics Vidhya

--

1. Using chunksize parameter to process large data
Working with huge data that run into several GB size can be handled using chunksize parameter which helps in reading part of the data, apply various process on it as needed, store it in a temporary variable and then concatenate all of them together.

We shall first share a code segment making use of chunksize parameter and then decode it step by step. We will be implementing two different DataFrames reading the same CSV files just to validate if both the DataFrames eventually end up with same content.

df = pd.read_csv("pokemon_data.csv")
print(len(df))
newDf = pd.DataFrame(columns = df.columns)
for tempDf in pd.read_csv("pokemon_data.csv", chunksize = 5):
newDf = pd.concat([newDf, tempDf])

print(len(newDf))
#reading chunks of 5 rows at a time and then process it to do whatever

The first two lines of code are quite clear, we are reading the pokemon_data.csv file and then printing the length of the DataFrame.

We next try to create a new empty DataFrame newDf with column labels similar to the first DataFrame df. In the next for loop, we read just 5 lines of the pokemon_data.csv file, by using the chunksize parameter in read_csv function of Pandas, in each iteration and then concatenate newDF to the 5 rows of data that we read in each iteration eventually ending up with just the same data as in df DataFrame. We can verify this by printing the length of the newDf DataFrame as well which will just be the same as length of df DataFrame.

2. The inplace parameter
There are various instance in which we make changes to certain columns in a DataFrame. To reflect the changes in the DataFrame, we can reassign it the DataFrame using the below code.

new_results = new_results.reset_index(drop = True)

Or, we can use the inplace variable as follows without reassigning the DataFrame to itself.

new_results.reset_index(drop = True, inplace = True)

3. Changing categorical data in a column
Categorical data implies the values present in a certain column of a DataFrame. For example, column 'Age' of a DataFrame might have values such as 28, 45, 67 etc., which are all the categorical data of that column.

We might encounter multiple instances where we will have to change categorical data. The process is quite simple which can again be explained with an example below.

print(pokeData.head())
types = pd.Categorical(pokeData['Type 1'])
print(types.categories)
types.rename_categories([x+'1' for x in types.categories], inplace= True)
print(types)
pokeData['Type 1'] = types
print(pokeData.head())

We first print out the pokeData's first 5 rows to understand what sort of data is stored in the DataFrame.

pokeData DataFrame’s first 5 rows

Now let’s try to change the categorical data of column 'Type 1' by just adding a suffix ‘1’ in the end of each category value. We first load all the categorical data of a column into a temporary variable types suing the pd.Categorical function. We then print the types variable just to check what all categories exist in column 'Type 1'.

Categorical data of ‘Type 1’ column

We then use the rename_categories() function of rename the categories. We just have to ensure that the new category names are provided as a list and has the exact same number of category value as originally present. If the categorical data initially has 13 different values then we have to provide 13 values again as a list in rename_categories() function. Over here, we are just retrieving the existing category names and adding a suffix ‘1’ to them.

We can validate the new categorical values by printing them out. We can assign these new values to the 'Type 1' column of the DataFrame once again.

New categorical data shown and applied to DataFrame

4. Data types of columns
We can find the data types of the columns of the DataFrame using the dtype operator, we can further filter out columns belonging to a particular type of data as well using either the dtypes function directly or by using select_dtypes function.

print(titanicData.dtypes)
print("\n\nColumns with object data type")
print(titanicData.dtypes[titanicData.dtypes == "object"])
(or)
print(results.select_dtypes(include= ['float']).head())
Showing dataype of all columns and filtering out object type columns

5. Arithmetic operations on DataFrame rows data
We can use float or integer values present in rows of DataFrame to perform similar arithmetic operation over all rows of the DataFrame quite easily. While we can directly use the operators over two columns using simple syntax such as results['AA'] + results['AAPL'] to generate a sum of columns 'AA' data and 'AAPL' data over all rows of the results DataFrame, we can also use simple functions such as sum() to add several columns instead of typing them all. Would be quite useful on large data sets.

print("Before")
print(results.head())
results['Total'] = results.iloc[:, 1:10].sum(axis=1)
print("\n\nAfter")
print(results.head())

The 'Total' column is added on to the right of existing results DataFrame by adding columns 1 to 9 inclusive and summed over columns which is indicated by axis = 1.

‘Total’ column generation by adding columns 1 to 9

6. Rearranging DataFrame columns
Consider a typical use case where we sum in values of certain columns over all rows and show it as a total. Typically, the newly added total column appears on thr rightmost column of the DataFrame, but in case we want to put the total column at the beginning of the values being added we can do it quite easily. Consider the example below for data stocks DataFrame we have been using.

allCols = list(results.columns.values)
print("All columns", allCols)
colsSummed = list(results.columns.values[1:10])
print("Columns to be summed", colsSummed)
print("\n\nBefore")
print(results.head())
results = results[[allCols[0]] + [allCols[-1]] + colsSummed]
print(results.head())

We can get all the column names in a DataFrame as a list using the columns.values attribute of the DataFrame then separate the columns that have been summed to get the 'Total' column using list slicing to generate colsSummed variable.

We can then do a simple rearranging by mentioning the column names order in the DataFrame as shown above.

Shifting ‘Total’ column to desired position

7. Describe function
Describe function when implemented on a DataFrame by default separated out the columns that have integer or float values and gives us a snapshot of number of rows, average value, standard deviation, minimum value, maximum value etc. Implementing describe() function on object type of data specifically will give us different details such as number of values, number of unique values, most frequent value and its count. Let’s implement few scenarios of describe() function over the stocks data.

print(results.describe())
print(results['Unnamed: 0'].describe())
print(results.describe().loc["count"])

We first implement the function on the overall DataFrame, then on object data and then retrieve 'count' values only for all columns.

8. Skiprows and nrows attribute while reading files into DataFrame
We can skip rows of the file we are reading using the skiprows attribute.

results = pd.read_csv("stock_px.csv", skiprows = [0,2,3])
results = pd.read_csv("stock_px.csv", skiprows = range(3,30), nrows = 40)

We can either provide specific row numbers of the file to be skipped or a range of rows using the range() function.

The nrows attribute limits the number of rows read into the DataFrame. When we set nrows = 40, we read only the first 40 rows apart from the ones in skiprows attribute into the DataFrame results.

9. Finding out null values in a DataFrame
We can use the isnull() function to find out null values of a DataFrame. It returns a boolean table with True where values are null and False where values are filled.

print(pd.isnull(results).head())
Printing out results of calling isnull() function on first 5 rows of DataFrame

10. Shape of DataFrame
We can know the shape of the DataFrame, i.e. the number of rows x columns, using the shape attribute which returns a tuple of form (rows, columns).

results.shape

Calling the above attribute shall return the value (5472, 10) indicating 5472 rows and 10 columns in DataFrame results.

11. Finding unique values of a column
We can find all the unique values of an array using the unique() function which returns an array of all the unique values.

print(titanicData['Cabin'].unique())
An array of all unique values from ‘Cabin’ column

All parts of the series:
Pandas Masterclass — Your Foundation To Data Science (Part 1: Basic Dataframe operations)
Pandas Masterclass — Your Foundation To Data Science (Part 2: DataFrame operations on special file formats)
Pandas Masterclass — Your Foundation To Data Science (Part 3: sorting, filtering and grouping DataFrame data and writing to files)
Pandas Masterclass — Your Foundation To Data Science (Part 4: pandas functions)
Pandas Masterclass — Your Foundation To Data Science (Part 5: Multi-index DataFrames, handling na values and Combining DataFrames)

--

--

Raghu Prodduturi
Analytics Vidhya

Programmer by education, business man by profession and entrepreneur by heart. Conflicting with a pinch of clarity.