Powerful Pandas functions for better data handling

Lekha Priya
6 min readJul 20, 2022

Pandas is the gold standard data manipulation library for all things Data. With the functionality it offers to manipulate and handle data, it’s the favorite among Data Scientists today including me. I remember the time when I started to use it, I was sticking to the very basic functions to perform everyday analysis like Loading data, filtering few rows, sorting and indexing the data frame etc. However, after learning about pandas lesser known yet important functions to handle data in a whole lot easier and cleaner way , I decided to list few of my favorite commands.

1. Combining DataFrames

A relatively unknown fact about Pandas DataFrame is that it comes with two different methods of combining Data and each method produces different results. It is imperative to select the proper one based on what you want to achieve. It contains parameters that can further customize merging.

Let’s check them out.

Concatenate: It is a common practice to split a large dataset into multiple sets which is referred to as sharding. While loading these Datasets into Pandas we can vertically stack the DataFrame of each dataset to create one big data DatFrame for all of the data.

Below code shows how to concatenate DataFrames in Pandas vertically.

# Vertical concat
pd.concat([october_df, november_df, december_df], axis=0)

Something similar can be done by splitting up the data into columns instead of rows-a few columns for each CSV file (with all the rows of the dataset).

It is like dividing up the dataset’s feature into different shards and then horizontally stack them to combine those columns/features.

# Horizontal concat
pd.concat([features_1to5_df, features_6to10_df, features_11to15_df], axis=1)

Merging: Merging allows us to combine the two DataFrames into a single one by joining by some common attribute using merge function. Let’s see an example of how it works in the code below.

pd.merge(left=ids_and_time_df,
right=ids_and_videos_df,
on="id")

The left and right parameters refer to the two DataFrames you wish to merge, while on specifies the column to be used for the matching.

Further, we can use how parameter to select the type of SQL-style join we want to perform: inner, outer, left, or right.

2) Reshaping DataFrames

There are several ways to reshape a Dataframe using Pandas. Let’s look at the most three common methods used.

Transpose: Transposing is a way to swap a DataFrame’s rows with its columns. The transpose() function is used to transpose index and columns.

Reflect the DataFrame over its main diagonal by writing rows as columns and vice-versa.

import numpy as np
import pandas as pd
d1 = {'c1': [2, 3], 'c2': [4, 5]}
df1 = pd.DataFrame(data=d1)
df1
df1_transposed = df1.T # or df1.transpose()
df1_transposed

Groupby

Groupby split’s up DataFrames into multiple parts based on some keys. After the DataFrame is split up into parts, we can loop through and apply some operations on each part independently.

Example: If there is a DataFrame of Players with corresponding Years and Points, upon applying a groupby function, the data is split up into multiple parts, according to the player. Thus, each player gets its own group showing how many points that player got for each year they were active.

groups_df = df.groupby('Player')

for player, group in groups_df:
print("----- {} -----".format(player))
print(group)
print("")

### This prints out the following
"""
----- Batman -----
Player Year Points
1 Batman 2000 43
3 Batman 2001 65
6 Batman 2002 23
9 Batman 2004 76

----- Black Widow -----
Player Year Points
8 Black Widow 2003 89

----- Superman -----
Player Year Points
0 Superman 2000 23
5 Superman 2002 34
11 Superman 2005 87

----- Thanos -----
Player Year Points
2 Thanos 2000 45
4 Thanos 2001 76
7 Thanos 2002 78
10 Thanos 2004 92

"""

Stacking: Stacking transforms the DataFrame into multi-level index, i.e., each row has multiple sub-parts. These sub-parts are created using the DataFrame’s columns, compressing them into the multi-index. Overall, stacking can be thought of as compressing columns into multi-index rows.

This is best illustrated by an example, shown down below.

df = df.stack()

print(df)

"""
0 Player Superman
Year 2000
Points 23
1 Player Batman
Year 2000
Points 43
2 Player Thanos
Year 2000
Points 45
3 Player Batman
Year 2001
Points 65
4 Player Thanos
Year 2001
Points 76
5 Player Superman
Year 2002
Points 34
6 Player Batman
Year 2002
Points 23
7 Player Thanos
Year 2002
Points 78
8 Player Black Widow
Year 2003
Points 89
9 Player Batman
Year 2004
Points 76
10 Player Thanos
Year 2004
Points 92
11 Player Superman
Year 2005
Points 87

"""

3. Working with time data Library

Pandas comes with a convenient DateTime function called to_datetime() that compresses and converts multiple DataFrame columns into a single Datetime object.

To use the to_datetime() function,we need to pass “Day”, “Month”, and “Year” from relevant columns. Once we have things in Datetime format, we no longer need the other columns and can simply drop them.

Below code illustrates, how it all works.

from itertools import product
import pandas as pd
import numpy as np

col_names = ["Day", "Month", "Year"]

df = pd.DataFrame(list(product([10, 11, 12], [8, 9], [2018, 2019])),
columns=col_names)

df['data'] = np.random.randn(len(df))

df = df.sort_values(['Year', 'Month'], ascending=[True, True])

df.insert(loc=0, column="date", value=pd.to_datetime(df[col_names]))
df = df.drop(col_names, axis=1).squeeze()

print(df)

"""
date data
0 2018-08-10 -0.328973
4 2018-08-11 -0.670790
8 2018-08-12 -1.360565
2 2018-09-10 -0.401973
6 2018-09-11 -1.238754
10 2018-09-12 0.957695
1 2019-08-10 0.571126
5 2019-08-11 -1.320735
9 2019-08-12 0.196036
3 2019-09-10 -1.717800
7 2019-09-11 0.074606
11 2019-09-12 -0.643198
"""

4) Mapping into Groups

Mapping is a method that helps in organizing categorical data. To Categorize a DataFrame with thousands of rows where one of the columns has items we wish to categorize, we first put our list into a pandas series, create a Dictionary showing the mapping we want and Categorize the Column we wish to. If the DatFrame is huge say a length of 1,000,000 items, looping through each item is not practically possible.

Instead of the basic for-loop, we can write a function using Pandas’s built-in .map() function to perform the mapping in an optimized way.

Check out the code below to see the function and how it’s applied.

def membership_map(pandas_series, groups_dict):
groups = {x: k for k, v in groups_dict.items() for x in v}
mapped_series = pandas_series.map(groups)
return mapped_series

mapped_data = membership_map(foods, groups_dict)
print(list(mapped_data))

In the code above, we first loop through our dictionary to create a new dictionary where the keys represent every possible item in the pandas series and the value represents the new mapped item. Then, we simply apply Pandas’s built-in map function to map all of the values in the series.

Check out the output below to see the results!

['Carbs', 'Carbs', 'Protein', 'Protein', 'Protein', 'Carbs', 'Carbs', 'Carbs', 'Protein', 'Carbs', 'Carbs', 'Protein', 'Protein', 'Protein', 'Carbs', 'Carbs', 'Carbs', 'Protein', 'Carbs', 'Carbs', 'Carbs', 'Protein', 'Carbs', 'Carbs', 'Carbs', 'Protein', 'Carbs', 'Carbs', 'Protein', 'Protein', 'Protein', 'Carbs', 'Carbs', 'Protein', 'Protein', 'Protein', 'Carbs', 'Carbs', 'Protein', 'Protein', 'Protein', 'Carbs', 'Carbs', 'Carbs', 'Protein', 'Carbs', 'Carbs', 'Carbs', 'Protein', 'Carbs', 'Carbs', 'Carbs', 'Protein', 'Carbs', 'Carbs', 'Protein', 'Protein', 'Protein', 'Carbs', 'Carbs', 'Protein', 'Protein', 'Protein']

I hope you enjoyed this article. Let me know your views in the comments section. Looking forward to the next topic soon.

--

--

Lekha Priya

AI/ML solutions architect operating as trusted advisor of customers ensuring they get most of the cloud at every stage adopting machine learning across org’s