Powerful Pandas functions for better data handling

6 min readJul 20, 2022

Pandas is the gold standard data manipulation library for all things Data. With the functionality it offers to manipulate and handle data, it’s the favorite among Data Scientists today including me. I remember the time when I started to use it, I was sticking to the very basic functions to perform everyday analysis like Loading data, filtering few rows, sorting and indexing the data frame etc. However, after learning about pandas lesser known yet important functions to handle data in a whole lot easier and cleaner way , I decided to list few of my favorite commands.

1. Combining DataFrames

A relatively unknown fact about Pandas DataFrame is that it comes with two different methods of combining Data and each method produces different results. It is imperative to select the proper one based on what you want to achieve. It contains parameters that can further customize merging.

Let’s check them out.

Concatenate: It is a common practice to split a large dataset into multiple sets which is referred to as sharding. While loading these Datasets into Pandas we can vertically stack the DataFrame of each dataset to create one big data DatFrame for all of the data.

Below code shows how to concatenate DataFrames in Pandas vertically.

# Vertical concat
pd.concat([october_df, november_df, december_df], axis=0)

Something similar can be done by splitting up the data into columns instead of rows-a few columns for each CSV file (with all the rows of the dataset).

It is like dividing up the dataset’s feature into different shards and then horizontally stack them to combine those columns/features.

# Horizontal concat
pd.concat([features_1to5_df, features_6to10_df, features_11to15_df], axis=1)

Merging: Merging allows us to combine the two DataFrames into a single one by joining by some common attribute using merge function. Let’s see an example of how it works in the code below.

pd.merge(left=ids_and_time_df,
         right=ids_and_videos_df,
         on="id")

The left and right parameters refer to the two DataFrames you wish to merge, while on specifies the column to be used for the matching.

Further, we can use how parameter to select the type of SQL-style join we want to perform: inner, outer, left, or right.

2) Reshaping DataFrames

There are several ways to reshape a Dataframe using Pandas. Let’s look at the most three common methods used.

Transpose: Transposing is a way to swap a DataFrame’s rows with its columns. The transpose() function is used to transpose index and columns.

Reflect the DataFrame over its main diagonal by writing rows as columns and vice-versa.

import numpy as np
import pandas as pdd1 = {'c1': [2, 3], 'c2': [4, 5]}
df1 = pd.DataFrame(data=d1)
df1

df1_transposed = df1.T # or df1.transpose()
df1_transposed

Groupby

Groupby split’s up DataFrames into multiple parts based on some keys. After the DataFrame is split up into parts, we can loop through and apply some operations on each part independently.

Example: If there is a DataFrame of Players with corresponding Years and Points, upon applying a groupby function, the data is split up into multiple parts, according to the player. Thus, each player gets its own group showing how many points that player got for each year they were active.

groups_df = df.groupby('Player')

for player, group in groups_df:
   print("----- {} -----".format(player))
   print(group)
   print("")
   
### This prints out the following
"""
----- Batman -----
   Player  Year  Points
1  Batman  2000      43
3  Batman  2001      65
6  Batman  2002      23
9  Batman  2004      76

----- Black Widow -----
        Player  Year  Points
8  Black Widow  2003      89

----- Superman -----
      Player  Year  Points
0   Superman  2000      23
5   Superman  2002      34
11  Superman  2005      87

----- Thanos -----
    Player  Year  Points
2   Thanos  2000      45
4   Thanos  2001      76
7   Thanos  2002      78
10  Thanos  2004      92

"""

Stacking: Stacking transforms the DataFrame into multi-level index, i.e., each row has multiple sub-parts. These sub-parts are created using the DataFrame’s columns, compressing them into the multi-index. Overall, stacking can be thought of as compressing columns into multi-index rows.

This is best illustrated by an example, shown down below.

df = df.stack()

print(df)

"""
0   Player       Superman
    Year             2000
    Points             23
1   Player         Batman
    Year             2000
    Points             43
2   Player         Thanos
    Year             2000
    Points             45
3   Player         Batman
    Year             2001
    Points             65
4   Player         Thanos
    Year             2001
    Points             76
5   Player       Superman
    Year             2002
    Points             34
6   Player         Batman
    Year             2002
    Points             23
7   Player         Thanos
    Year             2002
    Points             78
8   Player    Black Widow
    Year             2003
    Points             89
9   Player         Batman
    Year             2004
    Points             76
10  Player         Thanos
    Year             2004
    Points             92
11  Player       Superman
    Year             2005
    Points             87

"""

3. Working with time data Library

Pandas comes with a convenient DateTime function called to_datetime() that compresses and converts multiple DataFrame columns into a single Datetime object.

To use the to_datetime() function,we need to pass “Day”, “Month”, and “Year” from relevant columns. Once we have things in Datetime format, we no longer need the other columns and can simply drop them.

Below code illustrates, how it all works.

from itertools import product
import pandas as pd
import numpy as np

col_names = ["Day", "Month", "Year"]

df = pd.DataFrame(list(product([10, 11, 12], [8, 9], [2018, 2019])),
                   columns=col_names)

df['data'] = np.random.randn(len(df))

df = df.sort_values(['Year', 'Month'], ascending=[True, True])

df.insert(loc=0, column="date", value=pd.to_datetime(df[col_names]))
df = df.drop(col_names, axis=1).squeeze()

print(df)

"""
         date      data
0  2018-08-10 -0.328973
4  2018-08-11 -0.670790
8  2018-08-12 -1.360565
2  2018-09-10 -0.401973
6  2018-09-11 -1.238754
10 2018-09-12  0.957695
1  2019-08-10  0.571126
5  2019-08-11 -1.320735
9  2019-08-12  0.196036
3  2019-09-10 -1.717800
7  2019-09-11  0.074606
11 2019-09-12 -0.643198
"""

4) Mapping into Groups

Mapping is a method that helps in organizing categorical data. To Categorize a DataFrame with thousands of rows where one of the columns has items we wish to categorize, we first put our list into a pandas series, create a Dictionary showing the mapping we want and Categorize the Column we wish to. If the DatFrame is huge say a length of 1,000,000 items, looping through each item is not practically possible.

Instead of the basic for-loop, we can write a function using Pandas’s built-in .map() function to perform the mapping in an optimized way.

Check out the code below to see the function and how it’s applied.

def membership_map(pandas_series, groups_dict):
    groups = {x: k for k, v in groups_dict.items() for x in v}
    mapped_series = pandas_series.map(groups)
    return mapped_series
    
mapped_data = membership_map(foods, groups_dict)
print(list(mapped_data))

In the code above, we first loop through our dictionary to create a new dictionary where the keys represent every possible item in the pandas series and the value represents the new mapped item. Then, we simply apply Pandas’s built-in map function to map all of the values in the series.

Check out the output below to see the results!

['Carbs', 'Carbs', 'Protein', 'Protein', 'Protein', 'Carbs', 'Carbs', 'Carbs', 'Protein', 'Carbs', 'Carbs', 'Protein', 'Protein', 'Protein', 'Carbs', 'Carbs', 'Carbs', 'Protein', 'Carbs', 'Carbs', 'Carbs', 'Protein', 'Carbs', 'Carbs', 'Carbs', 'Protein', 'Carbs', 'Carbs', 'Protein', 'Protein', 'Protein', 'Carbs', 'Carbs', 'Protein', 'Protein', 'Protein', 'Carbs', 'Carbs', 'Protein', 'Protein', 'Protein', 'Carbs', 'Carbs', 'Carbs', 'Protein', 'Carbs', 'Carbs', 'Carbs', 'Protein', 'Carbs', 'Carbs', 'Carbs', 'Protein', 'Carbs', 'Carbs', 'Protein', 'Protein', 'Protein', 'Carbs', 'Carbs', 'Protein', 'Protein', 'Protein']

I hope you enjoyed this article. Let me know your views in the comments section. Looking forward to the next topic soon.

Powerful Pandas functions for better data handling

1. Combining DataFrames

2) Reshaping DataFrames

3. Working with time data Library

4) Mapping into Groups

Written by Lekha Priya