Left Join with Pandas Data Frames in Python

Merging Pandas data frames is covered extensively in a StackOverflow article Pandas Merging 101. However, my experience of grading data science take-home tests leads me to believe that left joins remain to be a challenge for many people. In this post, I show how to properly handle cases when the right table (data frame) in a Pandas left join contains nulls.

Let’s consider a scenario where we have a table transactions containing transactions performed by some users and a table users containing some user properties, for example, their favorite color. We want to annotate the transactions with the users’ properties. Here are the data frames:

import numpy as np
import pandas as pd

np.random.seed(0)
# transactions
left = pd.DataFrame({'transaction_id': ['A', 'B', 'C', 'D'],
                     'user_id': ['Peter', 'John', 'John', 'Anna'],
                     'value': np.random.randn(4),
                    })

# users
right = pd.DataFrame({'user_id': ['Paul', 'Mary', 'John', 'Anna'],
                      'favorite_color': ['blue', 'blue', 'red', np.NaN],
                     })

Note that Peter is not in the users table and Anna doesn’t have a favorite color.

>>> left
  transaction_id user_id     value
0              A   Peter  1.867558
1              B    John -0.977278
2              C    John  0.950088
3              D    Anna -0.151357

>>> right
  user_id favorite_color
0    Paul           blue
1    Mary           blue
2    John            red
3    Anna            NaN

Adding the user’s favorite color to the transaction table seems straightforward using a left join on the user id:

>>> left.merge(right, on='user_id', how='left')
  transaction_id user_id     value favorite_color
0              A   Peter  1.867558            NaN
1              B    John -0.977278            red
2              C    John  0.950088            red
3              D    Anna -0.151357            NaN

We see that Peter and Anna have NaNs in the favorite_color column. However, the missing values are there for two different reasons: Peter’s record didn’t have a match in the users table, while Anna didn’t have a value for the favorite color. In some cases, this subtle difference is important. For example, it can be critical to understanding the data during initial exploration and to improving data quality.

Here are two simple methods to track the differences in why a value is missing in the result of a left join. The first is provided directly by the merge function through the indicator parameter. When set to True, the resulting data frame has an additional column _merge:

>>> left.merge(right, on='user_id', how='left', indicator=True)
  transaction_id user_id     value favorite_color     _merge
0              A   Peter  1.867558            NaN  left_only
1              B    John -0.977278            red       both
2              C    John  0.950088            red       both
3              D    Anna -0.151357            NaN       both

The second method is related to how it would be done in the SQL world and explicitly adds a column representing the user_id in the right table. We note that if the join columns in the two tables have different names, both columns appear in the resulting data frame, so we rename the user_id column in the users table before merging.

>>> left.merge(right.rename({'user_id': 'user_id_r'}, axis=1),
               left_on='user_id', right_on='user_id_r', how='left')
  transaction_id user_id     value user_id_r favorite_color
0              A   Peter  1.867558       NaN            NaN
1              B    John -0.977278      John            red
2              C    John  0.950088      John            red
3              D    Anna -0.151357      Anna            NaN

An equivalent SQL query is

select
    t.transaction_id
    , t.user_id
    , t.value
    , u.user_id as user_id_r
    , u.favorite_color
from
    transactions t
    left join
    users u
    on t.user_id = u.user_id
;

In conclusion, adding an extra column that indicates whether there was a match in the Pandas left join allows us to subsequently treat the missing values for the favorite color differently depending on whether the user was known but didn’t have a favorite color or the user was missing from the users table.

Photo by Ilona Froehlich on Unsplash.

Copyright (c) 2018-2020 Sergei Izrailev. All opinions are my own.