Join, Merge, Append and Concatenate

Table of Contents

Working with multiple data frames often involves joining two or more tables to in bring out more no. of columns from another table by joining on some sort of relationship which exists within a table or appending two tables which is adding one or more table over another table with keeping the same order of columns.


Example of append data -> monthly files of revenue sheets of a company and wee need at end of the year to be clubbed into single table.

alt text      alt text

alt text

alt text


Example of merging -> multiple files regarding employee education, compensation, performance all linked to each other in some identifier in each one of them which maps to employee master table and for doing analysis we need data from each of these tables in the same which can be achieved by merging.


We’ll look out for merging/joining two tables now and later will discuss the possibilities around appending to tables using pandas.
To begin with let’s get create some dummy datasets.

import pandas as pd
states_codes = pd.DataFrame({'State': ['Haryana', 'Punjab', 'Rajasthan', 'Uttar Pradesh', 'Madhya Pradesh'],
                        'Code': ['HR', 'PB', 'RJ', 'UP', 'MP']})
states_area = pd.DataFrame({'State': ['Haryana', 'Punjab',  'Uttar Pradesh', 'Bihar'],
                        'Area_InSquareKM': [44212, 50362, 243290, 94165]})
# a dummy data file
states_literacyrate = pd.read_csv('literacy.csv')
states_codes
State Code
0 Haryana HR
1 Punjab PB
2 Rajasthan RJ
3 Uttar Pradesh UP
4 Madhya Pradesh MP
```python states_area ```
State Area_InSquareKM
0 Haryana 44212
1 Punjab 50362
2 Uttar Pradesh 243290
3 Bihar 94165
```python states_literacyrate ```
State Year Literacy Rate
0 Haryana 2011 76.64
1 Rajasthan 2011 67.06
2 Uttar Pradesh 2011 69.72
3 Haryana 2001 67.91
4 Uttar Pradesh 2001 56.27
5 Rajasthan 2001 60.41
As you can see State column repeats across all three tables, meaning to say that in case it is required to pull out from these three table
# Merge vs Join
area_codes_inner = pd.merge(states_area,states_codes)
area_codes_inner
State Area_InSquareKM Code
0 Haryana 44212 HR
1 Punjab 50362 PB
2 Uttar Pradesh 243290 UP
```python results_inner = pd.merge(area_codes_inner,states_literacyrate) results_inner ```
State Area_InSquareKM Code Year Literacy Rate
0 Haryana 44212 HR 2011 76.64
1 Haryana 44212 HR 2001 67.91
2 Uttar Pradesh 243290 UP 2011 69.72
3 Uttar Pradesh 243290 UP 2001 56.27
```python area_codes_left = pd.merge(states_area,states_codes,how ='left') area_codes_left ```
State Area_InSquareKM Code
0 Haryana 44212 HR
1 Punjab 50362 PB
2 Uttar Pradesh 243290 UP
3 Bihar 94165 NaN
```python results_left = pd.merge(area_codes_left,states_literacyrate,how ='left') results_left ```
State Area_InSquareKM Code Year Literacy Rate
0 Haryana 44212 HR 2011.0 76.64
1 Haryana 44212 HR 2001.0 67.91
2 Punjab 50362 PB NaN NaN
3 Uttar Pradesh 243290 UP 2011.0 69.72
4 Uttar Pradesh 243290 UP 2001.0 56.27
5 Bihar 94165 NaN NaN NaN
```python area_codes_right = pd.merge(states_area,states_codes,how ='right') area_codes_right ```
State Area_InSquareKM Code
0 Haryana 44212.0 HR
1 Punjab 50362.0 PB
2 Uttar Pradesh 243290.0 UP
3 Rajasthan NaN RJ
4 Madhya Pradesh NaN MP
```python results_right = pd.merge(area_codes_right,states_literacyrate,how ='right') results_right ```
State Area_InSquareKM Code Year Literacy Rate
0 Haryana 44212.0 HR 2011 76.64
1 Haryana 44212.0 HR 2001 67.91
2 Uttar Pradesh 243290.0 UP 2011 69.72
3 Uttar Pradesh 243290.0 UP 2001 56.27
4 Rajasthan NaN RJ 2011 67.06
5 Rajasthan NaN RJ 2001 60.41
```python area_codes_outer = pd.merge(states_area,states_codes, how = 'outer') area_codes_outer ```
State Area_InSquareKM Code
0 Haryana 44212.0 HR
1 Punjab 50362.0 PB
2 Uttar Pradesh 243290.0 UP
3 Bihar 94165.0 NaN
4 Rajasthan NaN RJ
5 Madhya Pradesh NaN MP
```python results_outer = pd.merge(area_codes_outer,states_literacyrate,how ='outer') results_outer ```
State Area_InSquareKM Code Year Literacy Rate
0 Haryana 44212.0 HR 2011.0 76.64
1 Haryana 44212.0 HR 2001.0 67.91
2 Punjab 50362.0 PB NaN NaN
3 Uttar Pradesh 243290.0 UP 2011.0 69.72
4 Uttar Pradesh 243290.0 UP 2001.0 56.27
5 Bihar 94165.0 NaN NaN NaN
6 Rajasthan NaN RJ 2011.0 67.06
7 Rajasthan NaN RJ 2001.0 60.41
8 Madhya Pradesh NaN MP NaN NaN
other options in merge function are - left_on right_on on left_index right_index sort

a detailed description is can be found here - https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

states_area
State Area_InSquareKM
0 Haryana 44212
1 Punjab 50362
2 Uttar Pradesh 243290
3 Bihar 94165
```python states_codes ```
State Code
0 Haryana HR
1 Punjab PB
2 Rajasthan RJ
3 Uttar Pradesh UP
4 Madhya Pradesh MP
-- .merge : For column(s)-on-columns(s) operations -- .join : Join DataFrames using their indexes., if need to be on specific keys, then set keys to be the index
states_area.join(states_literacyrate,lsuffix='_FromArea',rsuffix='_FromLiteracy', how='outer')
State_FromArea Area_InSquareKM State_FromLiteracy Year Literacy Rate
0 Haryana 44212.0 Haryana 2011 76.64
1 Punjab 50362.0 Rajasthan 2011 67.06
2 Uttar Pradesh 243290.0 Uttar Pradesh 2011 69.72
3 Bihar 94165.0 Haryana 2001 67.91
4 NaN NaN Uttar Pradesh 2001 56.27
5 NaN NaN Rajasthan 2001 60.41
```python pd.merge(states_area,states_literacyrate, left_index=True, right_index=True, how='outer') ```
State_x Area_InSquareKM State_y Year Literacy Rate
0 Haryana 44212.0 Haryana 2011 76.64
1 Punjab 50362.0 Rajasthan 2011 67.06
2 Uttar Pradesh 243290.0 Uttar Pradesh 2011 69.72
3 Bihar 94165.0 Haryana 2001 67.91
4 NaN NaN Uttar Pradesh 2001 56.27
5 NaN NaN Rajasthan 2001 60.41
```python ## merge on multiple columns ```
city_states_literacy = pd.read_csv('City_States_Literacy.csv')
city_states_area = pd.read_csv('City_States_Area.csv')
city_states_area
State City Area_InSquareKM
0 Haryana 1 983
1 Haryana 2 886
2 Rajasthan 1 881
3 Rajasthan 2 980
4 Rajasthan 3 895
5 Rajasthan 4 1010
6 Rajasthan 5 1075
7 Bihar 1 802
8 Bihar 2 859
9 Bihar 3 1020
10 Uttar Pradesh 4 945
11 Uttar Pradesh 5 787
12 Madhya Pradhes 6 983
```python results_inner = pd.merge(city_states_literacy,city_states_area, left_on=['State','City'], right_on=['State','City']) results_inner ```
State City Literacy Rate Area_InSquareKM
0 Haryana 1 74.78 983
1 Haryana 2 63.80 886
2 Rajasthan 1 61.68 881
3 Rajasthan 2 68.61 980
4 Rajasthan 3 70.78 895
5 Rajasthan 4 75.80 1010
6 Rajasthan 5 80.65 1075
7 Uttar Pradesh 4 73.77 945
8 Uttar Pradesh 5 60.61 787
```python pd.merge(city_states_literacy,city_states_area, left_on=['State','City'], right_on=['State','City']) ```
State City Literacy Rate Area_InSquareKM
0 Haryana 1 74.78 983
1 Haryana 2 63.80 886
2 Rajasthan 1 61.68 881
3 Rajasthan 2 68.61 980
4 Rajasthan 3 70.78 895
5 Rajasthan 4 75.80 1010
6 Rajasthan 5 80.65 1075
7 Uttar Pradesh 4 73.77 945
8 Uttar Pradesh 5 60.61 787
---

Related Posts

Read data into pandas

December 23, 2018

Read More
Subsetting a dataframe in pandas

January 5, 2019

Read More
Connect to azure datalake store using python

December 20, 2018

Read More