Combining data sources#
Data can come from different files and you can combine them in a single dataframe.
When working with data, storing data across multiple files or data sources is common. Combining data from different files or data sources into a single dataframe can be useful for data analysis and modelling.
In Pandas, combining data from additional files or sources is done using the merge or join functions. These functions allow us to connect data based on a common key or index, which can be helpful when analyzing related data.
This notebook will explore how to merge or join data from different files or sources into a single Pandas dataframe. We will also discuss different types of joins and merge operations and how to handle missing or duplicate data during the merging process.
How To#
import pandas as pd
df = pd.read_csv("data/housing.csv")
df.head()
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|---|
0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |
3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY |
4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |
df_geo = df[["latitude", "longitude", "ocean_proximity"]]
df_geo.head()
latitude | longitude | ocean_proximity | |
---|---|---|---|
0 | 37.88 | -122.23 | NEAR BAY |
1 | 37.86 | -122.22 | NEAR BAY |
2 | 37.85 | -122.24 | NEAR BAY |
3 | 37.85 | -122.25 | NEAR BAY |
4 | 37.85 | -122.25 | NEAR BAY |
df_geo.join(df["median_house_value"])
latitude | longitude | ocean_proximity | median_house_value | |
---|---|---|---|---|
0 | 37.88 | -122.23 | NEAR BAY | 452600.0 |
1 | 37.86 | -122.22 | NEAR BAY | 358500.0 |
2 | 37.85 | -122.24 | NEAR BAY | 352100.0 |
3 | 37.85 | -122.25 | NEAR BAY | 341300.0 |
4 | 37.85 | -122.25 | NEAR BAY | 342200.0 |
... | ... | ... | ... | ... |
20635 | 39.48 | -121.09 | INLAND | 78100.0 |
20636 | 39.49 | -121.21 | INLAND | 77100.0 |
20637 | 39.43 | -121.22 | INLAND | 92300.0 |
20638 | 39.43 | -121.32 | INLAND | 84700.0 |
20639 | 39.37 | -121.24 | INLAND | 89400.0 |
20640 rows × 4 columns
df_price = df[["longitude", "latitude", "median_house_value"]]
df_geo.merge(df_price, left_on="latitude", right_on="latitude")
latitude | longitude_x | ocean_proximity | longitude_y | median_house_value | |
---|---|---|---|---|---|
0 | 37.88 | -122.23 | NEAR BAY | -122.23 | 452600.0 |
1 | 37.88 | -122.23 | NEAR BAY | -122.34 | 350000.0 |
2 | 37.88 | -122.23 | NEAR BAY | -122.29 | 216700.0 |
3 | 37.88 | -122.23 | NEAR BAY | -122.28 | 261300.0 |
4 | 37.88 | -122.23 | NEAR BAY | -122.26 | 391800.0 |
... | ... | ... | ... | ... | ... |
1775707 | 35.96 | -119.04 | INLAND | -119.04 | 51700.0 |
1775708 | 35.86 | -119.46 | INLAND | -119.46 | 42700.0 |
1775709 | 35.85 | -119.12 | INLAND | -119.12 | 58300.0 |
1775710 | 35.89 | -119.27 | INLAND | -119.27 | 53300.0 |
1775711 | 35.88 | -119.27 | INLAND | -119.27 | 43700.0 |
1775712 rows × 5 columns
pd.concat([df_price, df_geo], join="inner", axis=1)
longitude | latitude | median_house_value | latitude | longitude | ocean_proximity | |
---|---|---|---|---|---|---|
0 | -122.23 | 37.88 | 452600.0 | 37.88 | -122.23 | NEAR BAY |
1 | -122.22 | 37.86 | 358500.0 | 37.86 | -122.22 | NEAR BAY |
2 | -122.24 | 37.85 | 352100.0 | 37.85 | -122.24 | NEAR BAY |
3 | -122.25 | 37.85 | 341300.0 | 37.85 | -122.25 | NEAR BAY |
4 | -122.25 | 37.85 | 342200.0 | 37.85 | -122.25 | NEAR BAY |
... | ... | ... | ... | ... | ... | ... |
20635 | -121.09 | 39.48 | 78100.0 | 39.48 | -121.09 | INLAND |
20636 | -121.21 | 39.49 | 77100.0 | 39.49 | -121.21 | INLAND |
20637 | -121.22 | 39.43 | 92300.0 | 39.43 | -121.22 | INLAND |
20638 | -121.32 | 39.43 | 84700.0 | 39.43 | -121.32 | INLAND |
20639 | -121.24 | 39.37 | 89400.0 | 39.37 | -121.24 | INLAND |
20640 rows × 6 columns
df_geo.append(df_geo.sample(1000))
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/tmp/ipykernel_2260/902374054.py in ?()
----> 1 df_geo.append(df_geo.sample(1000))
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/generic.py in ?(self, name)
5985 and name not in self._accessors
5986 and self._info_axis._can_hold_identifiers_and_holds_name(name)
5987 ):
5988 return self[name]
-> 5989 return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'append'
Exercise#
Familiarize yourself with the merge operations and try how inner
, left
, right
, and outer
change the merge results.
pd.concat([df_price, df_geo], join="inner", axis=1)
longitude | latitude | median_house_value | latitude | longitude | ocean_proximity | |
---|---|---|---|---|---|---|
0 | -122.23 | 37.88 | 452600.0 | 37.88 | -122.23 | NEAR BAY |
1 | -122.22 | 37.86 | 358500.0 | 37.86 | -122.22 | NEAR BAY |
2 | -122.24 | 37.85 | 352100.0 | 37.85 | -122.24 | NEAR BAY |
3 | -122.25 | 37.85 | 341300.0 | 37.85 | -122.25 | NEAR BAY |
4 | -122.25 | 37.85 | 342200.0 | 37.85 | -122.25 | NEAR BAY |
... | ... | ... | ... | ... | ... | ... |
20635 | -121.09 | 39.48 | 78100.0 | 39.48 | -121.09 | INLAND |
20636 | -121.21 | 39.49 | 77100.0 | 39.49 | -121.21 | INLAND |
20637 | -121.22 | 39.43 | 92300.0 | 39.43 | -121.22 | INLAND |
20638 | -121.32 | 39.43 | 84700.0 | 39.43 | -121.32 | INLAND |
20639 | -121.24 | 39.37 | 89400.0 | 39.37 | -121.24 | INLAND |
20640 rows × 6 columns