Formatting and deduping data#
Formatting columns and removing duplicates is an important part of data preparation.
Preparing data for analysis is a crucial step in any data science project. One aspect of data preparation is formatting columns and removing duplicates. Inaccurate or inconsistent formatting of columns can make it difficult to analyze data or even result in incorrect results. Similarly, duplicate data can skew analysis and lead to inaccurate conclusions.
This notebook will explore how to format columns in Pandas dataframes to ensure data accuracy and consistency. We will also discuss detecting and removing duplicate data and handling missing values in columns. These techniques ensure data is adequately prepared for analysis and modelling, leading to more accurate and reliable results.
How To#
import pandas as pd
df = pd.read_csv("data/housing.csv", dtype={"housing_median_age": int,"ocean_proximity": "category"})
df.dtypes
longitude float64
latitude float64
housing_median_age int64
total_rooms float64
total_bedrooms float64
population float64
households float64
median_income float64
median_house_value float64
ocean_proximity category
dtype: object
df.head()
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|---|
0 | -122.23 | 37.88 | 41 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
1 | -122.22 | 37.86 | 21 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
2 | -122.24 | 37.85 | 52 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |
3 | -122.25 | 37.85 | 52 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY |
4 | -122.25 | 37.85 | 52 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |
int_cols = ["total_rooms", "population", "households", "median_house_value"]
df[int_cols] = df[int_cols].astype(int)
df.dtypes
longitude float64
latitude float64
housing_median_age int64
total_rooms int64
total_bedrooms float64
population int64
households int64
median_income float64
median_house_value int64
ocean_proximity category
dtype: object
De-duplicating data#
df.duplicated()
0 False
1 False
2 False
3 False
4 False
...
20635 False
20636 False
20637 False
20638 False
20639 False
Length: 20640, dtype: bool
df.ocean_proximity.duplicated("last")
0 True
1 True
2 True
3 True
4 True
...
20635 True
20636 True
20637 True
20638 True
20639 False
Name: ocean_proximity, Length: 20640, dtype: bool
df.append(df.sample(5)).duplicated()
0 False
1 False
2 False
3 False
4 False
...
12292 True
15857 True
16022 True
10211 True
10793 True
Length: 20645, dtype: bool
df_dup = df.append(df.sample(5))
df_dup.duplicated()
0 False
1 False
2 False
3 False
4 False
...
4667 True
19131 True
16386 True
11879 True
14716 True
Length: 20645, dtype: bool
df_dup[~df_dup.duplicated()]
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|---|
0 | -122.23 | 37.88 | 41 | 880 | 129.0 | 322 | 126 | 8.3252 | 452600 | NEAR BAY |
1 | -122.22 | 37.86 | 21 | 7099 | 1106.0 | 2401 | 1138 | 8.3014 | 358500 | NEAR BAY |
2 | -122.24 | 37.85 | 52 | 1467 | 190.0 | 496 | 177 | 7.2574 | 352100 | NEAR BAY |
3 | -122.25 | 37.85 | 52 | 1274 | 235.0 | 558 | 219 | 5.6431 | 341300 | NEAR BAY |
4 | -122.25 | 37.85 | 52 | 1627 | 280.0 | 565 | 259 | 3.8462 | 342200 | NEAR BAY |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
20635 | -121.09 | 39.48 | 25 | 1665 | 374.0 | 845 | 330 | 1.5603 | 78100 | INLAND |
20636 | -121.21 | 39.49 | 18 | 697 | 150.0 | 356 | 114 | 2.5568 | 77100 | INLAND |
20637 | -121.22 | 39.43 | 17 | 2254 | 485.0 | 1007 | 433 | 1.7000 | 92300 | INLAND |
20638 | -121.32 | 39.43 | 18 | 1860 | 409.0 | 741 | 349 | 1.8672 | 84700 | INLAND |
20639 | -121.24 | 39.37 | 16 | 2785 | 616.0 | 1387 | 530 | 2.3886 | 89400 | INLAND |
20640 rows × 10 columns
Exercise#
Try generating unique values in the median age from the dataset.
df[...]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/pandas/core/indexes/base.py:2646, in Index.get_loc(self, key, method, tolerance)
2645 try:
-> 2646 return self._engine.get_loc(key)
2647 except KeyError:
File pandas/_libs/index.pyx:111, in pandas._libs.index.IndexEngine.get_loc()
File pandas/_libs/index.pyx:138, in pandas._libs.index.IndexEngine.get_loc()
File pandas/_libs/hashtable_class_helper.pxi:1619, in pandas._libs.hashtable.PyObjectHashTable.get_item()
File pandas/_libs/hashtable_class_helper.pxi:1627, in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: Ellipsis
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
Cell In[13], line 1
----> 1 df[...]
File /opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/pandas/core/frame.py:2800, in DataFrame.__getitem__(self, key)
2798 if self.columns.nlevels > 1:
2799 return self._getitem_multilevel(key)
-> 2800 indexer = self.columns.get_loc(key)
2801 if is_integer(indexer):
2802 indexer = [indexer]
File /opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/pandas/core/indexes/base.py:2648, in Index.get_loc(self, key, method, tolerance)
2646 return self._engine.get_loc(key)
2647 except KeyError:
-> 2648 return self._engine.get_loc(self._maybe_cast_indexer(key))
2649 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
2650 if indexer.ndim > 1 or indexer.size > 1:
File pandas/_libs/index.pyx:111, in pandas._libs.index.IndexEngine.get_loc()
File pandas/_libs/index.pyx:138, in pandas._libs.index.IndexEngine.get_loc()
File pandas/_libs/hashtable_class_helper.pxi:1619, in pandas._libs.hashtable.PyObjectHashTable.get_item()
File pandas/_libs/hashtable_class_helper.pxi:1627, in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: Ellipsis