Formatting and deduping data#

Formatting columns and removing duplicates is an important part of data preparation.

Preparing data for analysis is a crucial step in any data science project. One aspect of data preparation is formatting columns and removing duplicates. Inaccurate or inconsistent formatting of columns can make it difficult to analyze data or even result in incorrect results. Similarly, duplicate data can skew analysis and lead to inaccurate conclusions.

This notebook will explore how to format columns in Pandas dataframes to ensure data accuracy and consistency. We will also discuss detecting and removing duplicate data and handling missing values in columns. These techniques ensure data is adequately prepared for analysis and modelling, leading to more accurate and reliable results.

How To#

import pandas as pd
df = pd.read_csv("data/housing.csv", dtype={"housing_median_age": int,"ocean_proximity": "category"})
df.dtypes
longitude              float64
latitude               float64
housing_median_age       int64
total_rooms            float64
total_bedrooms         float64
population             float64
households             float64
median_income          float64
median_house_value     float64
ocean_proximity       category
dtype: object
df.head()
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY
int_cols = ["total_rooms", "population", "households", "median_house_value"]
df[int_cols] = df[int_cols].astype(int)
df.dtypes
longitude              float64
latitude               float64
housing_median_age       int64
total_rooms              int64
total_bedrooms         float64
population               int64
households               int64
median_income          float64
median_house_value       int64
ocean_proximity       category
dtype: object

De-duplicating data#

df.duplicated()
0        False
1        False
2        False
3        False
4        False
         ...  
20635    False
20636    False
20637    False
20638    False
20639    False
Length: 20640, dtype: bool
df.ocean_proximity.duplicated("last")
0         True
1         True
2         True
3         True
4         True
         ...  
20635     True
20636     True
20637     True
20638     True
20639    False
Name: ocean_proximity, Length: 20640, dtype: bool
df.append(df.sample(5)).duplicated()
0        False
1        False
2        False
3        False
4        False
         ...  
12292     True
15857     True
16022     True
10211     True
10793     True
Length: 20645, dtype: bool
df_dup = df.append(df.sample(5))
df_dup.duplicated()
0        False
1        False
2        False
3        False
4        False
         ...  
4667      True
19131     True
16386     True
11879     True
14716     True
Length: 20645, dtype: bool
df_dup[~df_dup.duplicated()]
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41 880 129.0 322 126 8.3252 452600 NEAR BAY
1 -122.22 37.86 21 7099 1106.0 2401 1138 8.3014 358500 NEAR BAY
2 -122.24 37.85 52 1467 190.0 496 177 7.2574 352100 NEAR BAY
3 -122.25 37.85 52 1274 235.0 558 219 5.6431 341300 NEAR BAY
4 -122.25 37.85 52 1627 280.0 565 259 3.8462 342200 NEAR BAY
... ... ... ... ... ... ... ... ... ... ...
20635 -121.09 39.48 25 1665 374.0 845 330 1.5603 78100 INLAND
20636 -121.21 39.49 18 697 150.0 356 114 2.5568 77100 INLAND
20637 -121.22 39.43 17 2254 485.0 1007 433 1.7000 92300 INLAND
20638 -121.32 39.43 18 1860 409.0 741 349 1.8672 84700 INLAND
20639 -121.24 39.37 16 2785 616.0 1387 530 2.3886 89400 INLAND

20640 rows × 10 columns

Exercise#

Try generating unique values in the median age from the dataset.

df[...]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/pandas/core/indexes/base.py:2646, in Index.get_loc(self, key, method, tolerance)
   2645 try:
-> 2646     return self._engine.get_loc(key)
   2647 except KeyError:

File pandas/_libs/index.pyx:111, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/index.pyx:138, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:1619, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:1627, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: Ellipsis

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
Cell In[13], line 1
----> 1 df[...]

File /opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/pandas/core/frame.py:2800, in DataFrame.__getitem__(self, key)
   2798 if self.columns.nlevels > 1:
   2799     return self._getitem_multilevel(key)
-> 2800 indexer = self.columns.get_loc(key)
   2801 if is_integer(indexer):
   2802     indexer = [indexer]

File /opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/pandas/core/indexes/base.py:2648, in Index.get_loc(self, key, method, tolerance)
   2646         return self._engine.get_loc(key)
   2647     except KeyError:
-> 2648         return self._engine.get_loc(self._maybe_cast_indexer(key))
   2649 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
   2650 if indexer.ndim > 1 or indexer.size > 1:

File pandas/_libs/index.pyx:111, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/index.pyx:138, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:1619, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:1627, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: Ellipsis

Additional Resources#