Formatting and deduping data#
Formatting columns and removing duplicates is an important part of data preparation.
Preparing data for analysis is a crucial step in any data science project. One aspect of data preparation is formatting columns and removing duplicates. Inaccurate or inconsistent formatting of columns can make it difficult to analyze data or even result in incorrect results. Similarly, duplicate data can skew analysis and lead to inaccurate conclusions.
This notebook will explore how to format columns in Pandas dataframes to ensure data accuracy and consistency. We will also discuss detecting and removing duplicate data and handling missing values in columns. These techniques ensure data is adequately prepared for analysis and modelling, leading to more accurate and reliable results.
How To#
import pandas as pd
df = pd.read_csv("data/housing.csv", dtype={"housing_median_age": int,"ocean_proximity": "category"})
df.dtypes
longitude float64
latitude float64
housing_median_age int64
total_rooms float64
total_bedrooms float64
population float64
households float64
median_income float64
median_house_value float64
ocean_proximity category
dtype: object
df.head()
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|---|
0 | -122.23 | 37.88 | 41 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
1 | -122.22 | 37.86 | 21 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
2 | -122.24 | 37.85 | 52 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |
3 | -122.25 | 37.85 | 52 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY |
4 | -122.25 | 37.85 | 52 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |
int_cols = ["total_rooms", "population", "households", "median_house_value"]
df[int_cols] = df[int_cols].astype(int)
df.dtypes
longitude float64
latitude float64
housing_median_age int64
total_rooms int64
total_bedrooms float64
population int64
households int64
median_income float64
median_house_value int64
ocean_proximity category
dtype: object
De-duplicating data#
df.duplicated()
0 False
1 False
2 False
3 False
4 False
...
20635 False
20636 False
20637 False
20638 False
20639 False
Length: 20640, dtype: bool
df.ocean_proximity.duplicated("last")
0 True
1 True
2 True
3 True
4 True
...
20635 True
20636 True
20637 True
20638 True
20639 False
Name: ocean_proximity, Length: 20640, dtype: bool
df.append(df.sample(5)).duplicated()
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/tmp/ipykernel_2356/2165154540.py in ?()
----> 1 df.append(df.sample(5)).duplicated()
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/generic.py in ?(self, name)
5985 and name not in self._accessors
5986 and self._info_axis._can_hold_identifiers_and_holds_name(name)
5987 ):
5988 return self[name]
-> 5989 return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'append'
df_dup = df.append(df.sample(5))
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/tmp/ipykernel_2356/3740749831.py in ?()
----> 1 df_dup = df.append(df.sample(5))
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/generic.py in ?(self, name)
5985 and name not in self._accessors
5986 and self._info_axis._can_hold_identifiers_and_holds_name(name)
5987 ):
5988 return self[name]
-> 5989 return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'append'
df_dup.duplicated()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[11], line 1
----> 1 df_dup.duplicated()
NameError: name 'df_dup' is not defined
df_dup[~df_dup.duplicated()]
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[12], line 1
----> 1 df_dup[~df_dup.duplicated()]
NameError: name 'df_dup' is not defined
Exercise#
Try generating unique values in the median age from the dataset.
df[...]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/indexes/base.py:3653, in Index.get_loc(self, key)
3652 try:
-> 3653 return self._engine.get_loc(casted_key)
3654 except KeyError as err:
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/_libs/index.pyx:147, in pandas._libs.index.IndexEngine.get_loc()
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/_libs/index.pyx:176, in pandas._libs.index.IndexEngine.get_loc()
File pandas/_libs/hashtable_class_helper.pxi:7080, in pandas._libs.hashtable.PyObjectHashTable.get_item()
File pandas/_libs/hashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: Ellipsis
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
Cell In[13], line 1
----> 1 df[...]
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/frame.py:3761, in DataFrame.__getitem__(self, key)
3759 if self.columns.nlevels > 1:
3760 return self._getitem_multilevel(key)
-> 3761 indexer = self.columns.get_loc(key)
3762 if is_integer(indexer):
3763 indexer = [indexer]
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/indexes/base.py:3655, in Index.get_loc(self, key)
3653 return self._engine.get_loc(casted_key)
3654 except KeyError as err:
-> 3655 raise KeyError(key) from err
3656 except TypeError:
3657 # If we have a listlike key, _check_indexing_error will raise
3658 # InvalidIndexError. Otherwise we fall through and re-raise
3659 # the TypeError.
3660 self._check_indexing_error(key)
KeyError: Ellipsis