Formatting and deduping data#

Formatting columns and removing duplicates is an important part of data preparation.

Preparing data for analysis is a crucial step in any data science project. One aspect of data preparation is formatting columns and removing duplicates. Inaccurate or inconsistent formatting of columns can make it difficult to analyze data or even result in incorrect results. Similarly, duplicate data can skew analysis and lead to inaccurate conclusions.

This notebook will explore how to format columns in Pandas dataframes to ensure data accuracy and consistency. We will also discuss detecting and removing duplicate data and handling missing values in columns. These techniques ensure data is adequately prepared for analysis and modelling, leading to more accurate and reliable results.

How To#

import pandas as pd
df = pd.read_csv("data/housing.csv", dtype={"housing_median_age": int,"ocean_proximity": "category"})
longitude              float64
latitude               float64
housing_median_age       int64
total_rooms            float64
total_bedrooms         float64
population             float64
households             float64
median_income          float64
median_house_value     float64
ocean_proximity       category
dtype: object
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY
int_cols = ["total_rooms", "population", "households", "median_house_value"]
df[int_cols] = df[int_cols].astype(int)
longitude              float64
latitude               float64
housing_median_age       int64
total_rooms              int64
total_bedrooms         float64
population               int64
households               int64
median_income          float64
median_house_value       int64
ocean_proximity       category
dtype: object

De-duplicating data#

0        False
1        False
2        False
3        False
4        False
20635    False
20636    False
20637    False
20638    False
20639    False
Length: 20640, dtype: bool
0         True
1         True
2         True
3         True
4         True
20635     True
20636     True
20637     True
20638     True
20639    False
Name: ocean_proximity, Length: 20640, dtype: bool
Try generating unique values in the median age from the dataset.

Additional Resources#