# Formatting and deduping data

Formatting columns and removing duplicates is an important part of data preparation.

Preparing data for analysis is a crucial step in any data science project. One aspect of data preparation is formatting columns and removing duplicates. Inaccurate or inconsistent formatting of columns can make it difficult to analyze data or even result in incorrect results. Similarly, duplicate data can skew analysis and lead to inaccurate conclusions. 

This notebook will explore how to format columns in Pandas dataframes to ensure data accuracy and consistency. We will also discuss detecting and removing duplicate data and handling missing values in columns. These techniques ensure data is adequately prepared for analysis and modelling, leading to more accurate and reliable results.

## How To

In [1]:
import pandas as pd

In [6]:
df = pd.read_csv("data/housing.csv", dtype={"housing_median_age": int,"ocean_proximity": "category"})

In [7]:
df.dtypes

longitude              float64
latitude               float64
housing_median_age       int32
total_rooms            float64
total_bedrooms         float64
population             float64
households             float64
median_income          float64
median_house_value     float64
ocean_proximity       category
dtype: object

In [8]:
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [9]:
int_cols = ["total_rooms", "population", "households", "median_house_value"]
df[int_cols] = df[int_cols].astype(int)

In [10]:
df.dtypes

longitude              float64
latitude               float64
housing_median_age       int32
total_rooms              int32
total_bedrooms         float64
population               int32
households               int32
median_income          float64
median_house_value       int32
ocean_proximity       category
dtype: object

## De-duplicating data

In [11]:
df.duplicated()

0        False
1        False
2        False
3        False
4        False
         ...  
20635    False
20636    False
20637    False
20638    False
20639    False
Length: 20640, dtype: bool

In [13]:
df.ocean_proximity.duplicated("last")

0         True
1         True
2         True
3         True
4         True
         ...  
20635     True
20636     True
20637     True
20638     True
20639    False
Name: ocean_proximity, Length: 20640, dtype: bool

In [14]:
df.append(df.sample(5)).duplicated()

0        False
1        False
2        False
3        False
4        False
         ...  
3041      True
19520     True
22        True
1215      True
8840      True
Length: 20645, dtype: bool

In [15]:
df_dup = df.append(df.sample(5))

In [16]:
df_dup.duplicated()

0        False
1        False
2        False
3        False
4        False
         ...  
14693     True
14572     True
7974      True
4140      True
20393     True
Length: 20645, dtype: bool

In [24]:
df_dup[~df_dup.duplicated()]

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41,880,129.0,322,126,8.3252,452600,NEAR BAY
1,-122.22,37.86,21,7099,1106.0,2401,1138,8.3014,358500,NEAR BAY
2,-122.24,37.85,52,1467,190.0,496,177,7.2574,352100,NEAR BAY
3,-122.25,37.85,52,1274,235.0,558,219,5.6431,341300,NEAR BAY
4,-122.25,37.85,52,1627,280.0,565,259,3.8462,342200,NEAR BAY
...,...,...,...,...,...,...,...,...,...,...
20635,-121.09,39.48,25,1665,374.0,845,330,1.5603,78100,INLAND
20636,-121.21,39.49,18,697,150.0,356,114,2.5568,77100,INLAND
20637,-121.22,39.43,17,2254,485.0,1007,433,1.7000,92300,INLAND
20638,-121.32,39.43,18,1860,409.0,741,349,1.8672,84700,INLAND


## Exercise
Try generating  unique values in the median age from the dataset.

In [None]:
df[...]

## Additional Resources

- [Pandas AsType](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html)
- [Pandas Duplicated Documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html)