Scaling and binning data#
Data comes in all shapes and forms, but sometimes it’s essential to get data into the same range of values. Sometimes scaling the data is not enough, but grouping data into similar categories is beneficial for analysis.
Data comes in various shapes and forms, including numerical, categorical, and textual data. Analyzing different types of data often requires extra preprocessing steps.
In some cases, data needs to be standardized to a standard range of values, such as scaling numerical data. However, scaling alone may not be enough, and grouping data into similar categories can be more beneficial for analysis. Grouping data can also simplify complex datasets and improve the accuracy of models.
This notebook will discuss different techniques for grouping data, including binning, one-hot encoding, and clustering. We will also explore the benefits and limitations of each technique and how to choose the most appropriate method based on the data and analysis goals.
How To#
import pandas as pd
df = pd.read_csv("data/housing.csv")
df.head()
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|---|
0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |
3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY |
4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |
Min-Max Scaling#
\(x' = \frac{x - \text{min}(x)}{\text{max}(x)-\text{min}(x)}\)
df["house_value_minmax"] = (df.median_house_value-df.median_house_value.min()) / (df.median_house_value.max()-df.median_house_value.min())
df.head()
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | house_value_minmax | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY | 0.902266 |
1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY | 0.708247 |
2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY | 0.695051 |
3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY | 0.672783 |
4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY | 0.674638 |
Z-Score#
\(x' = \frac{x - \bar{x}}{\sigma}\)
df["house_value_z"] = (df.median_house_value - df.median_house_value.mean()) / df.median_house_value.std()
df.head()
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | house_value_minmax | house_value_z | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY | 0.902266 | 2.129580 |
1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY | 0.708247 | 1.314124 |
2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY | 0.695051 | 1.258663 |
3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY | 0.672783 | 1.165072 |
4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY | 0.674638 | 1.172871 |
Binning#
bin_names = ["cheap", "medium", "high", "luxury"]
df["price_range"] = pd.cut(df.house_value_minmax,
[0, .25, .5, .75, 1],
labels=bin_names)
df.head()
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | house_value_minmax | house_value_z | price_range | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY | 0.902266 | 2.129580 | luxury |
1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY | 0.708247 | 1.314124 | high |
2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY | 0.695051 | 1.258663 | high |
3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY | 0.672783 | 1.165072 | high |
4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY | 0.674638 | 1.172871 | high |
df["price_range_quantile"] = pd.qcut(df.house_value_minmax,
4, labels = bin_names)
df.head()
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | house_value_minmax | house_value_z | price_range | price_range_quantile | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY | 0.902266 | 2.129580 | luxury | luxury |
1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY | 0.708247 | 1.314124 | high | luxury |
2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY | 0.695051 | 1.258663 | high | luxury |
3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY | 0.672783 | 1.165072 | high | luxury |
4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY | 0.674638 | 1.172871 | high | luxury |
df.price_range_quantile.unique()
['luxury', 'high', 'medium', 'cheap']
Categories (4, object): ['cheap' < 'medium' < 'high' < 'luxury']
Advanced Scaling#
from sklearn import preprocessing
out = preprocessing.MinMaxScaler().fit_transform(df[["median_house_value"]])
import numpy as np
np.squeeze(out) == df.house_value_minmax
0 False
1 True
2 False
3 True
4 False
...
20635 False
20636 False
20637 True
20638 True
20639 True
Name: house_value_minmax, Length: 20640, dtype: bool
print(out[0], df.house_value_minmax[0])
[0.90226638] 0.9022663824066705
np.allclose(np.squeeze(out), df.house_value_minmax)
True
preprocessing.StandardScaler().fit_transform(df[["median_house_value"]])
array([[ 2.12963148],
[ 1.31415614],
[ 1.25869341],
...,
[-0.99274649],
[-1.05860847],
[-1.01787803]])
preprocessing.RobustScaler().fit_transform(df[["median_house_value"]])
array([[ 1.88044789],
[ 1.23204134],
[ 1.18794143],
...,
[-0.60223945],
[-0.6546081 ],
[-0.62222222]])
Exercise#
Familiarize yourself with the preprocessing library in scikit-learn.
from sklearn import preprocessing