Scaling and binning data#

Data comes in all shapes and forms, but sometimes it’s essential to get data into the same range of values. Sometimes scaling the data is not enough, but grouping data into similar categories is beneficial for analysis.

Data comes in various shapes and forms, including numerical, categorical, and textual data. Analyzing different types of data often requires extra preprocessing steps.

In some cases, data needs to be standardized to a standard range of values, such as scaling numerical data. However, scaling alone may not be enough, and grouping data into similar categories can be more beneficial for analysis. Grouping data can also simplify complex datasets and improve the accuracy of models.

This notebook will discuss different techniques for grouping data, including binning, one-hot encoding, and clustering. We will also explore the benefits and limitations of each technique and how to choose the most appropriate method based on the data and analysis goals.

How To#

import pandas as pd
df = pd.read_csv("data/housing.csv")
df.head()
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY

Min-Max Scaling#

\(x' = \frac{x - \text{min}(x)}{\text{max}(x)-\text{min}(x)}\)

df["house_value_minmax"] = (df.median_house_value-df.median_house_value.min()) / (df.median_house_value.max()-df.median_house_value.min())
df.head()
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity house_value_minmax
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY 0.902266
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY 0.708247
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY 0.695051
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY 0.672783
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY 0.674638

Z-Score#

\(x' = \frac{x - \bar{x}}{\sigma}\)

df["house_value_z"] = (df.median_house_value - df.median_house_value.mean()) / df.median_house_value.std()
df.head()
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity house_value_minmax house_value_z
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY 0.902266 2.129580
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY 0.708247 1.314124
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY 0.695051 1.258663
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY 0.672783 1.165072
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY 0.674638 1.172871

Binning#

bin_names = ["cheap", "medium", "high", "luxury"]
df["price_range"] = pd.cut(df.house_value_minmax, 
                           [0, .25, .5, .75, 1],
                           labels=bin_names)
df.head()
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity house_value_minmax house_value_z price_range
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY 0.902266 2.129580 luxury
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY 0.708247 1.314124 high
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY 0.695051 1.258663 high
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY 0.672783 1.165072 high
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY 0.674638 1.172871 high
df["price_range_quantile"] = pd.qcut(df.house_value_minmax, 
                                     4, labels = bin_names)
df.head()
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity house_value_minmax house_value_z price_range price_range_quantile
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY 0.902266 2.129580 luxury luxury
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY 0.708247 1.314124 high luxury
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY 0.695051 1.258663 high luxury
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY 0.672783 1.165072 high luxury
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY 0.674638 1.172871 high luxury
df.price_range_quantile.unique()
['luxury', 'high', 'medium', 'cheap']
Categories (4, object): ['cheap' < 'medium' < 'high' < 'luxury']

Advanced Scaling#

from sklearn import preprocessing
out = preprocessing.MinMaxScaler().fit_transform(df[["median_house_value"]])
import numpy as np
np.squeeze(out) == df.house_value_minmax
0        False
1         True
2        False
3         True
4        False
         ...  
20635    False
20636    False
20637     True
20638     True
20639     True
Name: house_value_minmax, Length: 20640, dtype: bool
print(out[0], df.house_value_minmax[0])
[0.90226638] 0.9022663824066705
np.allclose(np.squeeze(out), df.house_value_minmax)
True
preprocessing.StandardScaler().fit_transform(df[["median_house_value"]])
array([[ 2.12963148],
       [ 1.31415614],
       [ 1.25869341],
       ...,
       [-0.99274649],
       [-1.05860847],
       [-1.01787803]])
preprocessing.RobustScaler().fit_transform(df[["median_house_value"]])
array([[ 1.88044789],
       [ 1.23204134],
       [ 1.18794143],
       ...,
       [-0.60223945],
       [-0.6546081 ],
       [-0.62222222]])

Exercise#

Familiarize yourself with the preprocessing library in scikit-learn.

from sklearn import preprocessing

Additional Resources#