Advanced Strategies (Encoding)#

Sometimes it’s good to change data from one representation to another

Advanced strategies in data encoding involve converting information from one format or representation to another. This can be useful in various contexts, such as improving data storage efficiency, enhancing data security, and facilitating data processing and analysis.

Encoding can involve various techniques, such as compression, encryption, and hashing. In some cases, encoding can also include transforming data to a format more suitable for a particular application or platform.

Encoding strategies can help organizations to optimize their data management practices and improve the overall performance of their systems.

How To#

import pandas as pd
df = pd.read_csv("data/housing.csv")
df.head()
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY
df.ocean_proximity
0        NEAR BAY
1        NEAR BAY
2        NEAR BAY
3        NEAR BAY
4        NEAR BAY
           ...   
20635      INLAND
20636      INLAND
20637      INLAND
20638      INLAND
20639      INLAND
Name: ocean_proximity, Length: 20640, dtype: object
pd.get_dummies(df.ocean_proximity)
<1H OCEAN INLAND ISLAND NEAR BAY NEAR OCEAN
0 False False False True False
1 False False False True False
2 False False False True False
3 False False False True False
4 False False False True False
... ... ... ... ... ...
20635 False True False False False
20636 False True False False False
20637 False True False False False
20638 False True False False False
20639 False True False False False

20640 rows × 5 columns

df.join(pd.get_dummies(df.ocean_proximity))
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity <1H OCEAN INLAND ISLAND NEAR BAY NEAR OCEAN
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY False False False True False
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY False False False True False
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY False False False True False
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY False False False True False
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY False False False True False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
20635 -121.09 39.48 25.0 1665.0 374.0 845.0 330.0 1.5603 78100.0 INLAND False True False False False
20636 -121.21 39.49 18.0 697.0 150.0 356.0 114.0 2.5568 77100.0 INLAND False True False False False
20637 -121.22 39.43 17.0 2254.0 485.0 1007.0 433.0 1.7000 92300.0 INLAND False True False False False
20638 -121.32 39.43 18.0 1860.0 409.0 741.0 349.0 1.8672 84700.0 INLAND False True False False False
20639 -121.24 39.37 16.0 2785.0 616.0 1387.0 530.0 2.3886 89400.0 INLAND False True False False False

20640 rows × 15 columns

from sklearn import preprocessing
enc = preprocessing.OneHotEncoder()
enc.fit(df.ocean_proximity.unique().reshape(-1, 1))
OneHotEncoder()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
enc.transform(df.ocean_proximity.unique().reshape(-1, 1)).toarray()
array([[0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 1.],
       [0., 0., 1., 0., 0.]])
transformed = enc.transform(df[["ocean_proximity"]]).toarray()
import numpy as np
arr = np.array([[0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 1.],
       [0., 0., 1., 0., 0.]])
enc.inverse_transform(arr)
array([['NEAR BAY'],
       ['<1H OCEAN'],
       ['INLAND'],
       ['NEAR OCEAN'],
       ['ISLAND']], dtype=object)
enc.get_feature_names()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[12], line 1
----> 1 enc.get_feature_names()

AttributeError: 'OneHotEncoder' object has no attribute 'get_feature_names'
enc = preprocessing.OrdinalEncoder().fit(df.ocean_proximity.unique().reshape(-1, 1))
enc.transform(df.ocean_proximity.unique().reshape(-1, 1))
array([[3.],
       [0.],
       [1.],
       [4.],
       [2.]])
enc.transform(df[["ocean_proximity"]])
array([[3.],
       [3.],
       [3.],
       ...,
       [1.],
       [1.],
       [1.]])

Exercise#

Explore different encodings.

enc = preprocessing.
  Cell In[16], line 1
    enc = preprocessing.
                        ^
SyntaxError: invalid syntax

Additional Resources#