Advanced Strategies (Encoding)#
Sometimes it’s good to change data from one representation to another
Advanced strategies in data encoding involve converting information from one format or representation to another. This can be useful in various contexts, such as improving data storage efficiency, enhancing data security, and facilitating data processing and analysis.
Encoding can involve various techniques, such as compression, encryption, and hashing. In some cases, encoding can also include transforming data to a format more suitable for a particular application or platform.
Encoding strategies can help organizations to optimize their data management practices and improve the overall performance of their systems.
How To#
import pandas as pd
df = pd.read_csv("data/housing.csv")
df.head()
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|---|
0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |
3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY |
4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |
df.ocean_proximity
0 NEAR BAY
1 NEAR BAY
2 NEAR BAY
3 NEAR BAY
4 NEAR BAY
...
20635 INLAND
20636 INLAND
20637 INLAND
20638 INLAND
20639 INLAND
Name: ocean_proximity, Length: 20640, dtype: object
pd.get_dummies(df.ocean_proximity)
<1H OCEAN | INLAND | ISLAND | NEAR BAY | NEAR OCEAN | |
---|---|---|---|---|---|
0 | False | False | False | True | False |
1 | False | False | False | True | False |
2 | False | False | False | True | False |
3 | False | False | False | True | False |
4 | False | False | False | True | False |
... | ... | ... | ... | ... | ... |
20635 | False | True | False | False | False |
20636 | False | True | False | False | False |
20637 | False | True | False | False | False |
20638 | False | True | False | False | False |
20639 | False | True | False | False | False |
20640 rows × 5 columns
df.join(pd.get_dummies(df.ocean_proximity))
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | <1H OCEAN | INLAND | ISLAND | NEAR BAY | NEAR OCEAN | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY | False | False | False | True | False |
1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY | False | False | False | True | False |
2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY | False | False | False | True | False |
3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY | False | False | False | True | False |
4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY | False | False | False | True | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
20635 | -121.09 | 39.48 | 25.0 | 1665.0 | 374.0 | 845.0 | 330.0 | 1.5603 | 78100.0 | INLAND | False | True | False | False | False |
20636 | -121.21 | 39.49 | 18.0 | 697.0 | 150.0 | 356.0 | 114.0 | 2.5568 | 77100.0 | INLAND | False | True | False | False | False |
20637 | -121.22 | 39.43 | 17.0 | 2254.0 | 485.0 | 1007.0 | 433.0 | 1.7000 | 92300.0 | INLAND | False | True | False | False | False |
20638 | -121.32 | 39.43 | 18.0 | 1860.0 | 409.0 | 741.0 | 349.0 | 1.8672 | 84700.0 | INLAND | False | True | False | False | False |
20639 | -121.24 | 39.37 | 16.0 | 2785.0 | 616.0 | 1387.0 | 530.0 | 2.3886 | 89400.0 | INLAND | False | True | False | False | False |
20640 rows × 15 columns
from sklearn import preprocessing
enc = preprocessing.OneHotEncoder()
enc.fit(df.ocean_proximity.unique().reshape(-1, 1))
OneHotEncoder()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
OneHotEncoder()
enc.transform(df.ocean_proximity.unique().reshape(-1, 1)).toarray()
array([[0., 0., 0., 1., 0.],
[1., 0., 0., 0., 0.],
[0., 1., 0., 0., 0.],
[0., 0., 0., 0., 1.],
[0., 0., 1., 0., 0.]])
transformed = enc.transform(df[["ocean_proximity"]]).toarray()
import numpy as np
arr = np.array([[0., 0., 0., 1., 0.],
[1., 0., 0., 0., 0.],
[0., 1., 0., 0., 0.],
[0., 0., 0., 0., 1.],
[0., 0., 1., 0., 0.]])
enc.inverse_transform(arr)
array([['NEAR BAY'],
['<1H OCEAN'],
['INLAND'],
['NEAR OCEAN'],
['ISLAND']], dtype=object)
enc.get_feature_names()
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[12], line 1
----> 1 enc.get_feature_names()
AttributeError: 'OneHotEncoder' object has no attribute 'get_feature_names'
enc = preprocessing.OrdinalEncoder().fit(df.ocean_proximity.unique().reshape(-1, 1))
enc.transform(df.ocean_proximity.unique().reshape(-1, 1))
array([[3.],
[0.],
[1.],
[4.],
[2.]])
enc.transform(df[["ocean_proximity"]])
array([[3.],
[3.],
[3.],
...,
[1.],
[1.],
[1.]])
Exercise#
Explore different encodings.
enc = preprocessing.
Cell In[16], line 1
enc = preprocessing.
^
SyntaxError: invalid syntax