# Advanced Strategies (Encoding)

Sometimes it's good to change data from one representation to another

Advanced strategies in data encoding involve converting information from one format or representation to another. This can be useful in various contexts, such as improving data storage efficiency, enhancing data security, and facilitating data processing and analysis. 

Encoding can involve various techniques, such as compression, encryption, and hashing. In some cases, encoding can also include transforming data to a format more suitable for a particular application or platform. 

Encoding strategies can help organizations to optimize their data management practices and improve the overall performance of their systems.

## How To

In [1]:
import pandas as pd
df = pd.read_csv("data/housing.csv")
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [2]:
df.ocean_proximity

0        NEAR BAY
1        NEAR BAY
2        NEAR BAY
3        NEAR BAY
4        NEAR BAY
           ...   
20635      INLAND
20636      INLAND
20637      INLAND
20638      INLAND
20639      INLAND
Name: ocean_proximity, Length: 20640, dtype: object

In [3]:
pd.get_dummies(df.ocean_proximity)

Unnamed: 0,<1H OCEAN,INLAND,ISLAND,NEAR BAY,NEAR OCEAN
0,0,0,0,1,0
1,0,0,0,1,0
2,0,0,0,1,0
3,0,0,0,1,0
4,0,0,0,1,0
...,...,...,...,...,...
20635,0,1,0,0,0
20636,0,1,0,0,0
20637,0,1,0,0,0
20638,0,1,0,0,0


In [6]:
df.join(pd.get_dummies(df.ocean_proximity))

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity,<1H OCEAN,INLAND,ISLAND,NEAR BAY,NEAR OCEAN
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY,0,0,0,1,0
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY,0,0,0,1,0
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY,0,0,0,1,0
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY,0,0,0,1,0
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20635,-121.09,39.48,25.0,1665.0,374.0,845.0,330.0,1.5603,78100.0,INLAND,0,1,0,0,0
20636,-121.21,39.49,18.0,697.0,150.0,356.0,114.0,2.5568,77100.0,INLAND,0,1,0,0,0
20637,-121.22,39.43,17.0,2254.0,485.0,1007.0,433.0,1.7000,92300.0,INLAND,0,1,0,0,0
20638,-121.32,39.43,18.0,1860.0,409.0,741.0,349.0,1.8672,84700.0,INLAND,0,1,0,0,0


In [23]:
from sklearn import preprocessing

In [24]:
enc = preprocessing.OneHotEncoder()

In [26]:
enc.fit(df.ocean_proximity.unique().reshape(-1, 1))

OneHotEncoder()

In [29]:
enc.transform(df.ocean_proximity.unique().reshape(-1, 1)).toarray()

array([[0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 1.],
       [0., 0., 1., 0., 0.]])

In [36]:
transformed = enc.transform(df[["ocean_proximity"]]).toarray()

In [39]:
import numpy as np
arr = np.array([[0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 1.],
       [0., 0., 1., 0., 0.]])

In [40]:
enc.inverse_transform(arr)

array([['NEAR BAY'],
       ['<1H OCEAN'],
       ['INLAND'],
       ['NEAR OCEAN'],
       ['ISLAND']], dtype=object)

In [42]:
enc.get_feature_names()

array(['x0_<1H OCEAN', 'x0_INLAND', 'x0_ISLAND', 'x0_NEAR BAY',
       'x0_NEAR OCEAN'], dtype=object)

In [43]:
enc = preprocessing.OrdinalEncoder().fit(df.ocean_proximity.unique().reshape(-1, 1))

In [45]:
enc.transform(df.ocean_proximity.unique().reshape(-1, 1))

array([[3.],
       [0.],
       [1.],
       [4.],
       [2.]])

In [47]:
enc.transform(df[["ocean_proximity"]])

array([[3.],
       [3.],
       [3.],
       ...,
       [1.],
       [1.],
       [1.]])

## Exercise

Explore different encodings.

In [None]:
enc = preprocessing.

## Additional Resources

- [Scikit Learn LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)