Advanced Strategies (Encoding)

Advanced Strategies (Encoding)#

Sometimes it’s good to change data from one representation to another

Advanced strategies in data encoding involve converting information from one format or representation to another. This can be useful in various contexts, such as improving data storage efficiency, enhancing data security, and facilitating data processing and analysis.

Encoding can involve various techniques, such as compression, encryption, and hashing. In some cases, encoding can also include transforming data to a format more suitable for a particular application or platform.

Encoding strategies can help organizations to optimize their data management practices and improve the overall performance of their systems.

How To#

import pandas as pd
df = pd.read_csv("data/housing.csv")
df.head()

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	ocean_proximity
0	-122.23	37.88	41.0	880.0	129.0	322.0	126.0	8.3252	452600.0	NEAR BAY
1	-122.22	37.86	21.0	7099.0	1106.0	2401.0	1138.0	8.3014	358500.0	NEAR BAY
2	-122.24	37.85	52.0	1467.0	190.0	496.0	177.0	7.2574	352100.0	NEAR BAY
3	-122.25	37.85	52.0	1274.0	235.0	558.0	219.0	5.6431	341300.0	NEAR BAY
4	-122.25	37.85	52.0	1627.0	280.0	565.0	259.0	3.8462	342200.0	NEAR BAY

df.ocean_proximity

      NEAR BAY
      NEAR BAY
      NEAR BAY
      NEAR BAY
      NEAR BAY
           ...   
    INLAND
    INLAND
    INLAND
    INLAND
    INLAND
Name: ocean_proximity, Length: 20640, dtype: object

pd.get_dummies(df.ocean_proximity)

	<1H OCEAN	INLAND	ISLAND	NEAR BAY	NEAR OCEAN
0	False	False	False	True	False
1	False	False	False	True	False
2	False	False	False	True	False
3	False	False	False	True	False
4	False	False	False	True	False
...	...	...	...	...	...
20635	False	True	False	False	False
20636	False	True	False	False	False
20637	False	True	False	False	False
20638	False	True	False	False	False
20639	False	True	False	False	False

20640 rows × 5 columns

df.join(pd.get_dummies(df.ocean_proximity))

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	ocean_proximity	<1H OCEAN	INLAND	ISLAND	NEAR BAY	NEAR OCEAN
0	-122.23	37.88	41.0	880.0	129.0	322.0	126.0	8.3252	452600.0	NEAR BAY	False	False	False	True	False
1	-122.22	37.86	21.0	7099.0	1106.0	2401.0	1138.0	8.3014	358500.0	NEAR BAY	False	False	False	True	False
2	-122.24	37.85	52.0	1467.0	190.0	496.0	177.0	7.2574	352100.0	NEAR BAY	False	False	False	True	False
3	-122.25	37.85	52.0	1274.0	235.0	558.0	219.0	5.6431	341300.0	NEAR BAY	False	False	False	True	False
4	-122.25	37.85	52.0	1627.0	280.0	565.0	259.0	3.8462	342200.0	NEAR BAY	False	False	False	True	False
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
20635	-121.09	39.48	25.0	1665.0	374.0	845.0	330.0	1.5603	78100.0	INLAND	False	True	False	False	False
20636	-121.21	39.49	18.0	697.0	150.0	356.0	114.0	2.5568	77100.0	INLAND	False	True	False	False	False
20637	-121.22	39.43	17.0	2254.0	485.0	1007.0	433.0	1.7000	92300.0	INLAND	False	True	False	False	False
20638	-121.32	39.43	18.0	1860.0	409.0	741.0	349.0	1.8672	84700.0	INLAND	False	True	False	False	False
20639	-121.24	39.37	16.0	2785.0	616.0	1387.0	530.0	2.3886	89400.0	INLAND	False	True	False	False	False

20640 rows × 15 columns

from sklearn import preprocessing

enc = preprocessing.OneHotEncoder()

enc.fit(df.ocean_proximity.unique().reshape(-1, 1))

OneHotEncoder()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

enc.transform(df.ocean_proximity.unique().reshape(-1, 1)).toarray()

array([[0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 1.],
       [0., 0., 1., 0., 0.]])

transformed = enc.transform(df[["ocean_proximity"]]).toarray()

import numpy as np
arr = np.array([[0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 1.],
       [0., 0., 1., 0., 0.]])

enc.inverse_transform(arr)

array([['NEAR BAY'],
       ['<1H OCEAN'],
       ['INLAND'],
       ['NEAR OCEAN'],
       ['ISLAND']], dtype=object)

enc.get_feature_names()

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[12], line 1
----> 1 enc.get_feature_names()

AttributeError: 'OneHotEncoder' object has no attribute 'get_feature_names'

enc = preprocessing.OrdinalEncoder().fit(df.ocean_proximity.unique().reshape(-1, 1))

enc.transform(df.ocean_proximity.unique().reshape(-1, 1))

array([[3.],
       [0.],
       [1.],
       [4.],
       [2.]])

enc.transform(df[["ocean_proximity"]])

array([[3.],
       [3.],
       [3.],
       ...,
       [1.],
       [1.],
       [1.]])

Exercise#

Explore different encodings.

enc = preprocessing.

  Cell In[16], line 1
    enc = preprocessing.
                        ^
SyntaxError: invalid syntax

Additional Resources#

Scikit Learn LabelEncoder

Advanced Strategies (Encoding)

Contents

Advanced Strategies (Encoding)#

How To#

Exercise#

Additional Resources#