Validating machine learning models

Validating machine learning models#

Once we built a machine learning model, we need to validate that this model learnt something meaningful from our training. This part is machine learning validation.

Validating a machine learning model is essential in developing any data-driven solution.

It ensures that the model performs as intended and has learned relevant patterns from the data. Validation involves assessing a model’s accuracy, reliability, and generalization performance. Machine learning validation is crucial because models can easily overfit the training data, making them unreliable in real-world scenarios.

This process involves splitting the data into training and validation sets, evaluating the model’s performance on the validation set, and tuning the model parameters until an acceptable level of performance is achieved.

How To#

from sklearn.model_selection import train_test_split
import pandas as pd

df = pd.read_csv("data/housing.csv")
df.head()

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	ocean_proximity
0	-122.23	37.88	41.0	880.0	129.0	322.0	126.0	8.3252	452600.0	NEAR BAY
1	-122.22	37.86	21.0	7099.0	1106.0	2401.0	1138.0	8.3014	358500.0	NEAR BAY
2	-122.24	37.85	52.0	1467.0	190.0	496.0	177.0	7.2574	352100.0	NEAR BAY
3	-122.25	37.85	52.0	1274.0	235.0	558.0	219.0	5.6431	341300.0	NEAR BAY
4	-122.25	37.85	52.0	1627.0	280.0	565.0	259.0	3.8462	342200.0	NEAR BAY

df = df.dropna()

x_train, x_, y_train, y_ = train_test_split(df.drop(["longitude","latitude", "ocean_proximity", "median_house_value"], axis=1), 
                                                    df.median_house_value, test_size=.5, stratify=df.ocean_proximity)

x_val, x_test, y_val, y_test = train_test_split(x_, y_, test_size=.5)

from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor().fit(x_train, y_train)

model.score(x_val, y_val)

0.6505743884778422

Cross-validation#

from sklearn.model_selection import cross_val_score, cross_val_predict

cross_val_score(model, x_val, y_val)

array([0.64030317, 0.63953665, 0.67780258, 0.61851229, 0.60711769])

cross_val_predict(model, x_test, y_test)

array([199369.  , 142416.  , 185627.02, ..., 135886.  , 155118.  ,
       419845.36])

Dummy Models#

from sklearn.dummy import DummyClassifier, DummyRegressor
from sklearn.ensemble import RandomForestClassifier

dummy = DummyRegressor(strategy="mean")

dummy.fit(x_train, y_train)

DummyRegressor()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

dummy.score(x_val, y_val)

-0.00012451028437854283

cross_val_predict(dummy, x_test, y_test)

array([204526.08000979, 204526.08000979, 204526.08000979, ...,
       203670.76834638, 203670.76834638, 203670.76834638])

x_train, x_, y_train, y_ = train_test_split(df.drop(["longitude","latitude", "ocean_proximity", "median_house_value"], axis=1), 
                                                    df.ocean_proximity, test_size=.5)

x_val, x_test, y_val, y_test = train_test_split(x_, y_, test_size=.5)

dummy = DummyClassifier(strategy="prior")

dummy.fit(x_train, y_train)

DummyClassifier()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

dummy.score(x_val, y_val)

0.4355912294440094

model = RandomForestClassifier().fit(x_train, y_train)

model.score(x_val, y_val)

0.6004306969459671

cross_val_score(model, x_test, y_test)

array([0.57632094, 0.58414873, 0.60665362, 0.57827789, 0.5798237 ])

cross_val_score(dummy, x_test, y_test)

array([0.44716243, 0.44618395, 0.44618395, 0.44618395, 0.44662096])

Exercise#

Try different dummy strategies and how they compare.

dummy = DummyClassifier(strategy=...)