Validating machine learning models#

Once we built a machine learning model, we need to validate that this model learnt something meaningful from our training. This part is machine learning validation.

Validating a machine learning model is essential in developing any data-driven solution.

It ensures that the model performs as intended and has learned relevant patterns from the data. Validation involves assessing a model’s accuracy, reliability, and generalization performance. Machine learning validation is crucial because models can easily overfit the training data, making them unreliable in real-world scenarios.

This process involves splitting the data into training and validation sets, evaluating the model’s performance on the validation set, and tuning the model parameters until an acceptable level of performance is achieved.

How To#

from sklearn.model_selection import train_test_split
import pandas as pd

df = pd.read_csv("data/housing.csv")
df.head()
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY
df = df.dropna()
x_train, x_, y_train, y_ = train_test_split(df.drop(["longitude","latitude", "ocean_proximity", "median_house_value"], axis=1), 
                                                    df.median_house_value, test_size=.5, stratify=df.ocean_proximity)

x_val, x_test, y_val, y_test = train_test_split(x_, y_, test_size=.5)
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor().fit(x_train, y_train)
model.score(x_val, y_val)
0.6505743884778422

Cross-validation#

from sklearn.model_selection import cross_val_score, cross_val_predict
cross_val_score(model, x_val, y_val)
array([0.64030317, 0.63953665, 0.67780258, 0.61851229, 0.60711769])
cross_val_predict(model, x_test, y_test)
array([199369.  , 142416.  , 185627.02, ..., 135886.  , 155118.  ,
       419845.36])

Dummy Models#

from sklearn.dummy import DummyClassifier, DummyRegressor
from sklearn.ensemble import RandomForestClassifier
dummy = DummyRegressor(strategy="mean")
dummy.fit(x_train, y_train)
DummyRegressor()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
dummy.score(x_val, y_val)
-0.00012451028437854283
cross_val_predict(dummy, x_test, y_test)
array([204526.08000979, 204526.08000979, 204526.08000979, ...,
       203670.76834638, 203670.76834638, 203670.76834638])
x_train, x_, y_train, y_ = train_test_split(df.drop(["longitude","latitude", "ocean_proximity", "median_house_value"], axis=1), 
                                                    df.ocean_proximity, test_size=.5)

x_val, x_test, y_val, y_test = train_test_split(x_, y_, test_size=.5)
dummy = DummyClassifier(strategy="prior")
dummy.fit(x_train, y_train)
DummyClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
dummy.score(x_val, y_val)
0.4355912294440094
model = RandomForestClassifier().fit(x_train, y_train)
model.score(x_val, y_val)
0.6004306969459671
cross_val_score(model, x_test, y_test)
array([0.57632094, 0.58414873, 0.60665362, 0.57827789, 0.5798237 ])
cross_val_score(dummy, x_test, y_test)
array([0.44716243, 0.44618395, 0.44618395, 0.44618395, 0.44662096])

Exercise#

Try different dummy strategies and how they compare.

dummy = DummyClassifier(strategy=...)

Additional Resources#