# Validating machine learning models

Once we built a machine learning model, we need to validate that this model learnt something meaningful from our training. This part is machine learning validation.

Validating a machine learning model is essential in developing any data-driven solution. 

It ensures that the model performs as intended and has learned relevant patterns from the data. Validation involves assessing a model's accuracy, reliability, and generalization performance. Machine learning validation is crucial because models can easily overfit the training data, making them unreliable in real-world scenarios. 

This process involves splitting the data into training and validation sets, evaluating the model's performance on the validation set, and tuning the model parameters until an acceptable level of performance is achieved.

## How To

In [1]:
from sklearn.model_selection import train_test_split
import pandas as pd

df = pd.read_csv("data/housing.csv")
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [2]:
df = df.dropna()

In [3]:
x_train, x_, y_train, y_ = train_test_split(df.drop(["longitude","latitude", "ocean_proximity", "median_house_value"], axis=1), 
                                                    df.median_house_value, test_size=.5, stratify=df.ocean_proximity)

x_val, x_test, y_val, y_test = train_test_split(x_, y_, test_size=.5)

In [4]:
from sklearn.ensemble import RandomForestRegressor

In [5]:
model = RandomForestRegressor().fit(x_train, y_train)

In [6]:
model.score(x_val, y_val)

0.6693145287445711

## Cross-validation

In [7]:
from sklearn.model_selection import cross_val_score, cross_val_predict

In [8]:
cross_val_score(model, x_val, y_val)

array([0.64611466, 0.65298153, 0.65183365, 0.63241862, 0.61532077])

In [9]:
cross_val_predict(model, x_test, y_test)

array([144938.  , 162973.02, 168389.  , ..., 192755.01, 240199.01,
        93347.  ])

## Dummy Models

In [23]:
from sklearn.dummy import DummyClassifier, DummyRegressor
from sklearn.ensemble import RandomForestClassifier

In [11]:
dummy = DummyRegressor(strategy="mean")

In [13]:
dummy.fit(x_train, y_train)

DummyRegressor()

In [14]:
dummy.score(x_val, y_val)

-9.191303146915963e-05

In [15]:
cross_val_predict(dummy, x_test, y_test)

array([207418.42427208, 207418.42427208, 207418.42427208, ...,
       206627.68517613, 206627.68517613, 206627.68517613])

In [16]:
x_train, x_, y_train, y_ = train_test_split(df.drop(["longitude","latitude", "ocean_proximity", "median_house_value"], axis=1), 
                                                    df.ocean_proximity, test_size=.5)

x_val, x_test, y_val, y_test = train_test_split(x_, y_, test_size=.5)

In [20]:
dummy = DummyClassifier(strategy="prior")

In [21]:
dummy.fit(x_train, y_train)

DummyClassifier(strategy='prior')

In [22]:
dummy.score(x_val, y_val)

0.43872357086922475

In [24]:
model = RandomForestClassifier().fit(x_train, y_train)

In [25]:
model.score(x_val, y_val)

0.5920125293657008

In [26]:
cross_val_score(model, x_test, y_test)



array([0.58708415, 0.61056751, 0.57729941, 0.59197652, 0.57884427])

In [27]:
cross_val_score(dummy, x_test, y_test)



array([0.44129159, 0.44129159, 0.44031311, 0.44031311, 0.44074437])

## Exercise

Try different dummy strategies and how they compare.

In [None]:
dummy = DummyClassifier(strategy=...)

## Additional Resources

- [ELI5](https://eli5.readthedocs.io/)
- [Dummy Models](https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html)
- [ML Fairness](https://en.wikipedia.org/wiki/Fairness_(machine_learning))