{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Validating machine learning models\n", "\n", "Once we built a machine learning model, we need to validate that this model learnt something meaningful from our training. This part is machine learning validation.\n", "\n", "Validating a machine learning model is essential in developing any data-driven solution. \n", "\n", "It ensures that the model performs as intended and has learned relevant patterns from the data. Validation involves assessing a model's accuracy, reliability, and generalization performance. Machine learning validation is crucial because models can easily overfit the training data, making them unreliable in real-world scenarios. \n", "\n", "This process involves splitting the data into training and validation sets, evaluating the model's performance on the validation set, and tuning the model parameters until an acceptable level of performance is achieved." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How To" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
longitudelatitudehousing_median_agetotal_roomstotal_bedroomspopulationhouseholdsmedian_incomemedian_house_valueocean_proximity
0-122.2337.8841.0880.0129.0322.0126.08.3252452600.0NEAR BAY
1-122.2237.8621.07099.01106.02401.01138.08.3014358500.0NEAR BAY
2-122.2437.8552.01467.0190.0496.0177.07.2574352100.0NEAR BAY
3-122.2537.8552.01274.0235.0558.0219.05.6431341300.0NEAR BAY
4-122.2537.8552.01627.0280.0565.0259.03.8462342200.0NEAR BAY
\n", "
" ], "text/plain": [ " longitude latitude housing_median_age total_rooms total_bedrooms \\\n", "0 -122.23 37.88 41.0 880.0 129.0 \n", "1 -122.22 37.86 21.0 7099.0 1106.0 \n", "2 -122.24 37.85 52.0 1467.0 190.0 \n", "3 -122.25 37.85 52.0 1274.0 235.0 \n", "4 -122.25 37.85 52.0 1627.0 280.0 \n", "\n", " population households median_income median_house_value ocean_proximity \n", "0 322.0 126.0 8.3252 452600.0 NEAR BAY \n", "1 2401.0 1138.0 8.3014 358500.0 NEAR BAY \n", "2 496.0 177.0 7.2574 352100.0 NEAR BAY \n", "3 558.0 219.0 5.6431 341300.0 NEAR BAY \n", "4 565.0 259.0 3.8462 342200.0 NEAR BAY " ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.model_selection import train_test_split\n", "import pandas as pd\n", "\n", "df = pd.read_csv(\"data/housing.csv\")\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "df = df.dropna()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "x_train, x_, y_train, y_ = train_test_split(df.drop([\"longitude\",\"latitude\", \"ocean_proximity\", \"median_house_value\"], axis=1), \n", " df.median_house_value, test_size=.5, stratify=df.ocean_proximity)\n", "\n", "x_val, x_test, y_val, y_test = train_test_split(x_, y_, test_size=.5)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "from sklearn.ensemble import RandomForestRegressor" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "model = RandomForestRegressor().fit(x_train, y_train)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.6693145287445711" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.score(x_val, y_val)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cross-validation" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import cross_val_score, cross_val_predict" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0.64611466, 0.65298153, 0.65183365, 0.63241862, 0.61532077])" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cross_val_score(model, x_val, y_val)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([144938. , 162973.02, 168389. , ..., 192755.01, 240199.01,\n", " 93347. ])" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cross_val_predict(model, x_test, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dummy Models" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "from sklearn.dummy import DummyClassifier, DummyRegressor\n", "from sklearn.ensemble import RandomForestClassifier" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "dummy = DummyRegressor(strategy=\"mean\")" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DummyRegressor()" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dummy.fit(x_train, y_train)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "-9.191303146915963e-05" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dummy.score(x_val, y_val)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([207418.42427208, 207418.42427208, 207418.42427208, ...,\n", " 206627.68517613, 206627.68517613, 206627.68517613])" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cross_val_predict(dummy, x_test, y_test)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "x_train, x_, y_train, y_ = train_test_split(df.drop([\"longitude\",\"latitude\", \"ocean_proximity\", \"median_house_value\"], axis=1), \n", " df.ocean_proximity, test_size=.5)\n", "\n", "x_val, x_test, y_val, y_test = train_test_split(x_, y_, test_size=.5)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "dummy = DummyClassifier(strategy=\"prior\")" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DummyClassifier(strategy='prior')" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dummy.fit(x_train, y_train)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.43872357086922475" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dummy.score(x_val, y_val)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "model = RandomForestClassifier().fit(x_train, y_train)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5920125293657008" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.score(x_val, y_val)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\tools\\Anaconda3\\envs\\skillshare\\lib\\site-packages\\sklearn\\model_selection\\_split.py:670: UserWarning: The least populated class in y has only 1 members, which is less than n_splits=5.\n", " warnings.warn((\"The least populated class in y has only %d\"\n" ] }, { "data": { "text/plain": [ "array([0.58708415, 0.61056751, 0.57729941, 0.59197652, 0.57884427])" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cross_val_score(model, x_test, y_test)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\tools\\Anaconda3\\envs\\skillshare\\lib\\site-packages\\sklearn\\model_selection\\_split.py:670: UserWarning: The least populated class in y has only 1 members, which is less than n_splits=5.\n", " warnings.warn((\"The least populated class in y has only %d\"\n" ] }, { "data": { "text/plain": [ "array([0.44129159, 0.44129159, 0.44031311, 0.44031311, 0.44074437])" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cross_val_score(dummy, x_test, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise\n", "\n", "Try different dummy strategies and how they compare." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dummy = DummyClassifier(strategy=...)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Additional Resources" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- [ELI5](https://eli5.readthedocs.io/)\n", "- [Dummy Models](https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html)\n", "- [ML Fairness](https://en.wikipedia.org/wiki/Fairness_(machine_learning))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 4 }