{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Machine learning classification\n", "\n", "Building machine learning models to assign data to classes.\n", "\n", "Machine learning has become an increasingly popular tool for solving classification problems. \n", "\n", "The goal is to assign data points to pre-defined classes based on their features or attributes. This technique has numerous applications in a wide range of fields, from image and speech recognition to fraud detection and spam filtering. Building machine learning models to assign data to classes involves training algorithms on labelled datasets. Each data point is associated with a specific class label. By analyzing the relationships between the input features and the output labels, these models can learn to accurately classify new, unseen data points with high accuracy. \n", "\n", "In this way, machine learning provides a powerful tool for automating classification tasks and enabling more efficient and effective decision-making." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How To" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
longitudelatitudehousing_median_agetotal_roomstotal_bedroomspopulationhouseholdsmedian_incomemedian_house_valueocean_proximity
0-122.2337.8841.0880.0129.0322.0126.08.3252452600.0NEAR BAY
1-122.2237.8621.07099.01106.02401.01138.08.3014358500.0NEAR BAY
2-122.2437.8552.01467.0190.0496.0177.07.2574352100.0NEAR BAY
3-122.2537.8552.01274.0235.0558.0219.05.6431341300.0NEAR BAY
4-122.2537.8552.01627.0280.0565.0259.03.8462342200.0NEAR BAY
\n", "
" ], "text/plain": [ " longitude latitude housing_median_age total_rooms total_bedrooms \\\n", "0 -122.23 37.88 41.0 880.0 129.0 \n", "1 -122.22 37.86 21.0 7099.0 1106.0 \n", "2 -122.24 37.85 52.0 1467.0 190.0 \n", "3 -122.25 37.85 52.0 1274.0 235.0 \n", "4 -122.25 37.85 52.0 1627.0 280.0 \n", "\n", " population households median_income median_house_value ocean_proximity \n", "0 322.0 126.0 8.3252 452600.0 NEAR BAY \n", "1 2401.0 1138.0 8.3014 358500.0 NEAR BAY \n", "2 496.0 177.0 7.2574 352100.0 NEAR BAY \n", "3 558.0 219.0 5.6431 341300.0 NEAR BAY \n", "4 565.0 259.0 3.8462 342200.0 NEAR BAY " ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.model_selection import train_test_split\n", "import pandas as pd\n", "\n", "df = pd.read_csv(\"data/housing.csv\")\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "df = df.dropna()" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "x_train, x_, y_train, y_ = train_test_split(df.drop([\"longitude\",\"latitude\",\"ocean_proximity\"], axis=1), \n", " df.ocean_proximity, test_size=.5)\n", "\n", "x_val, x_test, y_val, y_test = train_test_split(x_, y_, test_size=.5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Nearest Neighbours" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "from sklearn.neighbors import KNeighborsClassifier" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "model = KNeighborsClassifier(n_neighbors=10)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "KNeighborsClassifier(n_neighbors=10)" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.fit(x_train, y_train)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.6080657791699295" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.score(x_val, y_val)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Random Forest" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "from sklearn.ensemble import RandomForestClassifier" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "rf = RandomForestClassifier()" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RandomForestClassifier()" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rf.fit(x_train, y_train)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.6697337509788567" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rf.score(x_val, y_val)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0.1250961 , 0.12509637, 0.10339515, 0.12511516, 0.10756115,\n", " 0.12778423, 0.28595185])" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rf.feature_importances_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Logistic Regression" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import LogisticRegression" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "model = LogisticRegression(max_iter=10000)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LogisticRegression(max_iter=10000)" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.fit(x_train, y_train)" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.591425215348473" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.score(x_val, y_val)" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[-1.33993608e-02, 8.16578224e-04, 9.13929488e-04,\n", " 2.36299181e-03, -7.96006970e-04, -4.64591499e-04,\n", " 5.54100347e-06],\n", " [ 3.11373930e-02, 2.06688806e-03, 3.67085594e-03,\n", " 1.62954916e-03, -6.83993928e-03, 3.68037960e-03,\n", " -8.21354760e-06],\n", " [-3.78814880e-04, -5.12065433e-03, -9.88388439e-04,\n", " -4.76942738e-03, -1.29697099e-03, -3.79998357e-05,\n", " -2.32419383e-06],\n", " [ 2.90352744e-03, 1.37494140e-03, -6.76907251e-03,\n", " -5.52893927e-04, 1.05223855e-02, -1.18871567e-03,\n", " 1.32836663e-06],\n", " [-2.02627448e-02, 8.62246654e-04, 3.17267552e-03,\n", " 1.32978033e-03, -1.58946824e-03, -1.98907259e-03,\n", " 3.66837131e-06]])" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.coef_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Test different numbers of neighbours for the KNN classifier and see how pre-processing like scaling affects our results." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Additional Resources" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- [Sklearn Classification](https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 4 }