Linear regression#
A simple machine learning model that can uncover relationships in data.
Linear regression is a robust machine learning algorithm that is commonly used for modelling and analyzing data.
It is a simple and effective technique for discovering relationships between variables and predicting future outcomes. The basic premise of linear regression is to find the best linear relationship between the independent and dependent variables in a dataset. Doing so can help identify patterns, trends, and correlations in the data, enabling us to make informed decisions and accurate predictions.
Linear regression is a versatile tool with applications in various fields, from finance and economics to healthcare and engineering.
How To#
import pandas as pd
df = pd.read_csv("data/housing.csv")
df.head()
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|---|
0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |
3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY |
4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |
Preparing training data#
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(df[["housing_median_age", "total_rooms", "median_income"]],
df.median_house_value, test_size=.5,
stratify=df.ocean_proximity)
df.shape
(20640, 10)
x_train.shape
(10320, 3)
x_test.shape
(10320, 3)
Building the model#
model = LinearRegression()
model.fit(x_train, y_train)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
model.score(x_test, y_test)
0.5150544602369341
Improving the model#
from sklearn import preprocessing
x_val, x_test, y_val, y_test = train_test_split(x_test, y_test)
x_test.shape
(2580, 3)
scaler = preprocessing.StandardScaler()
model = LinearRegression()
scaler.fit(x_train)
StandardScaler()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
StandardScaler()
x_scaled = scaler.transform(x_train)
x_scaled
array([[-0.61168405, 0.10966594, -0.34801344],
[ 0.73900819, -0.42930933, 0.93166467],
[ 0.1828408 , -1.08498446, -0.27156486],
...,
[ 0.42119825, -0.1452928 , 1.36494401],
[-0.45277908, 1.66613277, 3.34677127],
[-1.00894647, 0.32572472, 1.99345637]])
model.fit(x_scaled, y_train)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
model.score(scaler.transform(x_val), y_val)
0.5113048620857856
scaler = preprocessing.MinMaxScaler().fit(x_train)
model = LinearRegression().fit(scaler.transform(x_train), y_train)
model.score(scaler.transform(x_val), y_val)
0.5113048620857856
Predicting with the Model#
model.predict(scaler.transform(x_test))
array([144942.9245965 , 203274.22981059, 358335.3990082 , ...,
289425.51119122, 176933.16912778, 182636.22567211])
y_test
9630 77400.0
3498 185800.0
10837 374200.0
13601 91500.0
1903 162500.0
...
4343 173900.0
9957 361100.0
15846 356100.0
18956 94200.0
12742 133000.0
Name: median_house_value, Length: 2580, dtype: float64
Inspecting the model#
model.coef_
array([105149.78222749, 153490.93791669, 618396.24168631])
model.intercept_
-3995.519358216203
Exercise#
Experiment how preprocessing can affect your data.