Linear regression#

A simple machine learning model that can uncover relationships in data.

Linear regression is a robust machine learning algorithm that is commonly used for modelling and analyzing data.

It is a simple and effective technique for discovering relationships between variables and predicting future outcomes. The basic premise of linear regression is to find the best linear relationship between the independent and dependent variables in a dataset. Doing so can help identify patterns, trends, and correlations in the data, enabling us to make informed decisions and accurate predictions.

Linear regression is a versatile tool with applications in various fields, from finance and economics to healthcare and engineering.

How To#

import pandas as pd
df = pd.read_csv("data/housing.csv")
df.head()
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY

Preparing training data#

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(df[["housing_median_age", "total_rooms", "median_income"]], 
                                                    df.median_house_value, test_size=.5,
                                                    stratify=df.ocean_proximity)
df.shape
(20640, 10)
x_train.shape
(10320, 3)
x_test.shape
(10320, 3)

Building the model#

model = LinearRegression()
model.fit(x_train, y_train)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
model.score(x_test, y_test)
0.5096314845775749

Improving the model#

from sklearn import preprocessing
x_val, x_test, y_val, y_test = train_test_split(x_test, y_test)
x_test.shape
(2580, 3)
scaler = preprocessing.StandardScaler()
model = LinearRegression()
scaler.fit(x_train)
StandardScaler()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
x_scaled = scaler.transform(x_train)
x_scaled
array([[-0.8516978 ,  2.93659372,  2.16035996],
       [ 0.5746837 , -0.47577298,  0.10809316],
       [ 1.12938762, -0.13154698, -1.22763257],
       ...,
       [ 0.41619687, -0.12792355, -0.07403226],
       [-0.45548072,  0.25661313, -0.15765223],
       [-1.72337538, -0.44814432,  0.07221127]])
model.fit(x_scaled, y_train)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
model.score(scaler.transform(x_val), y_val)
0.5130250865571753
scaler = preprocessing.MinMaxScaler().fit(x_train)
model = LinearRegression().fit(scaler.transform(x_train), y_train)
model.score(scaler.transform(x_val), y_val)
0.5130250865571754

Predicting with the Model#

model.predict(scaler.transform(x_test))
array([105754.56009361, 307266.29688142, 169758.79773491, ...,
       355118.98252233, 217674.69084895, 334322.95938733])
y_test
2400     100000.0
968      296900.0
3421     144300.0
10335    413700.0
13988     66800.0
           ...   
3886     199800.0
6748     225000.0
9295     400000.0
4146     196400.0
15629    250000.0
Name: median_house_value, Length: 2580, dtype: float64

Inspecting the model#

model.coef_
array([102851.44422959, 140859.04292946, 614483.19883308])
model.intercept_
-1388.5362536045432

Exercise#

Experiment how preprocessing can affect your data.

Additional Resources#