Linear regression#
A simple machine learning model that can uncover relationships in data.
Linear regression is a robust machine learning algorithm that is commonly used for modelling and analyzing data.
It is a simple and effective technique for discovering relationships between variables and predicting future outcomes. The basic premise of linear regression is to find the best linear relationship between the independent and dependent variables in a dataset. Doing so can help identify patterns, trends, and correlations in the data, enabling us to make informed decisions and accurate predictions.
Linear regression is a versatile tool with applications in various fields, from finance and economics to healthcare and engineering.
How To#
import pandas as pd
df = pd.read_csv("data/housing.csv")
df.head()
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|---|
0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |
3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY |
4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |
Preparing training data#
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(df[["housing_median_age", "total_rooms", "median_income"]],
df.median_house_value, test_size=.5,
stratify=df.ocean_proximity)
df.shape
(20640, 10)
x_train.shape
(10320, 3)
x_test.shape
(10320, 3)
Building the model#
model = LinearRegression()
model.fit(x_train, y_train)
LinearRegression()
model.score(x_test, y_test)
0.5216731917619295
Improving the model#
from sklearn import preprocessing
x_val, x_test, y_val, y_test = train_test_split(x_test, y_test)
x_test.shape
(2580, 3)
scaler = preprocessing.StandardScaler()
model = LinearRegression()
scaler.fit(x_train)
StandardScaler()
x_scaled = scaler.transform(x_train)
x_scaled
array([[ 0.43061665, -0.02302084, 0.29384573],
[ 0.19201884, 0.29239232, -0.20693597],
[ 0.27155144, -0.40272796, 0.00145671],
...,
[-0.92143763, -0.04631575, 0.64424097],
[-0.12611158, -0.49357813, 0.9849423 ],
[ 0.98734489, -0.19307372, -0.83195681]])
model.fit(x_scaled, y_train)
LinearRegression()
model.score(scaler.transform(x_val), y_val)
0.5216152602665426
scaler = preprocessing.MinMaxScaler().fit(x_train)
model = LinearRegression().fit(scaler.transform(x_train), y_train)
model.score(scaler.transform(x_val), y_val)
0.5216152602665426
Predicting with the Model#
model.predict(scaler.transform(x_test))
array([181000.00001305, 232931.86956911, 122120.58875964, ...,
148425.64797643, 158802.69973773, 248682.67338617])
y_test
19658 127100.0
7992 195400.0
2033 45500.0
11348 190400.0
9865 142100.0
...
6858 341100.0
9230 95300.0
4896 99200.0
18239 181300.0
18289 500001.0
Name: median_house_value, Length: 2580, dtype: float64
Inspecting the model#
model.coef_
array([100342.80625166, 154700.44979323, 604979.56346357])
model.intercept_
1950.5338472489384
Exercise#
Experiment how preprocessing can affect your data.