Decision trees and random forests#
Change up the machine learning models
Decision trees and random forests are popular machine learning techniques for classification and regression tasks.
A decision tree is a tree-like model where each node represents a decision based on a feature, and each branch represents an outcome of that decision. On the other hand, random forests are an ensemble of decision trees where each tree is trained on a subset of the data and a random subset of the features. They are powerful and widely used algorithms in machine learning because they can handle large datasets, deal with missing values, and provide interpretable results.
This notebook will explore decision trees and random forests in more detail and discuss their strengths and weaknesses.
How To#
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import pandas as pd
df = pd.read_csv("data/housing.csv")
df.head()
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|---|
0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |
3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY |
4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |
x_train, x_, y_train, y_ = train_test_split(df[["housing_median_age", "total_rooms", "median_income"]],
df.median_house_value, test_size=.5)
x_val, x_test, y_val, y_test = train_test_split(x_, y_, test_size=.5)
Decision Trees#
from sklearn import preprocessing
from sklearn import tree
scaler = preprocessing.StandardScaler()
model = tree.DecisionTreeRegressor()
scaler.fit(x_train)
StandardScaler()
model.fit(scaler.transform(x_train), y_train)
DecisionTreeRegressor()
model.score(scaler.transform(x_val), y_val)
0.11076885309119011
Build a forest of decision trees#
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor()
rf.fit(x_train, y_train)
RandomForestRegressor()
rf.score(x_train, y_train)
0.9325366452603119
rf.score(x_val, y_val)
0.5100602814817495
rf.feature_importances_
array([0.14080726, 0.19557078, 0.66362196])
Exercise#
Experiment with different machine learning models