Machine learning interpretability#
In modern day machine learning it is important to be able to explain how our models “think”. A simple accuracy score isn’t enough. This notebook explores the lesson on interpretability.
Machine learning interpretability is an increasingly important topic in artificial intelligence.
As machine learning models become more complex, understanding how they make predictions is becoming more difficult. This lack of transparency can lead to a lack of trust in the model. It can make it difficult to identify and correct errors. Interpretability is the ability to explain how a machine learning model arrived at a particular decision. It is essential to build trust and understanding in these powerful tools.
This notebook will explore the importance of interpretability and provide practical examples of how it can be achieved.
How To#
from sklearn.model_selection import train_test_split
import pandas as pd
df = pd.read_csv("data/housing.csv")
df.head()
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|---|
0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |
3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY |
4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |
df = df.dropna()
x_train, x_, y_train, y_ = train_test_split(df.drop(["longitude","latitude", "ocean_proximity", "median_house_value"], axis=1),
df.median_house_value, test_size=.5, stratify=df.ocean_proximity)
x_val, x_test, y_val, y_test = train_test_split(x_, y_, test_size=.5)
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
model.fit(x_train, y_train)
RandomForestRegressor()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestRegressor()
model.score(x_val, y_val)
0.6653737863987246
Influence of Variables#
import eli5
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
Cell In[8], line 1
----> 1 import eli5
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/eli5/__init__.py:13
6 from .formatters import (
7 format_as_html,
8 format_html_styles,
9 format_as_text,
10 format_as_dict,
11 )
12 from .explain import explain_weights, explain_prediction
---> 13 from .sklearn import explain_weights_sklearn, explain_prediction_sklearn
14 from .transform import transform_feature_names
17 try:
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/eli5/sklearn/__init__.py:3
1 # -*- coding: utf-8 -*-
2 from __future__ import absolute_import
----> 3 from .explain_weights import (
4 explain_weights_sklearn,
5 explain_linear_classifier_weights,
6 explain_linear_regressor_weights,
7 explain_rf_feature_importance,
8 explain_decision_tree,
9 )
10 from .explain_prediction import (
11 explain_prediction_sklearn,
12 explain_prediction_linear_classifier,
13 explain_prediction_linear_regressor,
14 )
15 from .unhashing import (
16 InvertableHashingVectorizer,
17 FeatureUnhasher,
18 invert_hashing_and_fit,
19 )
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/eli5/sklearn/explain_weights.py:78
73 from eli5.transform import transform_feature_names
74 from eli5._feature_importances import (
75 get_feature_importances_filtered,
76 get_feature_importance_explanation,
77 )
---> 78 from .permutation_importance import PermutationImportance
81 LINEAR_CAVEATS = """
82 Caveats:
83 1. Be careful with features which are not
(...)
90 classification result for most examples.
91 """.lstrip()
93 HASHING_CAVEATS = """
94 Feature names are restored from their hashes; this is not 100% precise
95 because collisions are possible. For known collisions possible feature names
(...)
99 the result is positive.
100 """.lstrip()
File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/eli5/sklearn/permutation_importance.py:7
5 import numpy as np
6 from sklearn.model_selection import check_cv
----> 7 from sklearn.utils.metaestimators import if_delegate_has_method
8 from sklearn.utils import check_array, check_random_state
9 from sklearn.base import (
10 BaseEstimator,
11 MetaEstimatorMixin,
12 clone,
13 is_classifier
14 )
ImportError: cannot import name 'if_delegate_has_method' from 'sklearn.utils.metaestimators' (/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/sklearn/utils/metaestimators.py)
eli5.explain_weights(model)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[9], line 1
----> 1 eli5.explain_weights(model)
NameError: name 'eli5' is not defined
for x in range(5):
display(eli5.explain_prediction(model, x_train.iloc[x, :]))
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[10], line 2
1 for x in range(5):
----> 2 display(eli5.explain_prediction(model, x_train.iloc[x, :]))
NameError: name 'eli5' is not defined
from sklearn.inspection import permutation_importance
permutation_importance(model, x_train, y_train)
{'importances_mean': array([0.31167605, 0.23759556, 0.41586708, 0.3610504 , 0.30871098,
1.54918227]),
'importances_std': array([0.00514937, 0.00433542, 0.00666387, 0.00283454, 0.0056444 ,
0.02092214]),
'importances': array([[0.31652346, 0.30323932, 0.310437 , 0.31760539, 0.31057506],
[0.23951261, 0.24095205, 0.233456 , 0.24256299, 0.23149416],
[0.4127844 , 0.4197916 , 0.41431098, 0.40636382, 0.4260846 ],
[0.35828569, 0.36594585, 0.36184542, 0.36093884, 0.35823622],
[0.31188079, 0.31009499, 0.30322236, 0.30153734, 0.31681944],
[1.55416512, 1.53079852, 1.54025123, 1.53299162, 1.58770485]])}
from sklearn.inspection import plot_partial_dependence
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
Cell In[13], line 1
----> 1 from sklearn.inspection import plot_partial_dependence
ImportError: cannot import name 'plot_partial_dependence' from 'sklearn.inspection' (/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/sklearn/inspection/__init__.py)
plot_partial_dependence(model, x_train, x_train.columns)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[14], line 1
----> 1 plot_partial_dependence(model, x_train, x_train.columns)
NameError: name 'plot_partial_dependence' is not defined
Shap#
import shap
expl = shap.TreeExplainer(model)
shap.TreeExplainer(model, data=x_train)
<shap.explainers._tree.TreeExplainer at 0x7fcf57225af0>
shap_val = expl.shap_values(x_val)
shap.initjs()
shap.force_plot(expl.expected_value, shap_val[0, :], x_val.iloc[0, :])
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
Exercise#
Check out shap
further and see which plots you can generate.