Machine learning fairness#

Machine Learning fairness is an important part of modern day data modeling. Here we explore an introduction to make models more fair and equitable.

Machine learning is a powerful tool that has revolutionized many industries by enabling computers to learn from data and make predictions or decisions.

However, as machine learning algorithms become increasingly ubiquitous in our daily lives, concerns about fairness and equity have emerged. Machine learning fairness refers to the idea that machine learning models should not perpetuate or exacerbate existing biases or discrimination. Fairness means that the model treats all individuals or groups fairly, regardless of race, gender, ethnicity, or other protected characteristics.

This notebook will provide an overview of the key concepts and challenges in machine learning fairness, as well as some techniques commonly used to address them.

How To#

from sklearn.model_selection import train_test_split
import pandas as pd

df = pd.read_csv("data/housing.csv").dropna()
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY
x_train, x_, y_train, y_ = train_test_split(df.drop(["longitude","latitude", "ocean_proximity", "median_house_value"], axis=1), 
                                                    df.median_house_value, test_size=.5, stratify=df.ocean_proximity)

x_val, x_test, y_val, y_test = train_test_split(x_, y_, test_size=.5)
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor().fit(x_train, y_train)
model.score(x_val, y_val)
from sklearn.model_selection import cross_val_score
for cls in df.ocean_proximity.unique():
        idx = df[df.ocean_proximity.isin([cls])].index

        idx_val = x_val.index.intersection(idx)
        print(model.score(x_val.loc[idx_val, :], y_val.loc[idx_val]))

        val = cross_val_score(model, x_val.loc[idx_val, :], y_val.loc[idx_val])
        print(val.mean(), " +- ", val.std(), "\n")
        print("Error in Validation")
        idx = df[df.ocean_proximity.isin([cls])].index

        idx_test = x_test.index.intersection(idx)
        print(model.score(x_test.loc[idx_test, :], y_test.loc[idx_test]))
        tst = cross_val_score(model,x_test.loc[idx_test, :], y_test.loc[idx_test])
        print(tst.mean(), " +- ", tst.std(), "\n")
        print("Error in Test")
[0.51361569 0.64630664 0.59280962 0.69167432 0.66258008]
0.621397271310316  +-  0.06275266873707354 

[0.46930571 0.43675151 0.56149112 0.53991517 0.43855913]
0.4892045308559778  +-  0.05197914753138655 

[0.62935218 0.58064886 0.57057219 0.62693846 0.66328712]
0.6141597616245293  +-  0.034148086465257306 

[0.62363128 0.61138027 0.60811331 0.57578551 0.60926118]
0.6056343121336554  +-  0.015919533470271377 

[0.60097059 0.43543378 0.47527694 0.48630933 0.57854818]
0.5153077647383298  +-  0.06349913017713156 

[0.47371532 0.54227822 0.23409711 0.52081891 0.43949161]
0.4420802336257415  +-  0.1100035220041293 

[0.56994925 0.62178888 0.52010515 0.50148382 0.57472886]
0.5576111924627768  +-  0.042710663438220095 

[0.5883007  0.61904115 0.55511549 0.49335263 0.55245425]
0.5616528456131616  +-  0.04194233268167542 

Error in Validation
Error in Test

Calculate Residuals#

from yellowbrick.regressor import residuals_plot, prediction_error
residuals_plot(model, x_train, y_train, x_test, y_test)
prediction_error(model, x_train, y_train, x_test, y_test)
PredictionError(ax=<Axes: title={'center': 'Prediction Error for RandomForestRegressor'}, xlabel='$y$', ylabel='$\\hat{y}$'>,
Confusion Matrix for Classifiers#

from sklearn.metrics import plot_confusion_matrix
from sklearn.ensemble import RandomForestClassifier

x_train, x_, y_train, y_ = train_test_split(df.drop(["longitude","latitude", "ocean_proximity"], axis=1), 
                                                    df.ocean_proximity, test_size=.5, stratify=df.ocean_proximity)

x_val, x_test, y_val, y_test = train_test_split(x_, y_, test_size=.5)

model = RandomForestClassifier().fit(x_train, y_train)
plot_confusion_matrix(model, x_test, y_test)
plot_confusion_matrix(model, x_test, y_test, normalize="all")
Other Visualizations that are important#

from yellowbrick.classifier import confusion_matrix, classification_report, precision_recall_curve, roc_auc
confusion_matrix(model, x_train, y_train, x_test, y_test)
ConfusionMatrix(ax=<Axes: title={'center': 'RandomForestClassifier Confusion Matrix'}, xlabel='Predicted Class', ylabel='True Class'>,
                cmap=<matplotlib.colors.ListedColormap object at 0x7fa324633b50>,
classification_report(model, x_train, y_train, x_test, y_test)
ClassificationReport(ax=<Axes: title={'center': 'RandomForestClassifier Classification Report'}>,
                     cmap=<matplotlib.colors.ListedColormap object at 0x7fa324481190>,
from sklearn.metrics import classification_report
print(classification_report(y_test, model.predict(x_test)))
              precision    recall  f1-score   support

   <1H OCEAN       0.64      0.85      0.73      2243
      INLAND       0.78      0.81      0.79      1632
      ISLAND       0.00      0.00      0.00         2
    NEAR BAY       0.52      0.27      0.35       590
  NEAR OCEAN       0.37      0.09      0.14       642

    accuracy                           0.67      5109
   macro avg       0.46      0.40      0.40      5109
weighted avg       0.64      0.67      0.63      5109
precision_recall_curve(model, x_train, y_train, x_test, y_test)
PrecisionRecallCurve(ax=<Axes: title={'center': 'Precision-Recall Curve for RandomForestClassifier'}, xlabel='Recall', ylabel='Precision'>,
                     iso_f1_values={0.2, 0.4, 0.6, 0.8})
roc_auc(model, x_train, y_train, x_test, y_test)
ROCAUC(ax=<Axes: title={'center': 'ROC Curves for RandomForestClassifier'}, xlabel='False Positive Rate', ylabel='True Positive Rate'>,
Modify the code to generate dummy models for each class.

Additional Resources#