Finding and understanding relationships in data#

In today’s data-driven world, we are constantly bombarded with vast amounts of information. However, this raw data is often meaningless without the ability to identify relationships and patterns within it.

Finding and understanding relationships in data is crucial for making informed decisions, developing predictive models, and discovering new insights. Through the use of statistical techniques and machine learning algorithms, we can uncover hidden connections and dependencies between variables, enabling us to make accurate predictions and improve our understanding of complex systems.

Whether in business, science, or everyday life, the ability to analyze and interpret data is becoming increasingly important, and finding relationships within it is an essential skill for success.

How To#

import pandas as pd
df = pd.read_csv("data/housing.csv")
df.head()
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY
df.corr()
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
longitude 1.000000 -0.924664 -0.108197 0.044568 0.069608 0.099773 0.055310 -0.015176 -0.045967
latitude -0.924664 1.000000 0.011173 -0.036100 -0.066983 -0.108785 -0.071035 -0.079809 -0.144160
housing_median_age -0.108197 0.011173 1.000000 -0.361262 -0.320451 -0.296244 -0.302916 -0.119034 0.105623
total_rooms 0.044568 -0.036100 -0.361262 1.000000 0.930380 0.857126 0.918484 0.198050 0.134153
total_bedrooms 0.069608 -0.066983 -0.320451 0.930380 1.000000 0.877747 0.979728 -0.007723 0.049686
population 0.099773 -0.108785 -0.296244 0.857126 0.877747 1.000000 0.907222 0.004834 -0.024650
households 0.055310 -0.071035 -0.302916 0.918484 0.979728 0.907222 1.000000 0.013033 0.065843
median_income -0.015176 -0.079809 -0.119034 0.198050 -0.007723 0.004834 0.013033 1.000000 0.688075
median_house_value -0.045967 -0.144160 0.105623 0.134153 0.049686 -0.024650 0.065843 0.688075 1.000000
df.total_rooms.corr(df.households)
0.9184844926543085

More than linear correlation#

from discover_feature_relationships import discover
rel = discover.discover(df.sample(500))
beyond_corr = rel.pivot(index="target", columns="feature", values="score").fillna(1)
beyond_corr
feature households housing_median_age latitude longitude median_house_value median_income ocean_proximity population total_bedrooms total_rooms
target
households 1.000000 -0.292469 -0.407305 -0.443292 -0.454553 -0.661367 -0.038141 0.769386 0.927661 0.793106
housing_median_age -0.384953 1.000000 -0.005774 0.166421 -0.266405 -0.390467 0.109177 -0.364504 -0.340247 -0.271119
latitude -0.488429 -0.077087 1.000000 0.878377 -0.306961 -0.593725 0.385670 -0.451856 -0.478616 -0.474196
longitude -0.450075 -0.040020 0.865377 1.000000 -0.375989 -0.586797 0.329246 -0.489123 -0.473913 -0.463378
median_house_value -0.427585 -0.051834 -0.155626 0.091216 1.000000 0.099368 0.239027 -0.459812 -0.349830 -0.366259
median_income -0.416885 -0.026529 -0.284841 -0.170559 0.193204 1.000000 0.048077 -0.792870 -0.571752 -0.234037
ocean_proximity -0.579639 -0.096715 0.198061 0.292165 -0.561875 -0.341718 1.000000 -0.541385 -0.450755 -0.508544
population 0.806036 -0.298956 -0.377054 -0.361676 -0.366975 -0.635945 -0.018287 1.000000 0.737155 0.725931
total_bedrooms 0.941491 -0.179944 -0.352001 -0.401063 -0.539463 -0.674509 -0.030705 0.718642 1.000000 0.817562
total_rooms 0.803418 -0.194646 -0.355038 -0.324732 -0.332408 -0.467443 -0.014449 0.646621 0.796884 1.000000
import seaborn as sns
sns.heatmap(beyond_corr, vmin=-1, vmax=1)
<AxesSubplot:xlabel='feature', ylabel='target'>
../_images/b2e06755c87395bd1101db24427f191c2d5faee7e2722729c53039ebd90f575e.png

Exercise#

Additional Resources#