Finding and understanding relationships in data#
In today’s data-driven world, we are constantly bombarded with vast amounts of information. However, this raw data is often meaningless without the ability to identify relationships and patterns within it.
Finding and understanding relationships in data is crucial for making informed decisions, developing predictive models, and discovering new insights. Through the use of statistical techniques and machine learning algorithms, we can uncover hidden connections and dependencies between variables, enabling us to make accurate predictions and improve our understanding of complex systems.
Whether in business, science, or everyday life, the ability to analyze and interpret data is becoming increasingly important, and finding relationships within it is an essential skill for success.
How To#
import pandas as pd
df = pd.read_csv("data/housing.csv")
df.head()
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|---|
0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |
3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY |
4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |
df.corr()
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | |
---|---|---|---|---|---|---|---|---|---|
longitude | 1.000000 | -0.924664 | -0.108197 | 0.044568 | 0.069608 | 0.099773 | 0.055310 | -0.015176 | -0.045967 |
latitude | -0.924664 | 1.000000 | 0.011173 | -0.036100 | -0.066983 | -0.108785 | -0.071035 | -0.079809 | -0.144160 |
housing_median_age | -0.108197 | 0.011173 | 1.000000 | -0.361262 | -0.320451 | -0.296244 | -0.302916 | -0.119034 | 0.105623 |
total_rooms | 0.044568 | -0.036100 | -0.361262 | 1.000000 | 0.930380 | 0.857126 | 0.918484 | 0.198050 | 0.134153 |
total_bedrooms | 0.069608 | -0.066983 | -0.320451 | 0.930380 | 1.000000 | 0.877747 | 0.979728 | -0.007723 | 0.049686 |
population | 0.099773 | -0.108785 | -0.296244 | 0.857126 | 0.877747 | 1.000000 | 0.907222 | 0.004834 | -0.024650 |
households | 0.055310 | -0.071035 | -0.302916 | 0.918484 | 0.979728 | 0.907222 | 1.000000 | 0.013033 | 0.065843 |
median_income | -0.015176 | -0.079809 | -0.119034 | 0.198050 | -0.007723 | 0.004834 | 0.013033 | 1.000000 | 0.688075 |
median_house_value | -0.045967 | -0.144160 | 0.105623 | 0.134153 | 0.049686 | -0.024650 | 0.065843 | 0.688075 | 1.000000 |
df.total_rooms.corr(df.households)
0.9184844926543085
More than linear correlation#
from discover_feature_relationships import discover
rel = discover.discover(df.sample(500))
beyond_corr = rel.pivot(index="target", columns="feature", values="score").fillna(1)
beyond_corr
feature | households | housing_median_age | latitude | longitude | median_house_value | median_income | ocean_proximity | population | total_bedrooms | total_rooms |
---|---|---|---|---|---|---|---|---|---|---|
target | ||||||||||
households | 1.000000 | -0.292469 | -0.407305 | -0.443292 | -0.454553 | -0.661367 | -0.038141 | 0.769386 | 0.927661 | 0.793106 |
housing_median_age | -0.384953 | 1.000000 | -0.005774 | 0.166421 | -0.266405 | -0.390467 | 0.109177 | -0.364504 | -0.340247 | -0.271119 |
latitude | -0.488429 | -0.077087 | 1.000000 | 0.878377 | -0.306961 | -0.593725 | 0.385670 | -0.451856 | -0.478616 | -0.474196 |
longitude | -0.450075 | -0.040020 | 0.865377 | 1.000000 | -0.375989 | -0.586797 | 0.329246 | -0.489123 | -0.473913 | -0.463378 |
median_house_value | -0.427585 | -0.051834 | -0.155626 | 0.091216 | 1.000000 | 0.099368 | 0.239027 | -0.459812 | -0.349830 | -0.366259 |
median_income | -0.416885 | -0.026529 | -0.284841 | -0.170559 | 0.193204 | 1.000000 | 0.048077 | -0.792870 | -0.571752 | -0.234037 |
ocean_proximity | -0.579639 | -0.096715 | 0.198061 | 0.292165 | -0.561875 | -0.341718 | 1.000000 | -0.541385 | -0.450755 | -0.508544 |
population | 0.806036 | -0.298956 | -0.377054 | -0.361676 | -0.366975 | -0.635945 | -0.018287 | 1.000000 | 0.737155 | 0.725931 |
total_bedrooms | 0.941491 | -0.179944 | -0.352001 | -0.401063 | -0.539463 | -0.674509 | -0.030705 | 0.718642 | 1.000000 | 0.817562 |
total_rooms | 0.803418 | -0.194646 | -0.355038 | -0.324732 | -0.332408 | -0.467443 | -0.014449 | 0.646621 | 0.796884 | 1.000000 |
import seaborn as sns
sns.heatmap(beyond_corr, vmin=-1, vmax=1)
<AxesSubplot:xlabel='feature', ylabel='target'>
