Finding and understanding relationships in data

Finding and understanding relationships in data#

In today’s data-driven world, we are constantly bombarded with vast amounts of information. However, this raw data is often meaningless without the ability to identify relationships and patterns within it.

Finding and understanding relationships in data is crucial for making informed decisions, developing predictive models, and discovering new insights. Through the use of statistical techniques and machine learning algorithms, we can uncover hidden connections and dependencies between variables, enabling us to make accurate predictions and improve our understanding of complex systems.

Whether in business, science, or everyday life, the ability to analyze and interpret data is becoming increasingly important, and finding relationships within it is an essential skill for success.

How To#

import pandas as pd
df = pd.read_csv("data/housing.csv")
df.head()

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	ocean_proximity
0	-122.23	37.88	41.0	880.0	129.0	322.0	126.0	8.3252	452600.0	NEAR BAY
1	-122.22	37.86	21.0	7099.0	1106.0	2401.0	1138.0	8.3014	358500.0	NEAR BAY
2	-122.24	37.85	52.0	1467.0	190.0	496.0	177.0	7.2574	352100.0	NEAR BAY
3	-122.25	37.85	52.0	1274.0	235.0	558.0	219.0	5.6431	341300.0	NEAR BAY
4	-122.25	37.85	52.0	1627.0	280.0	565.0	259.0	3.8462	342200.0	NEAR BAY

df.corr()

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[2], line 1
----> 1 df.corr()

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/frame.py:10054, in DataFrame.corr(self, method, min_periods, numeric_only)
  10052 cols = data.columns
  10053 idx = cols.copy()
> 10054 mat = data.to_numpy(dtype=float, na_value=np.nan, copy=False)
  10056 if method == "pearson":
  10057     correl = libalgos.nancorr(mat, minp=min_periods)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/frame.py:1838, in DataFrame.to_numpy(self, dtype, copy, na_value)
   1836 if dtype is not None:
   1837     dtype = np.dtype(dtype)
-> 1838 result = self._mgr.as_array(dtype=dtype, copy=copy, na_value=na_value)
   1839 if result.dtype is not dtype:
   1840     result = np.array(result, dtype=dtype, copy=False)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/internals/managers.py:1732, in BlockManager.as_array(self, dtype, copy, na_value)
   1730         arr.flags.writeable = False
   1731 else:
-> 1732     arr = self._interleave(dtype=dtype, na_value=na_value)
   1733     # The underlying data was copied within _interleave, so no need
   1734     # to further copy if copy=True or setting na_value
   1736 if na_value is not lib.no_default:

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/internals/managers.py:1794, in BlockManager._interleave(self, dtype, na_value)
   1792     else:
   1793         arr = blk.get_values(dtype)
-> 1794     result[rl.indexer] = arr
   1795     itemmask[rl.indexer] = 1
   1797 if not itemmask.all():

ValueError: could not convert string to float: 'NEAR BAY'

df.total_rooms.corr(df.households)

0.9184844926543082

More than linear correlation#

from discover_feature_relationships import discover

rel = discover.discover(df.sample(500))

beyond_corr = rel.pivot(index="target", columns="feature", values="score").fillna(1)
beyond_corr

feature	households	housing_median_age	latitude	longitude	median_house_value	median_income	ocean_proximity	population	total_bedrooms	total_rooms
target
households	1.000000	0.037340	-0.537450	-0.518198	-0.901268	-0.949288	-0.004915	0.657631	0.893712	0.760877
housing_median_age	-0.314722	1.000000	0.050557	0.010672	-0.340596	-0.487119	0.066056	-0.333395	-0.454009	-0.134160
latitude	-0.336849	-0.175834	1.000000	0.886597	-0.386434	-0.400958	0.371487	-0.377302	-0.496807	-0.440968
longitude	-0.379972	-0.127651	0.857229	1.000000	-0.477218	-0.420772	0.296853	-0.356622	-0.531754	-0.491624
median_house_value	-0.489163	-0.089184	-0.201703	0.109102	1.000000	0.211742	0.227511	-0.497541	-0.571927	-0.362765
median_income	-0.471860	-0.103471	-0.293770	-0.129196	0.320258	1.000000	0.026960	-0.475921	-0.481412	-0.217507
ocean_proximity	-0.521369	-0.212248	0.210688	0.168251	-0.309688	-0.440625	1.000000	-0.511484	-0.525493	-0.474652
population	0.691419	0.070601	-0.419086	-0.452841	-0.440496	-0.772012	0.024636	1.000000	0.627696	0.542372
total_bedrooms	0.888689	0.048452	-0.554363	-0.538626	-0.782471	-0.793926	-0.003064	0.605948	1.000000	0.749222
total_rooms	0.726035	0.050555	-0.597281	-0.582823	-0.466228	-0.737790	-0.000413	0.513196	0.737515	1.000000

import seaborn as sns
sns.heatmap(beyond_corr, vmin=-1, vmax=1)

<Axes: xlabel='feature', ylabel='target'>

../_images/157f14658abefe243099615a67031f6a9495e8c28022da2724727f2a787dcd8f.png

Finding and understanding relationships in data

Contents

Finding and understanding relationships in data#

How To#

More than linear correlation#

Exercise#

Additional Resources#