Visualizing the data for EDA#
Visualizations are an excellent start to explore data and see relationships between input features.
They provide an intuitive and easily digestible way to explore complex datasets and identify patterns and relationships between input features. Through visualizations, we can identify trends, outliers, and correlations that might only be apparent after traditional statistical analysis. Whether plotting scatterplots, histograms, or heatmaps, visualizations enable us to gain a deeper understanding of the data and help us communicate our findings effectively to others.
Therefore, visualizations are an excellent starting point for any data analysis project. They can serve as a powerful tool for discovering insights and unlocking the potential of data.
How To#
import pandas as pd
import seaborn as sns
df = pd.read_csv("data/housing.csv")
df.head()
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|---|
0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |
3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY |
4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |
sns.pairplot(df.sample(1000))
<seaborn.axisgrid.PairGrid at 0x7f827afebbe0>

sns.pairplot(df.sample(1000).drop(["latitude",
"longitude",], axis=1),
hue="ocean_proximity")
<seaborn.axisgrid.PairGrid at 0x7f82ac6e26d0>

for cls in df.ocean_proximity.unique():
sns.kdeplot(df[df.ocean_proximity==cls].median_house_value, label=cls)

sns.jointplot("households", "total_bedrooms", df)
<seaborn.axisgrid.JointGrid at 0x7f827389a730>

sns.jointplot("population", "total_bedrooms", df, kind="reg")
<seaborn.axisgrid.JointGrid at 0x7f827870f070>

sns.jointplot("households", "total_bedrooms", df, kind="reg")
<seaborn.axisgrid.JointGrid at 0x7f8273971340>

sns.heatmap(df.corr(), square=True)
<AxesSubplot:>

sns.heatmap(df.corr().abs().round(1), square=True, annot=True)
<AxesSubplot:>

Exercise#
Explore the data further, maybe try a bar chart!