Clustering for deeper data insights

Clustering for deeper data insights#

Clustering exploits inherent structures in data to find relationships and memberships to groups in an unsupervised way. It can be used for data mining to generate additional insights.

Clustering is a popular technique used in data analysis to group similar data points together based on their inherent patterns and structures.

It enables researchers and analysts to uncover hidden relationships within data sets and better understand complex systems. With the increasing volume of data generated by organizations and individuals, clustering has become a valuable tool for mining and exploration. Using clustering algorithms, it is possible to identify distinct subgroups within large data sets and gain deeper insights into complex phenomena, such as customer behaviour, market trends, and scientific phenomena.

In this article, we will explore the benefits of clustering for deeper data insights and discuss some of the most popular clustering algorithms used today.

How To#

from sklearn.model_selection import train_test_split
import pandas as pd

df = pd.read_csv("data/housing.csv")
df.head()

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	ocean_proximity
0	-122.23	37.88	41.0	880.0	129.0	322.0	126.0	8.3252	452600.0	NEAR BAY
1	-122.22	37.86	21.0	7099.0	1106.0	2401.0	1138.0	8.3014	358500.0	NEAR BAY
2	-122.24	37.85	52.0	1467.0	190.0	496.0	177.0	7.2574	352100.0	NEAR BAY
3	-122.25	37.85	52.0	1274.0	235.0	558.0	219.0	5.6431	341300.0	NEAR BAY
4	-122.25	37.85	52.0	1627.0	280.0	565.0	259.0	3.8462	342200.0	NEAR BAY

from sklearn.cluster import KMeans

km_cl = KMeans(n_clusters=3)

labels = km_cl.fit_predict(df[["longitude", "latitude"]])

import matplotlib.pyplot as plt
plt.scatter(df.longitude, df.latitude, c=labels)

<matplotlib.collections.PathCollection at 0x7f8dbc0bf700>

../_images/5e62f17ad19fe84c83c87d88d47b4a63129fdb7f94db847ca9ecbf776fb7d8e8.png

split_data = df[~df.longitude.between(-121, -118)]
plt.scatter(split_data.longitude, split_data.latitude)

<matplotlib.collections.PathCollection at 0x7f8db3fabc10>

../_images/3cc9b3843c07bdce27ea235d98d54d8b28588e4ffc613e10e247d02e96a9481f.png

km_cl = KMeans(n_clusters=5)
labels = km_cl.fit_predict(split_data[["longitude", "latitude"]])
plt.scatter(split_data.longitude, split_data.latitude, c=labels)

<matplotlib.collections.PathCollection at 0x7f8db3f32af0>

../_images/6a183de7f3ef93f4a2de13eb1927f1a609469e271e221ed48ee539bbb4167809.png

from sklearn.cluster import DBSCAN

db = DBSCAN()

labels = db.fit_predict(split_data[["longitude", "latitude"]])
plt.scatter(split_data.longitude, split_data.latitude, c=labels)

<matplotlib.collections.PathCollection at 0x7f8db3e8c3d0>

../_images/30eea76015e20bf8b8362841de874768ae90292c3172fac7bfe91e4178c09b1f.png

from sklearn.cluster import SpectralClustering

sp = SpectralClustering(n_clusters=4)

split_data = split_data.sample(1000)

labels = sp.fit_predict(split_data[["longitude", "latitude"]])
plt.scatter(split_data.longitude, split_data.latitude, c=labels)

<matplotlib.collections.PathCollection at 0x7f8db3ba8250>

../_images/746e772a673426baac1ff9a531a112bdbffef216eaa4c2cecb4c1eb93c91978e.png

Exercise#

Try different clustering algorithms. Venture out and explore HDBSCAN.

Clustering for deeper data insights

Contents

Clustering for deeper data insights#

How To#

Exercise#

Additional Resources#