Clustering for deeper data insights#
Clustering exploits inherent structures in data to find relationships and memberships to groups in an unsupervised way. It can be used for data mining to generate additional insights.
Clustering is a popular technique used in data analysis to group similar data points together based on their inherent patterns and structures.
It enables researchers and analysts to uncover hidden relationships within data sets and better understand complex systems. With the increasing volume of data generated by organizations and individuals, clustering has become a valuable tool for mining and exploration. Using clustering algorithms, it is possible to identify distinct subgroups within large data sets and gain deeper insights into complex phenomena, such as customer behaviour, market trends, and scientific phenomena.
In this article, we will explore the benefits of clustering for deeper data insights and discuss some of the most popular clustering algorithms used today.
How To#
from sklearn.model_selection import train_test_split
import pandas as pd
df = pd.read_csv("data/housing.csv")
df.head()
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|---|
0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |
3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY |
4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |
from sklearn.cluster import KMeans
km_cl = KMeans(n_clusters=3)
labels = km_cl.fit_predict(df[["longitude", "latitude"]])
import matplotlib.pyplot as plt
plt.scatter(df.longitude, df.latitude, c=labels)
<matplotlib.collections.PathCollection at 0x7f76a80d5be0>
split_data = df[~df.longitude.between(-121, -118)]
plt.scatter(split_data.longitude, split_data.latitude)
<matplotlib.collections.PathCollection at 0x7f76f412ab50>
km_cl = KMeans(n_clusters=5)
labels = km_cl.fit_predict(split_data[["longitude", "latitude"]])
plt.scatter(split_data.longitude, split_data.latitude, c=labels)
<matplotlib.collections.PathCollection at 0x7f769df42c10>
from sklearn.cluster import DBSCAN
db = DBSCAN()
labels = db.fit_predict(split_data[["longitude", "latitude"]])
plt.scatter(split_data.longitude, split_data.latitude, c=labels)
<matplotlib.collections.PathCollection at 0x7f769dec1b20>
from sklearn.cluster import SpectralClustering
sp = SpectralClustering(n_clusters=4)
split_data = split_data.sample(1000)
labels = sp.fit_predict(split_data[["longitude", "latitude"]])
plt.scatter(split_data.longitude, split_data.latitude, c=labels)
<matplotlib.collections.PathCollection at 0x7f769cbfc5b0>
Exercise#
Try different clustering algorithms. Venture out and explore HDBSCAN.