Selecting subsets#

Selecting subsets of data to analyze can give deeper insights.

When dealing with large datasets, selecting subsets of data to analyze can provide more focused and meaningful insights than analyzing the entire dataset.

This approach allows researchers to identify patterns and trends within specific subgroups and gain a deeper understanding of the data. Additionally, selecting subsets of data can help to reduce the amount of noise and irrelevant information in the analysis, making it easier to draw accurate conclusions.

Whether analyzing data for scientific research or business intelligence, selecting the correct subset of data can be a crucial step in unlocking deeper insights and driving successful outcomes.

How To#

import pandas as pd
df = pd.read_csv("data/housing.csv")
df.head()
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY
df.longitude < -122
0         True
1         True
2         True
3         True
4         True
         ...  
20635    False
20636    False
20637    False
20638    False
20639    False
Name: longitude, Length: 20640, dtype: bool
df[df.longitude < -122]
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY
... ... ... ... ... ... ... ... ... ... ...
20573 -122.05 38.56 20.0 1005.0 168.0 457.0 157.0 5.6790 225000.0 INLAND
20581 -122.21 38.83 20.0 1138.0 221.0 459.0 209.0 3.1534 123400.0 INLAND
20582 -122.16 38.90 33.0 1221.0 236.0 488.0 199.0 3.7574 92700.0 INLAND
20585 -122.04 38.68 26.0 1113.0 222.0 689.0 234.0 3.0486 83600.0 INLAND
20586 -122.03 38.69 23.0 1796.0 380.0 939.0 330.0 2.7955 96300.0 INLAND

3967 rows × 10 columns

df.shape
(20640, 10)
df[df.ocean_proximity.isin(["NEAR BAY", "INLAND"]) & (df.longitude < -122)]
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY
... ... ... ... ... ... ... ... ... ... ...
20573 -122.05 38.56 20.0 1005.0 168.0 457.0 157.0 5.6790 225000.0 INLAND
20581 -122.21 38.83 20.0 1138.0 221.0 459.0 209.0 3.1534 123400.0 INLAND
20582 -122.16 38.90 33.0 1221.0 236.0 488.0 199.0 3.7574 92700.0 INLAND
20585 -122.04 38.68 26.0 1113.0 222.0 689.0 234.0 3.0486 83600.0 INLAND
20586 -122.03 38.69 23.0 1796.0 380.0 939.0 330.0 2.7955 96300.0 INLAND

2718 rows × 10 columns

df_subset = df[df.ocean_proximity.isin(["NEAR BAY", "INLAND"]) | (df.longitude < -122)]
df_subset.ocean_proximity.unique()
array(['NEAR BAY', 'INLAND', 'NEAR OCEAN', '<1H OCEAN'], dtype=object)

.loc#

df.loc[:, ["longitude", "latitude"]]
longitude latitude
0 -122.23 37.88
1 -122.22 37.86
2 -122.24 37.85
3 -122.25 37.85
4 -122.25 37.85
... ... ...
20635 -121.09 39.48
20636 -121.21 39.49
20637 -121.22 39.43
20638 -121.32 39.43
20639 -121.24 39.37

20640 rows × 2 columns

df.loc[5:500, ["longitude", "latitude"]]
longitude latitude
5 -122.25 37.85
6 -122.25 37.84
7 -122.25 37.84
8 -122.26 37.84
9 -122.25 37.84
... ... ...
496 -122.26 37.85
497 -122.27 37.85
498 -122.27 37.85
499 -122.27 37.85
500 -122.27 37.85

496 rows × 2 columns

df
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY
... ... ... ... ... ... ... ... ... ... ...
20635 -121.09 39.48 25.0 1665.0 374.0 845.0 330.0 1.5603 78100.0 INLAND
20636 -121.21 39.49 18.0 697.0 150.0 356.0 114.0 2.5568 77100.0 INLAND
20637 -121.22 39.43 17.0 2254.0 485.0 1007.0 433.0 1.7000 92300.0 INLAND
20638 -121.32 39.43 18.0 1860.0 409.0 741.0 349.0 1.8672 84700.0 INLAND
20639 -121.24 39.37 16.0 2785.0 616.0 1387.0 530.0 2.3886 89400.0 INLAND

20640 rows × 10 columns

Indexing#

df = df.set_index("latitude")
df.index
Index([37.88, 37.86, 37.85, 37.85, 37.85, 37.85, 37.84, 37.84, 37.84, 37.84,
       ...
       39.29, 39.33, 39.26, 39.19, 39.27, 39.48, 39.49, 39.43, 39.43, 39.37],
      dtype='float64', name='latitude', length=20640)
df.loc[37.85, :].head()
longitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
latitude
37.85 -122.24 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
37.85 -122.25 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
37.85 -122.25 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY
37.85 -122.25 52.0 919.0 213.0 413.0 193.0 4.0368 269700.0 NEAR BAY
37.85 -122.26 52.0 2202.0 434.0 910.0 402.0 3.2031 281500.0 NEAR BAY
df[["longitude", "population"]].reset_index()
latitude longitude population
0 37.88 -122.23 322.0
1 37.86 -122.22 2401.0
2 37.85 -122.24 496.0
3 37.85 -122.25 558.0
4 37.85 -122.25 565.0
... ... ... ...
20635 39.48 -121.09 845.0
20636 39.49 -121.21 356.0
20637 39.43 -121.22 1007.0
20638 39.43 -121.32 741.0
20639 39.37 -121.24 1387.0

20640 rows × 3 columns

Index Slicing#

df.iloc[5:500, 3:8]
total_bedrooms population households median_income median_house_value
latitude
37.85 213.0 413.0 193.0 4.0368 269700.0
37.84 489.0 1094.0 514.0 3.6591 299200.0
37.84 687.0 1157.0 647.0 3.1200 241400.0
37.84 665.0 1206.0 595.0 2.0804 226700.0
37.84 707.0 1551.0 714.0 3.6912 261100.0
... ... ... ... ... ...
37.86 663.0 1316.0 590.0 5.3794 376900.0
37.85 768.0 1508.0 755.0 3.2619 309600.0
37.85 920.0 1800.0 815.0 2.7054 182300.0
37.85 400.0 719.0 326.0 2.2431 172700.0
37.85 300.0 675.0 255.0 1.9028 150800.0

495 rows × 5 columns

Selecting Columns and Indices#

df[["longitude", "population"]]
longitude population
latitude
37.88 -122.23 322.0
37.86 -122.22 2401.0
37.85 -122.24 496.0
37.85 -122.25 558.0
37.85 -122.25 565.0
... ... ...
39.48 -121.09 845.0
39.49 -121.21 356.0
39.43 -121.22 1007.0
39.43 -121.32 741.0
39.37 -121.24 1387.0

20640 rows × 2 columns

df.drop(["longitude", "population"], axis=1)
housing_median_age total_rooms total_bedrooms households median_income median_house_value ocean_proximity
latitude
37.88 41.0 880.0 129.0 126.0 8.3252 452600.0 NEAR BAY
37.86 21.0 7099.0 1106.0 1138.0 8.3014 358500.0 NEAR BAY
37.85 52.0 1467.0 190.0 177.0 7.2574 352100.0 NEAR BAY
37.85 52.0 1274.0 235.0 219.0 5.6431 341300.0 NEAR BAY
37.85 52.0 1627.0 280.0 259.0 3.8462 342200.0 NEAR BAY
... ... ... ... ... ... ... ...
39.48 25.0 1665.0 374.0 330.0 1.5603 78100.0 INLAND
39.49 18.0 697.0 150.0 114.0 2.5568 77100.0 INLAND
39.43 17.0 2254.0 485.0 433.0 1.7000 92300.0 INLAND
39.43 18.0 1860.0 409.0 349.0 1.8672 84700.0 INLAND
39.37 16.0 2785.0 616.0 530.0 2.3886 89400.0 INLAND

20640 rows × 7 columns

df.drop([37.85], axis=0)
longitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
latitude
37.88 -122.23 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
37.86 -122.22 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
37.84 -122.25 52.0 2535.0 489.0 1094.0 514.0 3.6591 299200.0 NEAR BAY
37.84 -122.25 52.0 3104.0 687.0 1157.0 647.0 3.1200 241400.0 NEAR BAY
37.84 -122.26 42.0 2555.0 665.0 1206.0 595.0 2.0804 226700.0 NEAR BAY
... ... ... ... ... ... ... ... ... ...
39.48 -121.09 25.0 1665.0 374.0 845.0 330.0 1.5603 78100.0 INLAND
39.49 -121.21 18.0 697.0 150.0 356.0 114.0 2.5568 77100.0 INLAND
39.43 -121.22 17.0 2254.0 485.0 1007.0 433.0 1.7000 92300.0 INLAND
39.43 -121.32 18.0 1860.0 409.0 741.0 349.0 1.8672 84700.0 INLAND
39.37 -121.24 16.0 2785.0 616.0 1387.0 530.0 2.3886 89400.0 INLAND

20595 rows × 9 columns

Why?#

df.describe()
longitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
count 20640.000000 20640.000000 20640.000000 20433.000000 20640.000000 20640.000000 20640.000000 20640.000000
mean -119.569704 28.639486 2635.763081 537.870553 1425.476744 499.539680 3.870671 206855.816909
std 2.003532 12.585558 2181.615252 421.385070 1132.462122 382.329753 1.899822 115395.615874
min -124.350000 1.000000 2.000000 1.000000 3.000000 1.000000 0.499900 14999.000000
25% -121.800000 18.000000 1447.750000 296.000000 787.000000 280.000000 2.563400 119600.000000
50% -118.490000 29.000000 2127.000000 435.000000 1166.000000 409.000000 3.534800 179700.000000
75% -118.010000 37.000000 3148.000000 647.000000 1725.000000 605.000000 4.743250 264725.000000
max -114.310000 52.000000 39320.000000 6445.000000 35682.000000 6082.000000 15.000100 500001.000000
df.loc[df["longitude"] < -122].describe()
longitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
count 3967.000000 3967.000000 3967.000000 3927.000000 3967.000000 3967.000000 3967.000000 3967.000000
mean -122.408072 33.831107 2461.945551 497.507003 1203.207714 468.308546 4.103604 244230.881775
std 0.423863 13.270992 1710.886064 341.984939 842.230630 325.648578 2.048455 127889.888450
min -124.350000 2.000000 8.000000 1.000000 8.000000 1.000000 0.499900 14999.000000
25% -122.470000 23.000000 1440.000000 287.000000 699.500000 270.000000 2.708300 140900.000000
50% -122.300000 34.000000 2104.000000 417.000000 1019.000000 395.000000 3.729200 226500.000000
75% -122.160000 46.000000 3010.000000 612.500000 1479.000000 575.500000 5.001450 331500.000000
max -122.010000 52.000000 18634.000000 3226.000000 8276.000000 3589.000000 15.000100 500001.000000

Exercise#

Select data from a specific population size.

df.loc[...]
longitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
latitude
37.88 -122.23 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
37.86 -122.22 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
37.85 -122.24 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
37.85 -122.25 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
37.85 -122.25 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY
... ... ... ... ... ... ... ... ... ...
39.48 -121.09 25.0 1665.0 374.0 845.0 330.0 1.5603 78100.0 INLAND
39.49 -121.21 18.0 697.0 150.0 356.0 114.0 2.5568 77100.0 INLAND
39.43 -121.22 17.0 2254.0 485.0 1007.0 433.0 1.7000 92300.0 INLAND
39.43 -121.32 18.0 1860.0 409.0 741.0 349.0 1.8672 84700.0 INLAND
39.37 -121.24 16.0 2785.0 616.0 1387.0 530.0 2.3886 89400.0 INLAND

20640 rows × 9 columns

Additional Resources#