Selecting subsets#
Selecting subsets of data to analyze can give deeper insights.
When dealing with large datasets, selecting subsets of data to analyze can provide more focused and meaningful insights than analyzing the entire dataset.
This approach allows researchers to identify patterns and trends within specific subgroups and gain a deeper understanding of the data. Additionally, selecting subsets of data can help to reduce the amount of noise and irrelevant information in the analysis, making it easier to draw accurate conclusions.
Whether analyzing data for scientific research or business intelligence, selecting the correct subset of data can be a crucial step in unlocking deeper insights and driving successful outcomes.
How To#
import pandas as pd
df = pd.read_csv("data/housing.csv")
df.head()
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|---|
0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |
3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY |
4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |
df.longitude < -122
0 True
1 True
2 True
3 True
4 True
...
20635 False
20636 False
20637 False
20638 False
20639 False
Name: longitude, Length: 20640, dtype: bool
df[df.longitude < -122]
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|---|
0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |
3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY |
4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
20573 | -122.05 | 38.56 | 20.0 | 1005.0 | 168.0 | 457.0 | 157.0 | 5.6790 | 225000.0 | INLAND |
20581 | -122.21 | 38.83 | 20.0 | 1138.0 | 221.0 | 459.0 | 209.0 | 3.1534 | 123400.0 | INLAND |
20582 | -122.16 | 38.90 | 33.0 | 1221.0 | 236.0 | 488.0 | 199.0 | 3.7574 | 92700.0 | INLAND |
20585 | -122.04 | 38.68 | 26.0 | 1113.0 | 222.0 | 689.0 | 234.0 | 3.0486 | 83600.0 | INLAND |
20586 | -122.03 | 38.69 | 23.0 | 1796.0 | 380.0 | 939.0 | 330.0 | 2.7955 | 96300.0 | INLAND |
3967 rows × 10 columns
df.shape
(20640, 10)
df[df.ocean_proximity.isin(["NEAR BAY", "INLAND"]) & (df.longitude < -122)]
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|---|
0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |
3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY |
4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
20573 | -122.05 | 38.56 | 20.0 | 1005.0 | 168.0 | 457.0 | 157.0 | 5.6790 | 225000.0 | INLAND |
20581 | -122.21 | 38.83 | 20.0 | 1138.0 | 221.0 | 459.0 | 209.0 | 3.1534 | 123400.0 | INLAND |
20582 | -122.16 | 38.90 | 33.0 | 1221.0 | 236.0 | 488.0 | 199.0 | 3.7574 | 92700.0 | INLAND |
20585 | -122.04 | 38.68 | 26.0 | 1113.0 | 222.0 | 689.0 | 234.0 | 3.0486 | 83600.0 | INLAND |
20586 | -122.03 | 38.69 | 23.0 | 1796.0 | 380.0 | 939.0 | 330.0 | 2.7955 | 96300.0 | INLAND |
2718 rows × 10 columns
df_subset = df[df.ocean_proximity.isin(["NEAR BAY", "INLAND"]) | (df.longitude < -122)]
df_subset.ocean_proximity.unique()
array(['NEAR BAY', 'INLAND', 'NEAR OCEAN', '<1H OCEAN'], dtype=object)
.loc#
df.loc[:, ["longitude", "latitude"]]
longitude | latitude | |
---|---|---|
0 | -122.23 | 37.88 |
1 | -122.22 | 37.86 |
2 | -122.24 | 37.85 |
3 | -122.25 | 37.85 |
4 | -122.25 | 37.85 |
... | ... | ... |
20635 | -121.09 | 39.48 |
20636 | -121.21 | 39.49 |
20637 | -121.22 | 39.43 |
20638 | -121.32 | 39.43 |
20639 | -121.24 | 39.37 |
20640 rows × 2 columns
df.loc[5:500, ["longitude", "latitude"]]
longitude | latitude | |
---|---|---|
5 | -122.25 | 37.85 |
6 | -122.25 | 37.84 |
7 | -122.25 | 37.84 |
8 | -122.26 | 37.84 |
9 | -122.25 | 37.84 |
... | ... | ... |
496 | -122.26 | 37.85 |
497 | -122.27 | 37.85 |
498 | -122.27 | 37.85 |
499 | -122.27 | 37.85 |
500 | -122.27 | 37.85 |
496 rows × 2 columns
df
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|---|
0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |
3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY |
4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
20635 | -121.09 | 39.48 | 25.0 | 1665.0 | 374.0 | 845.0 | 330.0 | 1.5603 | 78100.0 | INLAND |
20636 | -121.21 | 39.49 | 18.0 | 697.0 | 150.0 | 356.0 | 114.0 | 2.5568 | 77100.0 | INLAND |
20637 | -121.22 | 39.43 | 17.0 | 2254.0 | 485.0 | 1007.0 | 433.0 | 1.7000 | 92300.0 | INLAND |
20638 | -121.32 | 39.43 | 18.0 | 1860.0 | 409.0 | 741.0 | 349.0 | 1.8672 | 84700.0 | INLAND |
20639 | -121.24 | 39.37 | 16.0 | 2785.0 | 616.0 | 1387.0 | 530.0 | 2.3886 | 89400.0 | INLAND |
20640 rows × 10 columns
Indexing#
df = df.set_index("latitude")
df.index
Index([37.88, 37.86, 37.85, 37.85, 37.85, 37.85, 37.84, 37.84, 37.84, 37.84,
...
39.29, 39.33, 39.26, 39.19, 39.27, 39.48, 39.49, 39.43, 39.43, 39.37],
dtype='float64', name='latitude', length=20640)
df.loc[37.85, :].head()
longitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|
latitude | |||||||||
37.85 | -122.24 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |
37.85 | -122.25 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY |
37.85 | -122.25 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |
37.85 | -122.25 | 52.0 | 919.0 | 213.0 | 413.0 | 193.0 | 4.0368 | 269700.0 | NEAR BAY |
37.85 | -122.26 | 52.0 | 2202.0 | 434.0 | 910.0 | 402.0 | 3.2031 | 281500.0 | NEAR BAY |
df[["longitude", "population"]].reset_index()
latitude | longitude | population | |
---|---|---|---|
0 | 37.88 | -122.23 | 322.0 |
1 | 37.86 | -122.22 | 2401.0 |
2 | 37.85 | -122.24 | 496.0 |
3 | 37.85 | -122.25 | 558.0 |
4 | 37.85 | -122.25 | 565.0 |
... | ... | ... | ... |
20635 | 39.48 | -121.09 | 845.0 |
20636 | 39.49 | -121.21 | 356.0 |
20637 | 39.43 | -121.22 | 1007.0 |
20638 | 39.43 | -121.32 | 741.0 |
20639 | 39.37 | -121.24 | 1387.0 |
20640 rows × 3 columns
Index Slicing#
df.iloc[5:500, 3:8]
total_bedrooms | population | households | median_income | median_house_value | |
---|---|---|---|---|---|
latitude | |||||
37.85 | 213.0 | 413.0 | 193.0 | 4.0368 | 269700.0 |
37.84 | 489.0 | 1094.0 | 514.0 | 3.6591 | 299200.0 |
37.84 | 687.0 | 1157.0 | 647.0 | 3.1200 | 241400.0 |
37.84 | 665.0 | 1206.0 | 595.0 | 2.0804 | 226700.0 |
37.84 | 707.0 | 1551.0 | 714.0 | 3.6912 | 261100.0 |
... | ... | ... | ... | ... | ... |
37.86 | 663.0 | 1316.0 | 590.0 | 5.3794 | 376900.0 |
37.85 | 768.0 | 1508.0 | 755.0 | 3.2619 | 309600.0 |
37.85 | 920.0 | 1800.0 | 815.0 | 2.7054 | 182300.0 |
37.85 | 400.0 | 719.0 | 326.0 | 2.2431 | 172700.0 |
37.85 | 300.0 | 675.0 | 255.0 | 1.9028 | 150800.0 |
495 rows × 5 columns
Selecting Columns and Indices#
df[["longitude", "population"]]
longitude | population | |
---|---|---|
latitude | ||
37.88 | -122.23 | 322.0 |
37.86 | -122.22 | 2401.0 |
37.85 | -122.24 | 496.0 |
37.85 | -122.25 | 558.0 |
37.85 | -122.25 | 565.0 |
... | ... | ... |
39.48 | -121.09 | 845.0 |
39.49 | -121.21 | 356.0 |
39.43 | -121.22 | 1007.0 |
39.43 | -121.32 | 741.0 |
39.37 | -121.24 | 1387.0 |
20640 rows × 2 columns
df.drop(["longitude", "population"], axis=1)
housing_median_age | total_rooms | total_bedrooms | households | median_income | median_house_value | ocean_proximity | |
---|---|---|---|---|---|---|---|
latitude | |||||||
37.88 | 41.0 | 880.0 | 129.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
37.86 | 21.0 | 7099.0 | 1106.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
37.85 | 52.0 | 1467.0 | 190.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |
37.85 | 52.0 | 1274.0 | 235.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY |
37.85 | 52.0 | 1627.0 | 280.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |
... | ... | ... | ... | ... | ... | ... | ... |
39.48 | 25.0 | 1665.0 | 374.0 | 330.0 | 1.5603 | 78100.0 | INLAND |
39.49 | 18.0 | 697.0 | 150.0 | 114.0 | 2.5568 | 77100.0 | INLAND |
39.43 | 17.0 | 2254.0 | 485.0 | 433.0 | 1.7000 | 92300.0 | INLAND |
39.43 | 18.0 | 1860.0 | 409.0 | 349.0 | 1.8672 | 84700.0 | INLAND |
39.37 | 16.0 | 2785.0 | 616.0 | 530.0 | 2.3886 | 89400.0 | INLAND |
20640 rows × 7 columns
df.drop([37.85], axis=0)
longitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|
latitude | |||||||||
37.88 | -122.23 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
37.86 | -122.22 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
37.84 | -122.25 | 52.0 | 2535.0 | 489.0 | 1094.0 | 514.0 | 3.6591 | 299200.0 | NEAR BAY |
37.84 | -122.25 | 52.0 | 3104.0 | 687.0 | 1157.0 | 647.0 | 3.1200 | 241400.0 | NEAR BAY |
37.84 | -122.26 | 42.0 | 2555.0 | 665.0 | 1206.0 | 595.0 | 2.0804 | 226700.0 | NEAR BAY |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
39.48 | -121.09 | 25.0 | 1665.0 | 374.0 | 845.0 | 330.0 | 1.5603 | 78100.0 | INLAND |
39.49 | -121.21 | 18.0 | 697.0 | 150.0 | 356.0 | 114.0 | 2.5568 | 77100.0 | INLAND |
39.43 | -121.22 | 17.0 | 2254.0 | 485.0 | 1007.0 | 433.0 | 1.7000 | 92300.0 | INLAND |
39.43 | -121.32 | 18.0 | 1860.0 | 409.0 | 741.0 | 349.0 | 1.8672 | 84700.0 | INLAND |
39.37 | -121.24 | 16.0 | 2785.0 | 616.0 | 1387.0 | 530.0 | 2.3886 | 89400.0 | INLAND |
20595 rows × 9 columns
Why?#
df.describe()
longitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | |
---|---|---|---|---|---|---|---|---|
count | 20640.000000 | 20640.000000 | 20640.000000 | 20433.000000 | 20640.000000 | 20640.000000 | 20640.000000 | 20640.000000 |
mean | -119.569704 | 28.639486 | 2635.763081 | 537.870553 | 1425.476744 | 499.539680 | 3.870671 | 206855.816909 |
std | 2.003532 | 12.585558 | 2181.615252 | 421.385070 | 1132.462122 | 382.329753 | 1.899822 | 115395.615874 |
min | -124.350000 | 1.000000 | 2.000000 | 1.000000 | 3.000000 | 1.000000 | 0.499900 | 14999.000000 |
25% | -121.800000 | 18.000000 | 1447.750000 | 296.000000 | 787.000000 | 280.000000 | 2.563400 | 119600.000000 |
50% | -118.490000 | 29.000000 | 2127.000000 | 435.000000 | 1166.000000 | 409.000000 | 3.534800 | 179700.000000 |
75% | -118.010000 | 37.000000 | 3148.000000 | 647.000000 | 1725.000000 | 605.000000 | 4.743250 | 264725.000000 |
max | -114.310000 | 52.000000 | 39320.000000 | 6445.000000 | 35682.000000 | 6082.000000 | 15.000100 | 500001.000000 |
df.loc[df["longitude"] < -122].describe()
longitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | |
---|---|---|---|---|---|---|---|---|
count | 3967.000000 | 3967.000000 | 3967.000000 | 3927.000000 | 3967.000000 | 3967.000000 | 3967.000000 | 3967.000000 |
mean | -122.408072 | 33.831107 | 2461.945551 | 497.507003 | 1203.207714 | 468.308546 | 4.103604 | 244230.881775 |
std | 0.423863 | 13.270992 | 1710.886064 | 341.984939 | 842.230630 | 325.648578 | 2.048455 | 127889.888450 |
min | -124.350000 | 2.000000 | 8.000000 | 1.000000 | 8.000000 | 1.000000 | 0.499900 | 14999.000000 |
25% | -122.470000 | 23.000000 | 1440.000000 | 287.000000 | 699.500000 | 270.000000 | 2.708300 | 140900.000000 |
50% | -122.300000 | 34.000000 | 2104.000000 | 417.000000 | 1019.000000 | 395.000000 | 3.729200 | 226500.000000 |
75% | -122.160000 | 46.000000 | 3010.000000 | 612.500000 | 1479.000000 | 575.500000 | 5.001450 | 331500.000000 |
max | -122.010000 | 52.000000 | 18634.000000 | 3226.000000 | 8276.000000 | 3589.000000 | 15.000100 | 500001.000000 |
Exercise#
Select data from a specific population size.
df.loc[...]
longitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|
latitude | |||||||||
37.88 | -122.23 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
37.86 | -122.22 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
37.85 | -122.24 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |
37.85 | -122.25 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY |
37.85 | -122.25 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
39.48 | -121.09 | 25.0 | 1665.0 | 374.0 | 845.0 | 330.0 | 1.5603 | 78100.0 | INLAND |
39.49 | -121.21 | 18.0 | 697.0 | 150.0 | 356.0 | 114.0 | 2.5568 | 77100.0 | INLAND |
39.43 | -121.22 | 17.0 | 2254.0 | 485.0 | 1007.0 | 433.0 | 1.7000 | 92300.0 | INLAND |
39.43 | -121.32 | 18.0 | 1860.0 | 409.0 | 741.0 | 349.0 | 1.8672 | 84700.0 | INLAND |
39.37 | -121.24 | 16.0 | 2785.0 | 616.0 | 1387.0 | 530.0 | 2.3886 | 89400.0 | INLAND |
20640 rows × 9 columns