Dealing with huge datasets#
Some data is too large for our small laptop. Some data is even too large for our terrabyte servers. Being smart about loading data can help us build better data science pipelines.
As data science and machine learning become increasingly popular, the size of data sets used for analysis has grown exponentially. Sometimes, the data is so large that it cannot be loaded into memory on a single machine, even on a terabyte server.
In such cases, loading data can become a bottleneck, hindering data analysis and decision-making. To overcome this challenge, data scientists must be innovative about loading data and building efficient data pipelines. Various techniques and tools are available to process and analyze large datasets, including parallel computing, distributed systems, cloud computing, and data streaming.
This notebook will discuss the challenges of loading large datasets and explore some best practices for building efficient data science pipelines to handle big data. We will also explore popular tools and techniques for processing and analyzing large datasets.
How To#
import pandas as pd
random_state = 42
df = pd.read_csv("data/housing.csv")
df.head(5)
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|---|
0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |
3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY |
4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |
df.memory_usage(deep=True)
Index 128
longitude 165120
latitude 165120
housing_median_age 165120
total_rooms 165120
total_bedrooms 165120
population 165120
households 165120
median_income 165120
median_house_value 165120
ocean_proximity 1342940
dtype: int64
df.dtypes
longitude float64
latitude float64
housing_median_age float64
total_rooms float64
total_bedrooms float64
population float64
households float64
median_income float64
median_house_value float64
ocean_proximity object
dtype: object
df["ocean_proximity"] = df["ocean_proximity"].astype("category")
df.memory_usage(deep=True)
Index 128
longitude 165120
latitude 165120
housing_median_age 165120
total_rooms 165120
total_bedrooms 165120
population 165120
households 165120
median_income 165120
median_house_value 165120
ocean_proximity 21136
dtype: int64
df_small = pd.read_csv("data/housing.csv", dtype={"ocean_proximity": "category"})
df_small.memory_usage(deep=True)
Index 128
longitude 165120
latitude 165120
housing_median_age 165120
total_rooms 165120
total_bedrooms 165120
population 165120
households 165120
median_income 165120
median_house_value 165120
ocean_proximity 21136
dtype: int64
df_columns = pd.read_csv("data/housing.csv", usecols=["longitude", "latitude", "ocean_proximity"])
df_columns.head()
longitude | latitude | ocean_proximity | |
---|---|---|---|
0 | -122.23 | 37.88 | NEAR BAY |
1 | -122.22 | 37.86 | NEAR BAY |
2 | -122.24 | 37.85 | NEAR BAY |
3 | -122.25 | 37.85 | NEAR BAY |
4 | -122.25 | 37.85 | NEAR BAY |
df_columns.sample(100, random_state=random_state)
longitude | latitude | ocean_proximity | |
---|---|---|---|
20046 | -119.01 | 36.06 | INLAND |
3024 | -119.46 | 35.14 | INLAND |
15663 | -122.44 | 37.80 | NEAR BAY |
20484 | -118.72 | 34.28 | <1H OCEAN |
9814 | -121.93 | 36.62 | NEAR OCEAN |
... | ... | ... | ... |
6052 | -117.76 | 34.04 | INLAND |
15975 | -122.45 | 37.77 | NEAR BAY |
14331 | -117.15 | 32.72 | NEAR OCEAN |
1606 | -122.08 | 37.88 | NEAR BAY |
10915 | -117.87 | 33.73 | <1H OCEAN |
100 rows × 3 columns
Exercise#
Check out the Dask playground for lazy dataframes.