Dealing with huge datasets#

Some data is too large for our small laptop. Some data is even too large for our terrabyte servers. Being smart about loading data can help us build better data science pipelines.

As data science and machine learning become increasingly popular, the size of data sets used for analysis has grown exponentially. Sometimes, the data is so large that it cannot be loaded into memory on a single machine, even on a terabyte server.

In such cases, loading data can become a bottleneck, hindering data analysis and decision-making. To overcome this challenge, data scientists must be innovative about loading data and building efficient data pipelines. Various techniques and tools are available to process and analyze large datasets, including parallel computing, distributed systems, cloud computing, and data streaming.

This notebook will discuss the challenges of loading large datasets and explore some best practices for building efficient data science pipelines to handle big data. We will also explore popular tools and techniques for processing and analyzing large datasets.

How To#

import pandas as pd

random_state = 42
df = pd.read_csv("data/housing.csv")
df.head(5)
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY
df.memory_usage(deep=True)
Index                     128
longitude              165120
latitude               165120
housing_median_age     165120
total_rooms            165120
total_bedrooms         165120
population             165120
households             165120
median_income          165120
median_house_value     165120
ocean_proximity       1342940
dtype: int64
df.dtypes
longitude             float64
latitude              float64
housing_median_age    float64
total_rooms           float64
total_bedrooms        float64
population            float64
households            float64
median_income         float64
median_house_value    float64
ocean_proximity        object
dtype: object
df["ocean_proximity"] = df["ocean_proximity"].astype("category")
df.memory_usage(deep=True)
Index                    128
longitude             165120
latitude              165120
housing_median_age    165120
total_rooms           165120
total_bedrooms        165120
population            165120
households            165120
median_income         165120
median_house_value    165120
ocean_proximity        21136
dtype: int64
df_small = pd.read_csv("data/housing.csv", dtype={"ocean_proximity": "category"})
df_small.memory_usage(deep=True)
Index                    128
longitude             165120
latitude              165120
housing_median_age    165120
total_rooms           165120
total_bedrooms        165120
population            165120
households            165120
median_income         165120
median_house_value    165120
ocean_proximity        21136
dtype: int64
df_columns = pd.read_csv("data/housing.csv", usecols=["longitude", "latitude", "ocean_proximity"])
df_columns.head()
longitude latitude ocean_proximity
0 -122.23 37.88 NEAR BAY
1 -122.22 37.86 NEAR BAY
2 -122.24 37.85 NEAR BAY
3 -122.25 37.85 NEAR BAY
4 -122.25 37.85 NEAR BAY
df_columns.sample(100, random_state=random_state)
longitude latitude ocean_proximity
20046 -119.01 36.06 INLAND
3024 -119.46 35.14 INLAND
15663 -122.44 37.80 NEAR BAY
20484 -118.72 34.28 <1H OCEAN
9814 -121.93 36.62 NEAR OCEAN
... ... ... ...
6052 -117.76 34.04 INLAND
15975 -122.45 37.77 NEAR BAY
14331 -117.15 32.72 NEAR OCEAN
1606 -122.08 37.88 NEAR BAY
10915 -117.87 33.73 <1H OCEAN

100 rows × 3 columns

Exercise#

Check out the Dask playground for lazy dataframes.

Additional Resources#