Advanced Strategies (Schemas)#

Validating data to be within certain ranges is an advanced strategy when automating data science processes.

In data science, automation has become an essential aspect of various processes.

One of the critical challenges in automating data science workflows is ensuring the accuracy and validity of the data being used. Validating data to be within certain ranges is an advanced strategy that can be employed to ensure that the data being used is reliable and accurate. This approach involves setting predetermined limits or ranges for specific data points and verifying that the data falls within these parameters.

By implementing this strategy, data scientists can improve the accuracy and reliability of their automated data science workflows.

How To#

import pandas as pd
import pandera as pa
df = pd.read_csv("data/housing.csv")
df.head()
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY
schema = pa.DataFrameSchema({"ocean_proximity": pa.Column(pa.String)})
schema.validate(df)
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY
... ... ... ... ... ... ... ... ... ... ...
20635 -121.09 39.48 25.0 1665.0 374.0 845.0 330.0 1.5603 78100.0 INLAND
20636 -121.21 39.49 18.0 697.0 150.0 356.0 114.0 2.5568 77100.0 INLAND
20637 -121.22 39.43 17.0 2254.0 485.0 1007.0 433.0 1.7000 92300.0 INLAND
20638 -121.32 39.43 18.0 1860.0 409.0 741.0 349.0 1.8672 84700.0 INLAND
20639 -121.24 39.37 16.0 2785.0 616.0 1387.0 530.0 2.3886 89400.0 INLAND

20640 rows × 10 columns

schema = pa.DataFrameSchema({"ocean_proximity": pa.Column(pa.String,
                                                         pa.Check.isin(['NEAR BAY', '<1H OCEAN', 'INLAND', 'NEAR OCEAN', 'ISLAND']))})
schema.validate(df)
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY
... ... ... ... ... ... ... ... ... ... ...
20635 -121.09 39.48 25.0 1665.0 374.0 845.0 330.0 1.5603 78100.0 INLAND
20636 -121.21 39.49 18.0 697.0 150.0 356.0 114.0 2.5568 77100.0 INLAND
20637 -121.22 39.43 17.0 2254.0 485.0 1007.0 433.0 1.7000 92300.0 INLAND
20638 -121.32 39.43 18.0 1860.0 409.0 741.0 349.0 1.8672 84700.0 INLAND
20639 -121.24 39.37 16.0 2785.0 616.0 1387.0 530.0 2.3886 89400.0 INLAND

20640 rows × 10 columns

df.ocean_proximity.unique()
array(['NEAR BAY', '<1H OCEAN', 'INLAND', 'NEAR OCEAN', 'ISLAND'],
      dtype=object)
import numpy as np
df = pd.read_csv("data/housing.csv", dtype={"total_rooms": np.int64})

schema = pa.DataFrameSchema({"ocean_proximity": pa.Column(pa.String,
                                                         pa.Check.isin(['NEAR BAY', '<1H OCEAN', 'INLAND', 'NEAR OCEAN', 'ISLAND'])),
                            "total_rooms": pa.Column(pa.Int)})
schema.validate(df)
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41.0 880 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21.0 7099 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52.0 1467 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52.0 1274 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52.0 1627 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY
... ... ... ... ... ... ... ... ... ... ...
20635 -121.09 39.48 25.0 1665 374.0 845.0 330.0 1.5603 78100.0 INLAND
20636 -121.21 39.49 18.0 697 150.0 356.0 114.0 2.5568 77100.0 INLAND
20637 -121.22 39.43 17.0 2254 485.0 1007.0 433.0 1.7000 92300.0 INLAND
20638 -121.32 39.43 18.0 1860 409.0 741.0 349.0 1.8672 84700.0 INLAND
20639 -121.24 39.37 16.0 2785 616.0 1387.0 530.0 2.3886 89400.0 INLAND

20640 rows × 10 columns

schema = pa.DataFrameSchema({"ocean_proximity": pa.Column(pa.String,
                                                         pa.Check.isin(['NEAR BAY', '<1H OCEAN', 'INLAND', 'NEAR OCEAN', 'ISLAND'])),
                            "total_rooms": pa.Column(pa.Int),
                            "housing_median_age": pa.Column(pa.Float, pa.Check(lambda n: n**2 > 0))})
schema.validate(df)
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41.0 880 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21.0 7099 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52.0 1467 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52.0 1274 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52.0 1627 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY
... ... ... ... ... ... ... ... ... ... ...
20635 -121.09 39.48 25.0 1665 374.0 845.0 330.0 1.5603 78100.0 INLAND
20636 -121.21 39.49 18.0 697 150.0 356.0 114.0 2.5568 77100.0 INLAND
20637 -121.22 39.43 17.0 2254 485.0 1007.0 433.0 1.7000 92300.0 INLAND
20638 -121.32 39.43 18.0 1860 409.0 741.0 349.0 1.8672 84700.0 INLAND
20639 -121.24 39.37 16.0 2785 616.0 1387.0 530.0 2.3886 89400.0 INLAND

20640 rows × 10 columns

Simple Example Why:#

df_simple = pd.DataFrame({"percentages": [0.1, 0.3, 25.3, 4.1, 0.21, 99]})
df_simple.percentages[df_simple.percentages>1] /= 100
schema = pa.DataFrameSchema({"percentages": pa.Column(pa.Float,
                                                     pa.Check.less_than_or_equal_to(1))})
schema.validate(df_simple)
percentages
0 0.100
1 0.300
2 0.253
3 0.041
4 0.210
5 0.990

Exercise#

Explore custom validations and loading data.

schema = pa.DataFrameSchema(...)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[9], line 1
----> 1 schema = pa.DataFrameSchema(...)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandera/api/pandas/container.py:133, in DataFrameSchema.__init__(self, columns, checks, index, dtype, coerce, strict, name, ordered, unique, report_duplicates, unique_column_names, add_missing_columns, title, description, metadata, drop_invalid_rows)
    130 if columns is None:
    131     columns = {}
--> 133 _validate_columns(columns)
    134 columns = _columns_renamed(columns)
    136 if checks is None:

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandera/api/pandas/container.py:1401, in _validate_columns(column_dict)
   1398 def _validate_columns(
   1399     column_dict: dict[Any, "pandera.api.pandas.components.Column"],  # type: ignore [name-defined]
   1400 ) -> None:
-> 1401     for column_name, column in column_dict.items():
   1402         for check in column.checks:
   1403             if check.groupby is None or callable(check.groupby):

AttributeError: 'ellipsis' object has no attribute 'items'

Additional Resources#