{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Advanced Strategies (Schemas)\n", "\n", "Validating data to be within certain ranges is an advanced strategy when automating data science processes.\n", "\n", "In data science, automation has become an essential aspect of various processes. \n", "\n", "One of the critical challenges in automating data science workflows is ensuring the accuracy and validity of the data being used. Validating data to be within certain ranges is an advanced strategy that can be employed to ensure that the data being used is reliable and accurate. This approach involves setting predetermined limits or ranges for specific data points and verifying that the data falls within these parameters. \n", "\n", "By implementing this strategy, data scientists can improve the accuracy and reliability of their automated data science workflows." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How To" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import pandera as pa" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
longitudelatitudehousing_median_agetotal_roomstotal_bedroomspopulationhouseholdsmedian_incomemedian_house_valueocean_proximity
0-122.2337.8841.0880.0129.0322.0126.08.3252452600.0NEAR BAY
1-122.2237.8621.07099.01106.02401.01138.08.3014358500.0NEAR BAY
2-122.2437.8552.01467.0190.0496.0177.07.2574352100.0NEAR BAY
3-122.2537.8552.01274.0235.0558.0219.05.6431341300.0NEAR BAY
4-122.2537.8552.01627.0280.0565.0259.03.8462342200.0NEAR BAY
\n", "
" ], "text/plain": [ " longitude latitude housing_median_age total_rooms total_bedrooms \\\n", "0 -122.23 37.88 41.0 880.0 129.0 \n", "1 -122.22 37.86 21.0 7099.0 1106.0 \n", "2 -122.24 37.85 52.0 1467.0 190.0 \n", "3 -122.25 37.85 52.0 1274.0 235.0 \n", "4 -122.25 37.85 52.0 1627.0 280.0 \n", "\n", " population households median_income median_house_value ocean_proximity \n", "0 322.0 126.0 8.3252 452600.0 NEAR BAY \n", "1 2401.0 1138.0 8.3014 358500.0 NEAR BAY \n", "2 496.0 177.0 7.2574 352100.0 NEAR BAY \n", "3 558.0 219.0 5.6431 341300.0 NEAR BAY \n", "4 565.0 259.0 3.8462 342200.0 NEAR BAY " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv(\"data/housing.csv\")\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
longitudelatitudehousing_median_agetotal_roomstotal_bedroomspopulationhouseholdsmedian_incomemedian_house_valueocean_proximity
0-122.2337.8841.0880.0129.0322.0126.08.3252452600.0NEAR BAY
1-122.2237.8621.07099.01106.02401.01138.08.3014358500.0NEAR BAY
2-122.2437.8552.01467.0190.0496.0177.07.2574352100.0NEAR BAY
3-122.2537.8552.01274.0235.0558.0219.05.6431341300.0NEAR BAY
4-122.2537.8552.01627.0280.0565.0259.03.8462342200.0NEAR BAY
.................................
20635-121.0939.4825.01665.0374.0845.0330.01.560378100.0INLAND
20636-121.2139.4918.0697.0150.0356.0114.02.556877100.0INLAND
20637-121.2239.4317.02254.0485.01007.0433.01.700092300.0INLAND
20638-121.3239.4318.01860.0409.0741.0349.01.867284700.0INLAND
20639-121.2439.3716.02785.0616.01387.0530.02.388689400.0INLAND
\n", "

20640 rows × 10 columns

\n", "
" ], "text/plain": [ " longitude latitude housing_median_age total_rooms total_bedrooms \\\n", "0 -122.23 37.88 41.0 880.0 129.0 \n", "1 -122.22 37.86 21.0 7099.0 1106.0 \n", "2 -122.24 37.85 52.0 1467.0 190.0 \n", "3 -122.25 37.85 52.0 1274.0 235.0 \n", "4 -122.25 37.85 52.0 1627.0 280.0 \n", "... ... ... ... ... ... \n", "20635 -121.09 39.48 25.0 1665.0 374.0 \n", "20636 -121.21 39.49 18.0 697.0 150.0 \n", "20637 -121.22 39.43 17.0 2254.0 485.0 \n", "20638 -121.32 39.43 18.0 1860.0 409.0 \n", "20639 -121.24 39.37 16.0 2785.0 616.0 \n", "\n", " population households median_income median_house_value \\\n", "0 322.0 126.0 8.3252 452600.0 \n", "1 2401.0 1138.0 8.3014 358500.0 \n", "2 496.0 177.0 7.2574 352100.0 \n", "3 558.0 219.0 5.6431 341300.0 \n", "4 565.0 259.0 3.8462 342200.0 \n", "... ... ... ... ... \n", "20635 845.0 330.0 1.5603 78100.0 \n", "20636 356.0 114.0 2.5568 77100.0 \n", "20637 1007.0 433.0 1.7000 92300.0 \n", "20638 741.0 349.0 1.8672 84700.0 \n", "20639 1387.0 530.0 2.3886 89400.0 \n", "\n", " ocean_proximity \n", "0 NEAR BAY \n", "1 NEAR BAY \n", "2 NEAR BAY \n", "3 NEAR BAY \n", "4 NEAR BAY \n", "... ... \n", "20635 INLAND \n", "20636 INLAND \n", "20637 INLAND \n", "20638 INLAND \n", "20639 INLAND \n", "\n", "[20640 rows x 10 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "schema = pa.DataFrameSchema({\"ocean_proximity\": pa.Column(pa.String)})\n", "schema.validate(df)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
longitudelatitudehousing_median_agetotal_roomstotal_bedroomspopulationhouseholdsmedian_incomemedian_house_valueocean_proximity
0-122.2337.8841.0880.0129.0322.0126.08.3252452600.0NEAR BAY
1-122.2237.8621.07099.01106.02401.01138.08.3014358500.0NEAR BAY
2-122.2437.8552.01467.0190.0496.0177.07.2574352100.0NEAR BAY
3-122.2537.8552.01274.0235.0558.0219.05.6431341300.0NEAR BAY
4-122.2537.8552.01627.0280.0565.0259.03.8462342200.0NEAR BAY
.................................
20635-121.0939.4825.01665.0374.0845.0330.01.560378100.0INLAND
20636-121.2139.4918.0697.0150.0356.0114.02.556877100.0INLAND
20637-121.2239.4317.02254.0485.01007.0433.01.700092300.0INLAND
20638-121.3239.4318.01860.0409.0741.0349.01.867284700.0INLAND
20639-121.2439.3716.02785.0616.01387.0530.02.388689400.0INLAND
\n", "

20640 rows × 10 columns

\n", "
" ], "text/plain": [ " longitude latitude housing_median_age total_rooms total_bedrooms \\\n", "0 -122.23 37.88 41.0 880.0 129.0 \n", "1 -122.22 37.86 21.0 7099.0 1106.0 \n", "2 -122.24 37.85 52.0 1467.0 190.0 \n", "3 -122.25 37.85 52.0 1274.0 235.0 \n", "4 -122.25 37.85 52.0 1627.0 280.0 \n", "... ... ... ... ... ... \n", "20635 -121.09 39.48 25.0 1665.0 374.0 \n", "20636 -121.21 39.49 18.0 697.0 150.0 \n", "20637 -121.22 39.43 17.0 2254.0 485.0 \n", "20638 -121.32 39.43 18.0 1860.0 409.0 \n", "20639 -121.24 39.37 16.0 2785.0 616.0 \n", "\n", " population households median_income median_house_value \\\n", "0 322.0 126.0 8.3252 452600.0 \n", "1 2401.0 1138.0 8.3014 358500.0 \n", "2 496.0 177.0 7.2574 352100.0 \n", "3 558.0 219.0 5.6431 341300.0 \n", "4 565.0 259.0 3.8462 342200.0 \n", "... ... ... ... ... \n", "20635 845.0 330.0 1.5603 78100.0 \n", "20636 356.0 114.0 2.5568 77100.0 \n", "20637 1007.0 433.0 1.7000 92300.0 \n", "20638 741.0 349.0 1.8672 84700.0 \n", "20639 1387.0 530.0 2.3886 89400.0 \n", "\n", " ocean_proximity \n", "0 NEAR BAY \n", "1 NEAR BAY \n", "2 NEAR BAY \n", "3 NEAR BAY \n", "4 NEAR BAY \n", "... ... \n", "20635 INLAND \n", "20636 INLAND \n", "20637 INLAND \n", "20638 INLAND \n", "20639 INLAND \n", "\n", "[20640 rows x 10 columns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "schema = pa.DataFrameSchema({\"ocean_proximity\": pa.Column(pa.String,\n", " pa.Check.isin(['NEAR BAY', '<1H OCEAN', 'INLAND', 'NEAR OCEAN', 'ISLAND']))})\n", "schema.validate(df)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['NEAR BAY', '<1H OCEAN', 'INLAND', 'NEAR OCEAN', 'ISLAND'],\n", " dtype=object)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.ocean_proximity.unique()" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
longitudelatitudehousing_median_agetotal_roomstotal_bedroomspopulationhouseholdsmedian_incomemedian_house_valueocean_proximity
0-122.2337.8841.0880129.0322.0126.08.3252452600.0NEAR BAY
1-122.2237.8621.070991106.02401.01138.08.3014358500.0NEAR BAY
2-122.2437.8552.01467190.0496.0177.07.2574352100.0NEAR BAY
3-122.2537.8552.01274235.0558.0219.05.6431341300.0NEAR BAY
4-122.2537.8552.01627280.0565.0259.03.8462342200.0NEAR BAY
.................................
20635-121.0939.4825.01665374.0845.0330.01.560378100.0INLAND
20636-121.2139.4918.0697150.0356.0114.02.556877100.0INLAND
20637-121.2239.4317.02254485.01007.0433.01.700092300.0INLAND
20638-121.3239.4318.01860409.0741.0349.01.867284700.0INLAND
20639-121.2439.3716.02785616.01387.0530.02.388689400.0INLAND
\n", "

20640 rows × 10 columns

\n", "
" ], "text/plain": [ " longitude latitude housing_median_age total_rooms total_bedrooms \\\n", "0 -122.23 37.88 41.0 880 129.0 \n", "1 -122.22 37.86 21.0 7099 1106.0 \n", "2 -122.24 37.85 52.0 1467 190.0 \n", "3 -122.25 37.85 52.0 1274 235.0 \n", "4 -122.25 37.85 52.0 1627 280.0 \n", "... ... ... ... ... ... \n", "20635 -121.09 39.48 25.0 1665 374.0 \n", "20636 -121.21 39.49 18.0 697 150.0 \n", "20637 -121.22 39.43 17.0 2254 485.0 \n", "20638 -121.32 39.43 18.0 1860 409.0 \n", "20639 -121.24 39.37 16.0 2785 616.0 \n", "\n", " population households median_income median_house_value \\\n", "0 322.0 126.0 8.3252 452600.0 \n", "1 2401.0 1138.0 8.3014 358500.0 \n", "2 496.0 177.0 7.2574 352100.0 \n", "3 558.0 219.0 5.6431 341300.0 \n", "4 565.0 259.0 3.8462 342200.0 \n", "... ... ... ... ... \n", "20635 845.0 330.0 1.5603 78100.0 \n", "20636 356.0 114.0 2.5568 77100.0 \n", "20637 1007.0 433.0 1.7000 92300.0 \n", "20638 741.0 349.0 1.8672 84700.0 \n", "20639 1387.0 530.0 2.3886 89400.0 \n", "\n", " ocean_proximity \n", "0 NEAR BAY \n", "1 NEAR BAY \n", "2 NEAR BAY \n", "3 NEAR BAY \n", "4 NEAR BAY \n", "... ... \n", "20635 INLAND \n", "20636 INLAND \n", "20637 INLAND \n", "20638 INLAND \n", "20639 INLAND \n", "\n", "[20640 rows x 10 columns]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "df = pd.read_csv(\"data/housing.csv\", dtype={\"total_rooms\": np.int64})\n", "\n", "schema = pa.DataFrameSchema({\"ocean_proximity\": pa.Column(pa.String,\n", " pa.Check.isin(['NEAR BAY', '<1H OCEAN', 'INLAND', 'NEAR OCEAN', 'ISLAND'])),\n", " \"total_rooms\": pa.Column(pa.Int)})\n", "schema.validate(df)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
longitudelatitudehousing_median_agetotal_roomstotal_bedroomspopulationhouseholdsmedian_incomemedian_house_valueocean_proximity
0-122.2337.8841.0880129.0322.0126.08.3252452600.0NEAR BAY
1-122.2237.8621.070991106.02401.01138.08.3014358500.0NEAR BAY
2-122.2437.8552.01467190.0496.0177.07.2574352100.0NEAR BAY
3-122.2537.8552.01274235.0558.0219.05.6431341300.0NEAR BAY
4-122.2537.8552.01627280.0565.0259.03.8462342200.0NEAR BAY
.................................
20635-121.0939.4825.01665374.0845.0330.01.560378100.0INLAND
20636-121.2139.4918.0697150.0356.0114.02.556877100.0INLAND
20637-121.2239.4317.02254485.01007.0433.01.700092300.0INLAND
20638-121.3239.4318.01860409.0741.0349.01.867284700.0INLAND
20639-121.2439.3716.02785616.01387.0530.02.388689400.0INLAND
\n", "

20640 rows × 10 columns

\n", "
" ], "text/plain": [ " longitude latitude housing_median_age total_rooms total_bedrooms \\\n", "0 -122.23 37.88 41.0 880 129.0 \n", "1 -122.22 37.86 21.0 7099 1106.0 \n", "2 -122.24 37.85 52.0 1467 190.0 \n", "3 -122.25 37.85 52.0 1274 235.0 \n", "4 -122.25 37.85 52.0 1627 280.0 \n", "... ... ... ... ... ... \n", "20635 -121.09 39.48 25.0 1665 374.0 \n", "20636 -121.21 39.49 18.0 697 150.0 \n", "20637 -121.22 39.43 17.0 2254 485.0 \n", "20638 -121.32 39.43 18.0 1860 409.0 \n", "20639 -121.24 39.37 16.0 2785 616.0 \n", "\n", " population households median_income median_house_value \\\n", "0 322.0 126.0 8.3252 452600.0 \n", "1 2401.0 1138.0 8.3014 358500.0 \n", "2 496.0 177.0 7.2574 352100.0 \n", "3 558.0 219.0 5.6431 341300.0 \n", "4 565.0 259.0 3.8462 342200.0 \n", "... ... ... ... ... \n", "20635 845.0 330.0 1.5603 78100.0 \n", "20636 356.0 114.0 2.5568 77100.0 \n", "20637 1007.0 433.0 1.7000 92300.0 \n", "20638 741.0 349.0 1.8672 84700.0 \n", "20639 1387.0 530.0 2.3886 89400.0 \n", "\n", " ocean_proximity \n", "0 NEAR BAY \n", "1 NEAR BAY \n", "2 NEAR BAY \n", "3 NEAR BAY \n", "4 NEAR BAY \n", "... ... \n", "20635 INLAND \n", "20636 INLAND \n", "20637 INLAND \n", "20638 INLAND \n", "20639 INLAND \n", "\n", "[20640 rows x 10 columns]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "schema = pa.DataFrameSchema({\"ocean_proximity\": pa.Column(pa.String,\n", " pa.Check.isin(['NEAR BAY', '<1H OCEAN', 'INLAND', 'NEAR OCEAN', 'ISLAND'])),\n", " \"total_rooms\": pa.Column(pa.Int),\n", " \"housing_median_age\": pa.Column(pa.Float, pa.Check(lambda n: n**2 > 0))})\n", "schema.validate(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Simple Example Why:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
percentages
00.100
10.300
20.253
30.041
40.210
50.990
\n", "
" ], "text/plain": [ " percentages\n", "0 0.100\n", "1 0.300\n", "2 0.253\n", "3 0.041\n", "4 0.210\n", "5 0.990" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_simple = pd.DataFrame({\"percentages\": [0.1, 0.3, 25.3, 4.1, 0.21, 99]})\n", "df_simple.percentages[df_simple.percentages>1] /= 100\n", "schema = pa.DataFrameSchema({\"percentages\": pa.Column(pa.Float,\n", " pa.Check.less_than_or_equal_to(1))})\n", "schema.validate(df_simple)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise\n", "\n", "Explore custom validations and loading data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "schema = pa.DataFrameSchema(...)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Additional Resources" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- [Pandera Schema Validation](https://pandera.readthedocs.io/)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 4 }