{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Using descriptive statistics\n",
"\n",
"Statistics describe important aspects of our data, often revealing deeper insights.\n",
"\n",
"Statistics is a branch of mathematics concerned with data collection, analysis, interpretation, presentation, and organization. \n",
"\n",
"It plays a crucial role in various fields, from business and economics to healthcare and social sciences. Using statistical techniques, we can describe essential aspects of our data and uncover patterns and trends that may not be immediately apparent. Statistics can help us make informed decisions, identify potential problems, and evaluate the effectiveness of interventions.\n",
"\n",
"In short, statistics can reveal more profound insights into our data and provide valuable information that can guide us in making better decisions."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## How To"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" longitude | \n",
" latitude | \n",
" housing_median_age | \n",
" total_rooms | \n",
" total_bedrooms | \n",
" population | \n",
" households | \n",
" median_income | \n",
" median_house_value | \n",
" ocean_proximity | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" -122.23 | \n",
" 37.88 | \n",
" 41.0 | \n",
" 880.0 | \n",
" 129.0 | \n",
" 322.0 | \n",
" 126.0 | \n",
" 8.3252 | \n",
" 452600.0 | \n",
" NEAR BAY | \n",
"
\n",
" \n",
" 1 | \n",
" -122.22 | \n",
" 37.86 | \n",
" 21.0 | \n",
" 7099.0 | \n",
" 1106.0 | \n",
" 2401.0 | \n",
" 1138.0 | \n",
" 8.3014 | \n",
" 358500.0 | \n",
" NEAR BAY | \n",
"
\n",
" \n",
" 2 | \n",
" -122.24 | \n",
" 37.85 | \n",
" 52.0 | \n",
" 1467.0 | \n",
" 190.0 | \n",
" 496.0 | \n",
" 177.0 | \n",
" 7.2574 | \n",
" 352100.0 | \n",
" NEAR BAY | \n",
"
\n",
" \n",
" 3 | \n",
" -122.25 | \n",
" 37.85 | \n",
" 52.0 | \n",
" 1274.0 | \n",
" 235.0 | \n",
" 558.0 | \n",
" 219.0 | \n",
" 5.6431 | \n",
" 341300.0 | \n",
" NEAR BAY | \n",
"
\n",
" \n",
" 4 | \n",
" -122.25 | \n",
" 37.85 | \n",
" 52.0 | \n",
" 1627.0 | \n",
" 280.0 | \n",
" 565.0 | \n",
" 259.0 | \n",
" 3.8462 | \n",
" 342200.0 | \n",
" NEAR BAY | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" longitude latitude housing_median_age total_rooms total_bedrooms \\\n",
"0 -122.23 37.88 41.0 880.0 129.0 \n",
"1 -122.22 37.86 21.0 7099.0 1106.0 \n",
"2 -122.24 37.85 52.0 1467.0 190.0 \n",
"3 -122.25 37.85 52.0 1274.0 235.0 \n",
"4 -122.25 37.85 52.0 1627.0 280.0 \n",
"\n",
" population households median_income median_house_value ocean_proximity \n",
"0 322.0 126.0 8.3252 452600.0 NEAR BAY \n",
"1 2401.0 1138.0 8.3014 358500.0 NEAR BAY \n",
"2 496.0 177.0 7.2574 352100.0 NEAR BAY \n",
"3 558.0 219.0 5.6431 341300.0 NEAR BAY \n",
"4 565.0 259.0 3.8462 342200.0 NEAR BAY "
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"df = pd.read_csv(\"data/housing.csv\")\n",
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" longitude | \n",
" latitude | \n",
" housing_median_age | \n",
" total_rooms | \n",
" total_bedrooms | \n",
" population | \n",
" households | \n",
" median_income | \n",
" median_house_value | \n",
"
\n",
" \n",
" \n",
" \n",
" count | \n",
" 20640.000000 | \n",
" 20640.000000 | \n",
" 20640.000000 | \n",
" 20640.000000 | \n",
" 20433.000000 | \n",
" 20640.000000 | \n",
" 20640.000000 | \n",
" 20640.000000 | \n",
" 20640.000000 | \n",
"
\n",
" \n",
" mean | \n",
" -119.569704 | \n",
" 35.631861 | \n",
" 28.639486 | \n",
" 2635.763081 | \n",
" 537.870553 | \n",
" 1425.476744 | \n",
" 499.539680 | \n",
" 3.870671 | \n",
" 206855.816909 | \n",
"
\n",
" \n",
" std | \n",
" 2.003532 | \n",
" 2.135952 | \n",
" 12.585558 | \n",
" 2181.615252 | \n",
" 421.385070 | \n",
" 1132.462122 | \n",
" 382.329753 | \n",
" 1.899822 | \n",
" 115395.615874 | \n",
"
\n",
" \n",
" min | \n",
" -124.350000 | \n",
" 32.540000 | \n",
" 1.000000 | \n",
" 2.000000 | \n",
" 1.000000 | \n",
" 3.000000 | \n",
" 1.000000 | \n",
" 0.499900 | \n",
" 14999.000000 | \n",
"
\n",
" \n",
" 25% | \n",
" -121.800000 | \n",
" 33.930000 | \n",
" 18.000000 | \n",
" 1447.750000 | \n",
" 296.000000 | \n",
" 787.000000 | \n",
" 280.000000 | \n",
" 2.563400 | \n",
" 119600.000000 | \n",
"
\n",
" \n",
" 50% | \n",
" -118.490000 | \n",
" 34.260000 | \n",
" 29.000000 | \n",
" 2127.000000 | \n",
" 435.000000 | \n",
" 1166.000000 | \n",
" 409.000000 | \n",
" 3.534800 | \n",
" 179700.000000 | \n",
"
\n",
" \n",
" 75% | \n",
" -118.010000 | \n",
" 37.710000 | \n",
" 37.000000 | \n",
" 3148.000000 | \n",
" 647.000000 | \n",
" 1725.000000 | \n",
" 605.000000 | \n",
" 4.743250 | \n",
" 264725.000000 | \n",
"
\n",
" \n",
" max | \n",
" -114.310000 | \n",
" 41.950000 | \n",
" 52.000000 | \n",
" 39320.000000 | \n",
" 6445.000000 | \n",
" 35682.000000 | \n",
" 6082.000000 | \n",
" 15.000100 | \n",
" 500001.000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" longitude latitude housing_median_age total_rooms \\\n",
"count 20640.000000 20640.000000 20640.000000 20640.000000 \n",
"mean -119.569704 35.631861 28.639486 2635.763081 \n",
"std 2.003532 2.135952 12.585558 2181.615252 \n",
"min -124.350000 32.540000 1.000000 2.000000 \n",
"25% -121.800000 33.930000 18.000000 1447.750000 \n",
"50% -118.490000 34.260000 29.000000 2127.000000 \n",
"75% -118.010000 37.710000 37.000000 3148.000000 \n",
"max -114.310000 41.950000 52.000000 39320.000000 \n",
"\n",
" total_bedrooms population households median_income \\\n",
"count 20433.000000 20640.000000 20640.000000 20640.000000 \n",
"mean 537.870553 1425.476744 499.539680 3.870671 \n",
"std 421.385070 1132.462122 382.329753 1.899822 \n",
"min 1.000000 3.000000 1.000000 0.499900 \n",
"25% 296.000000 787.000000 280.000000 2.563400 \n",
"50% 435.000000 1166.000000 409.000000 3.534800 \n",
"75% 647.000000 1725.000000 605.000000 4.743250 \n",
"max 6445.000000 35682.000000 6082.000000 15.000100 \n",
"\n",
" median_house_value \n",
"count 20640.000000 \n",
"mean 206855.816909 \n",
"std 115395.615874 \n",
"min 14999.000000 \n",
"25% 119600.000000 \n",
"50% 179700.000000 \n",
"75% 264725.000000 \n",
"max 500001.000000 "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.describe()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" longitude | \n",
" latitude | \n",
" housing_median_age | \n",
" total_rooms | \n",
" total_bedrooms | \n",
" population | \n",
" households | \n",
" median_income | \n",
" median_house_value | \n",
"
\n",
" \n",
" ocean_proximity | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" <1H OCEAN | \n",
" -118.275 | \n",
" 34.03 | \n",
" 30.0 | \n",
" 2108.0 | \n",
" 438.0 | \n",
" 1247.0 | \n",
" 421.0 | \n",
" 3.87500 | \n",
" 214850.0 | \n",
"
\n",
" \n",
" INLAND | \n",
" -120.000 | \n",
" 36.97 | \n",
" 23.0 | \n",
" 2131.0 | \n",
" 423.0 | \n",
" 1124.0 | \n",
" 385.0 | \n",
" 2.98770 | \n",
" 108500.0 | \n",
"
\n",
" \n",
" ISLAND | \n",
" -118.320 | \n",
" 33.34 | \n",
" 52.0 | \n",
" 1675.0 | \n",
" 512.0 | \n",
" 733.0 | \n",
" 288.0 | \n",
" 2.73610 | \n",
" 414700.0 | \n",
"
\n",
" \n",
" NEAR BAY | \n",
" -122.250 | \n",
" 37.79 | \n",
" 39.0 | \n",
" 2083.0 | \n",
" 423.0 | \n",
" 1033.5 | \n",
" 406.0 | \n",
" 3.81865 | \n",
" 233800.0 | \n",
"
\n",
" \n",
" NEAR OCEAN | \n",
" -118.260 | \n",
" 33.79 | \n",
" 29.0 | \n",
" 2195.0 | \n",
" 464.0 | \n",
" 1136.5 | \n",
" 429.0 | \n",
" 3.64705 | \n",
" 229450.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" longitude latitude housing_median_age total_rooms \\\n",
"ocean_proximity \n",
"<1H OCEAN -118.275 34.03 30.0 2108.0 \n",
"INLAND -120.000 36.97 23.0 2131.0 \n",
"ISLAND -118.320 33.34 52.0 1675.0 \n",
"NEAR BAY -122.250 37.79 39.0 2083.0 \n",
"NEAR OCEAN -118.260 33.79 29.0 2195.0 \n",
"\n",
" total_bedrooms population households median_income \\\n",
"ocean_proximity \n",
"<1H OCEAN 438.0 1247.0 421.0 3.87500 \n",
"INLAND 423.0 1124.0 385.0 2.98770 \n",
"ISLAND 512.0 733.0 288.0 2.73610 \n",
"NEAR BAY 423.0 1033.5 406.0 3.81865 \n",
"NEAR OCEAN 464.0 1136.5 429.0 3.64705 \n",
"\n",
" median_house_value \n",
"ocean_proximity \n",
"<1H OCEAN 214850.0 \n",
"INLAND 108500.0 \n",
"ISLAND 414700.0 \n",
"NEAR BAY 233800.0 \n",
"NEAR OCEAN 229450.0 "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.groupby(\"ocean_proximity\").median()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" longitude | \n",
" latitude | \n",
" total_rooms | \n",
" median_income | \n",
"
\n",
" \n",
" \n",
" \n",
" max | \n",
" -114.310000 | \n",
" 41.950000 | \n",
" 39320.0 | \n",
" NaN | \n",
"
\n",
" \n",
" mean | \n",
" -119.569704 | \n",
" 35.631861 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" median | \n",
" NaN | \n",
" NaN | \n",
" 2127.0 | \n",
" NaN | \n",
"
\n",
" \n",
" min | \n",
" -124.350000 | \n",
" 32.540000 | \n",
" 2.0 | \n",
" NaN | \n",
"
\n",
" \n",
" skew | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 1.646657 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" longitude latitude total_rooms median_income\n",
"max -114.310000 41.950000 39320.0 NaN\n",
"mean -119.569704 35.631861 NaN NaN\n",
"median NaN NaN 2127.0 NaN\n",
"min -124.350000 32.540000 2.0 NaN\n",
"skew NaN NaN NaN 1.646657"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.agg({\"longitude\": [\"min\", \"max\", \"mean\"],\n",
" \"latitude\": [\"min\", \"max\", \"mean\"],\n",
" \"total_rooms\": [\"min\", \"max\", \"median\"],\n",
" \"median_income\": [\"skew\"]})"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<1H OCEAN 9136\n",
"INLAND 6551\n",
"NEAR OCEAN 2658\n",
"NEAR BAY 2290\n",
"ISLAND 5\n",
"Name: ocean_proximity, dtype: int64"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[\"ocean_proximity\"].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" longitude | \n",
" latitude | \n",
" housing_median_age | \n",
" total_rooms | \n",
" total_bedrooms | \n",
" population | \n",
" households | \n",
" median_income | \n",
" median_house_value | \n",
"
\n",
" \n",
" \n",
" \n",
" longitude | \n",
" 1.000000 | \n",
" -0.879203 | \n",
" -0.150752 | \n",
" 0.040120 | \n",
" 0.063879 | \n",
" 0.123527 | \n",
" 0.060020 | \n",
" -0.009928 | \n",
" -0.069667 | \n",
"
\n",
" \n",
" latitude | \n",
" -0.879203 | \n",
" 1.000000 | \n",
" 0.032440 | \n",
" -0.018435 | \n",
" -0.056636 | \n",
" -0.123626 | \n",
" -0.074299 | \n",
" -0.088029 | \n",
" -0.165739 | \n",
"
\n",
" \n",
" housing_median_age | \n",
" -0.150752 | \n",
" 0.032440 | \n",
" 1.000000 | \n",
" -0.357162 | \n",
" -0.306544 | \n",
" -0.283879 | \n",
" -0.281989 | \n",
" -0.147308 | \n",
" 0.074855 | \n",
"
\n",
" \n",
" total_rooms | \n",
" 0.040120 | \n",
" -0.018435 | \n",
" -0.357162 | \n",
" 1.000000 | \n",
" 0.915021 | \n",
" 0.816185 | \n",
" 0.906734 | \n",
" 0.271321 | \n",
" 0.205952 | \n",
"
\n",
" \n",
" total_bedrooms | \n",
" 0.063879 | \n",
" -0.056636 | \n",
" -0.306544 | \n",
" 0.915021 | \n",
" 1.000000 | \n",
" 0.870937 | \n",
" 0.975627 | \n",
" -0.006196 | \n",
" 0.086259 | \n",
"
\n",
" \n",
" population | \n",
" 0.123527 | \n",
" -0.123626 | \n",
" -0.283879 | \n",
" 0.816185 | \n",
" 0.870937 | \n",
" 1.000000 | \n",
" 0.903872 | \n",
" 0.006268 | \n",
" 0.003839 | \n",
"
\n",
" \n",
" households | \n",
" 0.060020 | \n",
" -0.074299 | \n",
" -0.281989 | \n",
" 0.906734 | \n",
" 0.975627 | \n",
" 0.903872 | \n",
" 1.000000 | \n",
" 0.030305 | \n",
" 0.112737 | \n",
"
\n",
" \n",
" median_income | \n",
" -0.009928 | \n",
" -0.088029 | \n",
" -0.147308 | \n",
" 0.271321 | \n",
" -0.006196 | \n",
" 0.006268 | \n",
" 0.030305 | \n",
" 1.000000 | \n",
" 0.676778 | \n",
"
\n",
" \n",
" median_house_value | \n",
" -0.069667 | \n",
" -0.165739 | \n",
" 0.074855 | \n",
" 0.205952 | \n",
" 0.086259 | \n",
" 0.003839 | \n",
" 0.112737 | \n",
" 0.676778 | \n",
" 1.000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" longitude latitude housing_median_age total_rooms \\\n",
"longitude 1.000000 -0.879203 -0.150752 0.040120 \n",
"latitude -0.879203 1.000000 0.032440 -0.018435 \n",
"housing_median_age -0.150752 0.032440 1.000000 -0.357162 \n",
"total_rooms 0.040120 -0.018435 -0.357162 1.000000 \n",
"total_bedrooms 0.063879 -0.056636 -0.306544 0.915021 \n",
"population 0.123527 -0.123626 -0.283879 0.816185 \n",
"households 0.060020 -0.074299 -0.281989 0.906734 \n",
"median_income -0.009928 -0.088029 -0.147308 0.271321 \n",
"median_house_value -0.069667 -0.165739 0.074855 0.205952 \n",
"\n",
" total_bedrooms population households median_income \\\n",
"longitude 0.063879 0.123527 0.060020 -0.009928 \n",
"latitude -0.056636 -0.123626 -0.074299 -0.088029 \n",
"housing_median_age -0.306544 -0.283879 -0.281989 -0.147308 \n",
"total_rooms 0.915021 0.816185 0.906734 0.271321 \n",
"total_bedrooms 1.000000 0.870937 0.975627 -0.006196 \n",
"population 0.870937 1.000000 0.903872 0.006268 \n",
"households 0.975627 0.903872 1.000000 0.030305 \n",
"median_income -0.006196 0.006268 0.030305 1.000000 \n",
"median_house_value 0.086259 0.003839 0.112737 0.676778 \n",
"\n",
" median_house_value \n",
"longitude -0.069667 \n",
"latitude -0.165739 \n",
"housing_median_age 0.074855 \n",
"total_rooms 0.205952 \n",
"total_bedrooms 0.086259 \n",
"population 0.003839 \n",
"households 0.112737 \n",
"median_income 0.676778 \n",
"median_house_value 1.000000 "
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.corr('spearman')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exercise"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Additional Resources"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- [Pandas Documentation](https://pandas.pydata.org/docs/getting_started/intro_tutorials/06_calculate_statistics.html)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}