249 lines
6.7 KiB
Plaintext
249 lines
6.7 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Learning goals\n",
|
|
"\n",
|
|
"Understand the following:\n",
|
|
"\n",
|
|
"- Probability distribution\n",
|
|
"- Probability density function (PDF)\n",
|
|
"- Gaussian distribution\n",
|
|
"- Histogram\n",
|
|
"- Mean\n",
|
|
"- Standard deviation\n",
|
|
"- Variance\n",
|
|
"- Mode\n",
|
|
"\n",
|
|
"We will discuss these as we work through this example"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import numpy as np\n",
|
|
"import matplotlib.pyplot as plt\n",
|
|
"import scipy.stats as stats"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"sample_size = 30\n",
|
|
"\n",
|
|
"true_mean = 10.0\n",
|
|
"true_variance = 4.0\n",
|
|
"true_std = np.sqrt(true_variance)\n",
|
|
"\n",
|
|
"rng = np.random.RandomState(123)\n",
|
|
"fish_sample = rng.normal(loc=true_mean, scale=true_std, size=sample_size)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"fish_sample"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"x=plt.hist(fish_sample)\n",
|
|
"plt.xlabel(\"fish length (cm)\")\n",
|
|
"plt.ylabel(\"counts\");"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"x"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## What is the mean of the sample of fish lengths?"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"sample_mean = np.mean(fish_sample)\n",
|
|
"print(sample_mean)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## What is the [mode](https://en.wikipedia.org/wiki/Mode_(statistics)) of the sample of fish lengths? (\"[modus](https://de.wikipedia.org/wiki/Modus_(Statistik))\" in German.)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## What is the sample variance of fish lengths?\n",
|
|
"\n",
|
|
"We put the result in the variable `length_variance`. We use the `ddof=1` keyword argument to `np.var()` because we are estimating variance from a sample here."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"sample_variance = np.var(fish_sample, ddof=1)\n",
|
|
"print(sample_variance)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Probability Density Functions (PDFs)\n",
|
|
"\n",
|
|
"- for working with continuous variables (vs. probability mass functions for discrete variables)\n",
|
|
"- here, the area under the curve gives the probability (in contrast to probability mass functions where we have probabilities for every single value)\n",
|
|
"- the area under the whole curve is 1. In other words, some outcome must happen.\n",
|
|
"\n",
|
|
"## Normal Distribution (Gaussian Distribution)\n",
|
|
"\n",
|
|
"- unimodal and symmetric\n",
|
|
"- many algorithms in machine learning & statistics have normality assumptions\n",
|
|
"- two parameters: mean (center of the peak) and standard deviation (spread); $N(\\mu, \\sigma)$\n",
|
|
"- we can estimate parameters $\\mu$ and $\\sigma$ by the sample mean ($\\bar{x})$ and sample variance ($s^2$)\n",
|
|
"- univariate Normal distribution:\n",
|
|
"\n",
|
|
"$$N(\\mu, \\sigma) = f(x \\mid \\mu, \\sigma^2) = \\frac{1}{\\sqrt{2\\pi\\sigma^2}} \\; \\exp\\bigg(-\\frac{(x-\\mu)^2}{2\\sigma^2}\\bigg)$$\n",
|
|
"\n",
|
|
"- standard normal distribution with zero mean and unit variance, $N(0, 1)$:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def univariate_gaussian_pdf(x, mean, variance):\n",
|
|
" return (1. / np.sqrt(2*np.pi*variance) * \n",
|
|
" np.exp(- ((x - mean)**2 / (2.*variance))))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"mean = 0\n",
|
|
"stddev = 1\n",
|
|
"x = np.arange(-5, 5, 0.01)\n",
|
|
"y = univariate_gaussian_pdf(x, mean, stddev**2)\n",
|
|
"plt.plot(x, y)\n",
|
|
"plt.xlabel('x')\n",
|
|
"plt.ylabel('Probability Density Function (PDF)')\n",
|
|
"plt.show()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"mean = 10\n",
|
|
"stddev = 2.0\n",
|
|
"x = np.arange(5, 15, 0.01)\n",
|
|
"y = univariate_gaussian_pdf(x, mean, stddev**2)\n",
|
|
"plt.plot(x, y)\n",
|
|
"plt.xlabel('fish length (cm)')\n",
|
|
"plt.ylabel('Probability Density Function (PDF)')\n",
|
|
"plt.show()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"plt.hist(fish_sample, density=True, label='samples')\n",
|
|
"\n",
|
|
"x = np.linspace(5.0, 15.0, 100)\n",
|
|
"y = univariate_gaussian_pdf(x, true_mean, true_variance)\n",
|
|
"plt.plot(x, y, lw=5, label='theoretical (mean=%.1f, variance=%.1f)'%(true_mean,true_variance))\n",
|
|
"\n",
|
|
"sample_mean = np.mean(fish_sample)\n",
|
|
"sample_variance = np.var(fish_sample, ddof=1) # for a sample variance, always use ddof=1. Here is one explanation why https://mortada.net/computing-sample-variance-why-divide-by-n-1.html\n",
|
|
"y2 = univariate_gaussian_pdf(x, sample_mean, sample_variance)\n",
|
|
"plt.plot(x, y2, label='empirical (mean=%.1f, variance=%.1f)'%(sample_mean,sample_variance))\n",
|
|
"\n",
|
|
"plt.xlabel(\"length (cm)\")\n",
|
|
"plt.ylabel(\"probability\");\n",
|
|
"plt.legend();\n",
|
|
"# plt.savefig('fish-gaussian.png');"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Discussion points\n",
|
|
"\n",
|
|
"Here are a few questions you should be able to answer:\n",
|
|
"\n",
|
|
" - what is plotted on the x axis?\n",
|
|
" - what is plotted on the y axis?\n",
|
|
" - what is the area under the curve from -∞ to ∞?\n",
|
|
" - according to our Guassian distribution, what is the most likely fish length?\n",
|
|
" - according to our histogram, what is the most likely fish length?"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.11.10"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 4
|
|
}
|