{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Learning goals\n", "\n", "Understand the following:\n", "\n", "- Probability distribution\n", "- Probability density function (PDF)\n", "- Gaussian distribution\n", "- Histogram\n", "- Mean\n", "- Standard deviation\n", "- Variance\n", "- Mode\n", "\n", "We will discuss these as we work through this example" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import scipy.stats as stats" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sample_size = 30\n", "\n", "true_mean = 10.0\n", "true_variance = 4.0\n", "true_std = np.sqrt(true_variance)\n", "\n", "rng = np.random.RandomState(123)\n", "fish_sample = rng.normal(loc=true_mean, scale=true_std, size=sample_size)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fish_sample" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x=plt.hist(fish_sample)\n", "plt.xlabel(\"fish length (cm)\")\n", "plt.ylabel(\"counts\");" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What is the mean of the sample of fish lengths?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sample_mean = np.mean(fish_sample)\n", "print(sample_mean)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What is the [mode](https://en.wikipedia.org/wiki/Mode_(statistics)) of the sample of fish lengths? (\"[modus](https://de.wikipedia.org/wiki/Modus_(Statistik))\" in German.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What is the sample variance of fish lengths?\n", "\n", "We put the result in the variable `length_variance`. We use the `ddof=1` keyword argument to `np.var()` because we are estimating variance from a sample here." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sample_variance = np.var(fish_sample, ddof=1)\n", "print(sample_variance)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Probability Density Functions (PDFs)\n", "\n", "- for working with continuous variables (vs. probability mass functions for discrete variables)\n", "- here, the area under the curve gives the probability (in contrast to probability mass functions where we have probabilities for every single value)\n", "- the area under the whole curve is 1. In other words, some outcome must happen.\n", "\n", "## Normal Distribution (Gaussian Distribution)\n", "\n", "- unimodal and symmetric\n", "- many algorithms in machine learning & statistics have normality assumptions\n", "- two parameters: mean (center of the peak) and standard deviation (spread); $N(\\mu, \\sigma)$\n", "- we can estimate parameters $\\mu$ and $\\sigma$ by the sample mean ($\\bar{x})$ and sample variance ($s^2$)\n", "- univariate Normal distribution:\n", "\n", "$$N(\\mu, \\sigma) = f(x \\mid \\mu, \\sigma^2) = \\frac{1}{\\sqrt{2\\pi\\sigma^2}} \\; \\exp\\bigg(-\\frac{(x-\\mu)^2}{2\\sigma^2}\\bigg)$$\n", "\n", "- standard normal distribution with zero mean and unit variance, $N(0, 1)$:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def univariate_gaussian_pdf(x, mean, variance):\n", " return (1. / np.sqrt(2*np.pi*variance) * \n", " np.exp(- ((x - mean)**2 / (2.*variance))))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mean = 0\n", "stddev = 1\n", "x = np.arange(-5, 5, 0.01)\n", "y = univariate_gaussian_pdf(x, mean, stddev**2)\n", "plt.plot(x, y)\n", "plt.xlabel('x')\n", "plt.ylabel('Probability Density Function (PDF)')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mean = 10\n", "stddev = 2.0\n", "x = np.arange(5, 15, 0.01)\n", "y = univariate_gaussian_pdf(x, mean, stddev**2)\n", "plt.plot(x, y)\n", "plt.xlabel('fish length (cm)')\n", "plt.ylabel('Probability Density Function (PDF)')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.hist(fish_sample, density=True, label='samples')\n", "\n", "x = np.linspace(5.0, 15.0, 100)\n", "y = univariate_gaussian_pdf(x, true_mean, true_variance)\n", "plt.plot(x, y, lw=5, label='theoretical (mean=%.1f, variance=%.1f)'%(true_mean,true_variance))\n", "\n", "sample_mean = np.mean(fish_sample)\n", "sample_variance = np.var(fish_sample, ddof=1) # for a sample variance, always use ddof=1. Here is one explanation why https://mortada.net/computing-sample-variance-why-divide-by-n-1.html\n", "y2 = univariate_gaussian_pdf(x, sample_mean, sample_variance)\n", "plt.plot(x, y2, label='empirical (mean=%.1f, variance=%.1f)'%(sample_mean,sample_variance))\n", "\n", "plt.xlabel(\"length (cm)\")\n", "plt.ylabel(\"probability\");\n", "plt.legend();\n", "# plt.savefig('fish-gaussian.png');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Discussion points\n", "\n", "Here are a few questions you should be able to answer:\n", "\n", " - what is plotted on the x axis?\n", " - what is plotted on the y axis?\n", " - what is the area under the curve from -∞ to ∞?\n", " - according to our Guassian distribution, what is the most likely fish length?\n", " - according to our histogram, what is the most likely fish length?" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.10" } }, "nbformat": 4, "nbformat_minor": 4 }