pm21-dragon/lectures/lecture-12/2 Introduction to distributions and random variables.ipynb
2025-01-17 08:33:02 +01:00

6.7 KiB

None <html> <head> </head>

Learning goals

Understand the following:

  • Probability distribution
  • Probability density function (PDF)
  • Gaussian distribution
  • Histogram
  • Mean
  • Standard deviation
  • Variance
  • Mode

We will discuss these as we work through this example

In [ ]:
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
In [ ]:
sample_size = 30

true_mean = 10.0
true_variance = 4.0
true_std = np.sqrt(true_variance)

rng = np.random.RandomState(123)
fish_sample = rng.normal(loc=true_mean, scale=true_std, size=sample_size)
In [ ]:
fish_sample
In [ ]:
x=plt.hist(fish_sample)
plt.xlabel("fish length (cm)")
plt.ylabel("counts");
In [ ]:
x

What is the mean of the sample of fish lengths?

In [ ]:
sample_mean = np.mean(fish_sample)
print(sample_mean)

What is the mode) of the sample of fish lengths? ("modus)" in German.)

What is the sample variance of fish lengths?

We put the result in the variable length_variance. We use the ddof=1 keyword argument to np.var() because we are estimating variance from a sample here.

In [ ]:
sample_variance = np.var(fish_sample, ddof=1)
print(sample_variance)

Probability Density Functions (PDFs)

  • for working with continuous variables (vs. probability mass functions for discrete variables)
  • here, the area under the curve gives the probability (in contrast to probability mass functions where we have probabilities for every single value)
  • the area under the whole curve is 1. In other words, some outcome must happen.

Normal Distribution (Gaussian Distribution)

  • unimodal and symmetric
  • many algorithms in machine learning & statistics have normality assumptions
  • two parameters: mean (center of the peak) and standard deviation (spread); $N(\mu, \sigma)$
  • we can estimate parameters $\mu$ and $\sigma$ by the sample mean ($\bar{x})$ and sample variance ($s^2$)
  • univariate Normal distribution:
$$N(\mu, \sigma) = f(x \mid \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \; \exp\bigg(-\frac{(x-\mu)^2}{2\sigma^2}\bigg)$$
  • standard normal distribution with zero mean and unit variance, $N(0, 1)$:
In [ ]:
def univariate_gaussian_pdf(x, mean, variance):
    return (1. / np.sqrt(2*np.pi*variance) * 
            np.exp(- ((x - mean)**2 / (2.*variance))))
In [ ]:
mean = 0
stddev = 1
x = np.arange(-5, 5, 0.01)
y = univariate_gaussian_pdf(x, mean, stddev**2)
plt.plot(x, y)
plt.xlabel('x')
plt.ylabel('Probability Density Function (PDF)')
plt.show()
In [ ]:
mean = 10
stddev = 2.0
x = np.arange(5, 15, 0.01)
y = univariate_gaussian_pdf(x, mean, stddev**2)
plt.plot(x, y)
plt.xlabel('fish length (cm)')
plt.ylabel('Probability Density Function (PDF)')
plt.show()
In [ ]:
plt.hist(fish_sample, density=True, label='samples')

x = np.linspace(5.0, 15.0, 100)
y = univariate_gaussian_pdf(x, true_mean, true_variance)
plt.plot(x, y, lw=5, label='theoretical (mean=%.1f, variance=%.1f)'%(true_mean,true_variance))

sample_mean = np.mean(fish_sample)
sample_variance = np.var(fish_sample, ddof=1) # for a sample variance, always use ddof=1. Here is one explanation why https://mortada.net/computing-sample-variance-why-divide-by-n-1.html
y2 = univariate_gaussian_pdf(x, sample_mean, sample_variance)
plt.plot(x, y2, label='empirical (mean=%.1f, variance=%.1f)'%(sample_mean,sample_variance))

plt.xlabel("length (cm)")
plt.ylabel("probability");
plt.legend();
# plt.savefig('fish-gaussian.png');

Discussion points

Here are a few questions you should be able to answer:

  • what is plotted on the x axis?
  • what is plotted on the y axis?
  • what is the area under the curve from -∞ to ∞?
  • according to our Guassian distribution, what is the most likely fish length?
  • according to our histogram, what is the most likely fish length?
</html>