6.7 KiB
6.7 KiB
None
<html>
<head>
</head>
</html>
Learning goals¶
Understand the following:
- Probability distribution
- Probability density function (PDF)
- Gaussian distribution
- Histogram
- Mean
- Standard deviation
- Variance
- Mode
We will discuss these as we work through this example
In [ ]:
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
In [ ]:
sample_size = 30
true_mean = 10.0
true_variance = 4.0
true_std = np.sqrt(true_variance)
rng = np.random.RandomState(123)
fish_sample = rng.normal(loc=true_mean, scale=true_std, size=sample_size)
In [ ]:
fish_sample
In [ ]:
x=plt.hist(fish_sample)
plt.xlabel("fish length (cm)")
plt.ylabel("counts");
In [ ]:
x
What is the mean of the sample of fish lengths?¶
In [ ]:
sample_mean = np.mean(fish_sample)
print(sample_mean)
What is the sample variance of fish lengths?¶
We put the result in the variable length_variance
. We use the ddof=1
keyword argument to np.var()
because we are estimating variance from a sample here.
In [ ]:
sample_variance = np.var(fish_sample, ddof=1)
print(sample_variance)
Probability Density Functions (PDFs)¶
- for working with continuous variables (vs. probability mass functions for discrete variables)
- here, the area under the curve gives the probability (in contrast to probability mass functions where we have probabilities for every single value)
- the area under the whole curve is 1. In other words, some outcome must happen.
Normal Distribution (Gaussian Distribution)¶
- unimodal and symmetric
- many algorithms in machine learning & statistics have normality assumptions
- two parameters: mean (center of the peak) and standard deviation (spread); $N(\mu, \sigma)$
- we can estimate parameters $\mu$ and $\sigma$ by the sample mean ($\bar{x})$ and sample variance ($s^2$)
- univariate Normal distribution:
- standard normal distribution with zero mean and unit variance, $N(0, 1)$:
In [ ]:
def univariate_gaussian_pdf(x, mean, variance):
return (1. / np.sqrt(2*np.pi*variance) *
np.exp(- ((x - mean)**2 / (2.*variance))))
In [ ]:
mean = 0
stddev = 1
x = np.arange(-5, 5, 0.01)
y = univariate_gaussian_pdf(x, mean, stddev**2)
plt.plot(x, y)
plt.xlabel('x')
plt.ylabel('Probability Density Function (PDF)')
plt.show()
In [ ]:
mean = 10
stddev = 2.0
x = np.arange(5, 15, 0.01)
y = univariate_gaussian_pdf(x, mean, stddev**2)
plt.plot(x, y)
plt.xlabel('fish length (cm)')
plt.ylabel('Probability Density Function (PDF)')
plt.show()
In [ ]:
plt.hist(fish_sample, density=True, label='samples')
x = np.linspace(5.0, 15.0, 100)
y = univariate_gaussian_pdf(x, true_mean, true_variance)
plt.plot(x, y, lw=5, label='theoretical (mean=%.1f, variance=%.1f)'%(true_mean,true_variance))
sample_mean = np.mean(fish_sample)
sample_variance = np.var(fish_sample, ddof=1) # for a sample variance, always use ddof=1. Here is one explanation why https://mortada.net/computing-sample-variance-why-divide-by-n-1.html
y2 = univariate_gaussian_pdf(x, sample_mean, sample_variance)
plt.plot(x, y2, label='empirical (mean=%.1f, variance=%.1f)'%(sample_mean,sample_variance))
plt.xlabel("length (cm)")
plt.ylabel("probability");
plt.legend();
# plt.savefig('fish-gaussian.png');
Discussion points¶
Here are a few questions you should be able to answer:
- what is plotted on the x axis?
- what is plotted on the y axis?
- what is the area under the curve from -∞ to ∞?
- according to our Guassian distribution, what is the most likely fish length?
- according to our histogram, what is the most likely fish length?