13 Statistics

Basic statistics: test statistics of a random variable (mean, median, variance, std dev, max, min, range), correlation (pearson vs spearman), outliers and how they affect statistics like correlations
Statistical theory: maximum likelihood estimation, negative log likelihood and loss function, Bayesian (optional)
Statistical inference: confidence intervals, statistical tests, p-values, etc (maybe)
Information theory: entropy, continuous entropy, cross entropy (maybe)
Cover bootstrapping, and use it on some examples of getting confidence intervals of estimators (also used for bagging, “bootstrap aggregation”)
annoying fact: np.var uses 1/m version while torch.var uses 1/(m-1) version
derive that MLEs for x ~ N(mu, sigma^2) are xbar and s^2 (the 1/m versions)
mention that the uniform is the max entropy distribution assuming nothing (principle of indifference),
and the Gaussian is assuming fixed mean and variance (why Gaussians are popular aside from CLT)

Code

import numpy as np
import sympy as sp
import matplotlib.pyplot as plt
from utils.math_ml import *

13.1 Statistics

Statistics is at its root the use of probability to study data. Data can be any real-world measurement, in essentially any form. Statistics treats data as either fixed values or random variables depending on the situation. If the data has already been observed we assume it’s fixed. If not, we assume it’s random and try to model it with a probability distribution. For the purposes of this lesson we’ll assume data comes in the form of some 1D or 2D array and that each data point takes on a numerical value, either an integer or real number.

Let’s start with a simple 1D array of \(m\) samples \(\mathbf{x} = (x_0,x_1,\cdots,x_{m-1})\). We’d like to know what univariate distribution \(p(x)\) would generate the samples \(x_0,x_1,\cdots,x_{m-1}\). If we can get this probability distribution even approximately, we can gain a lot of insight into the nature of the data itself.

Unfortunately, it’s really hard to figure out what distribution data is coming from with only a finite amount of data in all but the simplest cases. What we often thus settle for instead is to assume the data come from a certain class of distribution, and then try to estimate what the parameters of that distribution would have to be to ensure that distribution fits the data well. In this sense, a whole lot of statistics boils down to how to estimate the parameters of a given distribution from the data.

Some of the most important parameters to estimate from an array of data are its moments. The hope is to find, without knowing the data’s distribution, the best estimate of that distribution’s moments from the data given. This is where the formulas you’re probably used to come in for things like mean, variance, standard deviation, etc. Traditionally to avoid getting these moment estimates mixed up with the true distribution’s moments, we call them sample moments. I’ll define them below for the univariate case.

Sample Mean: \[\overline{x} = \frac{1}{m}\sum_{i=0}^{m-1} x_i = \frac{1}{m}(x_0 + x_1 + \cdots + x_{m-1})\]

Sample Variance: \[s^2 = \frac{1}{m}\sum_{i=0}^{m-1} (x_i-\overline{x})^2 = \frac{1}{m}\big((x_0-\overline{x})^2 + \cdots + (x_{m-1}-\overline{x})^2\big)\]

Sample Standard Deviation: \[s = \sqrt{s^2} = \sqrt{\frac{1}{m}\sum_{i=0}^{m-1} (x_i-\overline{x})^2}\]

Other quantities that might be of interest to estimate aren’t moments at all. One example is the median, which is defined as the midpoint of a distribution, i.e. the \(x\) such that \(p(x)=1/2\). The median is another way of estimating the center of a distribution, but has slightly different properties than the mean. One of those properties is that the mean depends only on the rank order of values, not on what numbers those values take on. This implies that, unlike the mean, the median is invariant to points “far away from the center”, called outliers.

We can estimate the sample median, call it \(M\), of an array \(x_0,x_1,\cdots,x_{m-1}\) by sorting them in ascending order and plucking out the midpoint. If \(m\) is odd, this is just \(M = x_{m//2+1}\). If \(m\) is even we by convention take the median as the average of the two midpoints \(M = \frac{1}{2}\big(x_{m//2} + x_{m//2+1}\big)\).

Other quantities we might want to estimate from the data are the sample minimum, the sample maximum, and the sample range, which is defined as the difference between the sample max and min.

13.2 Moments of a Distribution

Probability distributions have special quantities that are worth keeping track of, called moments. The most important moments for practical purposes are the mean and variance, but there are higher-order moments as well like skewness and kurtosis that sometimes become important.

13.2.1 Univariate Moments

Let’s start with a boring term that I’ll call the “zeroth moment”. It’s just a restatement of the fact that probabilities sum to one from before,

\[ 1 = \begin{cases} \sum_{j=0}^{k-1} p(x_k), & x \text{ is discrete}, \\ \\ \int_{-\infty}^\infty p(x) dx, & x \text{ is continuous}. \end{cases} \]

Moving onto the first moment. Define the mean (or expected value) of a distribution \(p(x)\), usually denoted by symbols like \(\langle x \rangle\) or \(\mathbb{E}(x)\) or \(\mu\), by

\[ \langle x \rangle = \mathbb{E}(x) = \begin{cases} \sum_{j=0}^{k-1} x_k p(x_k), & x \text{ is discrete}, \\ \\ \int_{-\infty}^\infty x p(x) dx, & x \text{ is continuous}. \end{cases} \]

The mean of a univariate distribution \(p(x)\) is an estimate of the “center of mass” or the “balancing point” of the distribution. It will be a single number.

Moving onto the second moment. Let \(\mu = \langle x \rangle\) for convenience. Define the variance (or mean square) of a distribution \(p(x)\), usually denoted \(\text{Var}(x)\) or \(\sigma^2\), by the mean of the squared difference \((x-\mu)^2\), that is

\[ \text{Var}(x) = \big\langle (x-\mu)^2 \big\rangle = \begin{cases} \sum_{j=0}^{k-1} (x_k-\mu)^2 p(x_k), & x \text{ is discrete}, \\ \\ \int_{-\infty}^\infty (x-\mu)^2 p(x) dx, & x \text{ is continuous}. \end{cases} \]

Similar to the mean, the variance of a univariate distribution will also be a single number indicating the “spread” of the distribution. More commonly, when talking about the “spread” of a distribution, we like to talk about the square root of the variance, called the standard deviation (or root mean square) and usually denoted \(\sigma\), \[\sigma = \sqrt{\text{Var}(x)}.\]

The standard deviation has the advantage that it’s on the same scale of \(x\), and has the same units.

I won’t prove it, but this pattern is very general. We can talk about taking the mean of any function of a random variable just as well. Suppose \(f(x)\) is some reasonably well-behaved function. Then we can talk about the mean of \(f(x)\) by using

\[ \langle f(x) \rangle = \mathbb{E}f(x) = \begin{cases} \sum_{j=0}^{k-1} f(x_j) p(x_j), & x \text{ is discrete}, \\ \\ \int_{-\infty}^\infty f(x) p(x) dx, & x \text{ is continuous}. \end{cases} \]

This trick is sometimes called the Law of the Unconscious Statistician.

The variance can be written in a different form by doing some algebraic manipulations, \[\text{Var}(x) = \langle x^2 \rangle - \mu^2.\] You can pretty easily derive this formula if you like, but I won’t do so here. This form is sometimes easier to use to do calculations if we’ve already calculated the mean \(\mu\), since we only need to calculate \(\langle x^2 \rangle\), which is often easier to do.

It’s worth noting that the mean operation is linear, which means two things, - if \(x\) and \(y\) are two random variables, then \(\langle x + y \rangle = \langle x \rangle + \langle y \rangle\), - if \(x\) is a random variable and \(c\) is some constant, then \(\langle cx \rangle = c\langle x \rangle\).

This fact follows immediately from the fact that the mean is just a sum (or integral), and sums are linear. This will always be true, not just for any two random variables, but any number of them. It won’t, however, be true for the variance. That is, \[\text{Var}(x+y) \neq \text{Var}(x) + \text{Var}(y), \qquad \text{Var}(cx) = c^2\text{Var}(x).\]

13.2.2 Multivariate Moments

Moments also extend to multivariate distributions, but it becomes more complicated. The mean for univariate distributions becomes the mean vector for multivariate distributions, while variance becomes the covariance matrix. Other moments can be even more complicated in the multivariate case by becoming higher-rank tensors. I’ll briefly spell out the definition of mean and covariance below and stop there.

The mean vector of an \(n\) dimensional multivariate distribution \(p(\mathbf{x})\), denoted \(\langle \mathbf{x} \rangle\) or \(\mathbb{E}(\mathbf{x})\) or \(\boldsymbol{\mu}\), is defined by

\[ \langle \mathbf{x} \rangle = \mathbb{E}(\mathbf{x}) = \begin{cases} \sum_{j=0}^{k-1} \mathbf{x}_j p(\mathbf{x}_j), & \mathbf{x} \text{ is discrete}, \\ \\ \int_{\mathbb{R}^n} \mathbf{x} p(\mathbf{x}) dA_{n-1}, & \mathbf{x} \text{ is continuous}. \end{cases} \]

If you look carefully, you’ll see the mean vector is just the vector whose components are the ordinary means, \(\mu_i = \langle x_i \rangle\). Just as in the univariate case, the mean vector again represents the “center of mass” of the distribution, except now in \(n\) dimensional space.

The covariance matrix of an \(n\) dimensional multivariate distribution \(p(\mathbf{x})\), denoted either \(\text{Cov}(\mathbf{x})\) or \(\boldsymbol{\Sigma}\), is defined as the mean square of the outer product of the difference \(\mathbf{x} - \boldsymbol{\mu}\), i.e. \[\boldsymbol{\Sigma} = \text{Cov}(\mathbf{x}) = \big\langle (\mathbf{x}-\boldsymbol{\mu}) (\mathbf{x}-\boldsymbol{\mu})^\top \big\rangle \qquad \Rightarrow \qquad \sigma_{i,j} = \langle (x_i-\mu_i)(x_j-\mu_j) \rangle.\]

This gives a symmetric, positive \((n,n)\) matrix whose elements \(\sigma_{i,j}\) represent how much any two variables “depend on each other”. The diagonal elements \(\sigma_{i,i}\) represent the variances of each \(x_i\), while the off-diagonal elements \(\sigma_{i,j}\), \(i \neq j\) represent the correlation of \(x_i\) with \(x_j\). The higher \(\sigma_{i,j}\), the higher the two variables are said to correlate. If \(\sigma_{i,j}=0\) the two variables are said to be uncorrelated.

13.2.3 Example: Bernoulli Distribution

Just so you can see how this process is done, I’ll work one simple example, the Bernoulli distribution. This one should be easy since it only takes on two values \(x=0,1\). The other ones, not so much.

Recall for Bernoulli we have the probability function

\[ p(x) = \begin{cases} p_0 & x = 1, \\ (1-p_0) & x = 0. \end{cases} \]

For the mean, we thus have

\[ \begin{align} \langle x \rangle &= \sum_{x=0}^{1} x p(x) \\ &= (0 \cdot (1-p_0)) + (1 \cdot p_0) \\ &= p_0. \end{align} \]

That is, \(\mu = \langle x \rangle = p_0\). The mean of Bernoulli is just the probability that \(x=1\).

For the variance, we have

\[ \begin{align} \text{Var}(x) &= \langle (x-\mu)^2 \rangle \\ &= \sum_{x=0}^{1} (x - \mu)^2 p(x) \\ &= ((0-p_0)^2 \cdot (1-p_0)) + ((1-p_0)^2 \cdot p_0) \\ &= p_0^2 \cdot (1-p_0) + (1-p_0)^2 \cdot p_0 \\ &= p_0 (1 - p_0). \end{align} \]

Thus, the variance of Bernoulli is \(\text{Var}(x) = p_0(1-p_0)\), i.e. the probability \(x=1\) times the probability \(x=0\). Taking the square root then gives the standard deviation, \(\sigma = \sqrt{p_0(1-p_0)}\).

13.2.4 List of Means and Variances

Rather than put you through the tedium of anymore boring and messy calculations, I’ll just list the mean and variance of several common distributions below for reference. Notice they won’t be a function of the random variable, but they will be a function of the distribution’s parameters.

Distribution	Support	Mean	Variance
Discrete Uniform \(\text{DU}(a,b)\)	\(\{a,a+1,\cdots,b-1\}\)	\(\frac{1}{2}(a+b-1)\)	\(\frac{1}{12}((b-a)^2-1)\)
Bernoulli \(\text{Ber}(\text{p})\)	\(\{0,1\}\)	\(\text{p}\)	\(\text{p}(1-\text{p})\)
Binomial \(\text{Bin}(n, \text{p})\)	\(\{0,1,\cdots,n\}\)	\(n\text{p}\)	\(n\text{p}(1-\text{p})\)
Categorical \(\text{Cat}(\mathbf{p})\)	\(\{0,1,\cdots,k-1\}\)	\(p_j\) for all \(j\)	\(p_j (1 - p_j)\) for all \(j\)
Uniform \(U(a,b)\)	\([a,b]\)	\(\frac{1}{2}(a+b)\)	\(\frac{1}{12}(b-a)^2\)
Gaussian \(\mathcal{N}(\mu,\sigma^2)\)	\(\mathbb{R}\)	\(\mu\)	\(\sigma^2\)
Laplace \(L(\mu,\sigma)\)	\(\mathbb{R}\)	\(\mu\)	\(2\sigma\)
Multivariate Gaussian \(\mathcal{N}(\boldsymbol{\mu},\boldsymbol{\Sigma})\)	\(\mathbb{R}^n\)	\(\boldsymbol{\mu}\)	\(\boldsymbol{\Sigma}\)