Mathematics of the t-test
There are times when our sample size is too small for asymptotic results to be valid. We cannot simply use Slutsky’s theorem to replace the true variance with the sample variance with such a small sample size, yet if our data is itself drawn from a Gaussian distribution, then we will see that we can make progress.
Proof of Cochran’s theorem and the t-test
Assume we have data:
\[X_1, \dots, X_n \stackrel{iid}{\sim} \mathcal{N}\left(\mu, \sigma^2\right)\]i.e. independent, indentically distributed set of $n$-samples drawn from a Gaussian distribution with mean $\mu$ and variance $\sigma^2$, with $\mu, \sigma^2 \in \mathbb{R}\times (0, \infty)$.
Recall that a “pivot” is a statistic that we create that doesn’t depend on the parameters of the underlying distribution that we want to estimate, e.g.
\[\widetilde{T} = \sqrt{n}\frac{\mu-\bar{X}}{\sqrt{\sigma^2}} \sim \mathcal{N}(0, 1)\]Note that this statement is non-asymptotic because our data is drawn from a Gaussian distribution. It holds for any sample size $n$. This test statistic $\widetilde{T}$ therefore is independent of $\mu, \sigma^2$ and is a “pivot”. We can compute its quantiles and apply the usual machinery of hypothesis testing on such a quantity.
Typically in hypothesis testing under the null hypothesis we have a value for $\mu$, the true mean, e.g.
\[H_0: \mu=0, \quad H_1: \mu\neq 0\]and we can use that value in the test statistic, however the problem is we do not know the ground-truth variance $\sigma^2$. In the asymptotic case Slutsky let’s us justify replacing $\sigma^2$ by the sample variance, but for small $n$ we cannot simply do that.
In order to properly do non-asymptotics we need to genuinely know something about the distribution of the sample variance $\widehat{\sigma}^2$
Recap: orthogonal matrices and projections
In order to understand the distribution of the sample variance, we need to take a quick detour to linear algebra, orthogonal matrices and projection operators. This may seem irrelevant but we will get back to statistics very soon I promise. Also a great overview of this topic can be found in Strang Lecture #15
If we have some orthogonal $n \times n$ matrix $\mathbf{V}$, we can think of it as consisting of columns of orthonormal vectors $\mathbf{v_i} \in \mathbb{R}^n$:
\[\mathbf{V}= \begin{pmatrix} \vert & \dots &\vert \\ \mathbf{v_1} & \dots & \mathbf{v_n} \\ \vert & \dots & \vert \end{pmatrix}\]These vectors satisfy $\mathbf{v_i^T}\mathbf{v_j}=\delta_{ij}$, and the matrix itself satisfies $\mathbf{V^T}\mathbf{V}=\mathbf{I_n}=\mathbf{V}\mathbf{V^T}$ where $\mathbf{I_n}$ is the $n\times n$ identity matrix.
Transformation of vectors by orthogonal matrices has the special property that lengths are invariant (if you rotate a vector then the length of the vector doesn’t change - we did not stretch the vector). In other words, if $\mathbf{x} \in \mathbb{R}^n$, then
\[\begin{aligned} ||\mathbf{V}\mathbf{x}||^2_2&=\left(\mathbf{V}\mathbf{x}\right)^T\left(\mathbf{V}\mathbf{x}\right)^T\\ &=\mathbf{x}^T\mathbf{V}^T\mathbf{V}\mathbf{x}\\ &=\mathbf{x}^T\mathbf{x}\\ &=||x||^2_2 \end{aligned}\]Projections
If we have a k-dimensional subspace of $\mathbb{R}^n$, with $k < n$ orthonormal basis vectors, represented by the $n\times k$ orthogonal matrix $\mathbf{W}$:
\[\mathbf{W}= \begin{pmatrix} \vert & \dots &\vert \\ \mathbf{w_1} & \dots & \mathbf{w_k} \\ \vert & \dots & \vert \end{pmatrix}\]and $\mathbf{w_i^T}\mathbf{w_j}=\delta_{ij}$, then we can think about the projection of the n-dimensional vector $\mathbf{x}\in \mathbb{R}^n$ onto that subspace.
Let’s consider the simplest thing, just projecting vector $\mathbf{b}$ onto vector $\mathbf{a}$ ($\mathbf{a}$ plays the role of a 1D subspace). It’s easy to see visually that to minimize the error vector $\mathbf{e}$ then the projection should be such that this error is orthogonal (perp) to $\mathbf{a}$. If you imagine moving the point to the left or right, so that it’s no longer a right angle, the error obviously gets larger…
The projection can be written as some multiple of $\mathbf{a}$,
\[\mathbf{p}=x\mathbf{a}\]where $x$ just a scalar number.
The error is then
\[\mathbf{e}=\mathbf{b}-\mathbf{p}\]and in order for it to be the minimal possible the orthogonality condition tells us that
\[\mathbf{a}^T(\mathbf{b}-x\mathbf{a})=\mathbf{0}\]which we can re-write as
\[x\mathbf{a}^T\mathbf{a}=\mathbf{a}^T\mathbf{b}\]or
\[x=(\mathbf{a}^T\mathbf{a})^{-1}\mathbf{a}^T\mathbf{b}\]The projection is then
\[\mathbf{p}=\mathbf{a}(\mathbf{a}^T\mathbf{a})^{-1}\mathbf{a}^T\mathbf{b}\]We can think of the projection matrix as operating on some vector $\mathbf{b}$ to project it down into the $\mathbf{a}$ plane (just a line in this case)…recall that $\mathbf{a}^T\mathbf{a}$ is just a scalar but $\mathbf{a}\mathbf{a}^T$ is a 2x2 matrix (it as a column vector times a row vector):
\[\mathbb{P} = \mathbf{a}(\mathbf{a}^T\mathbf{a})^{-1}\mathbf{a}^T\]This has the properties you would expect from projection, such as doing it twice is same as doing it once (more fancily called idempotency):
\[\begin{aligned} \mathbb{P}^2 &= \mathbf{a}(\mathbf{a}^T\mathbf{a})^{-1}\mathbf{a}^T\mathbf{a}(\mathbf{a}^T\mathbf{a})^{-1}\mathbf{a}^T\\ &=\mathbf{a}(\mathbf{a}^T\mathbf{a})^{-1}\mathbf{a}^T\\ &=\mathbb{P} \end{aligned}\]Moving to higher dimensions. Let’s think of $n=2$ for simplicity, and imagine now we have some matrix
\[\mathbf{A}=\begin{pmatrix} \vert & \vert \\ a_1 & a_2 \\ \vert & \vert \end{pmatrix}\]the column-space of this matrix is now a plane, and we want to project some vector $\mathbf{b}$ onto that plane.
Whenever we have
\[\mathbf{y}=\mathbf{A}\mathbf{x}= \begin{pmatrix} \vert & \vert \\ a_1 & a_2 \\ \vert & \vert \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \\ \vert \\ \end{pmatrix}\]The resulting vector $\mathbf{y}$ can be expressed as
\[\mathbf{y}=\mathbf{A}\mathbf{x}= x_1\begin{pmatrix} \vert \\ a_1 \\ \vert \\ \end{pmatrix} + x_2\begin{pmatrix} \vert \\ a_2 \\ \vert \\ \end{pmatrix} +\dots\]You can see this is true, just by laws of matrix multiplication…
Any vector in the column space of $\mathbf{A}$ can always be expressed as $\mathbf{p}=\mathbf{A}\mathbf{\hat{x}}$. To find this $\mathbf{\hat{x}}$ then we use exactly the same condition as in 1D - we require the residual error to be orthogonal to the subspace (the colspace of A):
\[\mathbf{A}^T \mathbf{e} = \begin{pmatrix} \text{---} & a_1 & \text{---}\\ \text{---} & a_2 & \text{---} \end{pmatrix} \begin{pmatrix} \vert \\ e \\ \vert \\ \end{pmatrix}=\mathbf{0}\]This means $\mathbf{e}$ the residual error lives in the nullspace of $\mathbf{A}^T$, which makes sense as things in that space are orthogonal to things in column-space of $\mathbf{A}$.
We are led to the condition for a projection that minimizes the residual error to be
\[\mathbf{A}^T(\mathbf{b}-\mathbf{A}\mathbf{\hat{x}})=\mathbf{0}\]which we can write as
\[\begin{aligned} \mathbf{A}^T\mathbf{A}\mathbf{\hat{x}}&=\mathbf{A}^T\mathbf{b}\\ \mathbf{\hat{x}}&=(\mathbf{A}^T\mathbf{A})^{-1}\mathbf{A}^T\mathbf{b}\\ \end{aligned}\]and the projection itself ($\mathbf{p}=\mathbf{A}\mathbf{\hat{x}}$):
\[\mathbf{p}=\mathbf{A}(\mathbf{A}^T\mathbf{A})^{-1}\mathbf{A}^T\mathbf{b}\]so we find
\[\mathbb{P}=\mathbf{A}(\mathbf{A}^T\mathbf{A})^{-1}\mathbf{A}^T\]Note that $\mathbb{P}=\mathbb{P}^T$ and $\mathbb{P}^2=\mathbb{P}$
If we are dealing with orthonormal subspaces then $\mathbf{A^T}\mathbf{A}=\mathbf{I}$ and the projection simplifies to
\[\mathbb{P}=\mathbf{A}\mathbf{A}^T\]Note that $\mathbf{A}^T\mathbf{A}=\mathbf{I}$ but $\mathbf{A}\mathbf{A}^T\neq\mathbf{I}$.
As a side note: this is really relevant mathematics to other statistical topics such as PCA and least-squares, so projections are a useful thing to get to know.
Remember we had a $k-dimensional$ orthogonal subspace embedded in the $n$-dimensional “ambient” space
\[\mathbf{W}= \begin{pmatrix} \vert & \dots &\vert \\ \mathbf{w_1} & \dots & \mathbf{w_k} \\ \vert & \dots & \vert \end{pmatrix}\]If we consider projecting a vector $\mathbf{x}\in \mathbb{R}^n$ down onto that space, $\mathbb{P}\mathbf{x}$, what length would it have?
\[\begin{aligned} ||\mathbb{P}\mathbf{x}||_2^2&= ||\mathbf{W}\mathbf{W}^T\mathbf{x}||_2^2\\ &=\left(\mathbf{W}\mathbf{W}^T\mathbf{x}\right)^T\left(\mathbf{W}\mathbf{W}^T\mathbf{x}\right)\\ &=\mathbf{x}^T\mathbf{W}\mathbf{W}^T\mathbf{W}\mathbf{W}^T\mathbf{x}\\ &=\mathbf{x}^T\mathbf{W}\mathbf{W}^T\mathbf{x}\\ &=||\mathbf{W}^T\mathbf{x}||^2_2 \end{aligned}\]where I used $\mathbf{W}^T\mathbf{W}=\mathbf{I}$ to get the fourth equality above.
How is this relevant?
Consider the sample variance
\[\widehat{\sigma}^2=\frac{1}{n}\sum_{i=1}^n \left(X_i-\hat{\mu}\right)^2\]We can also think of this as the length ($\ell_2$ norm) of the vector
\[n\widehat{\sigma}^2=\Bigg\|\begin{pmatrix} \vert \\ \mathbf{x} \\ \vert \end{pmatrix} -\begin{pmatrix} \vert \\ \mathbf{\hat{\mu}} \\ \vert \end{pmatrix}\Bigg\|_2^2\]where
\[\mathbf{x}=(X_1, \dots, X_n)^T\]and
\[\mathbf{\hat{\mu}}=(\hat{\mu}, \dots, \hat{\mu})^T\]We can also think of the sample mean in terms of vectors too
\[\hat{\mu}=\frac{1}{n}\sum_{i=1}^n X_i=\frac{1}{n}\mathbb{\mathbf{1}}^T\mathbf{x}\]where $\mathbb{1}=(1, \dots, 1)^T$, so that now
\[\mathbf{\hat{\mu}}=\begin{pmatrix} \vert\\ \hat{\mu}\\ \vert \end{pmatrix}=\mathbb{\mathbf{1}}\mathbb{\mathbf{1}}^T \mathbf{x}\]i.e. it’s a projection of our sample vector $\mathbf{x}$ onto a 1D subspace of the vector $\mathbb{1}=(1, \dots, 1)^T$.
If we define a rescaled version of this vector
\[\mathbf{w_1}=\frac{1}{\sqrt{n}}\mathbb{\mathbf{1}}\](which we will soon think of as the first of our orthonormal basis vectors in the subspace) then we can write
\[\mathbf{\hat{\mu}}=\begin{pmatrix} \vert\\ \hat{\mu}\\ \vert \end{pmatrix}=\mathbb{w_1}\mathbb{w_1}^T \mathbf{x}\]and we arrive at the following vectorized expression for the sample variance
\[n\widehat{\sigma}^2=||\mathbf{x}-\mathbf{w_1}\mathbf{w_1}^T\mathbf{x}||^2_2\]One way to think of this is we are subtracting from the vector $\mathbf{x}$ (representing our sample observations) its projection onto the 1D subspace aligned with the $\mathbf{w_1}$ basis vector whose entries are all $1$.
Next we will add and subtract a vector representing the true (population) mean
\[\mathbf{\mu}=\begin{pmatrix} \vert\\ \mu\\ \vert \end{pmatrix}=\mu \mathbb{\mathbf{1}}\]which gives
\[n\widehat{\sigma}^2=||\mathbf{x}-\mu \mathbb{\mathbf{1}} - \left(\mathbf{w_1}\mathbf{w_1}^T\mathbf{x}-\mu \mathbb{\mathbf{1}}\right)||^2_2\]Note that
\[\begin{aligned} \mathbf{w_1}\mathbf{w_1}^T\mu \mathbb{\mathbf{1}}&=\frac{1}{\sqrt{n}}\mathbb{\mathbf{1}}\frac{1}{\sqrt{n}}\mathbb{\mathbf{1}}^T\mu\mathbb{\mathbf{1}}\\ &=\frac{1}{n}\mu\mathbb{\mathbf{1}}\times n\\ &=\mu\mathbb{\mathbf{1}} \end{aligned}\]which makes sense because $\mu \mathbb{\mathbf{1}}$ already lives in the 1D space spanned by $\mathbf{v_1}$ so projecting down again does nothing to the vector.
Hence
\[\begin{aligned} n\widehat{\sigma}^2&=||\mathbf{x}-\mu \mathbb{\mathbf{1}} - \left(\mathbf{w_1}\mathbf{w_1}^T\mathbf{x}-\mathbf{w_1}\mathbf{w_1}^T\mu \mathbb{\mathbf{1}}\right)||^2_2\\ &=||\mathbf{x}-\mu \mathbb{\mathbf{1}} - \mathbf{w_1}\mathbf{w_1}^T\left(\mathbf{x}-\mu \mathbb{\mathbf{1}}\right)||^2_2\\ \end{aligned}\]Now we are getting close. If we define the random vector
\[\mathbf{y}=\frac{\mathbf{x}-\mu\mathbb{\mathbf{1}}}{\sigma}\]then $\mathbf{y}\sim \mathcal{N}(0, \mathbf{I}_n)$ by independence of the samples.
This allows us to write the ratio of the sample variance and true variance as
\[\begin{aligned} n\frac{\widehat{\sigma}^2}{\sigma^2}&=||\mathbf{y}-\mathbf{w_1}\mathbf{w_1}^T\mathbf{y}||^2_2\\ \end{aligned}\]Now consider if we had some $n-dimensional$ space spanned by the orthogonal vectors $\mathbf{w_i}$, projecting $\mathbf{y}$ onto the column space of $\mathbf{W}$, then would give
\[\begin{aligned} \mathbb{P}\mathbf{y}&=\mathbf{W}\mathbf{W}^T\mathbf{y}\\ &=\begin{pmatrix} \vert & \dots &\vert \\ \mathbf{w_1} & \dots & \mathbf{w_n} \\ \vert & \dots & \vert \end{pmatrix} \begin{pmatrix} \text{---} & \mathbf{w}_1^T & \text{---}\\ \vdots & \vdots & \vdots \\ \text{---} & \mathbf{w}_n^T & \text{---} \end{pmatrix} \begin{pmatrix} \vert \\ \mathbf{y} \\ \vert \end{pmatrix}\\ &=\begin{pmatrix} \vert & \dots &\vert \\ \mathbf{w_1} & \dots & \mathbf{w_k} \\ \vert & \dots & \vert \end{pmatrix} \begin{pmatrix} \mathbf{w_1}^T\mathbf{y} \\ \vdots \\ \mathbf{w_n}^T\mathbf{y} \end{pmatrix}\\ &=\begin{pmatrix} \vert \\ \mathbf{w_1} \\ \vert \end{pmatrix}\mathbf{w_1}^T\mathbf{y}+\dots +\begin{pmatrix} \vert \\ \mathbf{w_n} \\ \vert \end{pmatrix}\mathbf{w_n}^T\mathbf{y}\\ &=\sum_{i=1}^n \mathbf{w_i}\mathbf{w_i}^T\mathbf{y} \end{aligned}\]This is just how we represent $\mathbf{y}$ in that basis (which in practice we could build with Gram-Schmidt)
Then
\[\begin{aligned} n\frac{\widehat{\sigma}^2}{\sigma^2}&=||\sum_{i=1}^n \mathbf{w_i}\mathbf{w_i}^T\mathbf{y}-\mathbf{w_1}\mathbf{w_1}^T\mathbf{y}||^2_2\\ &=\Big|\Big|\sum_{i=2}^n \mathbf{w_i}\mathbf{w_i}^T\mathbf{y}\Big|\Big|^2_2\\ \end{aligned}\]Which we can think of as a projection of $\mathbf{y}$ onto an n-dimensional subspace of a matrix
\[\mathbf{V}= \begin{pmatrix} \vert & \dots &\vert \\ \mathbf{w_2} & \dots & \mathbf{w_n} \\ \vert & \dots & \vert \end{pmatrix}\]i.e.
\[\begin{aligned} n\frac{\widehat{\sigma}^2}{\sigma^2} &=\Big|\Big|\mathbf{V}\mathbf{V}^T\mathbf{y}\Big|\Big|^2_2\\ &=\Big|\Big|\mathbf{V}^T\mathbf{y}\Big|\Big|^2_2\\ \end{aligned}\]where the last equality follows from the result we showed earlier about length under a projection.
What can we say about the vector $\tilde{\mathbf{y}}=\mathbf{V}^T\mathbf{y}$?
This is an affine transformation and for such transformations I showed in another blog post that
\[\mathbf{\mu}_{\tilde{\mathbf{Y}}} =\mathbf{V}^T\mu_{\mathbf{Y}}=\mathbf{0}\]and the covariance
\[\begin{aligned} \Sigma_{\tilde{\mathbf{Y}}}&=\mathbf{V}^T\Sigma_{\mathbf{Y}}\mathbf{V}\\ &=\mathbf{V}^T\mathbf{I}\mathbf{V}\\ &=\mathbf{I} \end{aligned}\]In other words $\widetilde{\mathbf{y}} \sim \mathcal{N}{n-1}(0, \mathbf{I}{n-1})$ and
\[\begin{aligned} n\frac{\widehat{\sigma}^2}{\sigma^2} &\sim \chi^2_{n-1} \end{aligned}\]Recalling that if
\[Z_1, \dots, Z_n \stackrel{iid}{\sim} \mathcal{N}\left(0, 1\right)\]then $X \sim \chi^2_n$ is given by
\[X=Z_1^2+\dots+Z_n^2=||\mathbf{Z}_n||^2_2\]We’ve recovered the amazing fact that if our samples are drawn from a Gaussian distribution then the ratio of the sample variance and true variance will actually be distributed as a $\chi^2$ distribution with degree of freedom $n-1$. This is not an asymptotic result and holds for any sample size.
Replacing the variance with the sample variance
As discussed already, we know that
\[\sqrt{n}\frac{\mu-\hat{\mu}}{\sqrt{\sigma^2}}\sim \mathcal{N}(0, 1)\]and we wanted to be able to replace $\sigma^2$ by the sample variance $\widehat{\sigma}^2$. We now also know that
\[n\frac{\widehat{\sigma}^2}{\sigma^2} \sim \chi^2_{n-1}\]If we want to consider the unbiased sample variance $\widehat{\sigma}^2_{\text{ub}}$, which is related to the sample variance $\widehat{\sigma}^2$
\[\widehat{\sigma}^2_{\text{ub}}=\frac{n}{n-1}\widehat{\sigma}^2\]and so it also follows that
\[\frac{n}{n-1}\frac{\widehat{\sigma}_{ub}^2}{\sigma^2} \sim \chi^2_{n-1}/(n-1)\mathbf{I}\]and if we now take the ratio of our standard normal to the square root of this we get
\[\begin{aligned} T&=\frac{\sqrt{n}\frac{\mu-\hat{\mu}}{\sqrt{\sigma^2}}}{\sqrt{\frac{n}{n-1}\frac{\widehat{\sigma}_{ub}^2}{\sigma^2}}}\\ &=\sqrt{n-1}\frac{\mu-\hat{\mu}}{\sqrt{\widehat{\sigma}_{ub}^2}} \end{aligned}\]This is the ratio of a standard normal to a chi-squared with the degree of freedom $n-1$ and doesn’t depend at all on the population variance $\sigma^2$.
If we can show that the standard normal from the numerator is indeed independent of the chi-squared in the denominator, then this test statistic $T$ will be distributed as a Student t-distribution with degree $n-1$.
Independence of the $\chi^2$ and standard normal
Consider the standard normal from the numerator
\[\sqrt{n}\frac{\mu-\hat{\mu}}{\sqrt{\sigma^2}}\sim \mathcal{N}(0, 1)\]and recall
\[\hat{\mu}=\frac{1}{\sqrt{n}}\mathbf{w_1}^T\mathbf{x}\]where
\[\mathbf{w_1}=\frac{1}{\sqrt{n}}(1, \dots, 1)^T\]Using
\[\mathbf{x}=\mathbf{\mu}+\sigma\mathbf{y}\]then the stochastic part of the numerator is solely through $\hat{\mu}$
\[\begin{aligned} \hat{\mu}&=\frac{1}{\sqrt{n}}\mathbf{w_1}^T\left(\mathbf{\mu}+\sigma\mathbf{y}\right)\\ &=\mu+\frac{\sigma}{\sqrt{n}}\mathbf{w_1}^T\mathbf{y} \end{aligned}\]and the stochastic part of the denominator was via
\[\mathbf{V}^T\mathbf{y}\]Let’s look at the covariance of these things. Since they both have mean zero, this reduces to
\[\begin{aligned} &\text{cov}\left(\mathbf{w_1}^T\mathbf{y}, \mathbf{V}^T\mathbf{y}\right)\\ &=\mathbb{E}\left[\mathbf{w_1}^T\mathbf{y}\mathbf{V}^T\mathbf{y}\right] \end{aligned}\]and $\mathbf{w_1}^T\mathbf{y}$ is just a scalar $\in \mathbb{R}$ so it’s also equal to $\mathbf{y}^T\mathbf{w_1}$
\[\begin{aligned} &\text{cov}\left(\mathbf{w_1}^T\mathbf{y}, \mathbf{V}^T\mathbf{y}\right)\\ &=\mathbb{E}\left[(\mathbf{y}^T\mathbf{w_1})\mathbf{V}^T\mathbf{y}\right]\\ &=\mathbb{E}\left[\mathbf{V}^T\mathbf{y}\mathbf{y}^T\mathbf{w_1}\right]\\ &=\mathbf{V}^T\mathbb{E}\left[\mathbf{y}\mathbf{y}^T\right]\mathbf{w_1}\\ &=\mathbf{V}^T\Sigma_{Y}\mathbf{w_1}\\ &=\mathbf{V}^T\mathbf{I}\mathbf{w_1}\\ &=\mathbf{V}^T\mathbf{w_1}\\ &=\mathbf{0} \end{aligned}\]where the final equality follows from the fact that $\mathbf{V}$ is the matrix whose $n-1$-dimensional column space contains all orthogonal vectors except $\mathbf{w_1}$.
This means they are uncorrelated, but for jointly Gaussian vectors it also means they are independent.
Hence the ratio of the standard normal and the $\chi^2$ is distributed as a Student t-distribution.
T-test
Finally we have that
\[T_n=\sqrt{n-1}\frac{\mu-\hat{\mu}}{\sqrt{\widehat{\sigma}^2_{ub}}}\sim t_{n-1}\]and we can construct the usual statistical tests of the form
\[\psi=\mathbb{\mathbf{1}}\{T_n> s\}\]where $s$ is some threshold determined by whatever $\alpha$ confidence level we demand.
It is a non-asymptotic result, but otherwise just like for the asymptotic case and the CLT we would just read off quantiles of the $t_{n-1}$ distribution from a table.
Comments