5 Least Squares: Finite Sample Theory

We continue with properties of OLS. We will show that OLS coincides with the maximum likelihood estimator if the error term follows a normal distribution. We derive its finite-sample exact distribution which can be used for statistical inference. The Gauss-Markov theorem justifies the optimality of OLS under the classical assumptions.

Suppose the data is generated from a parametric model. Statistical estimation looks for the unknown parameter from the observed data. A principle is an ideology about a proper way of estimation. Over the history of statistics, only a few principles are widely accepted. Among them Maximum Likelihood is the most important and fundamental. The maximum likelihood principle entails that the unknown parameter being found as the maximizer of the log-likelihood function.

5.1 Maximum Likelihood

In this chapter, we first give an introduction of the maximum likelihood estimation. Consider a random sample of \(Z=\left(z_{1},z_{2},\ldots,z_{n}\right)\) drawn from a parametric distribution with density \(f_{z}\left(z_{i};\theta\right)\), where \(z_{i}\) is either a scalar random variable or a random vector. A parametric distribution is completely characterized by a finite-dimensional parameter \(\theta\). We know that \(\theta\) belongs to a parameter space \(\Theta\). We use the data to estimate \(\theta\).

The log-likelihood of observing the entire sample \(Z\) is \[L_{n}\left(\theta;Z\right):=\log\left(\prod_{i=1}^{n}f_{z}\left(z_{i};\theta\right)\right)=\sum_{i=1}^{n}\log f_{z}\left(z_{i};\theta\right).\label{eq:raw_likelihood}\] In reality the sample \(Z\) is given and for each \(\theta\in\Theta\) we can evaluate \(L_{n}\left(\theta;Z\right)\). The maximum likelihood estimator is \[\widehat{\theta}_{MLE}:=\arg\max_{\theta\in\Theta}L_{n}\left(\theta;Z\right).\] Why maximizing the log-likelihood function is desirable? An intuitive explanation is that \(\widehat{\theta}_{MLE}\) makes observing \(Z\) the “most likely” in the entire parametric space.

A more formal justification requires an explicitly defined distance. Suppose that the true parameter value that generates the data is \(\theta_{0}\), so that the true distribution is \(f_{z}\left(z_{i};\theta_{0}\right)\). Any generic point \(\theta\in\Theta\) produces \(f_{z}\left(z_{i};\theta\right)\). To measure their difference, we introduce the Kullback-Leibler divergence, or the Kullback-Leibler distance, defined as the logarithms of the expected log-likelihood ratio \[\begin{aligned} D_{f}\left(\theta_{0}\Vert\theta\right) & =D\left(f_{z}\left(z_{i};\theta_{0}\right)\Vert f_{z}\left(z_{i};\theta\right)\right):=E_{\theta_{0}}\left[\log\frac{f_{z}\left(z_{i};\theta_{0}\right)}{f_{z}\left(z_{i};\theta\right)}\right]\\ & =E_{\theta_{0}}\left[\log f_{z}\left(z_{i};\theta_{0}\right)\right]-E_{\theta_{0}}\left[\log f_{z}\left(z_{i};\theta\right)\right].\end{aligned}\] We call it a “distance” because it is non-negative, although it is not symmetric in that \(D_{f}\left(\theta_{1}\Vert\theta_{2}\right)\neq D_{f}\left(\theta_{2}\Vert\theta_{1}\right)\) and it does not satisfy the triangle inequality. To see \(D_{f}\left(\theta_{0}\Vert\theta\right)\) is non-negative, notice that \(-\log\left(\cdot\right)\) is strictly convex and then by Jensen’s inequality \[\begin{aligned} E_{\theta_{0}}\left[\log\frac{f_{z}\left(z_{i};\theta_{0}\right)}{f_{z}\left(z_{i};\theta\right)}\right] & =E_{\theta_{0}}\left[-\log\frac{f_{z}\left(z_{i};\theta\right)}{f_{z}\left(z_{i};\theta_{0}\right)}\right]\geq-\log\left(E_{\theta_{0}}\left[\frac{f_{z}\left(z_{i};\theta\right)}{f_{z}\left(z_{i};\theta_{0}\right)}\right]\right)\\ & =-\log\left(\int\frac{f_{z}\left(z_{i};\theta\right)}{f_{z}\left(z_{i};\theta_{0}\right)}f_{z}\left(z_{i};\theta_{0}\right)dz_{i}\right)=-\log\left(\int f_{z}\left(z_{i};\theta\right)dz_{i}\right)\\ & =-\log1=0,\end{aligned}\] where \(\int f_{z}\left(z_{i};\theta\right)dz_{i}=1\) for any pdf. The equality holds if and only if \(f_{z}\left(z_{i};\theta\right)=f_{z}\left(z_{i};\theta_{0}\right)\) almost everywhere. Furthermore, if there is a one-to-one mapping between \(\theta\) and \(f_{z}\left(z_{i};\theta\right)\) on \(\Theta\) (identification), then \(\theta_{0}=\arg\min_{\theta\in\Theta}D_{f}\left(\theta_{0}\Vert\theta\right)\) is the unique solution.

In information theory, \(-E_{\theta_{0}}\left[\log f_{z}\left(z_{i};\theta_{0}\right)\right]\) is the entropy of the continuous distribution of \(f_{z}\left(z_{i};\theta_{0}\right)\). Entropy measures the uncertainty of a random variable; the larger is the value, the more chaotic is the random variable. The Kullback-Leibler distance is the relative entropy between the distribution \(f_{z}\left(z_{i};\theta_{0}\right)\) and \(f_{z}\left(z_{i};\theta\right)\). It measures the inefficiency of assuming that the distribution is \(f_{z}\left(z_{i};\theta\right)\) when the true distribution is indeed \(f_{z}\left(z_{i};\theta_{0}\right)\). (Cover and Thomas 2006, 19)

Consider the Gaussian location model \(z_{i}\sim N\left(\mu,1\right)\), where \(\mu\) is the unknown parameter to be estimated. The likelihood of observing \(z_{i}\) is \(f_{z}\left(z_{i};\mu\right)=\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{1}{2}\left(z_{i}-\mu\right)^{2}\right)\). The likelihood of observing the sample \(Z\) is \[f_{Z}\left(Z;\mu\right)=\prod_{i=1}^{n}\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{1}{2}\left(z_{i}-\mu\right)^{2}\right)\] and the log-likelihood is \[L_{n}\left(\mu;Z\right)=-\frac{n}{2}\log\left(2\pi\right)-\frac{1}{2}\sum_{i=1}^{n}\left(z_{i}-\mu\right)^{2}.\] The (averaged) log-likelihood function for the \(n\) observations is \[\begin{aligned} \ell_{n}\left(\mu\right) & =-\frac{1}{2}\log\left(2\pi\right)-\frac{1}{2n}\sum_{i=1}^{n}\left(z_{i}-\mu\right)^{2}.\end{aligned}\] We work with the averaged log-likelihood \(\ell_{n}\), instead of the (raw) log-likelihood \(L_{n}\), to make it directly comparable with the expected log density \[\begin{aligned} E_{\mu_{0}}\left[\log f_{z}\left(z;\mu\right)\right] & =E_{\mu_{0}}\left[\ell_{n}\left(\mu\right)\right]\\ & =-\frac{1}{2}\log\left(2\pi\right)-\frac{1}{2}E_{\mu_{0}}\left[\left(z_{i}-\mu\right)^{2}\right]\\ & =-\frac{1}{2}\log\left(2\pi\right)-\frac{1}{2}E_{\mu_{0}}\left[\left(\left(z_{i}-\mu_{0}\right)+\left(\mu_{0}-\mu\right)\right)^{2}\right]\\ & =-\frac{1}{2}\log\left(2\pi\right)-\frac{1}{2}E_{\mu_{0}}\left[\left(z_{i}-\mu_{0}\right)^{2}\right]-E_{\mu_{0}}\left[z_{i}-\mu_{0}\right]\left(\mu_{0}-\mu\right)-\frac{1}{2}\left(\mu_{0}-\mu\right)^{2}\\ & =-\frac{1}{2}\log\left(2\pi\right)-\frac{1}{2}-\frac{1}{2}\left(\mu-\mu_{0}\right)^{2}.\end{aligned}\] where the first equality holds because of random sampling. Obviously, \(\ell_{n}\left(\mu\right)\) is maximized at \(\bar{z}=\frac{1}{n}\sum_{i=1}^{n}z_{i}\) while \(E_{\mu_{0}}\left[\ell_{n}\left(\mu\right)\right]\) is maximized at \(\mu=\mu_{0}\). The Kullback-Leibler divergence in this example is \[D\left(\mu_{0}\Vert\mu\right)=E_{\mu_{0}}\left[\ell_{n}\left(\mu_{0}\right)\right]-E_{\mu_{0}}\left[\ell_{n}\left(\mu\right)\right]=\frac{1}{2}\left(\mu-\mu_{0}\right)^{2},\] where \(-E_{\mu_{0}}\left[\ell_{n}\left(\mu_{0}\right)\right]=\frac{1}{2}\left(\log\left(2\pi\right)+1\right)\) is the entropy of the normal distribution with unit variance.

We use the following code to demonstrate the population log-likelihood \(E\left[\ell_{n}\left(\mu\right)\right]\) when \(\mu_{0}=2\) (solid line) and the 3 sample realizations when \(n=4\) (dashed lines).

**there is a knitr** part

5.2 Likelihood Estimation for Regression

Notation: \(y_{i}\) is a scalar, and \(x_{i}=\left(x_{i1},\ldots,x_{iK}\right)'\) is a \(K\times1\) vector. \(Y\) is an \(n\times1\) vector, and \(X\) is an \(n\times K\) matrix.

In this chapter we employ the classical statistical framework under restrictive distributional assumption \[y_{i}|x_{i}\sim N\left(x_{i}'\beta,\gamma\right),\label{eq:normal_yx}\] where \(\gamma=\sigma^{2}\) to ease the differentiation. This assumption is equivalent to \(e_{i}|x_{i}=\left(y_{i}-x_{i}'\beta\right)|x_{i}\sim N\left(0,\gamma\right)\). Because the distribution of \(e_{i}\) is invariant to \(x_{i}\), the error term \(e_{i}\sim N\left(0,\gamma\right)\) and is statistically independent of \(x_{i}\). This is a very strong assumption.

The likelihood of observing a pair \(\left(y_{i},x_{i}\right)\) is \[\begin{aligned} f_{yx}\left(y_{i},x_{i}\right) & =f_{y|x}\left(y_{i}|x_{i}\right)f_{x}\left(x\right)\\ & =\frac{1}{\sqrt{2\pi\gamma}}\exp\left(-\frac{1}{2\gamma}\left(y_{i}-x_{i}'\beta\right)^{2}\right)\times f_{x}\left(x\right),\end{aligned}\] where \(f_{yx}\) is the joint pdf, \(f_{y|x}\) is the conditional pdf and \(f_{x}\) is the marginal pdf of \(x\), and the second equality holds under (\[eq:normal\_yx\]). The likelihood of the random sample \(\left(y_{i},x_{i}\right)_{i=1}^{n}\) is \[\begin{aligned} \prod_{i=1}^{n}f_{yx}\left(y_{i},x_{i}\right) & =\prod_{i=1}^{n}f_{y|x}\left(y_{i}|x_{i}\right)f_{x}\left(x\right)\\ & =\prod_{i=1}^{n}f_{y|x}\left(y_{i}|x_{i}\right)\times\prod_{i=1}^{n}f_{x}\left(x\right)\\ & =\prod_{i=1}^{n}\frac{1}{\sqrt{2\pi\gamma}}\exp\left(-\frac{1}{2\gamma}\left(y_{i}-x_{i}'\beta\right)^{2}\right)\times\prod_{i=1}^{n}f_{x}\left(x\right).\end{aligned}\] The parameters of interest \(\left(\beta,\gamma\right)\) are irrelevant to the second term \(\prod_{i=1}^{n}f_{x}\left(x\right)\) for they appear only in the conditional likelihood \[\prod_{i=1}^{n}f_{y|x}\left(y_{i}|x_{i}\right)=\prod_{i=1}^{n}\frac{1}{\sqrt{2\pi\gamma}}\exp\left(-\frac{1}{2\gamma}\left(y_{i}-x_{i}'\beta\right)^{2}\right).\] We focus on the conditional likelihood. To facilitate derivation, we work with the (averaged) conditional log-likelihood function \[\ell_{n}\left(\beta,\gamma\right)=-\frac{1}{2}\log2\pi-\frac{1}{2}\log\gamma-\frac{1}{2n\gamma}\sum_{i=1}^{n}\left(y_{i}-x_{i}'\beta\right)^{2},\] for \(\log\left(\cdot\right)\) is a monotonic transformation that does not change the maximizer. The maximum likelihood estimator \(\widehat{\beta}_{MLE}\) can be found using the FOC: \[\begin{aligned} \frac{\partial}{\partial\beta}\ell_{n}\left(\beta,\gamma\right) & =\frac{1}{n\gamma}\sum_{i=1}^{n}x_{i}\left(y_{i}-x_{i}'\beta\right)=0\\ \frac{\partial}{\partial\gamma}\ell_{n}\left(\beta,\gamma\right) & =-\frac{1}{2\gamma}+\frac{1}{2n\gamma^{2}}\sum_{i=1}^{n}\left(y_{i}-x_{i}'\beta\right)^{2}=0.\end{aligned}\] Rearranging the above equations in matrix form: \[\begin{aligned} X'X\beta & =X'Y\\ \gamma & =\frac{1}{n}\left(Y-X\beta\right)'\left(Y-X\beta\right).\end{aligned}\] We solve \[\begin{aligned} \widehat{\beta}_{MLE} & =(X'X)^{-1}X'Y\\ \widehat{\gamma}_{\mathrm{MLE}} & =\frac{1}{n}\left(Y-X\widehat{\beta}_{MLE}\right)'\left(Y-X\widehat{\beta}_{MLE}\right)=\widehat{e}'\widehat{e}/n\end{aligned}\] when \(X'X\) is invertible. The MLE of the slope coefficient \(\widehat{\beta}_{MLE}\) coincides with the OLS estimator, and \(\widehat{e}\) is exactly the OLS residual.

5.3 Finite Sample Distribution

We can show the finite-sample exact distribution of \(\widehat{\beta}\) assuming the error term follows a Gaussian distribution. Finite sample distribution means that the distribution holds for any \(n\); it is in contrast to asymptotic distribution, which is a large sample approximation to the finite sample distribution. We first review some properties of a generic jointly normal random vector.

\[fact31\] Let \(z\sim N\left(\mu,\Omega\right)\) be an \(l\times1\) random vector with a positive definite variance-covariance matrix \(\Omega\). Let \(A\) be an \(m\times l\) non-random matrix where \(m\leq l\). Then \(Az\sim N\left(A\mu,A\Omega A'\right)\).

\[fact32\]If \(z\sim N\left(0,1\right)\), \(w\sim\chi^{2}\left(d\right)\) and \(z\) and \(w\) are independent. Then \(\frac{z}{\sqrt{w/d}}\sim t\left(d\right)\).

The OLS estimator \[\widehat{\beta}=\left(X'X\right)^{-1}X'Y=\left(X'X\right)^{-1}X'\left(X'\beta+e\right)=\beta+\left(X'X\right)^{-1}X'e,\] and its conditional distribution can be written as \[\begin{aligned} \widehat{\beta}|X & =\beta+\left(X'X\right)^{-1}X'e|X\\ & \sim\beta+\left(X'X\right)^{-1}X'\cdot N\left(0_{n},\gamma I_{n}\right)\\ & \sim N\left(\beta,\gamma\left(X'X\right)^{-1}X'X\left(X'X\right)^{-1}\right)\sim N\left(\beta,\gamma\left(X'X\right)^{-1}\right)\end{aligned}\] by Fact \[fact31\]. The \(k\)-th element of the vector coefficient \[\widehat{\beta}_{k}|X=\eta_{k}'\widehat{\beta}|X\sim N\left(\beta_{k},\gamma\eta_{k}'\left(X'X\right)^{-1}\eta_{k}\right)\sim N\left(\beta_{k},\gamma\left[\left(X'X\right)^{-1}\right]_{kk}\right),\] where \(\eta_{k}=\left(1\left\{ l=k\right\} \right)_{l=1,\ldots,K}\) is the selector of the \(k\)-th element.

In reality, \(\sigma^{2}\) is an unknown parameter, and \[s^{2}=\widehat{e}'\widehat{e}/\left(n-K\right)=e'M_{X}e/\left(n-K\right)\] is an unbiased estimator of \(\gamma\). (Because \[\begin{aligned} E\left[s^{2}|X\right] & =\frac{1}{n-K}E\left[e'M_{X}e|X\right]=\frac{1}{n-K}\mathrm{trace}\left(E\left[e'M_{X}e|X\right]\right)\\ & =\frac{1}{n-K}\mathrm{trace}\left(E\left[M_{X}ee'|X\right]\right)=\frac{1}{n-K}\mathrm{trace}\left(M_{X}E\left[ee'|X\right]\right)\\ & =\frac{1}{n-K}\mathrm{trace}\left(M_{X}\gamma I_{n}\right)=\frac{\gamma}{n-K}\mathrm{trace}\left(M_{X}\right)=\gamma\end{aligned}\] where we use the property of trace \(\mathrm{trace}\left(AB\right)=\mathrm{trace}\left(BA\right)\).)

Under the null hypothesis \(H_{0}:\beta_{k}=\beta_{k}^{*}\), where \(\beta_{k}^{*}\) is the hypothesized value we want to test. We can construct a \(t\)-statistic \[T_{k}=\frac{\widehat{\beta}_{k}-\beta_{k}^{*}}{\sqrt{s^{2}\left[\left(X'X\right)^{-1}\right]_{kk}}},\] which is infeasible is that sense that it can be directly computed from the data because there is no unknown object in this statistic. When the hypothesis is true, \(\beta_{k}=\beta_{k}^{*}\) and thus \[\begin{aligned} T_{k} & =\frac{\widehat{\beta}_{k}-\beta_{k}}{\sqrt{s^{2}\left[\left(X'X\right)^{-1}\right]_{kk}}}\nonumber \\ & =\frac{\widehat{\beta}_{k}-\beta_{k}}{\sqrt{\sigma^{2}\left[\left(X'X\right)^{-1}\right]_{kk}}}\cdot\frac{\sqrt{\sigma^{2}}}{\sqrt{s^{2}}}\nonumber \\ & =\frac{\left(\widehat{\beta}_{k}-\beta_{0,k}\right)/\sqrt{\sigma^{2}\left[\left(X'X\right)^{-1}\right]_{kk}}}{\sqrt{\frac{e'}{\sigma}M_{X}\frac{e}{\sigma}/\left(n-K\right)}},\label{eq:t-stat}\end{aligned}\] where we introduce the population quantity \(\sigma^{2}\) into the second equality to help derive the distribution of the numerator and the denominator of the last expression. The numerator \[\left(\widehat{\beta}_{k}-\beta_{k}\right)/\sqrt{\sigma^{2}\left[\left(X'X\right)^{-1}\right]_{kk}}\sim N\left(0,1\right),\] and the denominator \(\sqrt{\frac{e'}{\sigma}M_{X}\frac{e}{\sigma}/\left(n-K\right)}\) follows \(\sqrt{\frac{1}{n-K}\chi^{2}\left(n-K\right)}\). Moreover, because \[\begin{aligned} \begin{bmatrix}\widehat{\beta}-\beta\\ \widehat{e} \end{bmatrix} & =\begin{bmatrix}\left(X'X\right)^{-1}X'e\\ M_{X}e \end{bmatrix}=\begin{bmatrix}\left(X'X\right)^{-1}X'\\ M_{X} \end{bmatrix}e\\ & \sim\begin{bmatrix}\left(X'X\right)^{-1}X'\\ M_{X} \end{bmatrix}\cdot N\left(0,\gamma I_{n}\right)\sim N\left(0,\gamma\begin{bmatrix}\left(X'X\right)^{-1} & 0\\ 0 & M_{X} \end{bmatrix}\right)\end{aligned}\] are jointly normal with zero off-diagonal blocks, \(\left(\widehat{\beta}-\beta\right)\) and \(\widehat{e}\) are statistically independent. (This claim is true, although the covariance matrix of the \(\widehat{e}\) is singular.) Given that \(X\) is viewed as if non-random, the numerator and the denominator of (\[eq:t-stat\]) are statistically independent as well is a function since the former is a function of \(\left(\widehat{\beta}-\beta\right)\) and latter is a function of \(\widehat{e}\). (Alternatively, the statistically independent can be verified by Basu’s theorem, See Appendix \[subsec:Basu\'s-Theorem\].) As a result, we conclude \(T_{k}\sim t\left(n-K\right)\) by Fact \[fact32\]. This finite sample distribution allows us to conduct statistical inference.

5.4 Mean and Variance\[mean-and-variance\]

Now we relax the normality assumption and statistical independence. Instead, we represent the regression model as \(Y=X\beta+e\) and \[\begin{aligned} E[e|X] & =0_{n}\\ \mathrm{var}\left[e|X\right] & =E\left[ee'|X\right]=\sigma^{2}I_{n}.\end{aligned}\] where the first condition is the mean independence assumption, and the second condition is the homoskedasticity assumption. These assumptions are about the first and second moments of \(e_{i}\) conditional on \(x_{i}\). Unlike the normality assumption, they do not restrict the distribution of \(e_{i}\).

  • Unbiasedness: \[\begin{aligned} E\left[\widehat{\beta}|X\right] & =E\left[\left(X'X\right)^{-1}XY|X\right]=E\left[\left(X'X\right)^{-1}X\left(X'\beta+e\right)|X\right]\\ & =\beta+\left(X'X\right)^{-1}XE\left[e|X\right]=\beta.\end{aligned}\] By the law of iterated expectations, the unconditional expectation \(E\left[\widehat{\beta}\right]=E\left[E\left[\widehat{\beta}|X\right]\right]=\beta.\) Unbiasedness does not rely on homoskedasticity.

  • Variance: \[\begin{aligned}\mathrm{var}\left[\widehat{\beta}|X\right] & =E\left[\left(\widehat{\beta}-E\widehat{\beta}\right)\left(\widehat{\beta}-E\widehat{\beta}\right)'|X\right]\\ & =E\left[\left(\widehat{\beta}-\beta\right)\left(\widehat{\beta}-\beta\right)'|X\right]\\ & =E\left[\left(X'X\right)^{-1}X'ee'X\left(X'X\right)^{-1}|X\right]\\ & =\left(X'X\right)^{-1}X'E\left[ee'|X\right]X\left(X'X\right)^{-1} \end{aligned}\] where the second equality holds as

  • Under the assumption of homoskedasticity, it can be simplified as \[\begin{aligned}\mathrm{var}\left[\widehat{\beta}|X\right] & =\left(X'X\right)^{-1}X'\left(\sigma^{2}I_{n}\right)X\left(X'X\right)^{-1}\\ & =\sigma^{2}\left(X'X\right)^{-1}X'I_{n}X\left(X'X\right)^{-1}\\ & =\sigma^{2}\left(X'X\right)^{-1}. \end{aligned}\]

(Heteroskedasticity) If \(e_{i}=x_{i}u_{i}\), where \(x_{i}\) is a scalar random variable, \(u_{i}\) is statistically independent of \(x_{i}\), \(E\left[u_{i}\right]=0\) and \(E\left[u_{i}^{2}\right]=\sigma_{u}^{2}\). Then \(E\left[e_{i}|x_{i}\right]=E\left[x_{i}u_{i}|x_{i}\right]=x_{i}E\left[u_{i}|x_{i}\right]=0\) but \(E\left[e_{i}^{2}|x_{i}\right]=E\left[x_{i}^{2}u_{i}^{2}|x_{i}\right]=x_{i}^{2}E\left[u_{i}^{2}|x_{i}\right]=\sigma_{u}^{2}x_{i}^{2}\) is a function of \(x_{i}\). We say \(e_{i}^{2}\) is a heteroskedastic error.

**knitr**

It is important to notice that independently and identically distributed sample (iid) \(\left(y_{i},x_{i}\right)\) does not imply homoskedasticity. Homoskedasticity or heteroskedasticity is about the relationship between \(\left(x_{i},e_{i}=y_{i}-\beta x\right)\) within an observation, whereas iid is about the relationship between \(\left(y_{i},x_{i}\right)\) and \(\left(y_{j},x_{j}\right)\) for \(i\neq j\) across observations.

5.5 Gauss-Markov Theorem

Gauss-Markov theorem is concerned about the optimality of OLS. It justifies OLS as the efficient estimator among all linear unbiased ones. Efficient here means that it enjoys the smallest variance in a family of estimators.

We have shown that OLS is unbiased in that \(E\left[\widehat{\beta}\right]=\beta\). There are numerous linearly unbiased estimators. For example, \(\left(Z'X\right)^{-1}Z'y\) for \(z_{i}=x_{i}^{2}\) is unbiased because \(E\left[\left(Z'X\right)^{-1}Z'y\right]=E\left[\left(Z'X\right)^{-1}Z'\left(X\beta+e\right)\right]=\beta\). We cannot say OLS is better than those other unbiased estimators because they are all unbiased — they are equally good at this aspect. We move to the second order property of variance: an estimator is better if its variance is smaller.

For two generic random vectors \(X\) and \(Y\) of the same size, we say \(X\)’s variance is smaller or equal to \(Y\)’s variance if \(\left(\Omega_{Y}-\Omega_{X}\right)\) is a positive semi-definite matrix. The comparison is defined this way because for any non-zero constant vector \(c\), the variance of the linear combination of \(X\) \[\mathrm{var}\left(c'X\right)=c'\Omega_{X}c\leq c'\Omega_{Y}c=\mathrm{var}\left(c'Y\right)\] is no bigger than the same linear combination of \(Y\).

Let \(\tilde{\beta}=A'y\) be a generic linear estimator, where \(A\) is any \(n\times K\) functions of \(X\). As \[E\left[A'y|X\right]=E\left[A'\left(X\beta+e\right)|X\right]=A'X\beta.\] So the linearity and unbiasedness of \(\tilde{\beta}\) implies \(A'X=I_{n}\). Moreover, the variance \[\mbox{var}\left(A'y|X\right)=E\left[\left(A'y-\beta\right)\left(A'y-\beta\right)'|X\right]=E\left[A'ee'A|X\right]=\sigma^{2}A'A.\] Let \(C=A-X\left(X'X\right)^{-1}.\) \[\begin{aligned}A'A-\left(X'X\right)^{-1} & =\left(C+X\left(X'X\right)^{-1}\right)'\left(C+X\left(X'X\right)^{-1}\right)-\left(X'X\right)^{-1}\\ & =C'C+\left(X'X\right)^{-1}X'C+C'X\left(X'X\right)^{-1}\\ & =C'C, \end{aligned}\] where the last equality follows as \[\left(X'X\right)^{-1}X'C=\left(X'X\right)^{-1}X'\left(A-X\left(X'X\right)^{-1}\right)=\left(X'X\right)^{-1}-\left(X'X\right)^{-1}=0.\] Therefore \(A'A-\left(X'X\right)^{-1}\) is a positive semi-definite matrix. The variance of any \(\tilde{\beta}\) is no smaller than the OLS estimator \(\widehat{\beta}\). The above derivation shows OLS achieves the smallest variance among all linear unbiased estimators.

Homoskedasticity is a restrictive assumption. Under homoskedasticity, \(\mathrm{var}\left[\widehat{\beta}\right]=\sigma^{2}\left(X'X\right)^{-1}\). Popular estimator of \(\sigma^{2}\) is the sample mean of the residuals \(\widehat{\sigma}^{2}=\frac{1}{n}\widehat{e}'\widehat{e}\) or the unbiased one \(s^{2}=\frac{1}{n-K}\widehat{e}'\widehat{e}\). Under heteroskedasticity, Gauss-Markov theorem does not apply.

5.6 Summary

The exact distribution under the normality assumption of the error term is the classical statistical results. The Gauss Markov theorem holds under two crucial assumptions: linear CEF and homoskedasticity.

Historical notes: MLE was promulgated and popularized by Ronald Fisher (1890–1962). He was a major contributor of the frequentist approach which dominates mathematical statistics today, and he sharply criticized the Bayesian approach. Fisher collected the iris flower dataset of 150 observations in his biological study in 1936, which can be displayed in R by typing iris. Fisher invented the many concepts in classical mathematical statistics, such as sufficient statistic, ancillary statistic, completeness, and exponential family, etc.

Further reading: Phillips (1983) offered a comprehensive treatment of exact small sample theory in econometrics. After that, theoretical studies in econometrics swiftly shifted to large sample theory, which we will introduce in the next chapter.

5.7 Appendix

5.7.1 Joint Normal Distribution

It is arguable that normal distribution is the most frequently encountered distribution in statistical inference, as it is the asymptotic distribution of many popular estimators. Moreover, it boasts some unique features that facilitates the calculation of objects of interest. This note summaries a few of them.

An \(n\times1\) random vector \(Y\) follows a joint normal distribution \(N\left(\mu,\Sigma\right)\), where \(\mu\) is an \(n\times1\) vector and \(\Sigma\) is an \(n\times n\) symmetric positive definite matrix. The probability density function is \[f_{y}\left(y\right)=\left(2\pi\right)^{-n/2}\left(\mathrm{det}\left(\Sigma\right)\right)^{-1/2}\exp\left(-\frac{1}{2}\left(y-\mu\right)'\Sigma^{-1}\left(y-\mu\right)\right)\] where \(\mathrm{det}\left(\cdot\right)\) is the determinant of a matrix. The moment generating function is \[M_{y}\left(t\right)=\exp\left(t'\mu+\frac{1}{2}t'\Sigma t\right).\]

We will discuss the relationship between two components of a random vector. To fix notation, \[Y=\left(\begin{array}{c} Y_{1}\\ Y_{2} \end{array}\right)\sim N\left(\left(\begin{array}{c} \mu_{1}\\ \mu_{2} \end{array}\right),\left(\begin{array}{cc} \Sigma_{11} & \Sigma_{12}\\ \Sigma_{21} & \Sigma_{22} \end{array}\right)\right)\] where \(Y_{1}\) is an \(m\times1\) vector, and \(Y_{2}\) is an \(\left(n-m\right)\times1\) vector. \(\mu_{1}\) and \(\mu_{2}\) are the corresponding mean vectors, and \(\Sigma_{ij}\), \(j=1,2\) are the corresponding variance and covariance matrices. From now on, we always maintain the assumption that \(Y=\left(Y_{1}',Y_{2}'\right)'\) is jointly normal.

Fact \[fact31\] immediately implies a convenient feature of the normal distribution. Generally speaking, if we are given a joint pdf of two random variables and intend to find the marginal distribution of one random variables, we need to integrate out the other variable from the joint pdf. However, if the variables are jointly normal, the information of the other random variable is irrelevant to the marginal distribution of the random variable of interest. We only need to know the partial information of the part of interest, say the mean \(\mu_{1}\) and the variance \(\Sigma_{11}\) to decide the marginal distribution of \(Y_{1}\).

\[fact:marginal\]The marginal distribution \(Y_{1}\sim N\left(\mu_{1},\Sigma_{11}\right)\).

This result is very convenient if we are interested in some component if an estimator, but not the entire vector of the estimator. For example, the OLS estimator of the linear regression model \(y_{i}=x_{i}'\beta+e_{i}\), under the classical assumption of (i) random sample; (ii) independence of \(z_{i}\) and \(e_{i}\); (iii) \(e_{i}\sim N\left(0,\gamma\right)\) is \[\widehat{\beta}=\left(X'X\right)^{-1}X'y,\] and the finite sample exact distribution of \(\widehat{\beta}\) is \[\left(\widehat{\beta}-\beta\right)|X\sim N\left(0,\gamma\left(X'X\right)^{-1}\right)\] If we are interested in the inference of only the \(j\)-th component of \(\beta_{0}^{\left(j\right)}\), then from Fact \[fact:marginal\], \[\left(\widehat{\beta}_{k}-\beta_{k}\right)/\left(X'X\right)_{kk}^{-1}\sim N\left(0,\gamma\right)\] where \(\left[\left(X'X\right)^{-1}\right]_{kk}\) is the \(k\)-th diagonal element of \(\left(X'X\right)^{-1}\). The marginal distribution is independent of the other components. This saves us from integrating out the other components, which could be troublesome if the dimension of the vector is high.

Generally, zero covariance of two random variables only indicates that they are uncorrelated, whereas full statistical independence is a much stronger requirement. However, if \(Y_{1}\) and \(Y_{2}\) are jointly normal, then zero covariance is equivalent to full independence.

If \(\Sigma_{12}=0\), then \(Y_{1}\) and \(Y_{2}\) are independent.

If \(\Sigma\) is invertible, then \(Y'\Sigma^{-1}Y\sim\chi^{2}\left(\mathrm{rank}\left(\Sigma\right)\right)\).

The last result, which is useful in linear regression, is that if \(Y_{1}\) and \(Y_{2}\) are jointly normal, the conditional distribution of \(Y_{1}\) on \(Y_{2}\) is still jointly normal, with the mean and variance specified as in the following fact.

\(Y_{1}|Y_{2}\sim N\left(\mu_{1}+\Sigma_{12}\Sigma_{22}^{-1}\left(Y_{2}-\mu_{2}\right),\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}\right)\).

5.7.2 Basu’s Theorem* [\[subsec:Basu\'s-Theorem\]]{#subsec:Basu’s-Theorem label=“subsec:Basu’s-Theorem”}

\(Y=\left(y_{1},\ldots,y_{n}\right)\) consists of \(n\) iid observations. We say \(T\left(Y\right)\) is a sufficient statistic for a parameter \(\theta\) if the conditional probability \(f\left(Y|T\left(Y\right)\right)\) does not depend on \(\theta\). We say \(S\left(Y\right)\) is an ancillary statistic for \(\theta\) if its distribution does not depend on \(\theta\).

Basu’s theorem says that a complete sufficient statistic is statistically independent from any ancillary statistic.

Sufficient statistic is closely related to the exponential family in classical mathematical statistics. A parametric distribution indexed by \(\theta\) is a member of the exponential family is its PDF can be written as \[f\left(Y|\theta\right)=h\left(Y\right)g\left(\theta\right)\exp\left(\eta\left(\theta\right)'T\left(Y\right)\right),\] where \(g\left(\theta\right)\) and \(\eta\left(\theta\right)\) are functions depend, only on \(\theta\) and \(h\left(Y\right)\) and \(T\left(Y\right)\) are functions depend only on \(Y\).

(Univariate Gaussian location model.) For a normal distribution \(y_{i}\sim N\left(\mu,\gamma\right)\) with known \(\gamma\) and unknown \(\mu\), the sample mean \(\bar{y}\) is the sufficient statistic and the sample standard deviation \(s^{2}\) is an ancillary statistic.

We first verify that the sample mean \(\bar{y}=n^{-1}\sum_{i=1}^{n}y_{i}\) is a sufficient statistic for \(\mu\). Notice that the joint density of \(Y\) is \[\begin{aligned} f\left(Y\right) & =\left(2\pi\gamma\right)^{-\frac{n}{2}}\exp\left(-\frac{1}{2\gamma}\sum_{i=1}^{n}\left(y_{i}-\mu\right)^{2}\right)\\ & =\left(2\pi\gamma\right)^{-\frac{n}{2}}\exp\left(-\frac{1}{2\gamma}\sum_{i=1}^{n}\left(\left(y_{i}-\bar{y}\right)+\left(\bar{y}-\mu\right)\right)^{2}\right)\\ & =\left(2\pi\gamma\right)^{-\frac{n}{2}}\exp\left(-\frac{1}{2\gamma}\sum_{i=1}^{n}\left(\left(y_{i}-\bar{y}\right)^{2}+2\left(y_{i}-\bar{y}\right)\left(\bar{y}-\mu\right)+\left(\bar{y}-\mu\right)^{2}\right)\right)\\ & =\left(2\pi\gamma\right)^{-\frac{n}{2}}\exp\left(-\frac{1}{2\gamma}\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}\right)\exp\left(-\frac{n}{2\gamma}\left(\bar{y}-\mu\right)^{2}\right).\end{aligned}\] Because \(\bar{y}\sim N\left(\mu,\gamma/n\right),\) the marginal density is \[f\left(\bar{y}\right)=\left(2\pi\gamma/n\right)^{-1/2}\exp\left(-\frac{n}{2\gamma}\left(\bar{y}-\mu\right)^{2}\right).\] For \(\bar{y}\) is a statistic of \(Y\), we have \(f\left(Y,\bar{y}\right)=f\left(Y\right)\). The conditional density is \[f\left(Y|\bar{y}\right)=\frac{f\left(Y,\bar{y}\right)}{f\left(\bar{y}\right)}=\frac{f\left(Y\right)}{f\left(\bar{y}\right)}=\sqrt{n}\left(2\pi\gamma\right)^{-\frac{n-1}{2}}\exp\left(-\frac{1}{2\gamma}\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}\right)\] is independent of \(\mu\), and thus \(\bar{y}\) is a sufficient statistic for \(\mu\). In the meantime, the sample standard deviation \(s^{2}=\frac{1}{n-1}\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)\) is an ancillary statistic for \(\mu\) , because the distribution of \(s^{2}\) does not depend on \(\mu.\)

The normal distribution with known \(\sigma^{2}\) and unknown \(\mu\) belongs to the exponential family in view of the decomposition \[\begin{aligned} f(Y) & =\left(2\pi\gamma\right)^{-\frac{n}{2}}\exp\left(-\frac{1}{2\gamma}\sum_{i=1}^{n}\left(y_{i}-\mu\right)^{2}\right)\\ & =\underbrace{\exp\left(-\sum_{i=1}^{n}\frac{y_{i}^{2}}{2\gamma}\right)}_{h\left(Y\right)}\cdot\underbrace{\left(2\pi\gamma\right)^{-\frac{n}{2}}\exp\left(-\frac{n}{2\gamma}\mu^{2}\right)}_{g\left(\theta\right)}\cdot\underbrace{\exp\left(\frac{\mu}{2\gamma}n\bar{y}\right)}_{\exp\left(\eta\left(\theta\right)'T\left(Y\right)\right)}.\end{aligned}\] The exponential family is a class of distributions with the special functional form which is convenient for deriving sufficient statistics as well as other desirable properties in classical mathematical statistics.

(Conditional Gaussian location model.) If \(y_{i}\sim N\left(x_{i}\beta,\gamma\right)\) with known \(\gamma\) and unknown \(\beta\), We verify that the sample mean \(\widehat{\beta}\) is a sufficient statistic for \(\beta\). Notice that the joint density of \(Y\) given \(X\) is \[\begin{aligned} f\left(Y|X\right) & =\left(2\pi\gamma\right)^{-\frac{n}{2}}\exp\left(-\frac{1}{2\gamma}\sum_{i=1}^{n}\left(y_{i}-\mu\right)^{2}\right)\\ & =\left(2\pi\gamma\right)^{-\frac{n}{2}}\exp\left(-\frac{1}{2\gamma}\left(Y-X\widehat{\beta}\right)'\left(Y-X\widehat{\beta}\right)\right)\exp\left(-\frac{1}{2\gamma}\left(\widehat{\beta}-\beta\right)'X'X\left(\widehat{\beta}-\beta\right)\right).\end{aligned}\] Because \(\widehat{\beta}\sim N\left(\beta,\gamma\left(X'X\right)^{-1}\right),\) the marginal density is \[f\left(\widehat{\beta}|X\right)=\left(2\pi\gamma\right)^{-\frac{K}{2}}\left(\mathrm{det}\left(\left(X'X\right)^{-1}\right)\right)^{-1/2}\exp\left(-\frac{1}{2\gamma}\left(\widehat{\beta}-\beta\right)'X'X\left(\widehat{\beta}-\beta\right)\right).\] The conditional density is \[\begin{aligned} f\left(Y|\widehat{\beta},X\right) & =\frac{f\left(Y|X\right)}{f\left(\widehat{\beta}|X\right)}\\ & =\left(2\pi\gamma\right)^{-\frac{n-K}{2}}\left(\mathrm{det}\left(\left(X'X\right)^{-1}\right)\right)^{-1/2}\exp\left(-\frac{1}{2\gamma}\left(Y-X\widehat{\beta}\right)'\left(Y-X\widehat{\beta}\right)\right)\end{aligned}\] is independent of \(\beta\), and thus \(\widehat{\beta}\) is a sufficient statistic for \(\beta\).

In the meantime, the sample standard deviation \(s^{2}=\frac{1}{n-1}\sum_{i=1}^{n}\left(y_{i}-x_{i}\widehat{\beta}\right)\) is an ancillary statistic for \(\beta\) , because the distribution of \(s^{2}\) does not depend on \(\beta.\)

Zhentao Shi. Oct 10.