4 Least Squares: Linear Algebra

最小二乘法是计量经济学中最基本的估计方法，简单而透明。完全弄懂它，为未来研究更加复杂的线性估计量铺平道路。即使是非线性的估计量，它们的行为在一个点线性展开后，和线性估计量大同小异。在这一讲中，我们将会学习一系列最小二层估计的线性代数性质。

Ordinary least squares (OLS) is the most basic estimation technique in econometrics. It is simple and transparent. Understanding it thoroughly paves the way to study more sophisticated linear estimators. Moreover, many nonlinear estimators resemble the behavior of linear estimators in a neighborhood of the true value. In this lecture, we learn a series of facts from the linear algebra operation.

最小二乘估计是通用的统计方法，它不仅运用于计量经济学中也广泛的应用于其他统计分支。但我们必须认清，最小二层古迹是一个纯粹的线性代数操作。它只能揭示相关性，并不能说明英国性。只有经济理论，才能提供有关因果的假说。数据只能用来检验假说或者量化效果。

To manipulate Leopold Kronecker’s famous saying “God made the integers; all else is the works of man”, I would say “Gauss made OLS; all else is the works of applied researchers.” Popularity of OLS goes far beyond our dismal science. But be aware that OLS is a pure statistical or supervised machine learning method which reveals correlation instead of causality. Rather, economic theory hypothesizes causality while data are collected to test the theory or quantify the effect.

数学标记：\(y_{i}\)是标量。\(x_{i}=\left(x_{i1},\ldots,x_{iK}\right)'\)是一个\(K\times1\) 向量。 \(Y=\left(y_{1},\ldots,y_{n}\right)'\)是一个\(n\times1\)向量。 \[ X=\left[\begin{array}{c} x_{1}'\\ x_{2}'\\ \vdots\\ x_{n}' \end{array}\right]=\left[\begin{array}{cccc} x_{11} & x_{12} & \cdots & x_{1K}\\ x_{21} & x_{22} & \cdots & x_{2K}\\ \vdots & \vdots & \ddots & \vdots\\ x_{n1} & x_{22} & \cdots & x_{nK} \end{array}\right] \] 是\(n\times K\)矩阵。\(I_{n}\)是\(n\times n\)单位矩阵.

4.1 Estimator

我们已经在前一讲中学习了线性映射模型。线性映射函数是？ As we have learned from the linear projection model, the projection coefficient \(\beta\) in the regression \[ \begin{aligned}y & =x'\beta+e\end{aligned} \] can be written as可以被写成 \[ \beta=\left(E\left[xx'\right]\right)^{-1}E\left[xy\right].\label{eq:pop_OLS} \] 我们从\(\left(y,x\right)\)的联合分布中取出一对观测值，记作\(\left(y_{i},x_{i}\right)\) 重复n次之后，有n个观测值。，\(i=1,\ldots,n\) repeated experiments. We possess a sample 我们就得到了一个样本\(\left(y_{i},x_{i}\right)_{i=1}^{n}\).

1.1*. Is \(\left(y_{i},x_{i}\right)\) 样本到底是谁寄的？还是固定的呢？我们在观测之前，随机变量的值是不确定的。观测之后，他们的值就定下来了。 random or deterministic? Before we make the observation, they are treated as random variables whose realized values are uncertain. \(\left(y_{i},x_{i}\right)\) is treated as random when we talk about statistical properties — statistical properties of a fixed number is meaningless. After we make the observation, they become deterministic values which cannot vary anymore.

在实际操作中，我们手中只有一些给定的数字。（当然，现在的大数据也可以将文本，照片声音和图像处理成为数据，这些数据在计算机当中用零和一来表示。）我们把这些数据扔给计算机，让计算机给出一个结果。在统计上。我们认为这些数字是从一个概率分布上得出的思想实验。思想实验是一个学术用语，说白了，它就是一个故事。在公理体系的概率论当中，这个故事，在数学上是自洽的。但是数学本身是一个套套逻辑，而不是科学。概率模型的科学价值在于它在多大程度上能够毕竟事实的真相以及他是不是能够帮我们预测一些真相？在这门课的研究当中，我们假设数据来自于某种机制。我们把这种机制当成真相。在线性回归当中。Xy的联合分布就是真相。而我们想要研究线性映射系数beta。这个线性映射函数是真相的蕴含（implication）。

1.2. In reality, we have at hand fixed numbers (more recently, words, photos, audio clips, video clips, etc., which can all be represented in digital formats with 0 and 1) to feed into a computational operation, and the operation will return one or some numbers. All statistical interpretation about these numbers are drawn from the probabilistic thought experiments. A thought experiment* is an academic jargon for a story in plain language. Under the axiomatic approach of probability theory, such stories are mathematical consistent and coherent. But mathematics is a tautological system, not science. The scientific value of a probability model depends on how close it is to the truth or implications of the truth. In this course, we suppose that the data are generated from some mechanism, which is taken as the truth. In the linear regression model for example, the joint distribution of \(\left(y,x\right)\) is the truth, while we are interested in the linear projection coefficient \(\beta\), which is an implication of the truth as in (\[eq:pop_OLS\]).

The sample mean is a natural estimator of the population mean. Replace the population mean \(E\left[\cdot\right]\) in (\[eq:pop_OLS\]) by the sample mean \(\frac{1}{n}\sum_{i=1}^{n}\cdot\), and the resulting estimator is \[\begin{aligned} \widehat{\beta} & =\left(\frac{1}{n}\sum_{i=1}^{n}x_{i}x_{i}'\right)^{-1}\frac{1}{n}\sum_{i=1}^{n}x_{i}y_{i}\\ & =\left(\frac{X'X}{n}\right)^{-1}\frac{X'y}{n}=\left(X'X\right)^{-1}X'y\end{aligned}\] if \(X'X\) is invertible. This is one way to motivate the OLS estimator.

Alternatively, we can derive the OLS estimator from minimizing the sum of squared residuals \(\sum_{i=1}^{n}\left(y_{i}-x_{i}'b\right)^{2}\), or equivalently \[ Q\left(b\right)=\frac{1}{2n}\sum_{i=1}^{n}\left(y_{i}-x_{i}'b\right)^{2}=\frac{1}{2n}\left(Y-Xb\right)'\left(Y-Xb\right)=\frac{1}{2n}\left\Vert Y-Xb\right\Vert ^{2}, \] where the factor \(\frac{1}{2n}\) is nonrandom and does not change the minimizer, and \(\left\Vert \cdot\right\Vert\) is the Euclidean norm of a vector. Solve the first-order condition \[ \frac{\partial}{\partial b}Q\left(b\right)=\left[\begin{array}{c} \partial Q\left(b\right)/\partial b_{1}\\ \partial Q\left(b\right)/\partial b_{2}\\ \vdots\\ \partial Q\left(b\right)/\partial b_{K} \end{array}\right]=-\frac{1}{n}X'\left(Y-Xb\right)=0. \] This necessary condition for optimality gives exactly the same \(\widehat{\beta}=\left(X'X\right)^{-1}X'y\). Moreover, the second-order condition \[ \frac{\partial^{2}}{\partial b\partial b'}Q\left(b\right)=\left[\begin{array}{cccc} \frac{\partial^{2}}{\partial b_{1}^{2}}Q\left(b\right) & \frac{\partial^{2}}{\partial b_{2}\partial b_{2}}Q\left(b\right) & \cdots & \frac{\partial^{2}}{\partial b_{K}\partial b_{1}}Q\left(b\right)\\ \frac{\partial^{2}}{\partial b_{1}\partial b_{2}}Q\left(b\right) & \frac{\partial^{2}}{\partial b_{2}^{2}}Q\left(b\right) & \cdots & \frac{\partial^{2}}{\partial b_{K}\partial b_{2}}Q\left(b\right)\\ \vdots & \vdots & \ddots & \vdots\\ \frac{\partial^{2}}{\partial b_{1}\partial b_{K}}Q\left(b\right) & \frac{\partial^{2}}{\partial b_{2}\partial b_{K}}Q\left(b\right) & \cdots & \frac{\partial^{2}}{\partial b_{K}^{2}}Q\left(b\right) \end{array}\right]=\frac{1}{n}X'X \] shows that \(Q\left(b\right)\) is convex in \(b\) due to the positive semi-definite matrix \(X'X/n\). (The function \(Q\left(b\right)\) is strictly convex in \(b\) if \(X'X/n\) is positive definite.)

1.3. In the derivation of OLS we presume that the \(K\) columns in \(X\) are linearly independent, which means there is no \(K\times1\) vector \(b\) such that \(b\neq0_{K}\) and \(Xb=0_{n}\). Linear independence of the columns implies \(n\geq K\) and the invertibility of \(X'X/n\). Linear independence is violated when some regressors are perfectly collinear, for example when we use dummy variables to indicate categorical variables and put all these categories into the regression. Modern econometrics software automatically detects and reports perfect collinearity. What is treacherous is nearly collinear*, meaning that the minimal eigenvalue of \(X'X/n\) is close to 0, though not exactly equal to 0. We will talk about the consequence of near collinearity in the chapter of asymptotic theory.

Here are some definitions and properties of the OLS estimator.

Fitted value: \(\widehat{Y}=X\widehat{\beta}\).
Projection matrix: \(P_{X}=X\left(X'X\right)^{-1}X\); Residual maker matrix: \(M_{X}=I_{n}-P_{X}\).
\(P_{X}X=X\); \(X'P_{X}=X'\).
\(M_{X}X=0_{n\times K}\); \(X'M_{X}=0_{K\times n}\).
\(P_{X}M_{X}=M_{X}P_{X}=0_{n\times n}\).
If \(AA=A\), we call it an idempotent matrix. Both \(P_{X}\) and \(M_{X}\) are idempotent. All eigenvalues of an idempotent matrix must be either 1 or 0.
\(\mathrm{rank}\left(P_{X}\right)=K\), and \(\mathrm{rank}\left(M_{X}\right)=n-K\) (See the Appendix of this chapter).
Residual: \(\widehat{e}=Y-\widehat{Y}=Y-X\widehat{\beta}=Y-X(X'X)^{-1}X'Y=(I_{n}-P_{X})Y=M_{X}Y=M_{X}\left(X\beta+e\right)=M_{X}e\). Notice \(\widehat{e}\) and \(e\) are two different objects.
\(X'\widehat{e}=X'M_{X}e=0_{K}\).
\(\sum_{i=1}^{n}\widehat{e}_{i}=0\) if \(x_{i}\) contains a constant.

(Because \(X'\widehat{e}=\left[\begin{array}{cccc} 1 & 1 & \cdots & 1\\ \heartsuit & \heartsuit & \cdots & \heartsuit\\ \cdots & \cdots & \ddots & \vdots\\ \heartsuit & \heartsuit & \cdots & \heartsuit \end{array}\right]\left[\begin{array}{c} \widehat{e}_{1}\\ \widehat{e}_{2}\\ \vdots\\ \widehat{e}_{n} \end{array}\right]=\left[\begin{array}{c} 0\\ 0\\ \vdots\\ 0 \end{array}\right]\) , the the first row implies \(\sum_{i=1}^{n}\widehat{e}_{i}=0\). “\(\heartsuit\)” indicates the entries irrelevant to our purpose.)

The operation of OLS bears a natural geometric interpretation. Notice \(\mathcal{X}=\left\{ Xb:b\in\mathbb{R}^{K}\right\}\) is the linear space spanned by the \(K\) columns of \(X=\left[X_{\cdot1},\ldots,X_{\cdot K}\right]\), which is of \(K\)-dimension if the columns are linearly independent. The OLS estimator is the minimizer of \(\min_{b\in\mathbb{R}^{K}}\left\Vert Y-Xb\right\Vert\) (Square the Euclidean norm or not does not change the minimizer because \(a^{2}\) is a monotonic transformation for \(a\geq0\)). In other words, \(X\widehat{\beta}\) is the point in \(\mathcal{X}\) such that it is the closest to the vector \(Y\) in terms of the Euclidean norm.

The relationship \(Y=X\widehat{\beta}+\widehat{e}\) decomposes \(Y\) into two orthogonal vectors \(X\widehat{\beta}\) and \(\widehat{e}\) as \(\left\langle X\widehat{\beta},\widehat{e}\right\rangle =\widehat{\beta}'X'\widehat{e}=0_{K}^{\prime}\), where \(\left\langle \cdot,\cdot\right\rangle\) is the inner product of two vectors. Therefore \(X\widehat{\beta}\) is the projection of \(Y\) onto \(\mathcal{X}\), and \(\widehat{e}\) is the corresponding projection residuals. The Pythagorean theorem implies \[\left\Vert Y\right\Vert ^{2}=\Vert X\widehat{\beta}\Vert^{2}+\left\Vert \widehat{e}\right\Vert ^{2}.\]

Example 4.1 ** 1.1**. Here is a simple simulated example to demonstrate the properties of OLS. Given \(\left(x_{1i},x_{2i},x_{3i},e_{i}\right)^{\prime}\sim N\left(0_{4},I_{4}\right)\), the dependent variable \(y_{i}\) is generated from \[ y_{i}=0.5+2\cdot x_{1i}-1\cdot x_{2i}+e_{i} \]

The researcher does not know \(x_{3i}\) is redundant, and he regresses \(y_{i}\) on \(\left(1,x_{1i},x_{2i},x_{3i}\right)\).

The estimated coefficient \(\widehat{\beta}\) is ( 0.315, 1.955, -0.852, 0.151). It is close to the true value, but not very accurate due to the small sample size.

4.2 Subvector

The Frish-Waugh-Lovell (FWL) theorem is an algebraic fact about the formula of a subvector of the OLS estimator. To derive the FWL theorem we need to use the inverse of partitioned matrix. For a positive definite symmetric matrix \(A=\begin{pmatrix}A_{11} & A_{12}\\ A_{12}' & A_{22} \end{pmatrix}\), the inverse can be written as \[A^{-1}=\begin{pmatrix}\left(A_{11}-A_{12}A_{22}^{-1}A_{12}'\right)^{-1} & -\left(A_{11}-A_{12}A_{22}^{-1}A_{12}'\right)^{-1}A_{12}A_{22}^{-1}\\ -A_{22}^{-1}A_{12}'\left(A_{11}-A_{12}A_{22}^{-1}A_{12}'\right)^{-1} & \left(A_{22}-A_{12}'A_{11}^{-1}A_{12}\right)^{-1} \end{pmatrix}.\] In our context of OLS estimator, let \(X=\left(\begin{array}{cc} X_{1} & X_{2}\end{array}\right)\)

\[\begin{aligned} \begin{pmatrix}\widehat{\beta}_{1}\\ \widehat{\beta}_{2} \end{pmatrix} & =\widehat{\beta}=(X'X)^{-1}X'Y\\ & =\left(\begin{pmatrix}X_{1}'\\ X_{2}' \end{pmatrix}\begin{pmatrix}X_{1} & X_{2}\end{pmatrix}\right)^{-1}\begin{pmatrix}X_{1}'Y\\ X_{2}'Y \end{pmatrix}\\ & =\begin{pmatrix}X_{1}'X_{1} & X_{1}'X_{2}\\ X_{2}'X_{1} & X_{2}'X_{2} \end{pmatrix}^{-1}\begin{pmatrix}X_{1}'Y\\ X_{2}'Y \end{pmatrix}\\ & =\begin{pmatrix}\left(X_{1}'M_{X_{2}}'X_{1}\right)^{-1} & -\left(X_{1}'M_{X_{2}}'X_{1}\right)^{-1}X_{1}'X_{2}\left(X_{2}'X_{2}\right)^{-1}\\ \heartsuit & \heartsuit \end{pmatrix}\begin{pmatrix}X_{1}'Y\\ X_{2}'Y \end{pmatrix}.\end{aligned}\]

The subvector \[\begin{aligned} \widehat{\beta}_{1} & =\left(X_{1}'M_{X_{2}}'X_{1}\right)^{-1}X_{1}'Y-\left(X_{1}'M_{X_{2}}'X_{1}\right)^{-1}X_{1}'X_{2}\left(X_{2}'X_{2}\right)^{-1}X_{2}'Y\\ & =\left(X_{1}'M_{X_{2}}'X_{1}\right)^{-1}X_{1}'Y-\left(X_{1}'M_{X_{2}}'X_{1}\right)^{-1}X_{1}'P_{X_{2}}Y\\ & =\left(X_{1}'M_{X_{2}}'X_{1}\right)^{-1}\left(X_{1}'Y-X_{1}'P_{X_{2}}Y\right)\\ & =\left(X_{1}'M_{X_{2}}'X_{1}\right)^{-1}X_{1}'M_{X_{2}}Y.\end{aligned}\]

Notice that \(\widehat{\beta}_{1}\) can be obtained by the following:

Regress \(Y\) on \(X_{2}\), obtain the residual \(\tilde{Y}\);
Regress \(X_{1}\) on \(X_{2}\), obtain the residual \(\tilde{X}_{1}\);
Regress \(\tilde{Y}\) on \(\tilde{X}_{1}\), obtain OLS estimates \(\widehat{\beta}_{1}\).

Similar derivation can also be carried out in the population linear projection. See Hansen (2020) \[E\] Chapter 2.22-23.

4.3 Goodness of Fit

Consider the regression with the intercept \(Y=X_{1}\beta_{1}+\beta_{2}+e.\) The OLS estimator gives \[Y=\widehat{Y}+\widehat{e}=\left(X_{1}\widehat{\beta}_{1}+\widehat{\beta}_{2}\right)+\widehat{e}.\label{eq:decomp_1}\] Applying the FWL theorem with \(X_{2}=\iota\), where \(\iota\) (Greek letter, iota) is an \(n\times1\) vector of 1’s. Then \(M_{X_{2}}=M_{\iota}=I_{n}-\frac{1}{n}\iota\iota'\). Notice \(M_{\iota}\) is the demeaner in that \(M_{\iota}z=z-\bar{z}\). It subtract the vector mean \(\bar{z}=\frac{1}{n}\sum_{i=1}^{n}z_{i}\) from the original vector \(z\). The above three-step procedure becomes

Regress \(Y\) on \(\iota\), and the residual is \(M_{\iota}Y\);
Regress \(X_{1}\) on \(\iota\), and the residual is \(M_{\iota}X_{1}\);
Regress \(M_{\iota}Y\) on \(M_{\iota}X_{1}\), and the OLS estimates is exactly the same as \(\widehat{\beta}_{1}\) in (\[eq:decomp_1\]).

The last step gives the decomposition \[M_{\iota}Y=M_{\iota}X_{1}\widehat{\beta}_{1}+\tilde{e},\label{eq:decomp_2}\] and the Pythagorean theorem implies \[\left\Vert M_{\iota}Y\right\Vert ^{2}=\Vert M_{\iota}X_{1}\widehat{\beta}_{1}\Vert^{2}+\left\Vert \widehat{e}\right\Vert ^{2}.\]

** 1.1**. Show that \(\widehat{e}\) in (\[eq:decomp_1\]) is exactly the same as \(\tilde{e}\) in (\[eq:decomp_2\]).

R-squared is a popular measure of goodness-of-fit in the linear regression. The (in-sample) R-squared \[R^{2}=\frac{\Vert M_{\iota}X_{1}\widehat{\beta}_{1}\Vert^{2}}{\left\Vert M_{\iota}Y\right\Vert ^{2}}=1-\frac{\left\Vert \tilde{e}\right\Vert ^{2}}{\left\Vert M_{\iota}Y\right\Vert ^{2}}.\] is well defined only when a constant is included in the regressors.

** 1.2**. Show \[R^{2}=\frac{\widehat{Y}'M_{\iota}\widehat{Y}}{Y'M_{\iota}Y}=\frac{\sum_{i=1}^{n}\left(\widehat{y_{i}}-\overline{y}\right)^{2}}{\sum_{i=1}^{n}\left(y_{i}-\overline{y}\right)^{2}}\] as in the decomposition (\[eq:decomp_1\]). In other words, it is the ratio between the sample variance of \(\widehat{Y}\) and the sample variance of \(Y\).

The magnitude of R-squared varies in different contexts. In macro models with the lagged dependent variables, it is not unusually to observe R-squared larger than 90%. In cross sectional regressions it is often below 20%.

** 1.3**. Consider a short regression “regress \(y_{i}\) on \(x_{1i}\)” and a long regression “regress \(y_{i}\) on \(\left(x_{1i},x_{2i}\right)\)”. Given the same dataset \(\left(Y,X_{1},X_{2}\right)\), show that the R-squared from the short regression is no larger than that from the long regression. In other words, we can always (weakly) increase \(R^{2}\) by adding more regressors.

Conventionally we consider the regressions when the number of regressors \(K\) is much smaller the sample size \(n\). In the era of big data, it can happen that we have more potential regressors than the sample size.

** 1.4**. Show \(R^{2}=1\) when \(K\geq n\). (When \(K>n\), the matrix \(X'X\) must be rank deficient. We can generalize the definition OLS fitting as any vector that minimizes \(\left\Vert Y-Xb\right\Vert ^{2}\) though the minimizer is not unique.

##
## Call:
## lm(formula = Y ~ X)
##
## Residuals:
## ALL 5 residuals are 0: no residual degrees of freedom!
##
## Coefficients: (2 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  -0.2229         NA      NA       NA
## X1           -0.6422         NA      NA       NA
## X2            0.1170         NA      NA       NA
## X3            1.1844         NA      NA       NA
## X4            0.5883         NA      NA       NA
## X5                NA         NA      NA       NA
## X6                NA         NA      NA       NA
##
## Residual standard error: NaN on 0 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:    NaN
## F-statistic:   NaN on 4 and 0 DF,  p-value: NA

With a new dataset \(\left(Y^{\mathrm{new}},X^{\mathrm{new}}\right)\), the out-of-sample (OOS) R-squared is \[OOS\ R^{2}=\frac{\widehat{\beta}^{\prime}X^{\mathrm{new}\prime}M_{\iota}X^{\mathrm{new}}\widehat{\beta}}{Y^{\mathrm{new}\prime}M_{\iota}Y^{\mathrm{new}}}.\] OOS R-squred measures the goodness of fit in a new dataset given the coefficient estimated from the original data. In financial market shorter-term predictive models, a person may easily become rich if he can systematically achieve 2% OOS R-squared.

4.4 Summary

The linear algebraic properties holds in finite sample no matter the data are taken as fixed numbers or random variables. The Gauss Markov theorem holds under two crucial assumptions: linear CEF and homoskedasticity.

高斯说，他在1795年就想出了最小二乘法的操作。他用三个点来预测了。矮行星位置。高斯没有在1809年之前把文章发表出来。而勒让德发表了同样的方法。今天，人们通常将最小二乘法归功于高斯。因为大家觉得像高斯这样的数学巨人没有必要。撒一个谎，来偷取勒让德的发现。

Historical notes: Carl Friedrich Gauss (1777–1855) claimed he had come up with the operation of OLS in 1795. With only three data points at hand, Gauss successfully applied his method to predict the location of the dwarf planet Ceres in 1801. While Gauss did not publish the work on OLS until 1809, Adrien-Marie Legendre (1752–1833) presented this method in 1805. Today people tend to attribute OLS to Gauss, assuming that a giant like Gauss had no need to tell a lie to steal Legendre’s discovery.

3 Conditional Expectation

5 Least Squares: Finite Sample Theory