11 Endogeneity

In microeconomic analysis, exogenous variables are the factors determined outside of the economic system under consideration, and endogenous variables are those decided within the economic system.

A microeconomic exercise that we encountered so many times goes as follows. If a person has a utility function \(u\left(q_{1},q_{2}\right)\) where \(q_{1}\) and \(q_{2}\) are the quantities of two goods. He faces a budget \(p_{1}q_{1}+p_{2}q_{2}\leq C\), where \(p_{1}\) and \(p_{2}\) are the prices of the two goods, respectively. What is the optimal quantities \(q_{1}^{*}\) and \(q_{2}^{*}\) he will purchase? In this question the utility function \(u\left(\cdot,\cdot\right)\), the prices \(p_{1}\) and \(p_{2}\), and the budget \(C\) are exogenous. The optimal purchase \(q_{1}^{*}\) and \(q_{2}^{*}\) are endogenous.

The terms “endogenous” and “exogenous” in microeconomics will be carried over into multiple-equation econometric models. While in a single-equation regression model \[y_{i}=x_{i}'\beta+e_{i}\label{eq:generative}\] is only part of the equation system. To make it simple, in the single-equation model we say an \(x_{ik}\) is endogenous, or is an endogenous variable, if \(\mathrm{cov}\left(x_{ik},e_{i}\right)\neq0\); otherwise \(x_{ik}\) is an exogenous variable.

Empirical works using linear regressions are routinely challenged by questions about endogeneity. Such questions plague economic seminars and referee reports. To defend empirical strategies in quantitative economic studies, it is important to understand the sources of potential endogeneity and thoroughly discuss attempts for resolving endogeneity.

11.1 Identification

Endogeneity usually implies difficulty in identifying the parameter of interest with only \(\left(y_{i},x_{i}\right)\). Identification is critical for the interpretation of empirical economic research. We say a parameter is identified if the mapping between the parameter in the model and the distribution of the observed variable is one-to-one; otherwise we say the parameter is under-identified. This is an abstract definition, and let us discuss it in the family linear regression context.

The linear projection model implies the moment equation \[\mathbb{E}\left[x_{i}x_{i}'\right]\beta=\mathbb{E}\left[x_{i}y_{i}\right]. (citation)\] If \(E\left[x_{i}x_{i}'\right]\) is of full rank, then \(\beta=\left(\mathbb{E}\left[x_{i}x_{i}'\right]\right)^{-1}\mathbb{E}\left[x_{i}y_{i}\right]\) is a function of the quantities of the population moment and it is identified. On the contrary, if some \(x_{k}\)’s are perfect collinear so that \(\mathbb{E}\left[x_{i}x_{i}'\right]\) is rank deficient, there are multiple \(\beta\) that satisfies the \(k\)-equation system (\[eq:k-equation-FOC\]). Identification fails.

Suppose \(x_{i}\) is a scalar random variable, \[\begin{pmatrix}x_{i}\\ e_{i} \end{pmatrix}\sim N\left(\begin{pmatrix}0\\ 0 \end{pmatrix},\begin{pmatrix}1 & \sigma_{xe}\\ \sigma_{xe} & 1 \end{pmatrix}\right)\] follows a joint normal distribution, and the dependent variable \(y_{i}\) is generated from (\[eq:generative\]). The joint normal assumption implies that the conditional mean \[\mathbb{E}\left[y_{i}|x_{i}\right]=\beta x_{i}+\mathbb{E}\left[e_{i}|x_{i}\right]=\left(\beta+\sigma_{xe}\right)x_{i}\] coincides with the linear projection model, and \(\beta+\sigma_{xe}\) is the linear projection coefficient. From the observable random variable \(\left(y_{i},x_{i}\right)\), we can only learn \(\beta+\sigma_{xe}\). As we cannot learn \(\sigma_{xe}\) from the data due to the unobservable \(e_{i}\), there is no way to recover \(\beta\). This is exactly the omitted variable bias that we have discussed earlier in this course. The gap lies between the available data \(\left(y_{i},x_{i}\right)\) and the identification of the model. In the special case that we assume \(\sigma_{xe}=0\), the endogeneity vanishes and \(\beta\) is identified.

The linear projection model is so far the most general model in this course that justifies OLS. OLS is consistent for the linear projection coefficient. By the definition of the linear projection model, \(\mathbb{E}\left[x_{i}e_{i}\right]=0\) so there is no room for endogeneity in the linear projection model. In other words, if we talk about endogeneity, we must not be working with the linear projection model, and the coefficients we pursue the structural parameter, rather than the linear projection coefficients.

In econometrics we are often interested in a model with economic interpretation. The common practice in empirical research assumes that the observed data are generated from a parsimonious model, and the next step is to estimate the unknown parameters in the model. Since it is often possible to name some factors not included in the regressors but they are correlated with the included regressors and in the mean time also affects \(y_{i}\), endogeneity becomes a fundamental problem.

To resolve endogeneity, we seek extra variables or data structure that may guarantee the identification of the model. The most often used methods are (i) fixed effect model (ii) instrumental variables:

  • The fixed effect model requires that multiple observations, often across time, are collected for each individual \(i\). Moreover, the source of endogeneity is time invariant and enters the model additively in the form \[y_{it}=x_{it}'\beta+u_{it},\] where \(u_{it}=\alpha_{i}+\epsilon_{it}\) is the composite error. The panel data approach extends \(\left(y_{i},x_{i}\right)\) to \(\left(y_{it},x_{it}\right)_{i=1}^{T}\) if data are available along the time dimension.

  • The instrumental variable approach extends \(\left(y_{i},x_{i}\right)\) to \(\left(y_{i},x_{i},z_{i}\right)\), where the extra random variable \(z_{i}\) is called the instrument variable. It is assumed that \(z_{i}\) is orthogonal to the error \(e_{i}\) . Therefore, along with the model it adds an extra variable \(z_{i}\).

Either the panel data approach or the instrumental variable approach entails extra information beyond \(\left(y_{i},x_{i}\right)\). Without such extra data, there is no way to resolve the identification failure. Just as the linear project model is available for any joint distribution of \(\left(y_{i},x_{i}\right)\) with existence of suitable moments, from a pure statistical point of view a linear IV model is an artifact depends only on the choice of \(\left(y_{i},x_{i},z_{i}\right)\) without referencing to any economics. In essence, the linear IV model seeks a linear combination \(y_{i}-\beta x_{i}\) that is orthogonal to the linear space spanned by \(z_{i}\).

11.2 Instruments

There are two requirements for valid IVs: orthogonality and relevance. Orthogonality entails that the model is correctly specified. If relevance is violated, meaning that the IVs are not correlated with the endogenous variable, then multiple parameters can generate the observable data. Identification, as in the standard definition in econometrics, breaks down.

A structural equation is a model of economic interest. Consider the following linear structural model \[y_{i}=x_{1i}'\beta_{1}+z_{1i}'\beta_{2}+\epsilon_{i},\label{eq:basic_1}\] where \(x_{1i}\) is a \(k_{1}\)-dimensional endogenous explanatory variables, \(z_{1i}\) is a \(k_{2}\)-dimensional exogenous explanatory variables with the intercept included. In addition, we have \(z_{2i}\), a \(k_{3}\)-dimensional excluded exogenous variables. Let \(K=k_{1}+k_{2}\) and \(L=k_{2}+k_{3}\). Denote \(x_{i}=\left(x_{1i}',z_{1i}'\right)'\) as a \(K\)-dimensional explanatory variable, and \(z_{i}=\left(z_{1i}',z_{2i}'\right)\) as an \(L\)-dimensional exogenous vector.

We call the exogenous variable instrument variables, or simply instruments. Let \(\beta=\left(\beta_{1}',\beta_{2}'\right)'\) be a \(K\)-dimensional parameter of interest. From now on, we rewrite (\[eq:basic\_1\]) as \[y_{i}=x_{i}'\beta+\epsilon_{i},\label{eq:basic_2}\] and we have a vector of instruments \(z_{i}\).

Before estimating any structural econometric model, we must check identification. In the context of (\[eq:basic\_2\]), identification requires that the true value \(\beta_{0}\) is the only value on the parameters space that satisfies the moment condition \[\mathbb{E}\left[z_{i}\left(y_{i}-x_{i}'\beta\right)\right]=0_{L}.\label{eq:moment}\] The rank condition is sufficient and necessary for identification.

\(\mathrm{rank}\left(\mathbb{E}\left[z_{i}x_{i}'\right]\right)=K\).

Note that \(\mathbb{E}\left[x_{i}'z_{i}\right]\) is a \(K\times L\) matrix. The rank condition implies the order condition \(L\geq K\), which says that the number of excluded instruments must be no fewer than the number of endogenous variables.

The parameter in (\[eq:moment\]) is identified if and only if the rank condition holds.

(The “if” direction). For any \(\tilde{\beta}\) such that \(\tilde{\beta}\neq\beta_{0}\), \[\begin{aligned} \mathbb{E}\left[z_{i}\left(y_{i}-x_{i}'\tilde{\beta}\right)\right] & =\mathbb{E}\left[z_{i}\left(y_{i}-x_{i}'\beta_{0}\right)\right]+\mathbb{E}\left[z_{i}x_{i}'\right]\left(\beta_{0}-\tilde{\beta}\right)\\ & =0_{L}+\mathbb{E}\left[z_{i}x_{i}'\right]\left(\beta_{0}-\tilde{\beta}\right).\end{aligned}\] Because \(\mathrm{rank}\left(\mathbb{E}\left[z_{i}x_{i}'\right]\right)=K\), we would have \(\mathbb{E}\left[z_{i}x_{i}'\right]\left(\beta_{0}-\tilde{\beta}\right)=0_{L}\) if and only if \(\beta_{0}-\tilde{\beta}=0_{K}\), which violates \(\tilde{\beta}\neq\beta_{0}\). Therefore \(\beta_{0}\) is the unique value that satisfies (\[eq:moment\]).

(The “only if” direction is left as an exercise. Hint: By contrapositiveness, if the rank condition fails, then the model is not identified. We can easily prove the claim by making an example.)

11.3 Sources of Endogeneity

As econometricians mostly work with non-experimental data, we cannot overstate the importance of the endogeneity problem. We go over a few examples.

We know that the first-difference (FD) estimator is consistent for (static) panel data model. Nevertheless, the FD estimator encounters difficulty in a dynamic panel model \[y_{it}=\beta_{1}+\beta_{2}y_{i,t-1}+\beta_{3}x_{it}+\alpha_{i}+\epsilon_{it},\label{eq:dymPanel}\] even if we assume \[\mathbb{E}\left[\epsilon_{is}|\alpha_{i},x_{i1},\ldots,x_{iT},y_{i,t-1},y_{i,t-2},\ldots,y_{i0}\right]=0,\ \ \forall s\geq t\label{eq:dyn_mean_0}\] When taking difference of the above equation (\[eq:dymPanel\]) for periods \(t\) and \(t-1\), we have \[\left(y_{it}-y_{i,t-1}\right)=\beta_{2}\left(y_{it-1}-y_{i,t-2}\right)+\beta_{3}\left(x_{it}-x_{i,t-1}\right)+\left(\epsilon_{it}-\epsilon_{i,t-1}\right).\label{eq:dyn_mean_1}\] Under (\[eq:dyn\_mean\_0\]), \(\mathbb{E}\left[\left(x_{it}-x_{i,t-1}\right)\left(\epsilon_{it}-\epsilon_{i,t-1}\right)\right]=0\), but \[\mathbb{E}\left[\left(y_{i,t-1}-y_{i,t-2}\right)\left(\epsilon_{it}-\epsilon_{i,t-1}\right)\right]=-\mathbb{E}\left[y_{i,t-1}\epsilon_{i,t-1}\right]=-\mathbb{E}\left[\epsilon_{i,t-1}^{2}\right]\neq0.\] Therefore the coefficients \(\beta_{2}\) and \(\beta_{3}\) cannot be identified from the linear regression model (\[eq:dyn\_mean\_1\]).

Instruments for the above example is easy to find. Notice that the linear relationship (\[eq:dymPanel\]) implies \[\begin{aligned} & & \mathbb{E}\left[\epsilon_{i,t}-\epsilon_{i,t-1}|\alpha_{i},x_{i1},\ldots,x_{iT},\epsilon_{i,t-2},\epsilon_{i,t-3},\ldots,\epsilon_{i1},y_{i0}\right]\\ & = & \mathbb{E}\left[\epsilon_{i,t}-\epsilon_{i,t-1}|\alpha_{i},x_{i1},\ldots,x_{iT},y_{i,t-2},y_{i,t-3},\ldots,y_{i0}\right]=0\end{aligned}\] according to the assumption (\[eq:dyn\_mean\_0\]). The above relationship gives orthogonal condition in the form \[\mathbb{E}\left[\left(\epsilon_{i,t}-\epsilon_{i,t-1}\right)f\left(\epsilon_{i,t-2},\epsilon_{i,t-3},\ldots,\epsilon_{i1}\right)\right]=0.\] In other words, any function of \(y_{i,t-2},y_{i,t-3},\ldots,y_{i1}\) is orthogonal to the error term \(\left(\epsilon{}_{i,t-1}-\epsilon_{i,t-2}\right)\). Here the excluded IVs are naturally generated from the model itself.

Another classical source of endogeneity is the measurement error.

Endogeneity also emerges when an explanatory variables is not directly observable but is replaced by a measurement with error. Suppose the true linear model is \[y_{i}=\beta_{1}+\beta_{2}x_{i}^{*}+u_{i},\label{eq:measurement_error}\] with \(\mathbb{E}\left[u_{i}|x_{i}^{*}\right]=0\). We cannot observe \(x_{i}^{*}\) but we observe \(x_{i}\), a measurement of \(x_{i}^{*}\), and they are linked by \[x_{i}=x_{i}^{*}+v_{i}\] with \(\mathbb{E}\left[v_{i}|x_{i}^{*},u_{i}\right]=0\). Such a formulation of the measurement error is called the classical measurement error. Substitute out the unobservable \(x_{i}^{*}\) in (\[eq:measurement\_error\]), \[y_{i}=\beta_{1}+\beta_{2}\left(x_{i}-v_{i}\right)+u_{i}=\beta_{1}+\beta_{2}x_{i}+e_{i}\label{eq:measurement_error2}\] where \(e_{i}=u_{i}-\beta_{2}v_{i}\). The correlation \[\mathbb{E}\left[x_{i}e_{i}\right]=\mathbb{E}\left[\left(x_{i}^{*}+v_{i}\right)\left(u_{i}-\beta_{2}v_{i}\right)\right]=-\beta_{2}\mathbb{E}\left[v_{i}^{2}\right]\neq0.\] OLS (\[eq:measurement\_error2\]) would not deliver a consistent estimator.

Alternatively, we can look at the above problem of classical measurement error from the expression of the linear projection coefficient. We know that in (\[eq:measurement\_error\]) \(\beta_{2}^{\mathrm{infeasible}}=\mathrm{cov}\left[x_{i}^{*},y_{i}\right]/\mathrm{var}\left[x_{i}^{*}\right].\) In contrast, when we regression \(y_{i}\) on the observable \(x_{i}\) the corresponding linear projection coefficient is \[\beta_{2}^{\mathrm{feasible}}=\frac{\mathrm{cov}\left[x_{i},y_{i}\right]}{\mathrm{var}\left[x_{i}\right]}=\frac{\mathrm{cov}\left[x_{i}^{*}+v_{i},y_{i}\right]}{\mathrm{var}\left[x_{i}^{*}+v_{i}\right]}=\frac{\mathrm{cov}\left[x_{i}^{*},y_{i}\right]}{\mathrm{var}\left[x_{i}^{*}\right]+\mathrm{var}\left[v_{i}\right]}.\] It is clear that \(|\beta_{2}^{\mathrm{feasible}}|\leq|\beta_{2}^{\mathrm{infeasible}}|\) and the equality holds only if \(\mathrm{var}\left[v_{i}\right]=0\) (no measurement error). This is called the attenuation bias due to the measurement error.

Next, we give two examples of equation systems, one from microeconomics and the other from macroeconomics.

Let \(p_{i}\) and \(q_{i}\) be a good’s log-price and log-quantity on the \(i\)-th market, and they are iid across markets. We are interested in the demand curve \[p_{i}=\alpha_{d}-\beta_{d}q_{i}+e_{di}\label{eq:demand}\] for some \(\beta_{d}\geq0\) and the supply curve \[p_{i}=\alpha_{s}+\beta_{s}q_{i}+e_{si}\label{eq:supply}\] for some \(\beta_{s}\geq0\). We use a simple linear specification so that the coefficient \(\beta_{d}\) can be interpreted as demand elasticity and \(\beta_{s}\) as supply elasticity. Undergraduate microeconomics teaches the deterministic form but we add an error term to cope with the data. Can we learn the elasticities by regression \(p_{i}\) on \(q_{i}\)?

The two equations can be written in a matrix form \[\begin{pmatrix}1 & \beta_{d}\\ 1 & -\beta_{s} \end{pmatrix}\begin{pmatrix}p_{i}\\ q_{i} \end{pmatrix}=\begin{pmatrix}\alpha_{d}\\ \alpha_{s} \end{pmatrix}+\begin{pmatrix}e_{di}\\ e_{si} \end{pmatrix}.\label{eq:structural}\] Microeconomic terminology calls \(\left(p_{i},q_{i}\right)\) endogenous variables and \(\left(e_{di},e_{si}\right)\) exogenous variables. (\[eq:structural\]) is a structural equation because it is motivated from economic theory so that the coefficients bear economic meaning. If we rule out the trivial case \(\beta_{d}=\beta_{s}=0\), we can solve \[\begin{aligned} \begin{pmatrix}p_{i}\\ q_{i} \end{pmatrix} & =\begin{pmatrix}1 & \beta_{d}\\ 1 & -\beta_{s} \end{pmatrix}^{-1}\left[\begin{pmatrix}\alpha_{d}\\ \alpha_{s} \end{pmatrix}+\begin{pmatrix}e_{di}\\ e_{si} \end{pmatrix}\right]\nonumber \\ & =\frac{1}{\beta_{s}+\beta_{d}}\begin{pmatrix}\beta_{s} & \beta_{d}\\ 1 & -1 \end{pmatrix}\left[\begin{pmatrix}\alpha_{d}\\ \alpha_{s} \end{pmatrix}+\begin{pmatrix}e_{di}\\ e_{si} \end{pmatrix}\right].\label{eq:reduced}\end{aligned}\] This equation (\[eq:reduced\]) is called the reduced form—the endogenous variables are expressed as explicit functions of the parameters and the exogenous variables. In particular, \[q_{i}=\left(\alpha_{d}+e_{di}-\alpha_{s}-e_{si}\right)/\left(\beta_{s}+\beta_{d}\right)\] so that the log-price is correlated with both \(e_{si}\) and \(e_{di}\). As \(q_{i}\) is endogenous (in the econometric sense) in either (\[eq:demand\]) or (\[eq:supply\]), neither the demand elasticity nor the supply elasticity is identified with \(\left(p_{i},q_{i}\right)\). Indeed, as \[p_{i}=\left(\beta_{s}\alpha_{d}+\beta_{d}\alpha_{s}+\beta_{s}e_{di}+\beta_{d}e_{si}\right)/\left(\beta_{s}+\beta_{d}\right)\] from (\[eq:reduced\]), the linear projection coefficient of \(p_{i}\) on \(q_{i}\) is \[\frac{\mathrm{cov}\left[p_{i},q_{i}\right]}{\mathrm{var}\left[q_{i}\right]}=\frac{\beta_{s}\sigma_{d}^{2}-\beta_{d}\sigma_{s}^{2}+\left(\beta_{d}-\beta_{s}\right)\sigma_{sd}}{\beta_{d}^{2}\sigma_{d}^{2}+\beta_{d}\sigma_{s}^{2}+2\beta_{d}\beta_{s}\sigma_{sd}},\] where \(\sigma_{d}^{2}=\mathrm{var}\left[e_{di}\right]\), \(\sigma_{s}^{2}=\mathrm{var}\left[e_{si}\right]\) and \(\sigma_{sd}=\mathrm{cov}\left[e_{di},e_{si}\right]\).

This is a classical example of the demand-supply system. The structural parameter cannot be directly identified because the observed \(\left(p_{i},q_{i}\right)\) is the outcome of an equilibrium—the crossing of the demand curve and the supply curve. To identify the demand curve, we will need an instrument that shifts the supply curve only; and vice versa.

This is a model borrowed from Hayashi (2000, p.193) but originated from Haavelmo (1943). An econometrician is interested in learning \(\beta_{2}\), the marginal propensity of consumption, in the Keynesian-type equation \[C_{i}=\beta_{1}+\beta_{2}Y_{i}+u_{i}\label{eq:keynes}\] where \(C_{i}\) is household consumption, \(Y_{i}\) is the GNP, and \(u_{i}\) is the unobservable error. However, \(Y_{i}\) and \(C_{i}\) are connected by an accounting equality (with no error) \[Y_{i}=C_{i}+I_{i},\] where \(I_{i}\) is investment. We assume \(\mathbb{E}\left[u_{i}|I_{i}\right]=0\) as investment is determined in advance. In this example, \(\left(Y_{i}C_{i}\right)\) are endogenous and \(\left(I_{i},u_{i}\right)\) are exogenous. Put the two equations together as the structural form \[\begin{pmatrix}1 & -\beta_{2}\\ -1 & 1 \end{pmatrix}\begin{pmatrix}C_{i}\\ Y_{i} \end{pmatrix}=\begin{pmatrix}\beta_{1}\\ 0 \end{pmatrix}+\begin{pmatrix}u_{i}\\ I_{i} \end{pmatrix}.\] The corresponding reduced form is \[\begin{aligned} \begin{pmatrix}C_{i}\\ Y_{i} \end{pmatrix} & =\begin{pmatrix}1 & -\beta_{2}\\ -1 & 1 \end{pmatrix}^{-1}\left[\begin{pmatrix}\beta_{1}\\ 0 \end{pmatrix}+\begin{pmatrix}u_{i}\\ I_{i} \end{pmatrix}\right]\\ & =\frac{1}{1-\beta_{2}}\begin{pmatrix}1 & \beta_{2}\\ 1 & 1 \end{pmatrix}\left[\begin{pmatrix}\beta_{1}\\ 0 \end{pmatrix}+\begin{pmatrix}u_{i}\\ I_{i} \end{pmatrix}\right]\\ & =\frac{1}{1-\beta_{2}}\begin{pmatrix}\beta_{1}+u_{i}+\beta_{2}I_{i}\\ \beta_{1}+u_{i}+I_{i} \end{pmatrix}.\end{aligned}\] OLS (\[eq:keynes\]) will be inconsistent because in the reduced-form \(Y_{i}=\frac{1}{1-\beta_{2}}\left(\beta_{1}+u_{i}+I_{i}\right)\) implies \(\mathbb{E}\left[Y_{i}u_{i}\right]=\mathbb{E}\left[u_{i}^{2}\right]/\left(1-\beta_{2}\right)\neq0\).

11.4 Summary

Even though we often deal with a single equation model with potential endogenous variables, the underlying structural system may involve multiple equations. The simultaneous equation model is a classical econometric modeling approach, and it is still actively applied in structural economic studies. When our economic model is “structural”, we keep in mind a causal mechanism. Instead of identifying the causal effect by control group and treatment group as in Chapter 2, here we look at causality from the economic structural perspective.

Historical notes: Instruments originally appeared in Philip Wright (1928) for identifying the coefficient of an endogenous variables. It is believed to be a collaborative idea with Philip’s son Sewall Wright. The demand and supply analysis is attributed to Working (1927), and the measurement error study is dated back to Fricsh (1934).

Further reading: Causality is the holy grail of econometrics. Pearl and Mackenzie (2018) is a popular book with philosophical depth. It is a delight to read. (Chen, Hong, and Nekipelov 2011) is a survey for modern nonlinear measurement error models.