3 Conditional Expectation

Notation: In this note, \(y\) is a scale random variable, and \(x=\left(x_{1},\ldots,x_{K}\right)'\) is a \(K\times1\) random vector. Throughout this course, a vector is a column vector, i.e. a one-column matrix.


Machine learning is a big basket that contains the regression models. We motivate the conditional expectation model from the perspective of prediction. We view a regression as supervised learning. Supervised learning uses a function of \(x\), say, \(g\left(x\right)\), to predict \(y\). \(x\) cannot perfectly predict \(y\); otherwise their relationship is deterministic. The prediction error \(y-g\left(x\right)\) depends on the choice of \(g\). There are numerous possible choices of \(g\). Which one is the best? Notice that this question is not concerned about the underlying data generating process (DGP) of the joint distribution of \(\left(y,x\right)\). We want to find a general rule to achieve accurate prediction of \(y\) given \(x\), no matter how this pair of variables is generated.

To answer this question, we need to decide a criterion to compare different \(g\). Such a criterion is called the loss function \(L\left(y,g\left(x\right)\right)\). A particularly convenient one is the quadratic loss, defined as \[L\left(y,g\left(x\right)\right)=\left(y-g\left(x\right)\right)^{2}.\] Since the data are random, \(L\left(y,g\left(x\right)\right)\) is also random. “Random” means uncertainty: sometimes this happens, and sometimes that happens. To get rid of the uncertainty, we average the loss function with respect to the joint distribution of \(\left(y,x\right)\) as \(R\left(y,g\left(x\right)\right)=E\left[L\left(y,g\left(x\right)\right)\right]\), which is called risk. Risk is a deterministic quality. For the quadratic loss function, the corresponding risk is \[R\left(y,g\left(x\right)\right)=E\left[\left(y-g\left(x\right)\right)^{2}\right],\] is called the mean squared error (MSE). MSE is the most widely used risk measure, although there exist many alternative measures, for example the mean absolute error (MAE) \(E\left[\left|y-g\left(x\right)\right|\right]\). The popularity of MSE comes from its convenience for analysis in closed-form, which MAE does not enjoy due to its nondifferentiability. This is similar to the choice of utility functions in economics. There are only a few functional forms for the utility, for example CRRA, CARA, and so on. They are popular because they lead to close-form solutions that are easy to handle. Now our quest is narrowed to: What is the optimal choice of \(g\) if we minimize the MSE?

\[prop:CEF\] The conditional mean function (CEF) \(m\left(x\right)=E\left[y|x\right]=\int yf\left(y|x\right)\mathrm{d}y\) minimizes MSE.

Before we prove the above proposition, we first discuss some properties of the conditional mean function. Obviously \[y=m\left(x\right)+\left(y-m\left(x\right)\right)=m\left(x\right)+\epsilon,\] where \(\epsilon:=y-m\left(x\right)\) is called the regression error. This equation holds for \(\left(y,x\right)\) following any joint distribution, as long as \(E\left[y|x\right]\) exists. The error term \(\epsilon\) satisfies these properties:

  • \(E\left[\epsilon|x\right]=E\left[y-m\left(x\right)|x\right]=E\left[y|x\right]-m(x)=0\),

  • \(E\left[\epsilon\right]=E\left[E\left[\epsilon|x\right]\right]=E\left[0\right]=0\),

  • For any function \(h\left(x\right)\), we have \[E\left[h\left(x\right)\epsilon\right]=E\left[E\left[h\left(x\right)\epsilon|x\right]\right]=E\left[h(x)E\left[\epsilon|x\right]\right]=0.\label{eq:uncorr}\]

The last property implies that \(\epsilon\) is uncorrelated with any function of \(x\). In particular, when \(h\) is the identity function \(h\left(x\right)=x\), we have \(E\left[x\epsilon\right]=\mathrm{cov}\left(x,\epsilon\right)=0\).

The optimality of the CEF can be confirmed by “guess-and-verify.” For an arbitrary \(g\left(x\right)\), the MSE can be decomposed into three terms \[\begin{aligned} & & E\left[\left(y-g\left(x\right)\right)^{2}\right]\\ & = & E\left[\left(y-m(x)+m(x)-g(x)\right)^{2}\right]\\ & = & E\left[\left(y-m\left(x\right)\right)^{2}\right]+2E\left[\left(y-m\left(x\right)\right)\left(m\left(x\right)-g\left(x\right)\right)\right]+E\left[\left(m\left(x\right)-g\left(x\right)\right)^{2}\right].\end{aligned}\] The first term is irrelevant to \(g\left(x\right)\). The second term \[\begin{aligned}2E\left[\left(y-m\left(x\right)\right)\left(m\left(x\right)-g\left(x\right)\right)\right] & =2E\left[\epsilon\left(m\left(x\right)-g\left(x\right)\right)\right]=0\end{aligned}\] by invoking (\[eq:uncorr\]) with \(h\left(x\right)=m\left(x\right)-g\left(x\right)\). The second term is again irrelevant of \(g\left(x\right)\). The third term, obviously, is minimized at \(g\left(x\right)=m\left(x\right)\).

Our perspective so far deviates from many econometric textbooks that assume that the dependent variable \(y\) is generated as \(g\left(x\right)+\epsilon\) for some unknown function \(g\left(\cdot\right)\) and error term \(\epsilon\) such that \(E\left[\epsilon|x\right]=0\). Instead, we take a predictive approach regardless the DGP. What we observe are \(y\) and \(x\) and we are solely interested in seeking a function \(g\left(x\right)\) to predict \(y\) as accurately as possible under the MSE criterion.

3.1 Linear Projection

The CEF \(m(x)\) is the function that minimizes the MSE. However, \(m\left(x\right)=E\left[y|x\right]\) is a complex function of \(x\), for it depends on the joint distribution of \(\left(y,x\right)\), which is mostly unknown in practice. Now let us make the prediction task even simpler. How about we minimize the MSE within all linear functions in the form of \(h\left(x\right)=h\left(x;b\right)=x'b\) for \(b\in\mathbb{R}^{K}\)? The minimization problem is \[\min_{b\in\mathbb{R}^{K}}E\left[\left(y-x'b\right)^{2}\right].\label{eq:linear_MSE}\] Take the first-order condition of the MSE \[\frac{\partial}{\partial b}E\left[\left(y-x'b\right)^{2}\right]=E\left[\frac{\partial}{\partial b}\left(y-x'b\right)^{2}\right]=-2E\left[x\left(y-x'b\right)\right],\] where the first equality holds if \(E\left[\left(y-x'b\right)^{2}\right]<\infty\) so that the expectation and partial differentiation is interchangeable, and the second equality hods by the chain rule and the linearity of expectation. Set the first order condition to 0 and we solve \[\beta=\arg\min_{b\in\mathbb{R}^{K}}E\left[\left(y-x'b\right)^{2}\right]\] in the closed-form \[\beta=\left(E\left[xx'\right]\right)^{-1}E\left[xy\right]\] if \(E\left[xx'\right]\) is invertible. Notice here that \(b\) is an arbitrary \(K\)-vector, while \(\beta\) is the optimizer. The function \(x'\beta\) is called the best linear projection (BLP) of \(y\) on \(x\), and the vector \(\beta\) is called the linear projection coefficient.

The linear function is not as restrictive as one might thought. It can be used to produce some nonlinear (in random variables) effect if we re-define \(x\). For example, if \[y=x_{1}\beta_{1}+x_{2}\beta_{2}+x_{1}^{2}\beta_{3}+e,\] then \(\frac{\partial}{\partial x_{1}}m\left(x_{1},x_{2}\right)=\beta_{1}+2x_{1}\beta_{3}\), which is nonlinear in \(x_{1}\), while it is still linear in the parameter \(\beta=\left(\beta_{1},\beta_{2},\beta_{3}\right)\) if we define a set of new regressors as \(\left(\tilde{x}_{1},\tilde{x}_{2},\tilde{x}_{3}\right)=\left(x_{1},x_{2},x_{1}^{2}\right)\).

If \(\left(y,x\right)\) is jointly normal in the form \[\begin{pmatrix}y\\ x \end{pmatrix}\sim\mathrm{N}\left(\begin{pmatrix}\mu_{y}\\ \mu_{x} \end{pmatrix},\begin{pmatrix}\sigma_{y}^{2} & \rho\sigma_{y}\sigma_{x}\\ \rho\sigma_{y}\sigma_{x} & \sigma_{x}^{2} \end{pmatrix}\right)\] where \(\rho\) is the correlation coefficient, then \[E\left[y|x\right]=\mu_{y}+\rho\frac{\sigma_{y}}{\sigma_{x}}\left(x-\mu_{x}\right)=\left(\mu_{y}-\rho\frac{\sigma_{y}}{\sigma_{x}}\mu_{x}\right)+\rho\frac{\sigma_{y}}{\sigma_{x}}x,\] is a liner function of \(x\). In this example, the CEF is linear.

Even though in general \(m\left(x\right)\neq x'\beta\), the linear form \(x'\beta\) is still useful in approximating \(m\left(x\right)\). That is, \(\beta=\arg\min\limits _{b\in\mathbb{R}^{K}}E\left[\left(m(x)-x'b\right)^{2}\right]\).

The first-order condition gives \(\frac{\partial}{\partial b}E\left[\left(m(x)-x'b\right)^{2}\right]=-2E[x(m(x)-x'b)]=0\). Rearrange the terms and obtain \(E[x\cdot m(x)]=E[xx']b\). When \(E[xx']\) is invertible, we solve \[\left(E\left[xx'\right]\right){}^{-1}E[x\cdot m(x)]=\left(E\left[xx'\right]\right){}^{-1}E[E[xy|x]]=\left(E\left[xx'\right]\right){}^{-1}E[xy]=\beta.\] Thus \(\beta\) is also the best linear approximation to \(m\left(x\right)\) under MSE.

We may rewrite the linear regression model, or the linear projection model, as \[\begin{array}[t]{c} y=x'\beta+e\\ E[xe]=0, \end{array}\] where \(e=y-x'\beta\) is called the linear projection error, to be distinguished from \(\epsilon=y-m(x).\)

Show (a) \(E\left[xe\right]=0\). (b) If \(x\) contains a constant, then \(E\left[e\right]=0\).

3.1.1 Omitted Variable Bias

We write the long regression as \[y=x_{1}'\beta_{1}+x_{2}'\beta_{2}+\beta_{3}+e_{\beta},\] and the short regression as \[y=x_{1}'\gamma_{1}+\gamma_{2}+e_{\gamma},\] where \(e_{\beta}\) and \(e_{\gamma}\) are the projection errors, respectively. If \(\beta_{1}\) in the long regression is the parameter of interest, omitting \(x_{2}\) as in the short regression will render omitted variable bias (meaning \(\gamma_{1}\neq\beta_{1}\)) unless \(x_{1}\) and \(x_{2}\) are uncorrelated.

We first demean all the variables in the two regressions, which is equivalent as if we project out the effect of the constant. The long regression becomes \[\tilde{y}=\tilde{x}_{1}'\beta_{1}+\tilde{x}_{2}'\beta_{2}+\tilde{e}_{\beta},\] and the short regression becomes \[\tilde{y}=\tilde{x}_{1}'\gamma_{1}+\tilde{e}_{\gamma},\] where tilde denotes the demeaned variable.

Show \(\tilde{e}_{\beta}=e_{\beta}\) and \(\tilde{e}_{\gamma}=e_{\gamma}\).

After demeaning, the cross-moment equals to the covariance. The short regression coefficient \[\begin{aligned}\gamma_{1} & =\left(E\left[\tilde{x}_{1}\tilde{x}_{1}'\right]\right)^{-1}E\left[\tilde{x}_{1}\tilde{y}\right]\\ & =\left(E\left[\tilde{x}_{1}\tilde{x}_{1}'\right]\right)^{-1}E\left[\tilde{x}_{1}\left(\tilde{x}_{1}'\beta_{1}+\tilde{x}_{2}'\beta_{2}+\tilde{e}_{\beta}\right)\right]\\ & =\left(E\left[\tilde{x}_{1}\tilde{x}_{1}'\right]\right)^{-1}E\left[\tilde{x}_{1}\tilde{x}_{1}'\right]\beta_{1}+\left(E\left[\tilde{x}_{1}\tilde{x}_{1}'\right]\right)^{-1}E\left[\tilde{x}_{1}\tilde{x}_{2}'\right]\beta_{2}\\ & =\beta_{1}+\left(E\left[\tilde{x}_{1}\tilde{x}_{1}'\right]\right)^{-1}E\left[\tilde{x}_{1}\tilde{x}_{2}'\right]\beta_{2}, \end{aligned}\] where the third line holds as \(E\left[\tilde{x}_{1}\tilde{e}_{\beta}\right]=0\). Therefore, \(\gamma_{1}=\beta_{1}\) if and only if \(E\left[\tilde{x}_{1}\tilde{x}_{2}'\right]\beta_{2}=0\), which demands either \(E\left[\tilde{x}_{1}\tilde{x}_{2}'\right]=0\) or \(\beta_{2}=0\).

Show that \(E\left[\left(y-x_{1}'\beta_{1}-x_{2}'\beta_{2}-\beta_{3}\right)^{2}\right]\leq E\left[\left(y-x_{1}'\gamma_{1}-\gamma_{2}\right)^{2}\right]\).

Obviously we prefer to run the long regression to attain \(\beta_{1}\) if possible, for it is a more general model than the short regression and achieves no larger variance in the projection error. However, sometimes \(x_{2}\) is unobservable so the long regression is unavailable. This example of omitted variable bias is ubiquitous in applied econometrics. Ideally we would like to directly observe some regressors but in reality we do not have them at hand. We should be aware of the potential consequence when the data are not as ideal as we have wished. When only the short regression is available, in some cases we are able to sign the bias, meaning that we can argue whether \(\gamma_{1}\) is bigger or smaller than \(\beta_{1}\) based on our knowledge.

3.2 Causality

3.2.1 Structure and Identification

Unlike physical laws such as Einstein’s mass–energy equivalence \(E=mc^{2}\) and Newton’s universal gravitation \(F=Gm_{1}m_{2}/r^{2}\), economic phenomena can rarely be summarized in such a minimalistic style. When using experiments to verify physical laws, scientists often manage to come up with smart design in which signal-to-noise ratio is so high that small disturbances are kept at a negligible level. On the contrary, economic laws do not fit a laboratory for experimentation. What is worse, the subjects in economic studies — human beings — are heterogeneous and with many features that are hard to control. People from distinctive cultural and family backgrounds respond to the same issue differently and researchers can do little to homogenize them. The signal-to-noise ratios in economic laws are often significantly lower than those of physical laws, mainly due to the lack of laboratory setting and the heterogeneous nature of the subjects.

Educational return and the demand-supply system are two classical topics in econometrics. A person’s incomes is determined by too many random factors in the academic and career path that is impossible to exhaustively observe and control. The observable prices and quantities are outcomes of equilibrium so the demand and supply affect each other.

Generations of thinkers have been debating the definitions of causality. In economics, an accepted definition is structural causality. Structural causality is a thought experiment. It assumes that there is a DGP that produces the observational data. If we can use data to recover the DGP or some features of the DGP, then we have learned causality or some implications of causality.

A key issue to resolve before looking at the realized sample is identification. We say a model or DGP is identified if the each possible parameter of the model under consideration generates distinctive features of the observable data. A model is under-identified if more than one parameter in the model can generate exact the same features of the observable data. In other words, a model is under-identified if from the observable data we cannot trace back to a unique parameter in the model. A correctly specified model is the prerequisite for any discussion of identification. In reality, all models are wrong. Thus when talking about identification, we are indulged in an imaginary world. If in such a thought experiment we still cannot unique distinguish the true parameter of the data generating process, then identification fails. We cannot determine what is the true model no matter how large the sample is.

3.2.2 Treatment Effect

We narrow down to the framework of the relationship between \(y\) and \(x\). One question of particular interest is treatment effect. The treatment effect is how much \(y\) will change if we change a variable of interest, say \(d\), by one unit while keeping all other variables (including the unobservable variables) the same. The Latin phrase ceteris paribus means “keep all other things constant.”

During the 2020 covid-19 pandemic, Hong Kong’s unemployment rate rose to a high-level and consumption collapsed. In order to boost the economy, some Hong Kong residents were qualified in receiving 10,000 HKD cash allowance from the government. We are interested in learning how much does the 10,000 HKD allowance increase people’s consumption. For an individual, we imagine two parallel worlds: one with the cash allowance and one without. The difference of the consumption in the world with the allowance, denoted \(Y\left(1\right)\), and that in the world without the allowance, denoted \(Y\left(0\right)\), is the treatment effect of that particular person. This thought experiment is called the potential outcome framework.

However, in reality one and only one scenario happens, which echoes the saying of ancient Greek philosopher Heraclitus (553 BC--475 BC) “You cannot step into the same river twice.” The individual treatment effect is not operational (operational means it can be computed from data at the population level), because one and only one outcome is realized. With many people available, we can define average treatment effect (ATE) as \[ATE=E\left[Y\left(1\right)-Y\left(0\right)\right]=E\left[Y\left(1\right)\right]-E\left[Y\left(0\right)\right].\] Notice that \(E\left[Y\left(1\right)\right]\) and \(E\left[Y\left(0\right)\right]\) are still not operational before we observe a companion variable \[D=1\left\{ \mbox{treatment received}\right\} .\] Once each individual’s treatment status is observable, \(E\left[Y\left(1\right)|D=1\right]\) and \(E\left[Y\left(0\right)|D=0\right]\) are operational from the data.

If the two potential outcomes \(\left(Y\left(1\right),Y\left(0\right)\right)\) are independent of the assignment \(D\), then \(E\left[Y\left(1\right)\right]=E\left[Y\left(1\right)|D=1\right]\) and \(E\left[Y\left(0\right)\right]=E\left[Y\left(0\right)|D=0\right]\) so that ATE can be estimated from the data in an operational way as \[ATE=E\left[Y\left(1\right)|D=1\right]-E\left[Y\left(0\right)|D=0\right].\] Therefore, to evaluate ATE ideally we would like use a lottery to randomly decide that some people receive the treatment (treatment group, with \(D=1\)) and the others do not (control group, with \(D=0\)).

When we have other control variables, we can also define a finer treatment effect conditional on \(x\): \[ATE\left(x\right)=E\left[Y\left(1\right)|x\right]-E\left[Y\left(0\right)|x\right].\] ATE is the average effect in the population of individuals when we hypothetical give them the treatment, keeping all other factors \(x\) constant. If conditioning on \(x\), the treatment \(D\) is independent of \(\left(Y\left(1\right),Y\left(0\right)\right)\), then ATE becomes operational: \[ATE\left(x\right)=E\left[Y\left(1\right)|D=1,x\right]-E\left[Y\left(0\right)|D=0,x\right]\] The important condition \(\left(\left(Y\left(1\right),Y\left(0\right)\right)\perp D\right)|x\) is called the conditional independence assumption (CIA).

CIA is more plausible than full independence. Consider the example \(Y\left(1\right)=x+u\left(1\right)\), \(Y\left(0\right)=x+u\left(0\right)\) and \(D=1\left\{ x+u_{d}\geq0\right\}\). If \(\left(\left(u\left(0\right),u\left(1\right)\right)\perp u_{d}\right)|x\), then CIA is satisfied. Nevertheless \(\left(Y\left(1\right),Y\left(0\right)\right)\) and \(D\) are statistically dependent, since \(x\) is involved in all random variables.

3.2.3 ATE and CEF

In the previous section the treatment \(D\) is binary. Now we consider a continuous treatment \(D\). Suppose the DGP, or the structural model, is \(Y=h\left(D,x,u\right)\) where \(D\) and \(x\) are observable and \(u\) is unobservable. It is natural to define ATE with the continuous treatment (Hansen’s book Chapter 2.30 calls it average causal effect) as \[ATE\left(d,x\right)=E\left[\lim_{\Delta\to0}\frac{h\left(d+\Delta,x,u\right)-h\left(d,x,u\right)}{\Delta}\right]=E\left[\frac{\partial}{\partial d}h\left(d,x,u\right)\right],\] where the continuous differentiability of \(h\left(d,x,u\right)\) at \(d\) is implicitly assumed. Unlike the binary treatment case, here \(d\) explicitly shows up in \(ATE\left(d,x\right)\) because the effect can vary at different values of \(d\). ATE here is the average effect in the population of individuals if we hypothetical move \(D\) a tiny bit around \(d\), keeping all other factors \(x\) constant.

In the previous sections, we focused on the CEF \(m\left(d,x\right)\), where \(d\) is added to \(x\) as an additional variable of interest. We did not intend to model the underlying economic mechanism \(h\left(D,x,u\right)\), which may be very complex. Can we learn the \(ATE\left(d,x\right)\) which bears the structural causal interpretation, from the mechanical \(m\left(d,x\right)\) which merely cares about best prediction? The answer is positive under CIA: \(\left(u\perp D\right)|x\). \[\begin{aligned} \frac{\partial}{\partial d}m\left(d,x\right) & =\frac{\partial}{\partial d}E\left[y|d,x\right]=\frac{\partial}{\partial d}E\left[h\left(d,x,u\right)|d,x\right]=\frac{\partial}{\partial d}\int h\left(d,x,u\right)f\left(u|d,x\right)du\\ & =\int\frac{\partial}{\partial d}\left[h\left(d,x,u\right)f\left(u|d,x\right)\right]du\\ & =\int\left[\frac{\partial}{\partial d}h\left(d,x,u\right)\right]f\left(u|d,x\right)du+\int h\left(d,x,u\right)\left[\frac{\partial}{\partial d}f\left(u|d,x\right)\right]du,\end{aligned}\] where the second line implicitly assumes interchangeability between the integral and the partial derivative. Under CIA, \(\frac{\partial}{\partial d}f\left(u|d,x\right)=0\) and the second term drops out. Thus \[\frac{\partial}{\partial d}m\left(d,x\right)=\int\left[\frac{\partial}{\partial d}h\left(d,x,u\right)\right]f\left(u|d,x\right)du=E\left[\frac{\partial}{\partial d}h\left(d,x,u\right)\right]=ATE\left(d,x\right).\] This is an important result. It says that if CIA holds, we can learn the causal effect of \(d\) on \(y\) by the partial derivative of CEF conditional on \(x\). In particular, if we further assume a linear CEF \(m\left(d,x\right)=\beta_{d}d+\beta_{x}'x\), then the causal effect is the coefficient \(\beta_{d}\).

CIA is the key condition that links the CEF and the causal effect. CIA is not an innocuous assumption. In applications, our causal results are credible only when we can convincing defend CIA.

Let factories’ output be a Cobb-Douglas function \(Y=AK^{\alpha}L^{\beta}\), where the capital level \(K\) and labor \(L\) as well as the output \(Y\) is observable, while the “technology” \(A\) is unobservable. Take logarithm on both sides of the equation: \[y=u+\alpha k+\beta l\label{eq:causal}\] where \(y=\log Y\), \(u=\log A\), \(k=\log K\) and \(l=\log L\). Suppose \(\begin{pmatrix}u\\ k\\ l \end{pmatrix}\sim N\left(\begin{pmatrix}1\\ 1\\ 1 \end{pmatrix},\begin{pmatrix}1 & 0.5 & 0\\ 0.5 & 1 & 0\\ 0 & 0 & 1 \end{pmatrix}\right)\) and \(\alpha=\beta=1/2\) make the true DGP. Here \(u\) and \(k\) are correlated, because factories of larger scale can afford robots to facilitate automation.

  1. What is the partial derivative of CEF when we use \(k\) as a treatment variable for a fixed labor level \(l\)? (Hint: the CEF is a linear function thanks to the joint normality.)

  2. Does it coincide with \(\alpha=1/2\), the coefficient in the causal model (\[eq:causal\])? (Hint: No, because CIA is violated.)

Sometimes applied researchers assume by brute force that \(y=m\left(d,x\right)+u\) is the DGP and \(E\left[u|d,x\right]=0\), where \(d\) is the variable of interest and \(x\) is the vector of other control variables. Under these assumptions, \[ATE\left(d,x\right)=E\left[\frac{\partial}{\partial d}\left(m\left(d,x\right)+u\right)|d,x\right]=\frac{\partial m\left(d,x\right)}{\partial d}+\frac{\partial}{\partial d}E\left[u|d,x\right]=\frac{\partial m\left(d,x\right)}{\partial d},\] where the second equality holds if \(\frac{\partial}{\partial d}E\left[u|d,x\right]=E\left[\frac{\partial}{\partial d}u|d,x\right]\). At a first glance, it seems that the mean independence assumption \(E\left[u|d,x\right]=0\), which is weaker than CIA, implies the equivalence between \(ATE\left(d,x\right)\) and \(\partial m\left(d,x\right)/\partial d\) here. However, such slight weakening is achieved by a very strong assumption that the DGP \(h\left(d,x,u\right)\) follows the additive separable form \(m\left(d,x\right)+u\). Without economic theory to defend the choice of the assumed DGP \(y=m\left(d,x\right)+u\), this is at best the reduced-form approach.

The structural approach here models the economic mechanism, guided by economic theory. The reduced-form approach is convenient and can document stylized facts when suitable economic theory is not immediately available. There are constant debates about the pros and cons of the two approaches; see Journal of Economic Perspectives Vol. 24, No. 2 Spring 2010. In macroeconomics, the so-called Phillips curve, attributed to A.W. Phillips about the negative correlation between inflation and unemployment, is a stylized fact learned from the reduced-form approach. The Lucas critique (Lucas 1976) exposed its lack of microfoundation and advocated modeling deep parameters that are invariant to policy changes. The latter is a structural approach. Ironically, more than 40 years has passed since the Lucas critique, equations with little microfoundation still dominate the analytical apparatus of central bankers.

3.3 Summary

In this lecture, we cover the conditional mean function and causality. When we are faced with a pair of random variable \(\left(y,x\right)\) drawn from some joint distribution, the CEF is the best predictor. When we go further into the structural causality about some treatment \(d\) to the dependent variable \(y\), under CIA we can find equivalence between ATE and the partial derivative of CEF. All analyses are conducted in population. We have not touched the sample yet.

Historical notes: Regressions and conditional expectations are concepts from statistics and they are imported to econometrics in early time. Researchers at the Cowles Commission (now Cowles Foundation for Research in Economics) — Jacob Marschak (1898–1977), Tjalling Koopmans (1910–1985, Nobel Prize 1975), Trygve Haavelmo (1911–1999, Nobel Prize 1989) and their colleagues — were trailblazers of the econometric structural approach.

The potential outcome framework is not peculiar to economics. It is widely used in other fields such as biostatistics and medical studies. It was initiated by Jerzy Neyman (1894–1981) and extended by Donald B. Rubin (1943– ), Professor of Statistics at Tsinghua University.

Further reading: Lewbel (2019) offers a comprehensive summary of identification in econometrics. Accounting is an applied field with many claimed causal inference drawn from simple regressions; it is encouraging to hear Gow, Larcker, and Reiss (2016) to reflect causality in their practices.

Zhentao Shi. Sep 17, 2020