Instrumental Variables

Zhentao Shi

Nov 15, 2021

Endogeneity

Consider the simple regression

\[ y_t = \beta_0 + \beta_1 x_{t} + u_t \]

A necessary condition for consistency is \(E[x_t u_t] = 0\) (orthogonality)
If \(\beta_1\) is the linear projection coefficient, by definition orthogonality automatically holds
If \(\beta_1\) is a causal coefficient, orthogonality may be violated
If the regressor \(x_{t}\) is correlated with \(u_t\), we say \(x_{t}\) is endogenous

Sources of endogeity

Error in variables
- Phillips curve describes the relationship between unemployment and expected inflation. However, expected inflation is unobservable
- Realized inflation as proxy
Omitted variables
- The theoretical model of CAPM has only the market factor on the right-hand side
- Factor zoo
Simultaneity
- Demand and supply
- Ubiquitous in corporate finance: e.g., Venture capital and start-ups

Consequence of endogeneity

When an endogenous variable is present in a linear regression, OLS cannot deliver consistency because the numerator term in

\[ \hat{\beta}_{1, OLS} - \beta_1 = \frac{\hat{cov}[x_t, u_t]}{\hat{var}[{x_t}]} \stackrel{p}{\nrightarrow} 0 \]

In general

\[ \hat{\boldsymbol{\beta}}_{OLS} - \boldsymbol{\beta} = \left(\frac{\mathbf{X}'\mathbf{X}}{T}\right)^{-1} \frac{\mathbf{X}'\mathbf{u}}{T} \stackrel{p}{\nrightarrow} \mathbf{0}_p \]

Instrumental variables

We call a variable \(z_t\) an instrumental variable for \(x_{t}\) if
- Outside of the regression equation
- Relevance condition: \(cov[x_t, z_t] \neq 0\)
- Orthogonality condition: \(cov[u_t, z_t] = 0\)
An instrumental variable is called instrument or IV for short

Identifying the parameter of interest

Orthogonality implies

\[ \begin{align} 0 & = cov[u_t, z_t] \\ & = cov[y_t - \beta_0 - \beta_1 x_t, z_t] \\ & = cov[y_t - \beta_1 x_t, z_t] \\ & = cov[y_t, z_t] - \beta_1 cov[ x_t, z_t] \end{align} \]

Rearrange the above equation:

\[ \beta_1 = \frac{cov[y_t, z_t]} { cov[ x_t, z_t]} \]

if the denominator \(cov[ x_t, z_t]\neq 0\) (relevance)

Source of instruments

In microeconometrics, credible instruments are rare
Exclusion and relevance are at odd
Knowledge of the mechanisms
- Acemoglu, Johnson and Robinson (2001, AER): “The Colonial Origins of Comparative Development: An Empirical Investigation”. Institution -> economic growth. IV: early settlers’ mortality rates
- Miguel, Satyanath and Sergenti (2004, JPE): “Economic Shocks and Civil Conflict: An Instrumental Variables Approach”. Agriculture crises -> civil conflict. IV: rainfall
Economic structures
- Ross and Shi (2021): peer effects in college roommates. IV: 1st year’s roommates who no longer share the dormitory in the 2nd year
Time series lags
- CAPM \(y_t = \beta_0 + \beta_1 x_{t} + u_t\), and \(x_t = \gamma_0 + \gamma_1 x_{t-1} + v_t\). Endogeneity comes from missing factors. Potential IV: \(x_{t-1}\). (Demonstrative example in HMPY Chapter 8.3)
- Dynamic panel data regression (HMPY Chapter 11.5)

Estimation

Consistent estimation can be achieved by replacing the population covariance with the sample covariance
Method of moments

\[ \hat{\beta}_1 = \frac{\hat{cov}[y_t, z_t]} {\hat{cov}[ x_t, z_t]} \]

Valid only for the simple regression

Structural equation and reduced-form equation

Structural equation

\[ y_t = \beta_0 + \beta_1 x_{t} + u_t \]

Reduced-form equation

\[ x_t = \gamma_0 + \gamma_1 z_{t} + v_t \]

The relevance condition entails \(\gamma_1\neq 0\)

Two stage least squares (2SLS)

One of the most popular estimators in econometrics
In order to consistently estimate \(\beta_1\), conduct the following two steps

Run OLS in the reduced-form equation, save the fitted value \(\hat{x}_t = \hat{\gamma}_0 + \hat{\gamma}_1 z_t\)
In the structural equation replace the endogenous variable \(x_t\) by its fitted value \(\hat{x}_t\) from the 1st stage, and then OLS \(y_t\) on \(\hat{x}_t\). The estimated coefficient associated with \(\hat{x}_t\) is consistent for \(\beta_1\)

Caution: OLS’s variance estimate in the 2nd stage is invalid. 2SLS’s S.E. has its own formula

Back to the simple regression

Start from 2SLS

\[ \begin{align} \hat{\beta}_1^{2SLS} & = \frac{\hat{cov}[ y_t, \hat{x}_t]} {\hat{var}[ \hat{x}_t]} \\ & = \frac{\hat{cov}[ y_t, \hat{\gamma}_0 + \hat{\gamma}_1 z_{t}]} {\hat{var}[\hat{\gamma}_0 + \hat{\gamma}_1 z_{t}]}\\ & = \frac{ \hat{\gamma}_1\hat{cov}[ y_t, z_{t}]} {\hat{cov}[ \hat{\gamma}_0+ \hat{\gamma}_1 z_{t} + \hat{v}_t, \hat{\gamma}_0+\hat{\gamma}_1 z_{t}]} \\ & = \frac{ \hat{\gamma}_1\hat{cov}[ y_t, z_{t}]} { \hat{\gamma}_1 \hat{cov}[ x_t, z_{t}]} \\ & = \hat {\beta}_1 \end{align} \]

Verified the numerical equivalence between the 2SLS and the method-of-moments estimator

Real data example

2SLS
- R package AER::ivreg
- R package ivreg::ivreg (more printed diagnostic outcomes)
Demonstration with familyfirms.csv
- 1999 data of 294 firms
- Dependent variable: firm performance in terms of log of Tobin’s Q
- Endogenous variable: family CEO
- Exogenous variables: firm size, age of firm, volatility
- IV: average age of the founders

d0 <- read.csv("familyfirms.csv", header = TRUE)
d99 <- d0[d0$year=="1999", ] # keep all the 1999 data. 294 firms
head(d99)

##    year company agefirm meanagef assets bs_volatility founderCEO Q digit2_in
## 8  1999    1045      65       95  24374             0          0 1        45
## 16 1999    1078      99       95  14471             0          0 4        28
## 23 1999    1164      31       51  91072             0          0 2        48
## 31 1999    1209      59       95   8236             0          0 1        28
## 38 1999    1213      31       95   1643             0          0 1        45
## 45 1999    1240      41       87  15701             0          0 1        54

OLS vs 2SLS

Observe the difference between the estimated coefficients

## OLS
ols <- lm( 
  log(Q) ~ founderCEO + log(assets) + log(agefirm) + bs_volatility, 
  data = d99 )
print(ols)

## 
## Call:
## lm(formula = log(Q) ~ founderCEO + log(assets) + log(agefirm) + 
##     bs_volatility, data = d99)
## 
## Coefficients:
##   (Intercept)     founderCEO    log(assets)   log(agefirm)  bs_volatility  
##      -0.13446        0.27179        0.08972       -0.02961       -0.05828

## 2sls
tsls <- ivreg::ivreg( 
  formula = log(Q) ~ founderCEO + log(assets) + log(agefirm) + bs_volatility, 
  instruments = ~ meanagef + log(assets) + log(agefirm) + bs_volatility, 
  data = d99 )
print(tsls)

## 
## Call:
## ivreg::ivreg(formula = log(Q) ~ founderCEO + log(assets) + log(agefirm) +     bs_volatility | meanagef + log(assets) + log(agefirm) + bs_volatility,     data = d99)
## 
## Coefficients:
##   (Intercept)     founderCEO    log(assets)   log(agefirm)  bs_volatility  
##      -0.72410        1.07827        0.10187        0.07683       -0.18784

2SLS in two stages

# 1st stage
stage1 <- lm( 
  founderCEO ~ meanagef + log(assets) + log(agefirm) + bs_volatility, 
  data = d99 )
print(stage1)

## 
## Call:
## lm(formula = founderCEO ~ meanagef + log(assets) + log(agefirm) + 
##     bs_volatility, data = d99)
## 
## Coefficients:
##   (Intercept)       meanagef    log(assets)   log(agefirm)  bs_volatility  
##      1.182186      -0.010731      -0.008662      -0.022756      -0.018084

# 2nd stage
CEO_hat = predict(stage1) # predict the endogenous variable
stage2 <- lm( 
  log(Q) ~ CEO_hat + log(assets) + log(agefirm) + bs_volatility, 
  data = d99 )
print(summary(stage2))

## 
## Call:
## lm(formula = log(Q) ~ CEO_hat + log(assets) + log(agefirm) + 
##     bs_volatility, data = d99)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0233 -0.4972 -0.1004  0.3321  1.9619 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -0.72410    0.38467  -1.882  0.06079 .  
## CEO_hat        1.07827    0.25586   4.214 3.35e-05 ***
## log(assets)    0.10187    0.03493   2.917  0.00381 ** 
## log(agefirm)   0.07683    0.05410   1.420  0.15665    
## bs_volatility -0.18784    0.13657  -1.375  0.17007    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6086 on 289 degrees of freedom
## Multiple R-squared:  0.08088,    Adjusted R-squared:  0.06816 
## F-statistic: 6.358 on 4 and 289 DF,  p-value: 6.468e-05

Compare coefficients and S.E. with those of ivreg

summary(tsls)

## 
## Call:
## ivreg::ivreg(formula = log(Q) ~ founderCEO + log(assets) + log(agefirm) + 
##     bs_volatility | meanagef + log(assets) + log(agefirm) + bs_volatility, 
##     data = d99)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5630 -0.5003 -0.1148  0.4378  2.3621 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -0.72410    0.41803  -1.732  0.08431 .  
## founderCEO     1.07827    0.27805   3.878  0.00013 ***
## log(assets)    0.10187    0.03796   2.684  0.00769 ** 
## log(agefirm)   0.07683    0.05879   1.307  0.19233    
## bs_volatility -0.18784    0.14841  -1.266  0.20666    
## 
## Diagnostic tests:
##                  df1 df2 statistic  p-value    
## Weak instruments   1 289     98.89  < 2e-16 ***
## Wu-Hausman         1 288     13.29 0.000317 ***
## Sargan             0  NA        NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6614 on 289 degrees of freedom
## Multiple R-Squared: -0.08547,    Adjusted R-squared: -0.1005 
## Wald test: 5.383 on 4 and 289 DF,  p-value: 0.0003406

General case

2SLS can be used for general multivariate cases
1. OLS the reduced-form regression for each endogenous variable
2. OLS the structural form with each endogenous variables replaced by its corresponding fitted value from the 1st stage
2SLS is asymptotic normal under general conditions

\[ \sqrt{T} (\hat{\boldsymbol{\beta}}_{2SLS} - \boldsymbol{\beta}) \stackrel{d}{\rightarrow} N(\mathbf{0}_p, \Omega) \]

When there are multiple endogenous variables in a regression, a necessary condition for identification is that the number of extra instruments is no fewer than the IVs
- Otherwise, the 2nd stage OLS will suffer perfect collinearity
2SLS is a special case of the generalized method of moments (GMM) (HMPY Chapter 9)

OLS vs. 2SLS

Endogeneity: \(E[x_t u_t ] \neq 0\)
Exogeneity: \(E[x_t u_t ] = 0\)
OLS is consistent under exogeneity, and inconsistent under endogneity
Given a valid IV, 2SLS is consistent no matter endogenous or not.
Under exogeneity, OLS is preferred as it’s “BLUE” under classical assumptions
Under endogeneity, 2SLS is preferred thanks to consistency

Endogeneity test

Null hypothesis: \(E[x_t u_t ] = 0\) (exogeneity)
Alternative hypothesis: \(E[x_t u_t ] \neq 0\) (endogeneity)
In the two-equation system

\[ \begin{align} y_t & = \beta_0 + \beta_1 x_{t} + u_t \\ x_t & = \gamma_0 + \gamma_1 z_{t} + v_t \end{align} \]

\(x_t\) is endogeneity if and only if \(E[u_t, v_t] \neq 0\)

Idea of testing:
1. Run OLS in the reduced-form equation, save \(\hat{v}_t\)
2. Add \(\hat{v}_t\) as an extra regressor in the structural equation, and test whether its coefficient is 0 or not
Durbin-Wu-Hausman test
- Automatically reported in ivreg::ivreg

Weak IV

In view of the formula

\[ \hat{\beta}_1 = \frac{\hat{cov}[y_t, z_t]} {\hat{cov}[ x_t, z_t]} \] the validity of 2SLS counts on \(cov[ x_t, z_t] \neq 0\)

Weak IV, meaning \(cov[ x_t, z_t] \approx 0\), is not uncommon in practice
Solution
1. Always check the significance in the 1st stage
2. If weak, use weak-IV-robust tests, for example, the Anderson-Rubin test (R package ivmodel::AR.test)

Screen print

Sargan test is available when we have more IV than endogenous variables

summary(tsls)

## 
## Call:
## ivreg::ivreg(formula = log(Q) ~ founderCEO + log(assets) + log(agefirm) + 
##     bs_volatility | meanagef + log(assets) + log(agefirm) + bs_volatility, 
##     data = d99)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5630 -0.5003 -0.1148  0.4378  2.3621 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -0.72410    0.41803  -1.732  0.08431 .  
## founderCEO     1.07827    0.27805   3.878  0.00013 ***
## log(assets)    0.10187    0.03796   2.684  0.00769 ** 
## log(agefirm)   0.07683    0.05879   1.307  0.19233    
## bs_volatility -0.18784    0.14841  -1.266  0.20666    
## 
## Diagnostic tests:
##                  df1 df2 statistic  p-value    
## Weak instruments   1 289     98.89  < 2e-16 ***
## Wu-Hausman         1 288     13.29 0.000317 ***
## Sargan             0  NA        NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6614 on 289 degrees of freedom
## Multiple R-Squared: -0.08547,    Adjusted R-squared: -0.1005 
## Wald test: 5.383 on 4 and 289 DF,  p-value: 0.0003406

Summary

Structural model and reduced-form model
Endogeneity and exogeneity
Consequence of endogeneity
2SLS
Test endogeneity
Weak IV