Instrumental Variables

Zhentao Shi

Nov 15, 2021

Endogeneity

\[ y_t = \beta_0 + \beta_1 x_{t} + u_t \]

Sources of endogeity

Consequence of endogeneity

\[ \hat{\beta}_{1, OLS} - \beta_1 = \frac{\hat{cov}[x_t, u_t]}{\hat{var}[{x_t}]} \stackrel{p}{\nrightarrow} 0 \]

\[ \hat{\boldsymbol{\beta}}_{OLS} - \boldsymbol{\beta} = \left(\frac{\mathbf{X}'\mathbf{X}}{T}\right)^{-1} \frac{\mathbf{X}'\mathbf{u}}{T} \stackrel{p}{\nrightarrow} \mathbf{0}_p \]

Instrumental variables

Identifying the parameter of interest

\[ \begin{align} 0 & = cov[u_t, z_t] \\ & = cov[y_t - \beta_0 - \beta_1 x_t, z_t] \\ & = cov[y_t - \beta_1 x_t, z_t] \\ & = cov[y_t, z_t] - \beta_1 cov[ x_t, z_t] \end{align} \]

\[ \beta_1 = \frac{cov[y_t, z_t]} { cov[ x_t, z_t]} \]

if the denominator \(cov[ x_t, z_t]\neq 0\) (relevance)

Source of instruments

Estimation

\[ \hat{\beta}_1 = \frac{\hat{cov}[y_t, z_t]} {\hat{cov}[ x_t, z_t]} \]

Structural equation and reduced-form equation

\[ y_t = \beta_0 + \beta_1 x_{t} + u_t \]

\[ x_t = \gamma_0 + \gamma_1 z_{t} + v_t \]

Two stage least squares (2SLS)

  1. Run OLS in the reduced-form equation, save the fitted value \(\hat{x}_t = \hat{\gamma}_0 + \hat{\gamma}_1 z_t\)
  2. In the structural equation replace the endogenous variable \(x_t\) by its fitted value \(\hat{x}_t\) from the 1st stage, and then OLS \(y_t\) on \(\hat{x}_t\). The estimated coefficient associated with \(\hat{x}_t\) is consistent for \(\beta_1\)

Back to the simple regression

\[ \begin{align} \hat{\beta}_1^{2SLS} & = \frac{\hat{cov}[ y_t, \hat{x}_t]} {\hat{var}[ \hat{x}_t]} \\ & = \frac{\hat{cov}[ y_t, \hat{\gamma}_0 + \hat{\gamma}_1 z_{t}]} {\hat{var}[\hat{\gamma}_0 + \hat{\gamma}_1 z_{t}]}\\ & = \frac{ \hat{\gamma}_1\hat{cov}[ y_t, z_{t}]} {\hat{cov}[ \hat{\gamma}_0+ \hat{\gamma}_1 z_{t} + \hat{v}_t, \hat{\gamma}_0+\hat{\gamma}_1 z_{t}]} \\ & = \frac{ \hat{\gamma}_1\hat{cov}[ y_t, z_{t}]} { \hat{\gamma}_1 \hat{cov}[ x_t, z_{t}]} \\ & = \hat {\beta}_1 \end{align} \]

Real data example

d0 <- read.csv("familyfirms.csv", header = TRUE)
d99 <- d0[d0$year=="1999", ] # keep all the 1999 data. 294 firms
head(d99)
##    year company agefirm meanagef assets bs_volatility founderCEO Q digit2_in
## 8  1999    1045      65       95  24374             0          0 1        45
## 16 1999    1078      99       95  14471             0          0 4        28
## 23 1999    1164      31       51  91072             0          0 2        48
## 31 1999    1209      59       95   8236             0          0 1        28
## 38 1999    1213      31       95   1643             0          0 1        45
## 45 1999    1240      41       87  15701             0          0 1        54

OLS vs 2SLS

## OLS
ols <- lm( 
  log(Q) ~ founderCEO + log(assets) + log(agefirm) + bs_volatility, 
  data = d99 )
print(ols)
## 
## Call:
## lm(formula = log(Q) ~ founderCEO + log(assets) + log(agefirm) + 
##     bs_volatility, data = d99)
## 
## Coefficients:
##   (Intercept)     founderCEO    log(assets)   log(agefirm)  bs_volatility  
##      -0.13446        0.27179        0.08972       -0.02961       -0.05828
## 2sls
tsls <- ivreg::ivreg( 
  formula = log(Q) ~ founderCEO + log(assets) + log(agefirm) + bs_volatility, 
  instruments = ~ meanagef + log(assets) + log(agefirm) + bs_volatility, 
  data = d99 )
print(tsls)
## 
## Call:
## ivreg::ivreg(formula = log(Q) ~ founderCEO + log(assets) + log(agefirm) +     bs_volatility | meanagef + log(assets) + log(agefirm) + bs_volatility,     data = d99)
## 
## Coefficients:
##   (Intercept)     founderCEO    log(assets)   log(agefirm)  bs_volatility  
##      -0.72410        1.07827        0.10187        0.07683       -0.18784

2SLS in two stages

# 1st stage
stage1 <- lm( 
  founderCEO ~ meanagef + log(assets) + log(agefirm) + bs_volatility, 
  data = d99 )
print(stage1)
## 
## Call:
## lm(formula = founderCEO ~ meanagef + log(assets) + log(agefirm) + 
##     bs_volatility, data = d99)
## 
## Coefficients:
##   (Intercept)       meanagef    log(assets)   log(agefirm)  bs_volatility  
##      1.182186      -0.010731      -0.008662      -0.022756      -0.018084
# 2nd stage
CEO_hat = predict(stage1) # predict the endogenous variable
stage2 <- lm( 
  log(Q) ~ CEO_hat + log(assets) + log(agefirm) + bs_volatility, 
  data = d99 )
print(summary(stage2))
## 
## Call:
## lm(formula = log(Q) ~ CEO_hat + log(assets) + log(agefirm) + 
##     bs_volatility, data = d99)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0233 -0.4972 -0.1004  0.3321  1.9619 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -0.72410    0.38467  -1.882  0.06079 .  
## CEO_hat        1.07827    0.25586   4.214 3.35e-05 ***
## log(assets)    0.10187    0.03493   2.917  0.00381 ** 
## log(agefirm)   0.07683    0.05410   1.420  0.15665    
## bs_volatility -0.18784    0.13657  -1.375  0.17007    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6086 on 289 degrees of freedom
## Multiple R-squared:  0.08088,    Adjusted R-squared:  0.06816 
## F-statistic: 6.358 on 4 and 289 DF,  p-value: 6.468e-05
summary(tsls)
## 
## Call:
## ivreg::ivreg(formula = log(Q) ~ founderCEO + log(assets) + log(agefirm) + 
##     bs_volatility | meanagef + log(assets) + log(agefirm) + bs_volatility, 
##     data = d99)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5630 -0.5003 -0.1148  0.4378  2.3621 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -0.72410    0.41803  -1.732  0.08431 .  
## founderCEO     1.07827    0.27805   3.878  0.00013 ***
## log(assets)    0.10187    0.03796   2.684  0.00769 ** 
## log(agefirm)   0.07683    0.05879   1.307  0.19233    
## bs_volatility -0.18784    0.14841  -1.266  0.20666    
## 
## Diagnostic tests:
##                  df1 df2 statistic  p-value    
## Weak instruments   1 289     98.89  < 2e-16 ***
## Wu-Hausman         1 288     13.29 0.000317 ***
## Sargan             0  NA        NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6614 on 289 degrees of freedom
## Multiple R-Squared: -0.08547,    Adjusted R-squared: -0.1005 
## Wald test: 5.383 on 4 and 289 DF,  p-value: 0.0003406

General case

\[ \sqrt{T} (\hat{\boldsymbol{\beta}}_{2SLS} - \boldsymbol{\beta}) \stackrel{d}{\rightarrow} N(\mathbf{0}_p, \Omega) \]

OLS vs. 2SLS

Endogeneity test

\[ \begin{align} y_t & = \beta_0 + \beta_1 x_{t} + u_t \\ x_t & = \gamma_0 + \gamma_1 z_{t} + v_t \end{align} \]

\(x_t\) is endogeneity if and only if \(E[u_t, v_t] \neq 0\)

Weak IV

\[ \hat{\beta}_1 = \frac{\hat{cov}[y_t, z_t]} {\hat{cov}[ x_t, z_t]} \] the validity of 2SLS counts on \(cov[ x_t, z_t] \neq 0\)

Screen print

summary(tsls)
## 
## Call:
## ivreg::ivreg(formula = log(Q) ~ founderCEO + log(assets) + log(agefirm) + 
##     bs_volatility | meanagef + log(assets) + log(agefirm) + bs_volatility, 
##     data = d99)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5630 -0.5003 -0.1148  0.4378  2.3621 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -0.72410    0.41803  -1.732  0.08431 .  
## founderCEO     1.07827    0.27805   3.878  0.00013 ***
## log(assets)    0.10187    0.03796   2.684  0.00769 ** 
## log(agefirm)   0.07683    0.05879   1.307  0.19233    
## bs_volatility -0.18784    0.14841  -1.266  0.20666    
## 
## Diagnostic tests:
##                  df1 df2 statistic  p-value    
## Weak instruments   1 289     98.89  < 2e-16 ***
## Wu-Hausman         1 288     13.29 0.000317 ***
## Sargan             0  NA        NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6614 on 289 degrees of freedom
## Multiple R-Squared: -0.08547,    Adjusted R-squared: -0.1005 
## Wald test: 5.383 on 4 and 289 DF,  p-value: 0.0003406

Summary