数据实例1
5. 数据实例1#
在古代,中国北方曾是世界上最繁荣的地区之一,但现在的中国南方比北方富裕。 经济史学家通常认为,造成这个现象的主要原因之一是汉族的南移。历史上规模最大的汉族南移发生在宋代。 Bai(2021)发现,当时人口的迁入该地区当代的经济繁荣产生了显著的积极影响。
为了量化移民的长期影响,Bai(2021) 收集287个地区在1127–1130年间的移民数量和 该地2000年GDP。在简单回归中,采用对数GDP作为应变量(y),对数移民数(m)以及该地是否位于中国南方的虚拟变量(W)作为自变量。回归结果如下:
\[
\hat{y} = 1.577 + 0.258 m - 0.212 W
\]
从结果中可以看到,对数移民数系数为正,而位于中国南方则系数为负。
The following block is to import data and run OLS regression.
The standard deviation and the inference is based on homoskedastic error.
prefLevelTest <- read.csv(file = "prefLevelTest.csv", header = TRUE)
## regression 1
reg.ols <-lm(lngdppc2000~lnmig1127+south,data=prefLevelTest)
summary(reg.ols)
Call:
lm(formula = lngdppc2000 ~ lnmig1127 + south, data = prefLevelTest)
Residuals:
Min 1Q Median 3Q Max
-1.60701 -0.37418 0.03236 0.45080 1.71793
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.57726 0.05745 27.456 < 2e-16 ***
lnmig1127 0.25786 0.03541 7.282 3.25e-12 ***
south -0.21244 0.08222 -2.584 0.0103 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5998 on 284 degrees of freedom
Multiple R-squared: 0.1592, Adjusted R-squared: 0.1533
F-statistic: 26.89 on 2 and 284 DF, p-value: 2.012e-11
The following block uses the heteroskedastic-robust error.
robust.csv <- sandwich::vcovHC(reg.ols,type="HC1")
print( lmtest::coeftest(reg.ols, vcov. = robust.csv) ) # robust test
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.577260 0.060898 25.8999 < 2.2e-16 ***
lnmig1127 0.257864 0.035459 7.2721 3.453e-12 ***
south -0.212436 0.083374 -2.5480 0.01136 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The negative coefficient for the dummy variable south
is counterintuitive.
If we add more control variables, the significance is gone.
The following regression replicated Column (2) of Table II.
## regression 2
d1=fastDummies::dummy_cols(prefLevelTest,
select_columns = c("provgb"),
remove_first_dummy = TRUE)
## Fit the regression model(column 2)
reg.dummy<-lm(lngdppc2000~lnmig1127+south+lnhhden1080+lnarea+
provgb_14+provgb_32+provgb_33+provgb_34+provgb_35+provgb_36
+provgb_37+provgb_41+provgb_42+provgb_43+provgb_44+provgb_45
+provgb_50+provgb_61+provgb_62,
data=d1)
summary(reg.dummy)
Call:
lm(formula = lngdppc2000 ~ lnmig1127 + south + lnhhden1080 +
lnarea + provgb_14 + provgb_32 + provgb_33 + provgb_34 +
provgb_35 + provgb_36 + provgb_37 + provgb_41 + provgb_42 +
provgb_43 + provgb_44 + provgb_45 + provgb_50 + provgb_61 +
provgb_62, data = d1)
Residuals:
Min 1Q Median 3Q Max
-1.28984 -0.28922 0.00266 0.26672 1.32230
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.09221 0.34382 3.177 0.001665 **
lnmig1127 0.16010 0.04046 3.957 9.72e-05 ***
south -0.09760 0.12058 -0.809 0.418980
lnhhden1080 0.09468 0.02924 3.238 0.001357 **
lnarea 0.08966 0.03913 2.292 0.022713 *
provgb_14 -0.73763 0.14063 -5.245 3.18e-07 ***
provgb_32 0.07293 0.19041 0.383 0.702006
provgb_33 -0.15351 0.24118 -0.636 0.524996
provgb_34 -0.19587 0.18269 -1.072 0.284632
provgb_35 -0.20223 0.23784 -0.850 0.395928
provgb_36 -1.05992 0.21454 -4.940 1.38e-06 ***
provgb_37 -0.14206 0.14422 -0.985 0.325490
provgb_41 -0.25985 0.14873 -1.747 0.081762 .
provgb_42 -0.33977 0.20287 -1.675 0.095136 .
provgb_43 -0.62105 0.21101 -2.943 0.003534 **
provgb_44 -0.02682 0.19141 -0.140 0.888655
provgb_45 -0.54277 0.19007 -2.856 0.004633 **
provgb_50 -0.64765 0.16954 -3.820 0.000166 ***
provgb_61 -0.79993 0.14568 -5.491 9.29e-08 ***
provgb_62 -1.48178 0.16047 -9.234 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4444 on 267 degrees of freedom
Multiple R-squared: 0.5659, Adjusted R-squared: 0.5351
F-statistic: 18.32 on 19 and 267 DF, p-value: < 2.2e-16