5. 数据实例1#

在古代,中国北方曾是世界上最繁荣的地区之一,但现在的中国南方比北方富裕。 经济史学家通常认为,造成这个现象的主要原因之一是汉族的南移。历史上规模最大的汉族南移发生在宋代。 Bai(2021)发现,当时人口的迁入该地区当代的经济繁荣产生了显著的积极影响。

为了量化移民的长期影响,Bai(2021) 收集287个地区在1127–1130年间的移民数量和 该地2000年GDP。在简单回归中,采用对数GDP作为应变量(y),对数移民数(m)以及该地是否位于中国南方的虚拟变量(W)作为自变量。回归结果如下:

\[ \hat{y} = 1.577 + 0.258 m - 0.212 W \]

从结果中可以看到,对数移民数系数为正,而位于中国南方则系数为负。

The following block is to import data and run OLS regression.

The standard deviation and the inference is based on homoskedastic error.

prefLevelTest <- read.csv(file = "prefLevelTest.csv", header = TRUE)


## regression 1
reg.ols <-lm(lngdppc2000~lnmig1127+south,data=prefLevelTest)
summary(reg.ols)
Call:
lm(formula = lngdppc2000 ~ lnmig1127 + south, data = prefLevelTest)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.60701 -0.37418  0.03236  0.45080  1.71793 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.57726    0.05745  27.456  < 2e-16 ***
lnmig1127    0.25786    0.03541   7.282 3.25e-12 ***
south       -0.21244    0.08222  -2.584   0.0103 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5998 on 284 degrees of freedom
Multiple R-squared:  0.1592,	Adjusted R-squared:  0.1533 
F-statistic: 26.89 on 2 and 284 DF,  p-value: 2.012e-11

The following block uses the heteroskedastic-robust error.

robust.csv <- sandwich::vcovHC(reg.ols,type="HC1")
print( lmtest::coeftest(reg.ols, vcov. = robust.csv) ) # robust test
t test of coefficients:

             Estimate Std. Error t value  Pr(>|t|)    
(Intercept)  1.577260   0.060898 25.8999 < 2.2e-16 ***
lnmig1127    0.257864   0.035459  7.2721 3.453e-12 ***
south       -0.212436   0.083374 -2.5480   0.01136 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The negative coefficient for the dummy variable south is counterintuitive. If we add more control variables, the significance is gone. The following regression replicated Column (2) of Table II.

## regression 2

d1=fastDummies::dummy_cols(prefLevelTest,
            select_columns = c("provgb"),
            remove_first_dummy = TRUE)

## Fit the regression model(column 2)
reg.dummy<-lm(lngdppc2000~lnmig1127+south+lnhhden1080+lnarea+
    provgb_14+provgb_32+provgb_33+provgb_34+provgb_35+provgb_36
    +provgb_37+provgb_41+provgb_42+provgb_43+provgb_44+provgb_45
    +provgb_50+provgb_61+provgb_62,
    data=d1)
summary(reg.dummy)
Call:
lm(formula = lngdppc2000 ~ lnmig1127 + south + lnhhden1080 + 
    lnarea + provgb_14 + provgb_32 + provgb_33 + provgb_34 + 
    provgb_35 + provgb_36 + provgb_37 + provgb_41 + provgb_42 + 
    provgb_43 + provgb_44 + provgb_45 + provgb_50 + provgb_61 + 
    provgb_62, data = d1)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.28984 -0.28922  0.00266  0.26672  1.32230 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.09221    0.34382   3.177 0.001665 ** 
lnmig1127    0.16010    0.04046   3.957 9.72e-05 ***
south       -0.09760    0.12058  -0.809 0.418980    
lnhhden1080  0.09468    0.02924   3.238 0.001357 ** 
lnarea       0.08966    0.03913   2.292 0.022713 *  
provgb_14   -0.73763    0.14063  -5.245 3.18e-07 ***
provgb_32    0.07293    0.19041   0.383 0.702006    
provgb_33   -0.15351    0.24118  -0.636 0.524996    
provgb_34   -0.19587    0.18269  -1.072 0.284632    
provgb_35   -0.20223    0.23784  -0.850 0.395928    
provgb_36   -1.05992    0.21454  -4.940 1.38e-06 ***
provgb_37   -0.14206    0.14422  -0.985 0.325490    
provgb_41   -0.25985    0.14873  -1.747 0.081762 .  
provgb_42   -0.33977    0.20287  -1.675 0.095136 .  
provgb_43   -0.62105    0.21101  -2.943 0.003534 ** 
provgb_44   -0.02682    0.19141  -0.140 0.888655    
provgb_45   -0.54277    0.19007  -2.856 0.004633 ** 
provgb_50   -0.64765    0.16954  -3.820 0.000166 ***
provgb_61   -0.79993    0.14568  -5.491 9.29e-08 ***
provgb_62   -1.48178    0.16047  -9.234  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4444 on 267 degrees of freedom
Multiple R-squared:  0.5659,	Adjusted R-squared:  0.5351 
F-statistic: 18.32 on 19 and 267 DF,  p-value: < 2.2e-16