Least Square Dummy Variable Regression V.S. Fixed Effect Model

Sun, Jul 9, 2017 R Learning Notes R regression

In the panle regression setup, the coefficients in the Least Square Dummy Variable model is identical to that in the Fixed Effect Model. However, the computing time needed is much less in the Fixed Effect Model than the time in the Least Square Dummy Variable Model.

if (!require(plm)) install.packages("plm")

## Loading required package: plm

## Loading required package: Formula

library(plm)
if (!require(broom)) install.packages("broom")

## Loading required package: broom

library(broom)
data(package = "plm")
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:plm':
## 
##     between, lag, lead

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Data Desciption

I use Males dataset as an example.

data(Males, package = "plm")
summary(Males)

##        nr             year          school          exper       
##  Min.   :   13   Min.   :1980   Min.   : 3.00   Min.   : 0.000  
##  1st Qu.: 2329   1st Qu.:1982   1st Qu.:11.00   1st Qu.: 4.000  
##  Median : 4569   Median :1984   Median :12.00   Median : 6.000  
##  Mean   : 5262   Mean   :1984   Mean   :11.77   Mean   : 6.515  
##  3rd Qu.: 8406   3rd Qu.:1985   3rd Qu.:12.00   3rd Qu.: 9.000  
##  Max.   :12548   Max.   :1987   Max.   :16.00   Max.   :18.000  
##                                                                 
##  union         ethn      married    health          wage       
##  no :3296   other:3176   no :2446   no :4286   Min.   :-3.579  
##  yes:1064   black: 504   yes:1914   yes:  74   1st Qu.: 1.351  
##             hisp : 680                         Median : 1.671  
##                                                Mean   : 1.649  
##                                                3rd Qu.: 1.991  
##                                                Max.   : 4.052  
##                                                                
##                              industry   
##  Manufacturing                   :1231  
##  Trade                           :1169  
##  Professional_and_Related Service: 333  
##  Business_and_Repair_Service     : 331  
##  Construction                    : 327  
##  Transportation                  : 286  
##  (Other)                         : 683  
##                                occupation            residence   
##  Craftsmen, Foremen_and_kindred     :934   rural_area     :  85  
##  Operatives_and_kindred             :881   north_east     : 733  
##  Service_Workers                    :509   nothern_central: 964  
##  Clerical_and_kindred               :486   south          :1333  
##  Professional, Technical_and_kindred:453   NA's           :1245  
##  Laborers_and_farmers               :401                         
##  (Other)                            :696

class(Males)

## [1] "data.frame"

In the Males dataset, nr represents the respondents id, and year is the survey year. This dataset is a panel data set.

LSDV

In the following comparison, I study the relationship between experience and wage. \(\log(wage) = \alpha + \beta exp + \gamma exp^2 + \theta Controls\) where the controls include union, ethnicity(ethn), health and married.

t <- proc.time()
lsdv <- lm(
  log(wage) ~ exper + I(exper^2) + I(as.integer(union)) + ethn + health + married + factor(nr),
  data = Males,
  subset = wage >= 1
)

## Warning in log(wage): NaNs produced

lsdv.time <- proc.time() - t
print(lsdv.time)

##    user  system elapsed 
##   1.299   0.008   1.307

lsdv.result <- tidy(lsdv)
dim(lsdv.result)

## [1] 549   5

In the LSDV model, the final result includes a set of individual dummies.

knitr::kable(head(lsdv.result, 9))

term	estimate	std.error	statistic	p.value
(Intercept)	0.2077293	0.0527951	3.9346308	0.0000850
exper	0.0500831	0.0034919	14.3425289	0.0000000
I(exper^2)	-0.0017047	0.0002482	-6.8679477	0.0000000
I(as.integer(union))	0.0379404	0.0078563	4.8292689	0.0000014
ethnblack	0.2615435	0.0703569	3.7173801	0.0002045
ethnhisp	0.1712365	0.0704787	2.4296203	0.0151653
healthyes	0.0049806	0.0195340	0.2549727	0.7987594
marriedyes	0.0134629	0.0073480	1.8321787	0.0670109
factor(nr)17	-0.0246613	0.0701815	-0.3513930	0.7253150

lsdv.residual_square <- sum((lsdv$residuals)^2)
glance(lsdv)

##   r.squared adj.r.squared     sigma statistic p.value  df   logLik
## 1 0.7123676     0.6666931 0.1353907  15.59663       0 549 2617.868
##         AIC       BIC deviance df.residual
## 1 -4135.736 -674.0087 63.25907        3451

Fixed Effect Model

Males.panel <- plm.data(Males, indexes = c("nr", "year"))

## Warning: use of 'plm.data' is discouraged, better use 'pdata.frame' instead

t <- proc.time()
fixed_effect <- plm(
  log(wage) ~ exper + I(exper^2) + I(as.integer(union)) + ethn + health + married,
  data = Males.panel,
  subset = wage >= 1,
  effect = "individual",
  model = "within"
)

## Warning in log(wage): NaNs produced

fixed_effect.time <- proc.time() - t
print(fixed_effect.time)

##    user  system elapsed 
##   0.277   0.008   0.286

fixed_effect.result <- tidy(fixed_effect)
dim(fixed_effect.result)

## [1] 5 5

In the fixed effect model, we only have 5 coefficients because the time-invariant variable(ethn) is cancalled out in the regression.

knitr::kable(fixed_effect.result)

term	estimate	std.error	statistic	p.value
exper	0.0500831	0.0034919	14.3425289	0.0000000
I(exper^2)	-0.0017047	0.0002482	-6.8679477	0.0000000
I(as.integer(union))	0.0379404	0.0078563	4.8292689	0.0000014
healthyes	0.0049806	0.0195340	0.2549727	0.7987594
marriedyes	0.0134629	0.0073480	1.8321787	0.0670109

fixed_effect.residual_square <- sum((fixed_effect$residuals)^2)
glance(fixed_effect)

##   r.squared adj.r.squared statistic       p.value deviance df.residual
## 1 0.2186536    0.09458001   193.147 6.996476e-182 63.25907        3451

Comparison

knitr::kable(fixed_effect.result[,1:3] %>% left_join(lsdv.result[,1:3], by = "term", suffix = c(".fixed", ".lsdv")))

term	estimate.fixed	std.error.fixed	estimate.lsdv	std.error.lsdv
exper	0.0500831	0.0034919	0.0500831	0.0034919
I(exper^2)	-0.0017047	0.0002482	-0.0017047	0.0002482
I(as.integer(union))	0.0379404	0.0078563	0.0379404	0.0078563
healthyes	0.0049806	0.0195340	0.0049806	0.0195340
marriedyes	0.0134629	0.0073480	0.0134629	0.0073480

The results of continuous variables are (almost) numerically identical. plm ajust the se by defult.

fixed_effect.residual_square - lsdv.residual_square

## [1] 7.105427e-15

The sum of residual squares are identical too.

print(lsdv.time)

##    user  system elapsed 
##   1.299   0.008   1.307

print(fixed_effect.time)

##    user  system elapsed 
##   0.277   0.008   0.286

lsdv.time - fixed_effect.time

##         user       system      elapsed 
## 1.022000e+00 6.938894e-18 1.021000e+00

lsdv.time[3] / fixed_effect.time[3]

## elapsed 
## 4.56993

In this example, fixed effect model is 13 times faster than LSDV. Notice the individuals in my sample is 545. As individuals increases, the computing time increase dramatically. (The (X + D) matrix increases accordingly).