In the panle regression setup, the coefficients in the Least Square Dummy Variable model is identical to that in the Fixed Effect Model. However, the computing time needed is much less in the Fixed Effect Model than the time in the Least Square Dummy Variable Model.
if (!require(plm)) install.packages("plm")
## Loading required package: plm
## Loading required package: Formula
library(plm)
if (!require(broom)) install.packages("broom")
## Loading required package: broom
library(broom)
data(package = "plm")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plm':
##
## between, lag, lead
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Data Desciption
I use Males
dataset as an example.
data(Males, package = "plm")
summary(Males)
## nr year school exper
## Min. : 13 Min. :1980 Min. : 3.00 Min. : 0.000
## 1st Qu.: 2329 1st Qu.:1982 1st Qu.:11.00 1st Qu.: 4.000
## Median : 4569 Median :1984 Median :12.00 Median : 6.000
## Mean : 5262 Mean :1984 Mean :11.77 Mean : 6.515
## 3rd Qu.: 8406 3rd Qu.:1985 3rd Qu.:12.00 3rd Qu.: 9.000
## Max. :12548 Max. :1987 Max. :16.00 Max. :18.000
##
## union ethn married health wage
## no :3296 other:3176 no :2446 no :4286 Min. :-3.579
## yes:1064 black: 504 yes:1914 yes: 74 1st Qu.: 1.351
## hisp : 680 Median : 1.671
## Mean : 1.649
## 3rd Qu.: 1.991
## Max. : 4.052
##
## industry
## Manufacturing :1231
## Trade :1169
## Professional_and_Related Service: 333
## Business_and_Repair_Service : 331
## Construction : 327
## Transportation : 286
## (Other) : 683
## occupation residence
## Craftsmen, Foremen_and_kindred :934 rural_area : 85
## Operatives_and_kindred :881 north_east : 733
## Service_Workers :509 nothern_central: 964
## Clerical_and_kindred :486 south :1333
## Professional, Technical_and_kindred:453 NA's :1245
## Laborers_and_farmers :401
## (Other) :696
class(Males)
## [1] "data.frame"
In the Males
dataset, nr
represents the respondents id, and year
is the survey year. This dataset is a panel data set.
LSDV
In the following comparison, I study the relationship between experience and wage. \(\log(wage) = \alpha + \beta exp + \gamma exp^2 + \theta Controls\) where the controls include union
, ethnicity(ethn)
, health
and married
.
t <- proc.time()
lsdv <- lm(
log(wage) ~ exper + I(exper^2) + I(as.integer(union)) + ethn + health + married + factor(nr),
data = Males,
subset = wage >= 1
)
## Warning in log(wage): NaNs produced
lsdv.time <- proc.time() - t
print(lsdv.time)
## user system elapsed
## 1.299 0.008 1.307
lsdv.result <- tidy(lsdv)
dim(lsdv.result)
## [1] 549 5
In the LSDV model, the final result includes a set of individual dummies.
knitr::kable(head(lsdv.result, 9))
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 0.2077293 | 0.0527951 | 3.9346308 | 0.0000850 |
exper | 0.0500831 | 0.0034919 | 14.3425289 | 0.0000000 |
I(exper^2) | -0.0017047 | 0.0002482 | -6.8679477 | 0.0000000 |
I(as.integer(union)) | 0.0379404 | 0.0078563 | 4.8292689 | 0.0000014 |
ethnblack | 0.2615435 | 0.0703569 | 3.7173801 | 0.0002045 |
ethnhisp | 0.1712365 | 0.0704787 | 2.4296203 | 0.0151653 |
healthyes | 0.0049806 | 0.0195340 | 0.2549727 | 0.7987594 |
marriedyes | 0.0134629 | 0.0073480 | 1.8321787 | 0.0670109 |
factor(nr)17 | -0.0246613 | 0.0701815 | -0.3513930 | 0.7253150 |
lsdv.residual_square <- sum((lsdv$residuals)^2)
glance(lsdv)
## r.squared adj.r.squared sigma statistic p.value df logLik
## 1 0.7123676 0.6666931 0.1353907 15.59663 0 549 2617.868
## AIC BIC deviance df.residual
## 1 -4135.736 -674.0087 63.25907 3451
Fixed Effect Model
Males.panel <- plm.data(Males, indexes = c("nr", "year"))
## Warning: use of 'plm.data' is discouraged, better use 'pdata.frame' instead
t <- proc.time()
fixed_effect <- plm(
log(wage) ~ exper + I(exper^2) + I(as.integer(union)) + ethn + health + married,
data = Males.panel,
subset = wage >= 1,
effect = "individual",
model = "within"
)
## Warning in log(wage): NaNs produced
fixed_effect.time <- proc.time() - t
print(fixed_effect.time)
## user system elapsed
## 0.277 0.008 0.286
fixed_effect.result <- tidy(fixed_effect)
dim(fixed_effect.result)
## [1] 5 5
In the fixed effect model, we only have 5 coefficients because the time-invariant variable(ethn
) is cancalled out in the regression.
knitr::kable(fixed_effect.result)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
exper | 0.0500831 | 0.0034919 | 14.3425289 | 0.0000000 |
I(exper^2) | -0.0017047 | 0.0002482 | -6.8679477 | 0.0000000 |
I(as.integer(union)) | 0.0379404 | 0.0078563 | 4.8292689 | 0.0000014 |
healthyes | 0.0049806 | 0.0195340 | 0.2549727 | 0.7987594 |
marriedyes | 0.0134629 | 0.0073480 | 1.8321787 | 0.0670109 |
fixed_effect.residual_square <- sum((fixed_effect$residuals)^2)
glance(fixed_effect)
## r.squared adj.r.squared statistic p.value deviance df.residual
## 1 0.2186536 0.09458001 193.147 6.996476e-182 63.25907 3451
Comparison
knitr::kable(fixed_effect.result[,1:3] %>% left_join(lsdv.result[,1:3], by = "term", suffix = c(".fixed", ".lsdv")))
term | estimate.fixed | std.error.fixed | estimate.lsdv | std.error.lsdv |
---|---|---|---|---|
exper | 0.0500831 | 0.0034919 | 0.0500831 | 0.0034919 |
I(exper^2) | -0.0017047 | 0.0002482 | -0.0017047 | 0.0002482 |
I(as.integer(union)) | 0.0379404 | 0.0078563 | 0.0379404 | 0.0078563 |
healthyes | 0.0049806 | 0.0195340 | 0.0049806 | 0.0195340 |
marriedyes | 0.0134629 | 0.0073480 | 0.0134629 | 0.0073480 |
The results of continuous variables are (almost) numerically identical. plm
ajust the se by defult.
fixed_effect.residual_square - lsdv.residual_square
## [1] 7.105427e-15
The sum of residual squares are identical too.
print(lsdv.time)
## user system elapsed
## 1.299 0.008 1.307
print(fixed_effect.time)
## user system elapsed
## 0.277 0.008 0.286
lsdv.time - fixed_effect.time
## user system elapsed
## 1.022000e+00 6.938894e-18 1.021000e+00
lsdv.time[3] / fixed_effect.time[3]
## elapsed
## 4.56993
In this example, fixed effect model is 13 times faster than LSDV. Notice the individuals in my sample is 545. As individuals increases, the computing time increase dramatically. (The (X + D) matrix increases accordingly).