The score-based test framework for parameter instability has
been proposed for testing measurement invariance in measurement models.
Until now, the focus was on (a) testing the invariance of all parameters
simultaneously, or (b) on testing the invariance of a single parameter
in the model. However in educational and psychological assessments, the
appropriateness of each items is of interest. For instance, the
detection of differential item function (DIF) plays an important role in
validating new items. The scDIFtest package provides a
user-friendly method for detecting DIF by automatically and efficiently
applying the tests from the score-based test framework to the individual
items in the assessment. The main function of the scDIFtest
package is the scDIFtest function, which is a wrapper
around the strucchange::sctest-function.
To detect DIF with the scDIFtest package, first, the
appropriate Item Response Theory (IRT) or Factor Analysis (FA) model
should fitted using the mirt package. The
scDIFtest-function can directly be used on the resulting
mirt-object. Hence, in addition to the
scDIFtest, the package mirt will typically
also be loaded in the R session. For now,
scDIFtest only works for IRT/FA models that were fitted
using the mirt package, but we aim to extend this to other
packages that fit IRT/FA models using maximum likelihood estimation.
In order to fit the IRT model and analyze DIF with the
scDIFtest, the following steps are necessary:
R-package(s)mirt or
multipleGroup-function implemented in the mirt
package Chalmers (2012)scDIFtest Debeer (2020)In the sections that follow, these steps will be explained in detail.
The scDIFtest package is installed using the following
commands:
Since, the mirt package Chalmers
(2012) is required for fitting the IRT/FA model of interest, it
should also be installed (using
install.packages("mirt")).
In this vignette, a subset of the SPISA data is used.
This data is part of the psychotree package, it can be
accessed when the psychotree package is installed. To load
the SPISA dataset:
The SPISA data is a subsample from the general knowledge quiz
“Studentenpisa” conducted online by the German weekly news magazine
SPIEGEL Trepte and Verbeet (2010). The
data contain the quiz results from 45 questions as well as
socio-demographic data for 1075 university students from Bavaria Trepte and Verbeet (2010). Although there were
45 questions addressing different topics, this illustration is limited
to the analysis of the nine science questions (items 37 - 45). To
analyze the data with mirt, the responses are converted to
a data frame.
In addition to the responses, the SPISA data also contains five socio-demographic variables (i.e., person covariates):
summary(SPISA[,2:6])
#> gender age semester elite spon
#> female:417 Min. :18.0 2 :173 no :836 never :303
#> male :658 1st Qu.:21.0 4 :123 yes:239 <1/month :127
#> Median :23.0 6 :116 1-3/month:107
#> Mean :23.1 1 :105 1/week : 79
#> 3rd Qu.:25.0 5 : 99 2-3/week : 73
#> Max. :40.0 3 : 98 4-5/week : 60
#> (Other):361 daily :326In this illustration, we will try to detect DIF along the following three covariates:
age of the student in years (numeric covariate)gender of the student (unordered categorical
covariate)spon, which is the frequency of assessing the
SPIEGEL ONline (SPON) magazine (ordered categorical covariate)mirt or
multipleGroup functionIt is important to note that, for the package to work, the parameters
in the assumed IRT model need to be be estimated using either the
mirt or the multipleGroup function from the
mirt-package. The multipleGroup function can
model impact between groups of persons, which is not possible with the
mirt function. Modeling impact is important when the goal
is to detect DIF DeMars (2010). In this
illustration, for instance, we test whether there is impact with respect
to gender by comparing a model which allows ability differences between
male and female students with a model that assumes there are no group
difference in ability. The relative fit of these two models is compared,
and the best fitting model is selected for the DIF analysis. The general
idea is that we want to avoid (a) false cases of DIF detection that can
be attributed to ability differences and (b) not detecting DIF that is
masked due to not modeling ability differences.
First the mirt package is loaded in the `R} session:
Then the two models are fit and compared. Note that in general we do
not recommend using verbose = FALSE, but for this vignette
it is more convenient.
fit_2PL <- mirt(data = resp,
model = 1,
itemtype = "2PL",
verbose = FALSE)
fit_multiGroup <- multipleGroup(
data = resp, model = 1,
group = SPISA$gender,
invariance = c("free_means",
"slopes",
"intercepts",
"free_var"),
verbose = FALSE)The comparison of the two models with anova yields the
following results:
anova(fit_2PL, fit_multiGroup)
#> AIC SABIC HQ BIC logLik X2 df p
#> fit_2PL 10161.68 10194.16 10195.64 10251.33 -5062.843
#> fit_multiGroup 10139.62 10175.69 10177.34 10239.22 -5049.808 26.069 -509 NaNThe multipleGroup model with ability differences between
male and female test takers best fits the data (lower AIC and BIC; small
\(p\)-value for the Likelihood Ratio
Test). It seem like there are differences between male and female
students with respect to the assessed science knowledge. Therefore, the
multipleGroup model is used in the DIF detection
analysis.
In the (sub)sections that follow, DIF is tested for three different
covariates: gender, age and spon
but only the DIF analysis for gender is explained in more detail. Yet
the the used R commands are the same for any covariate. The
interpretation is given for all of the covariates.
genderTo test item wise DIF along gender, the scDIFtest
function is used with the fitted model object and gender as
the DIF_covariate argument. Note that the
scDIFtest package has to be loaded first.
The resulting object is assigned to DIF_gender. For a
readable version of the results The print method is
available. In addition, the summary method returns a
summary of the results as a data frame.
In the two subsections that follow, the results regarding the
analyses of item wise DIF by gender, age and
spon will be interpreted.
genderFor the gender covariate, the print method gives the following results:
DIF_gender
#>
#> Score Based DIF-tests for 9 items
#> Person covariate: SPISA$gender
#> Test statistic type: Lagrange Multiplier Test for Unordered Groups
#>
#> item_type n_est_pars stat p_value p_fdr
#> V1 2PL 2 0.4141290 8.129672e-01 9.145881e-01
#> V2 2PL 2 8.3160143 1.563869e-02 4.691608e-02
#> V3 2PL 2 4.8448186 8.870764e-02 1.995922e-01
#> V4 2PL 2 32.7336800 7.797793e-08 7.018014e-07
#> V5 2PL 2 3.2678719 1.951599e-01 3.512879e-01
#> V6 2PL 2 0.4159239 8.122379e-01 9.145881e-01
#> V7 2PL 2 30.3498706 2.568085e-07 1.155638e-06
#> V8 2PL 2 0.1516963 9.269569e-01 9.269569e-01
#> V9 2PL 2 0.5925199 7.435941e-01 9.145881e-01First, in three lines some general information is given:
gender ) andLMuo;
Merkle and Zeileis (2013), Merkle et al. (2014)).After these three lines, a table with the main results is printed with one line for each item that was included in the DIF detection analysis. The columns of the table represent:
"V1" -
"V9")item_type the type of IRT model used for each item (in
this case the two-Parameter Logistic Model (2PL))n_est_pars: the number of estimated parameters for each
itemstatistic: the value for the statistic per item (in
this case the LMuo statistic)p-value: the \(p\)-value per itemp.fdr: the False-Discovery-Rate corrected \(p\)-value Benjamini
and Hochberg (1995)The printed output indicates that, when a significance level of \(.05\) is used, DIF along
gender is detected in item V4 and in item V7: these two
items function differently, depending on the gender of the students.
When one of more items are selected using the
item_selection argument of the print method,
the underlying sctest objects (or M-fluctuation tests) are
printed.
print(DIF_gender, item_selection = c("V4", "V7"))
#>
#> DIF-test for V4
#> Person covariate: SPISA$gender
#> Test statistic type: Lagrange Multiplier Test for Unordered Groups
#>
#> M-fluctuation test
#>
#> data: resp
#> f(efp) = 32.734, p-value = 7.798e-08
#>
#>
#> DIF-test for V7
#> Person covariate: SPISA$gender
#> Test statistic type: Lagrange Multiplier Test for Unordered Groups
#>
#> M-fluctuation test
#>
#> data: resp
#> f(efp) = 30.35, p-value = 2.568e-07Note that here the uncorrected \(p\)-values are given.
ageThe results for the DIF-detection analysis with age as
the covariate are:
DIF_age <- scDIFtest(fit_multiGroup, DIF_covariate = SPISA$age)
summary_age <- summary(DIF_age)
summary_age
#> item_type n_est_pars stat p_value p_fdr
#> V1 2PL 2 1.0593683 0.378589381 0.56788407
#> V2 2PL 2 0.7508064 0.859981567 0.96747926
#> V3 2PL 2 1.3579483 0.097577612 0.21954963
#> V4 2PL 2 1.6092813 0.022394842 0.06718453
#> V5 2PL 2 1.0936506 0.332065265 0.56788407
#> V6 2PL 2 1.6830414 0.013809032 0.06214064
#> V7 2PL 2 0.5720341 0.989800709 0.98980071
#> V8 2PL 2 0.7729091 0.830897077 0.96747926
#> V9 2PL 2 1.9126405 0.002656469 0.02390822In this case, the Double Maximum Test for continuous numeric
orderings (dm; Merkle and Zeileis
(2013), Merkle et al. (2014)) is
used. The results indicate that DIF along age is detected
in three items: V4 (\(p = 0.022\)), V6
(\(p = 0.014\)), and V9 ($ p = 0.003$).
Note that the score-based framework has the power to detect DIF along
numeric covariates, without assuming some functional form of the
DIF.
sponThe results for the DIF-detection analysis with spon as
the covariate are:
DIF_spon <- scDIFtest(fit_multiGroup, DIF_covariate = SPISA$spon)
DIF_spon
#>
#> Score Based DIF-tests for 9 items
#> Person covariate: SPISA$spon
#> Test statistic type: Maximum Lagrange Multiplier Test for Ordered
#> Groups
#>
#> item_type n_est_pars stat p_value p_fdr
#> V1 2PL 2 1.868827 0.78060172 0.8781769
#> V2 2PL 2 6.342637 0.13596628 0.4475654
#> V3 2PL 2 2.390408 0.66504124 0.8550530
#> V4 2PL 2 3.597939 0.43189988 0.6478498
#> V5 2PL 2 7.536508 0.08199012 0.4475654
#> V6 2PL 2 4.847344 0.26234688 0.5325869
#> V7 2PL 2 1.305092 0.89392903 0.8939290
#> V8 2PL 2 6.174746 0.14918847 0.4475654
#> V9 2PL 2 4.553667 0.29588159 0.5325869In this case, the maximum Lagrange-Multiplier-Test
(maxLMO; Merkle and Zeileis
(2013), Merkle et al. (2014)) is
used. Since all tests result in large \(p\)-values, we conclude that no DIF was
detected along the spon covariate.
scDIFtest is a user-friendly and efficient wrapper
around the sctest function of the strucchange
package. scDIFtest can be used to detect item-wise DIF,
along both categorical and continuous DIF covariates. Note
however, that the functionality is compatible with IRT models fit using
the mirt package only. For now.