Title: | Conditional Permutation Importance |
---|---|
Description: | An add-on to the 'party' package, with a faster implementation of the partial-conditional permutation importance for random forests. The standard permutation importance is implemented exactly the same as in the 'party' package. The conditional permutation importance can be computed faster, with an option to be backward compatible to the 'party' implementation. The package is compatible with random forests fit using the 'party' and the 'randomForest' package. The methods are described in Strobl et al. (2007) <doi:10.1186/1471-2105-8-25> and Debeer and Strobl (2020) <doi:10.1186/s12859-020-03622-2>. |
Authors: | Dries Debeer [aut, cre], Torsten Hothorn [aut], Carolin Strobl [aut] |
Maintainer: | Dries Debeer <[email protected]> |
License: | GPL-2 | GPL-3 |
Version: | 1.0-2 |
Built: | 2025-03-05 04:02:49 UTC |
Source: | https://github.com/ddebeer/permimp |
An add-on to the 'party' package, with a faster implementation of the partial-conditional permutation importance for random forests. The standard permutation importance is implemented exactly the same as in the 'party' package. The conditional permutation importance can be computed faster, with an option to be backward compatible to the 'party' implementation. The package is compatible with random forests fit using the 'party' and the 'randomForest' package. The methods are described in Strobl et al. (2007) <doi:10.1186/1471-2105-8-25> and Debeer and Strobl (2020) <doi:10.1186/s12859-020-03622-2>.
Index of help topics:
VarImp VarImp Objects VarImp-methods Methods for VarImp Objects permimp Random Forest Permutation Importance for random forests permimp-package Conditional Permutation Importance ranks Reversed Rankings selFreq Predictor Selection Frequency in Random Forests
Maintainer: Dries Debeer [email protected]
Authors:
Carolin Strobl
Torsten Hothorn
Standard and partial/conditional permutation importance for
random forest-objects fit using the party or randomForest
packages, following the permutation principle of the 'mean decrease in
accuracy' importance in randomForest . The partial/conditional permutation
importance is implemented differently, selecting the predictions to condition
on in each tree using Pearson Chi-squared tests applied to the
by-split point-categorized predictors. In general the new implementation has
similar results as the original varimp
function. With
asParty = TRUE
, the partial/conditional permutation importance is
fully backward-compatible but faster than the original varimp
function in party.
permimp(object, ...) ## S3 method for class 'randomForest' permimp(object, nperm = 1, OOB = TRUE, scaled = FALSE, conditional = FALSE, threshold = .95, whichxnames = NULL, thresholdDiagnostics = FALSE, progressBar = TRUE, do_check = TRUE, ...) ## S3 method for class 'RandomForest' permimp(object, nperm = 1, OOB = TRUE, scaled = FALSE, conditional = FALSE, threshold = .95, whichxnames = NULL, thresholdDiagnostics = FALSE, progressBar = TRUE, pre1.0_0 = conditional, AUC = FALSE, asParty = FALSE, mincriterion = 0, ...)
permimp(object, ...) ## S3 method for class 'randomForest' permimp(object, nperm = 1, OOB = TRUE, scaled = FALSE, conditional = FALSE, threshold = .95, whichxnames = NULL, thresholdDiagnostics = FALSE, progressBar = TRUE, do_check = TRUE, ...) ## S3 method for class 'RandomForest' permimp(object, nperm = 1, OOB = TRUE, scaled = FALSE, conditional = FALSE, threshold = .95, whichxnames = NULL, thresholdDiagnostics = FALSE, progressBar = TRUE, pre1.0_0 = conditional, AUC = FALSE, asParty = FALSE, mincriterion = 0, ...)
object |
an object as returned by |
mincriterion |
the value of the test statistic or 1 - p-value that
must be exceeded in order to include a split in the
computation of the importance. The default
|
conditional |
a logical that determines whether unconditional or conditional permutation is performed. |
threshold |
the threshold value for (1 - p-value) of the association
between the predictor of interest and another predictor, which
must be exceeded in order to include the other predictor in
the conditioning scheme for the predictor of interest (only
relevant if |
nperm |
the number of permutations performed. |
OOB |
a logical that determines whether the importance is computed from the out-of-bag sample or the learning sample (not suggested). |
pre1.0_0 |
Prior to party version 1.0-0, the actual data values were permuted according to the original permutation importance suggested by Breiman (2001). Now the assignments to child nodes of splits in the variable of interest are permuted as described by Hapfelmeier et al. (2012), which allows for missing values in the predictors and is more efficient with respect to memory consumption and computing time. This method does not apply to the conditional permutation importance, nor to random forests that were not fit using the party package. |
scaled |
a logical that determines whether the differences in prediction accuracy should be scaled by the total (null-model) error. |
AUC |
a logical that determines whether the Area Under the Curve (AUC) instead of the accuracy is used to compute the permutation importance (cf. Janitza et al., 2012). The AUC-based permutation importance is more robust towards class imbalance, but it is only applicable to binary classification. |
asParty |
a logical that determines whether or not exactly the same
values as the original |
whichxnames |
a character vector containing the predictor variable names for which the permutation importance should be computed. Only use when aware of the implications, see section 'Details'. |
thresholdDiagnostics |
a logical that specifies whether diagnostics with respect to the threshold-value should be prompted as warnings. |
progressBar |
a logical that determines whether a progress bar should be displayed. |
do_check |
a logical that determines whether a check requiring user input should be included. |
... |
additional arguments to be passed to the Methods |
Function permimp
is highly comparable to varimp
in party,
but the partial/conditional variable importance has a different, more efficient
implementation. Compared to the original varimp
in party,
permimp
applies a different strategy to select the predictors to condition
on (ADD REFERENCE TO PAPER).
With asParty = TRUE
, permimp returns exactly the same values as
varimp
in party, but the computation is done more efficiently.
If conditional = TRUE
, the importance of each variable is computed by
permuting within a grid defined by the predictors that are associated
(with 1 - p-value greater than threshold
) to the variable of interest.
The threshold
can be interpreted as a parameter that moves the permutation
importance across a dimension from fully conditional (threshold = 0
) to
completely unconditional (threshold = 1
), see Debeer and Strobl (2020).
Using the wichxnames
argument, the computation of the permutation importance
can be limited to a smaller number of specified predictors. Note, however, that when
conditional = TRUE
, the (other) predictors to condition on are also
limited to this selection of predictors. Only use when fully aware of the
implications.
For further details, please refer to the documentation of varimp
.
An object of class VarImp
, with the mean decrease in accuracy
as its $values
.
Leo Breiman (2001). Random Forests. Machine Learning, 45(1), 5–32.
Alexander Hapfelmeier, Torsten Hothorn, Kurt Ulm, and Carolin Strobl (2012). A New Variable Importance Measure for Random Forests with Missing Data. Statistics and Computing, https://link.springer.com/article/10.1007/s11222-012-9349-1
Torsten Hothorn, Kurt Hornik, and Achim Zeileis (2006b). Unbiased Recursive Partitioning: A Conditional Inference Framework. Journal of Computational and Graphical Statistics, 15 (3), 651-674. Preprint available from https://www.zeileis.org/papers/Hothorn+Hornik+Zeileis-2006.pdf
Silke Janitza, Carolin Strobl and Anne-Laure Boulesteix (2013). An AUC-based Permutation Variable Importance Measure for Random Forests. BMC Bioinformatics.2013, 14 119. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-119
Carolin Strobl, Anne-Laure Boulesteix, Thomas Kneib, Thomas Augustin, and Achim Zeileis (2008). Conditional Variable Importance for Random Forests. BMC Bioinformatics, 9, 307. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-307
Debeer Dries and Carolin Strobl (2020). Conditional Permutation Importance Revisited. BMC Bioinformatics, 21, 307. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03622-2
### for RandomForest-objects, by party::cforest() set.seed(290875) readingSkills.cf <- party::cforest(score ~ ., data = party::readingSkills, control = party::cforest_unbiased(mtry = 2, ntree = 25)) ### conditional importance, may take a while... # party implementation: set.seed(290875) party::varimp(readingSkills.cf, conditional = TRUE) # faster implementation but same results set.seed(290875) permimp(readingSkills.cf, conditional = TRUE, asParty = TRUE) # different implementation with similar results set.seed(290875) permimp(readingSkills.cf, conditional = TRUE, asParty = FALSE) ### standard (unconditional) importance is unchanged set.seed(290875) party::varimp(readingSkills.cf) set.seed(290875) permimp(readingSkills.cf) ### set.seed(290875) readingSkills.rf <- randomForest::randomForest(score ~ ., data = party::readingSkills, mtry = 2, ntree = 25, importance = TRUE, keep.forest = TRUE, keep.inbag = TRUE) ### (unconditional) Permutation Importance set.seed(290875) permimp(readingSkills.rf, do_check = FALSE) # very close to readingSkills.rf$importance[,1] ### Conditional Permutation Importance set.seed(290875) permimp(readingSkills.rf, conditional = TRUE, threshold = .8, do_check = FALSE)
### for RandomForest-objects, by party::cforest() set.seed(290875) readingSkills.cf <- party::cforest(score ~ ., data = party::readingSkills, control = party::cforest_unbiased(mtry = 2, ntree = 25)) ### conditional importance, may take a while... # party implementation: set.seed(290875) party::varimp(readingSkills.cf, conditional = TRUE) # faster implementation but same results set.seed(290875) permimp(readingSkills.cf, conditional = TRUE, asParty = TRUE) # different implementation with similar results set.seed(290875) permimp(readingSkills.cf, conditional = TRUE, asParty = FALSE) ### standard (unconditional) importance is unchanged set.seed(290875) party::varimp(readingSkills.cf) set.seed(290875) permimp(readingSkills.cf) ### set.seed(290875) readingSkills.rf <- randomForest::randomForest(score ~ ., data = party::readingSkills, mtry = 2, ntree = 25, importance = TRUE, keep.forest = TRUE, keep.inbag = TRUE) ### (unconditional) Permutation Importance set.seed(290875) permimp(readingSkills.rf, do_check = FALSE) # very close to readingSkills.rf$importance[,1] ### Conditional Permutation Importance set.seed(290875) permimp(readingSkills.rf, conditional = TRUE, threshold = .8, do_check = FALSE)
Method for giving the reversed rankings of the numerical values of a vector or VarImp object.
ranks(x, note = TRUE, ...) ## Default S3 method: ranks(x, note = TRUE, ...) ## S3 method for class 'VarImp' ranks(x, note = TRUE, ...)
ranks(x, note = TRUE, ...) ## Default S3 method: ranks(x, note = TRUE, ...) ## S3 method for class 'VarImp' ranks(x, note = TRUE, ...)
x |
an object to be reverse ranked. |
note |
a logical specifying whether the (reversed) rankings should be printed instead of the importance values. |
... |
additional arguments to be passed to |
The ranks
function is nothing more than
(length(x) - rank(x, ...) + 1L)
. But it also works for objects of class
VarImp
.
A vector containing the reversed rankings.
## High Jump data HighJumps <- c(Anna = 1.45, Betty = 1.53, Cara = 1.37, Debby = 1.61, Emma = 1.29, Hanna = 1.44, Juno = 1.71) HighJumps ## ranking of high jump data ranks(HighJumps)
## High Jump data HighJumps <- c(Anna = 1.45, Betty = 1.53, Cara = 1.37, Debby = 1.61, Emma = 1.29, Hanna = 1.44, Juno = 1.71) HighJumps ## ranking of high jump data ranks(HighJumps)
counts how many times each predictor variable was selected for splitting
in a random forest. Only implemented for cforest
form the party
package.
selFreq(object, whichxnames = NULL)
selFreq(object, whichxnames = NULL)
object |
an object as returned by |
whichxnames |
a character vector containing the predictor variable names that for which the permutation importance should be computed. See section 'Details'. |
Function selFreq
counts how many times each predictor variable was
selected for splitting in a random forest. In the current implementation this
selFreq
can only be applied to random forests as returned by
cforest
.
An object of class VarImp
, with as $values
the mean of
the sum of the selection frequencies across the trees.
set.seed(290875) readingSkills.cf <- party::cforest(score ~ ., data = party::readingSkills, control = party::cforest_unbiased(mtry = 2, ntree = 100)) ## Selection Frequency selFreq(readingSkills.cf)
set.seed(290875) readingSkills.cf <- party::cforest(score ~ ., data = party::readingSkills, control = party::cforest_unbiased(mtry = 2, ntree = 100)) ## Selection Frequency selFreq(readingSkills.cf)
A class for random forest variable importance measures VarImp
objects.
as.VarImp(object, ...) ## S3 method for class 'data.frame' ## S3 method for class 'data.frame' as.VarImp(object, FUN = mean, type = c("Permutation", "Conditional Permutation", "Selection Frequency", "See Info"), info = NULL, ...) ## S3 method for class 'matrix' ## S3 method for class 'matrix' as.VarImp(object, FUN = mean, type = c("Permutation", "Conditional Permutation", "Selection Frequency", "See Info"), info = NULL, ...) ## S3 method for class 'numeric' ## S3 method for class 'numeric' as.VarImp(object, perTree = NULL, type = c("Permutation", "Conditional Permutation", "Selection Frequency", "See Info"), info = NULL, ...) is.VarImp(VarImp)
as.VarImp(object, ...) ## S3 method for class 'data.frame' ## S3 method for class 'data.frame' as.VarImp(object, FUN = mean, type = c("Permutation", "Conditional Permutation", "Selection Frequency", "See Info"), info = NULL, ...) ## S3 method for class 'matrix' ## S3 method for class 'matrix' as.VarImp(object, FUN = mean, type = c("Permutation", "Conditional Permutation", "Selection Frequency", "See Info"), info = NULL, ...) ## S3 method for class 'numeric' ## S3 method for class 'numeric' as.VarImp(object, perTree = NULL, type = c("Permutation", "Conditional Permutation", "Selection Frequency", "See Info"), info = NULL, ...) is.VarImp(VarImp)
object |
an R object. |
perTree |
a matrix or data frame of size |
type |
a character indicating the type of variable importance measure. |
info |
a list with additional information about the variable importance measure. |
FUN |
a function to compute the variable importance. See section 'Details'. |
VarImp |
an object of the class |
... |
additional arguments. |
as.VarImp
creates an object of class 'VarImp'. When object
is a
matrix
or a data.frame
, the final values
are computed by applying FUN
to its columns. is.VarImp
returns a logical indicating whether the evaluated object is of class
'VarImp'.
## Matrix of fake importance measures per Tree set.seed(290875) ntree <- 500 p <- 15 fakeVIM <- matrix(rnorm(ntree * p), nrow = ntree, ncol = p, dimnames = list(paste0("pred", seq_len(ntree)), paste0("pred", seq_len(p)))) is.VarImp(fakeVIM) ## make a 'VarImp' object fakeVarImp <- as.VarImp(fakeVIM, type = "See Info", info = list("The Vims are based on fake data.", "The mean was used to aggregate across the trees")) is.VarImp(fakeVarImp)
## Matrix of fake importance measures per Tree set.seed(290875) ntree <- 500 p <- 15 fakeVIM <- matrix(rnorm(ntree * p), nrow = ntree, ncol = p, dimnames = list(paste0("pred", seq_len(ntree)), paste0("pred", seq_len(p)))) is.VarImp(fakeVIM) ## make a 'VarImp' object fakeVarImp <- as.VarImp(fakeVIM, type = "See Info", info = list("The Vims are based on fake data.", "The mean was used to aggregate across the trees")) is.VarImp(fakeVarImp)
Methods for computing on VarImp
objects..
## S3 method for class 'VarImp' plot(x, nVar = length(x$values), type = c("bar", "box", "dot", "rank"), sort = TRUE, interval = c( "no", "quantile", "sd"), intervalProbs = c(.25, .75), intervalColor = NULL, horizontal = FALSE, col = NULL, pch = NULL, main = NULL, margin = NULL, ...) ## S3 method for class 'VarImp' print(x, ranks = FALSE, ...) ## S3 method for class 'VarImp' subset(x, subset, ...)
## S3 method for class 'VarImp' plot(x, nVar = length(x$values), type = c("bar", "box", "dot", "rank"), sort = TRUE, interval = c( "no", "quantile", "sd"), intervalProbs = c(.25, .75), intervalColor = NULL, horizontal = FALSE, col = NULL, pch = NULL, main = NULL, margin = NULL, ...) ## S3 method for class 'VarImp' print(x, ranks = FALSE, ...) ## S3 method for class 'VarImp' subset(x, subset, ...)
x |
an object of the class |
nVar |
an integer specifying the number of predictor variables that should be
included in the plot. The |
type |
a character string that indicates the type of plot. Must be one of the following: "bar", "box", "dot" or "rank" (see Details). |
sort |
a logical that specifies whether the predictors should be ranked according to the importance measures. |
interval |
a character string that indicates if, and which type of intervals
should be added to the plot. Must be one of the following: |
intervalProbs |
a numerical vector of the form c(bottom, top), specifying
the two quantiles that should be used for the interval.
Only meaningful when |
intervalColor |
a color code or name, see |
horizontal |
a logical that specifies whether the plot should be horizontal
(= importance values on the x-axis. The default is |
col |
a color code or name, see |
pch |
Either a single character or an integer code specifying the plotting
'character', see |
main |
an overall title for the plot: see |
margin |
a numerical vector of the form |
ranks |
a logical specifying whether the (reversed) rankings should be printed instead of the importance values. |
subset |
a character, integer or logical vector, specifying the subset of predictor variables. |
... |
additional arguments. |
plot
gives visualization of the variable importance values. print
prints the importance values, or their (reversed) rankings if ranks = TRUE
. ranks
returns the reversed rankings of the variable importance values. The subset
method for VarImp objects returns a VarImp object for only a subset of the original predictors in the random forest.
In plot
, the type = "bar"
results in a barplot, type = "dot"
in a point-plot, type = "rank"
in a point-plot with the importance rankings as the plotting 'characters', see ranks
. In each of these three options an interval (based on either two quantiles or on the standard deviation of the perTree
values) can be added to the plot. type = "box"
results in boxplots, and is only meaningful when perTree
values are available.
## Fit a random forest (using cforest) set.seed(290875) readingSkills.cf <- party::cforest(score ~ ., data = party::readingSkills, control = party::cforest_unbiased(mtry = 2, ntree = 50)) ## compute permutation variable importance: set.seed(290875) permVIM <- permimp(readingSkills.cf) ## print the variable importance values permVIM print(permVIM, ranks = TRUE) ranks(permVIM) ## Visualize the variable importance values plot(permVIM, type = "bar", margin = c(6,3,3,1)) plot(permVIM, nVar = 2, type = "box", horizontal = TRUE) ## note the rankings plot(subset(permVIM, c("age", "nativeSpeaker")), intervalColor = "pink") plot(subset(permVIM, c("shoeSize", "nativeSpeaker")), intervalColor = "pink")
## Fit a random forest (using cforest) set.seed(290875) readingSkills.cf <- party::cforest(score ~ ., data = party::readingSkills, control = party::cforest_unbiased(mtry = 2, ntree = 50)) ## compute permutation variable importance: set.seed(290875) permVIM <- permimp(readingSkills.cf) ## print the variable importance values permVIM print(permVIM, ranks = TRUE) ranks(permVIM) ## Visualize the variable importance values plot(permVIM, type = "bar", margin = c(6,3,3,1)) plot(permVIM, nVar = 2, type = "box", horizontal = TRUE) ## note the rankings plot(subset(permVIM, c("age", "nativeSpeaker")), intervalColor = "pink") plot(subset(permVIM, c("shoeSize", "nativeSpeaker")), intervalColor = "pink")