library(tidyverse)
library(ranger)
library(treeshap)
7 Estimations
This chapter uses Shapley values to evaluate the importance of variables in explaining the predictions made by the classifiers trained in Chapter 5.
Before estimating the Shapley values (Hart (1989) ; Scott M. Lundberg and Lee (2017)), we need to get the results from the estimations performed in the previous chapter.
We need to load the train and test datasets obtained in Chapter 5:
load("../data/out/df_train.rda")
load("../data/out/df_test.rda")
We also load the estimated random forests and xgb:
load("../data/out/estim/v3/grid_search_rf.rda")
load("../data/out/estim/v3/grid_search_xgb.rda")
Let us keep track of the levels of qualitative variables.
<- df_train |>
corresp_factors select_if(is.factor) |>
map(levels)
The formula used to train the classifiers:
<- which(
ind_remove colnames(df_train) %in% c("status", "id", "PERSONNE_statut")
)<- as.formula(
formula paste("status ~", paste(colnames(df_train[-ind_remove]), collapse = "+"))
)
7.1 Estimations
To explain the predictions made with the tree-based methods, we will rely on TreeSHAP (Scott M. Lundberg et al. (2020)).
7.1.1 Random Forest
For this algorithm to work with random forests, we need to transform the categorical variables as numerical variables…
<-
df_train_num |>
df_train mutate(status = as.numeric(status)-1) |>
mutate(across(where(is.factor), as.numeric))
<-
df_test_num |>
df_test mutate(status = as.numeric(status)-1) |>
mutate(across(where(is.factor), as.numeric))
We need to estimate the final model for the random forest, with the selected hyperparameters.
Let us train a random forest on the train set with the selected hyperparameters:
<- ranger(
final_model_rf
formula,data = df_train_num,
mtry = grid_search_rf$bestTune$mtry,
splitrule = "gini",
min.node.size = grid_search_rf$bestTune$min.node.size,
classification = T
)
The data on which Treeshap will rely:
<- df_train_num reference_data_rf
We need to convert the model into a standard representation for threeshap:
<- ranger.unify(final_model_rf, reference_data_rf) model_unified_rf
Then, we can estimate the SHAP values. Note that in this chapter, the following code is not run, we load the results). We estimate the SHAP values on observations from both the train sample and the test sample, but the synthetic data are made from the train set only.
<- bind_rows(df_train_num, df_test_num) df_all_rf
<- treeshap(model_unified_rf, df_all_rf) treeshap_rf
We save the estimated SHAP values and the estimated random forest:
dir.create("../data/out/treeSHAP/v3/", recursive = TRUE)
save(final_model_rf, treeshap_rf, file = "../data/out/treeSHAP/v3/treeshap_rf.rda")
7.1.2 Extreme Gradient Boosting
The best model obtained:
<- grid_search_xgb$finalModel final_model_xgb
The reference data on which Treeshap will rely (only on train data):
<- df_train reference_data_xgb
Formatting the model for treeSHAP:
<- unify(final_model_xgb, data = df_train) unified_xgb
We will estimate the SHAP values on data from both the train set and the test set:
<- bind_rows(df_train, df_test) df_all_xgb
Let us calculate the SHAP values. Again, the following code is not run in this chapter. Results obtained prior the compilation of the ebook are loaded.
<- treeshap(
treeshap_xgb
unified_xgb, model.matrix(formula, df_all_xgb),
verbose = TRUE,
interactions = TRUE
)save(treeshap_xgb, file = "../data/out/treeSHAP/v3/treeshap_xgb.rda")