7  Estimations

Objectives

This chapter uses Shapley values to evaluate the importance of variables in explaining the predictions made by the classifiers trained in Chapter 5.

Before estimating the Shapley values (Hart (1989) ; Scott M. Lundberg and Lee (2017)), we need to get the results from the estimations performed in the previous chapter.

library(tidyverse)
library(ranger)
library(treeshap)

We need to load the train and test datasets obtained in Chapter 5:

load("../data/out/df_train.rda")
load("../data/out/df_test.rda")

We also load the estimated random forests and xgb:

load("../data/out/estim/v3/grid_search_rf.rda")
load("../data/out/estim/v3/grid_search_xgb.rda")

Let us keep track of the levels of qualitative variables.

corresp_factors <- df_train |> 
  select_if(is.factor) |> 
  map(levels)

The formula used to train the classifiers:

ind_remove <- which(
  colnames(df_train) %in% c("status", "id", "PERSONNE_statut")
)
formula <- as.formula(
  paste("status ~", paste(colnames(df_train[-ind_remove]), collapse = "+"))
)

7.1 Estimations

To explain the predictions made with the tree-based methods, we will rely on TreeSHAP (Scott M. Lundberg et al. (2020)).

7.1.1 Random Forest

For this algorithm to work with random forests, we need to transform the categorical variables as numerical variables…

df_train_num <-
  df_train |>
  mutate(status = as.numeric(status)-1) |> 
  mutate(across(where(is.factor), as.numeric))

df_test_num <- 
  df_test |>
  mutate(status = as.numeric(status)-1) |>
  mutate(across(where(is.factor), as.numeric))

We need to estimate the final model for the random forest, with the selected hyperparameters.

Let us train a random forest on the train set with the selected hyperparameters:

final_model_rf <- ranger(
  formula,
  data = df_train_num, 
  mtry = grid_search_rf$bestTune$mtry,
  splitrule = "gini",
  min.node.size = grid_search_rf$bestTune$min.node.size,
  classification = T
)

The data on which Treeshap will rely:

reference_data_rf <- df_train_num

We need to convert the model into a standard representation for threeshap:

model_unified_rf <- ranger.unify(final_model_rf, reference_data_rf)

Then, we can estimate the SHAP values. Note that in this chapter, the following code is not run, we load the results). We estimate the SHAP values on observations from both the train sample and the test sample, but the synthetic data are made from the train set only.

df_all_rf <- bind_rows(df_train_num, df_test_num)
treeshap_rf <- treeshap(model_unified_rf, df_all_rf)

We save the estimated SHAP values and the estimated random forest:

dir.create("../data/out/treeSHAP/v3/", recursive = TRUE)
save(final_model_rf, treeshap_rf, file = "../data/out/treeSHAP/v3/treeshap_rf.rda")

7.1.2 Extreme Gradient Boosting

The best model obtained:

final_model_xgb <- grid_search_xgb$finalModel

The reference data on which Treeshap will rely (only on train data):

reference_data_xgb <- df_train

Formatting the model for treeSHAP:

unified_xgb <- unify(final_model_xgb, data = df_train)

We will estimate the SHAP values on data from both the train set and the test set:

df_all_xgb <- bind_rows(df_train, df_test)

Let us calculate the SHAP values. Again, the following code is not run in this chapter. Results obtained prior the compilation of the ebook are loaded.

treeshap_xgb <- treeshap(
  unified_xgb,  
  model.matrix(formula, df_all_xgb), 
  verbose = TRUE,
  interactions = TRUE
)
save(treeshap_xgb, file = "../data/out/treeSHAP/v3/treeshap_xgb.rda")