17  Estimations

Objectives

This chapter gives the code to perform the classification of individuals’ mental health given their characteristics and consumption patterns. This is a robustness check that replicates the analysis performed in Chapter 5, using the self-assessed health (SAH) instead of the declarative question on depression to define the imaginary healthy. The MHI-5 score is used here to define the imaginary healthy.

Let us load the data obtained after cleaning (see the chapter where the data is cleaned, Chapter 15).

Let us first load {tidyverse}:

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

The cleaned data:

load("../data/df_clean_sah.rda")
dim(df_clean)
[1] 5380   94

17.1 Samples

We only focus on people who state they did not experience a depression episode during the previous year. We consider two types of individuals: those whose MHI-5 score is considered low (less than or equal to 60, i.e., the first quartile of the distribution of MHI-5 scores among all respondents), and those whose score is considered high (greater than 60).

Let us create our target variable based on this. First, we need to load some packages.

library(caret)
Loading required package: lattice

Attaching package: 'caret'
The following object is masked from 'package:purrr':

    lift
library(MLmetrics)

Attaching package: 'MLmetrics'
The following objects are masked from 'package:caret':

    MAE, RMSE
The following object is masked from 'package:base':

    Recall
library(parallel) 
library(doParallel)
Loading required package: foreach

Attaching package: 'foreach'
The following objects are masked from 'package:purrr':

    accumulate, when
Loading required package: iterators

We split the dataset into two subsets. One for training the model (df_train), with 80% observations and a second for validating it (df_test) with the remaining 20% observations.

set.seed(123)
ind_build <- createDataPartition(df_clean$status, p = .8, list = FALSE)

df_train <- df_clean[ind_build,] |> as.data.frame()
df_test <- df_clean[-ind_build,] |> as.data.frame()

Here are the two levels of the target variable:

  • Level 1: MHI5 > q1 (healthy) (Not_D_and_sup_Q1).
  • Level 2 (positive class): MHI5 < q1 (imaginary healthy) (Not_D_and_inf_Q1)
table(df_train$status) |> prop.table()

Not_D_and_inf_Q1 Not_D_and_sup_Q1 
        0.295935         0.704065 
table(df_test$status) |> prop.table()

Not_D_and_inf_Q1 Not_D_and_sup_Q1 
        0.295814         0.704186 

Let us save the train and test samples.

save(df_train, file ="../data/out/df_train_sah.rda")
save(df_test, file ="../data/out/df_test_sah.rda")

17.3 Training

17.3.1 Random Forest

This section presents the code used to classify the respondents accordingly with their label, using a random forest (Breiman (2001)).

We will need the following libraries:

library(randomForest)
Warning: package 'randomForest' was built under R version 4.4.1
randomForest 4.7-1.2
Type rfNews() to see new features/changes/bug fixes.

Attaching package: 'randomForest'
The following object is masked from 'package:dplyr':

    combine
The following object is masked from 'package:ggplot2':

    margin
library(ranger)

Attaching package: 'ranger'
The following object is masked from 'package:randomForest':

    importance

We are relying on the {ranger} package to classify individuals as imaginary healthy (Not_D_and_inf_Q1) or not (Not_D_and_sup_Q1). We use the ranger() function, that performs the classification using a random forest algorithm. Some parameters are needed to estimate the model:

  • the number of trees to grow (num.trees)
  • the number of variables to possibly split at each node of the trees (mtry)
  • the splitting rule (splitrule)
  • the minimal node size (min.node.size).

The choice of these hyper-parameters can greatly affect the quality of fit. Hence, we loop over a grid of possible combinations of these hyperparameters. More specifically:

  • we set the number of trees to 500 (default value in ranger())
  • we let the number of variables to possibly split at each node vary between 3 to the square root of the number of features (3 to 8)
  • we select the Gini index as the splitting rule
  • we let the minimum number of observations in terminal nodes vary in the following set: {50, 75, 100, 150}.

The grid is created as follows:

tune_grid_rf <- expand.grid(
  mtry = seq(3, sqrt(ncol(df_train[, -ind_remove]))),
  splitrule = "gini",
  min.node.size = c(50, 75, 100, 150)
)

We iterate over this grid. At each iteration, we estimate the model and compute the sensitivity.

The following code is not run, we load the results obtained when running this code.

no_cores <- detectCores() - 1
cl <- makePSOCKcluster(no_cores)
registerDoParallel(cl)

grid_search_rf <- train(
  form = formula,
  data = df_train,
  method = "ranger",
  trControl = control,
  tuneGrid = tune_grid_rf,
  metric = "Sens",
  # Arguments passed to ranger()
  num.trees = 500,
  importance = "permutation",
  respect.unordered.factors = "partition",
  classification = T,
  # slows the process greatly
  # keep.inbag = T,
  class.weights = 1 / prop.table(table(df_train$status))
)
stopCluster(cl)

From the code above, it can be noted that:

  • the variable importance mode we select is "permutation"
  • we set the respect.unordered.factors argument to "partition", so that all possible bi partitions are considered for splitting
  • the argument class.weights is set to the inverse proportion of our target variable in the training sample, to account for the imbalance of the target variable in the dataset.

We save the results:

dir.create("../data/out/estim/", recursive = TRUE)
save(
  grid_search_rf, tune_grid_rf, 
  file = "../data/out/estim/v3/grid_search_rf_sah.rda"
)

17.3.2 Extreme Gradient Boosting (XGBoost)

Let us now build a classifier on the same data using extreme gradient boosting (J. H. Friedman (2001)). We need to load the {xgboost} package.

library(xgboost)

Attaching package: 'xgboost'
The following object is masked from 'package:dplyr':

    slice

The grid to perform the grid search:

tune_grid_xgb <- expand.grid(
  nrounds = 500, # no. boosting iterations
  max_depth = 3:6, # max tree depth
  colsample_bytree = (1:10)/10, # pct columns used
  eta = 0.01, # learning rate
  gamma = c(0, 5, 10), # complexity parameter
  min_child_weight = c(50, 100, 150), # Minimum sum of instance weight in a child
  subsample = c(.7, .8, .9, 1) # Subsample ratio of the training instances
)

And the estimation (not run in this notebook):

cl <- makePSOCKcluster(no_cores)
registerDoParallel(cl)

set.seed(123)
grid_search_xgb <- train(
  formula,
  data = df_train,
  trControl = control,
  tuneGrid = tune_grid_xgb,
  method = "xgbTree",
  scale_pos_weight = sum(df_train$status != "Not_D_and_inf_Q1") /
    sum(df_train$status == "Not_D_and_inf_Q1"),
  metric = "Sens"
)
stopCluster(cl)
registerDoSEQ()

We save the results:

save(
  grid_search_xgb, tune_grid_xgb, 
  file = "../data/out/estim/v3/grid_search_xgb_sah.rda"
)

17.3.3 Support Vector Machine model (SVM)

Let us now build an SVM (Ben-Hur et al. (2002)) with the {e1071} package.

library(e1071)
cl <- makePSOCKcluster(no_cores)
registerDoParallel(cl)
svm_model <- train(
  formula,              # Formula
  data = df_train,      # Data
  method = "svmRadial", # SVM with Radial kernel
  trControl = control,  # Training control
  preProcess = c("center", "scale"), # Preprocessing
  tuneLength = 100      # Number of hyperparameter values to try  
)
stopCluster(cl)
registerDoSEQ()

Let us save the results:

save(svm_model, file = "../data/out/estim/v3/svm_model_sah.rda")

17.3.4 Penalized Logistic Regression Model (glmnet)

Let us now estimate a penalized logistic regression model (J. Friedman, Hastie, and Tibshirani (2010)). Again, the following code is not run in this chapter.

cl <- makePSOCKcluster(no_cores)
registerDoParallel(cl)

glmnet_model <- train(
  formula,             # Formula
  data = df_train,     # Data
  method = "glmnet",   # Penalized logistic regression
  trControl = control, # Training control
  preProcess = c("center", "scale"), # Preprocessing
  tuneLength = 100     # Number of hyperparameter values to try
)

stopCluster(cl)
registerDoSEQ()

Let us save the results:

save(glmnet_model, file = "../data/out/estim/v3/glmnet_model_sah.rda")