This chapter gives the code to perform the classification of individuals’ mental health given their characteristics and consumption patterns. This is a robustness check that replicates the analysis performed in Chapter 5, using the self-assessed health (SAH) instead of the declarative question on depression to define the imaginary healthy. The MHI-5 score is used here to define the imaginary healthy.
Let us load the data obtained after cleaning (see the chapter where the data is cleaned, Chapter 15).
Let us first load {tidyverse}:
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
The cleaned data:
load("../data/df_clean_sah.rda")dim(df_clean)
[1] 5380 94
17.1 Samples
We only focus on people who state they did not experience a depression episode during the previous year. We consider two types of individuals: those whose MHI-5 score is considered low (less than or equal to 60, i.e., the first quartile of the distribution of MHI-5 scores among all respondents), and those whose score is considered high (greater than 60).
Let us create our target variable based on this. First, we need to load some packages.
library(caret)
Loading required package: lattice
Attaching package: 'caret'
The following object is masked from 'package:purrr':
lift
library(MLmetrics)
Attaching package: 'MLmetrics'
The following objects are masked from 'package:caret':
MAE, RMSE
The following object is masked from 'package:base':
Recall
library(parallel) library(doParallel)
Loading required package: foreach
Attaching package: 'foreach'
The following objects are masked from 'package:purrr':
accumulate, when
Loading required package: iterators
We split the dataset into two subsets. One for training the model (df_train), with 80% observations and a second for validating it (df_test) with the remaining 20% observations.
set.seed(123)ind_build <-createDataPartition(df_clean$status, p = .8, list =FALSE)df_train <- df_clean[ind_build,] |>as.data.frame()df_test <- df_clean[-ind_build,] |>as.data.frame()
We will train different classifiers. For models that require hyperparameters, we will use the {caret} package to perform a grid search to find the optimal parameters. We will perform repeated 5-fold cross validation (10 repetitions) to find the hyperparameters. We will optimize the sensitivity (true positive rate, returned by the twoClassSummary() function from {caret}).
control <-trainControl(method ="repeatedcv",repeats =10,number =5, #number of foldssummaryFunction = twoClassSummary,search ="grid",classProbs =TRUE,verboseIter =TRUE,allowParallel =TRUE,savePredictions =TRUE)
Some columns need to be removed from the estimation:
This section presents the code used to classify the respondents accordingly with their label, using a random forest (Breiman (2001)).
We will need the following libraries:
library(randomForest)
Warning: package 'randomForest' was built under R version 4.4.1
randomForest 4.7-1.2
Type rfNews() to see new features/changes/bug fixes.
Attaching package: 'randomForest'
The following object is masked from 'package:dplyr':
combine
The following object is masked from 'package:ggplot2':
margin
library(ranger)
Attaching package: 'ranger'
The following object is masked from 'package:randomForest':
importance
We are relying on the {ranger} package to classify individuals as imaginary healthy (Not_D_and_inf_Q1) or not (Not_D_and_sup_Q1). We use the ranger() function, that performs the classification using a random forest algorithm. Some parameters are needed to estimate the model:
the number of trees to grow (num.trees)
the number of variables to possibly split at each node of the trees (mtry)
the splitting rule (splitrule)
the minimal node size (min.node.size).
The choice of these hyper-parameters can greatly affect the quality of fit. Hence, we loop over a grid of possible combinations of these hyperparameters. More specifically:
we set the number of trees to 500 (default value in ranger())
we let the number of variables to possibly split at each node vary between 3 to the square root of the number of features (3 to 8)
we select the Gini index as the splitting rule
we let the minimum number of observations in terminal nodes vary in the following set: {50, 75, 100, 150}.
the variable importance mode we select is "permutation"
we set the respect.unordered.factors argument to "partition", so that all possible bi partitions are considered for splitting
the argument class.weights is set to the inverse proportion of our target variable in the training sample, to account for the imbalance of the target variable in the dataset.
17.3.4 Penalized Logistic Regression Model (glmnet)
Let us now estimate a penalized logistic regression model (J. Friedman, Hastie, and Tibshirani (2010)). Again, the following code is not run in this chapter.
cl <-makePSOCKcluster(no_cores)registerDoParallel(cl)glmnet_model <-train( formula, # Formuladata = df_train, # Datamethod ="glmnet", # Penalized logistic regressiontrControl = control, # Training controlpreProcess =c("center", "scale"), # PreprocessingtuneLength =100# Number of hyperparameter values to try)stopCluster(cl)registerDoSEQ()
Friedman, Jerome H. 2001. “Greedy function approximation: A gradient boosting machine.”The Annals of Statistics 29 (5): 1189–1232. https://doi.org/10.1214/aos/1013203451.
Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. 2010. “Regularization Paths for Generalized Linear Models via Coordinate Descent.”Journal of Statistical Software 33 (1). https://doi.org/10.18637/jss.v033.i01.