Note: the following variables were excluded for the following reasons:
PERSONNE_etatleg gives the marital status; we keep PERSONNE_couple which states whether the person lives as a couple
MENAGE_revdetail is the income in numerical values; we prefer keeping MENAGE_revucinsee (Net Income per Cons. Unit)
MENAGE_rap_zau: the variable MENAGE_tu already provides a very similar information
SOINS_remamb, SOINS_tmamb, SOINS_dpaamb, SOINS_pf_frambSOINS_pf_framb : these are the sum of other reimbursements; they would therefore cause colinearity issues.
Some additional information should be provided regarding healthcare variables. These variables are composed of :
expenditures (dep, dépenses)
health insurance reimbursements (remb, remboursements par l’assurance maladie)
co-payment (tm, ticket modérateur)
fee overruns (dpa, dépassement d’honoraires, non remboursés)
Let us keep track on the number of individuals that were removed:
nb_obs_df <- nb_obs_df |>bind_rows(tibble(step ="pb_depress_not_reported", n =nrow(df_tmp)) )nb_obs_df
# A tibble: 4 × 2
step n
<chr> <dbl>
1 raw 19940
2 age_15 18561
3 missing_mhi5_score 12373
4 pb_depress_not_reported 11872
Let us focus on the MHI-3 score here and discard the MHI-3. Let us create a variable (status) that classifies individuals depending on their MHI-3 score and their answer regarding depression:
# A tibble: 0 × 3
# ℹ 3 variables: status <chr>, PERSONNE_pb_depress <chr>, inf_q1_mhi_3 <fct>
We are interested in people who self report having not experienced depression during the past year. Other people will be discarded. The proportion of imaginary healthy patients among the individuals that are about to be discarded:
df_tmp |>filter(! status %in%c("Not_D_and_inf_Q1", "Not_D_and_sup_Q1")) |>group_by(inf_q1_mhi_3, status) |>count() |>ungroup() |>mutate(prop =round(100* n /sum(n), digits =2))
# A tibble: 2 × 4
inf_q1_mhi_3 status n prop
<fct> <chr> <int> <dbl>
1 <=Q1 D_and_inf_Q1 542 80.5
2 >Q1 D_and_sup_Q1 131 19.5
Let us keep track on the number of individuals that were removed:
nb_obs_df <- nb_obs_df |>bind_rows(tibble(step ="not_D", n =nrow(df_tmp)) )nb_obs_df
# A tibble: 5 × 2
step n
<chr> <dbl>
1 raw 19940
2 age_15 18561
3 missing_mhi5_score 12373
4 pb_depress_not_reported 11872
5 not_D 11199
9.1 Looking at missing data
We need to check where are the missing data. Let us have a look at the “Working conditions” questions (columns which begin with QST_c), and count the number and corresponding proportion of missing values:
Table 9.1: Missing values in the Working Conditions questions
variable
no_obs
nb_not_surveyed
prop
QST_ct_depech
11199
5714
51.02
QST_ct_liberte
11199
5714
51.02
QST_ct_apprend
11199
5714
51.02
QST_ct_aidecol
11199
5714
51.02
QST_ct_travnuit
11199
5714
51.02
QST_ct_repet
11199
5714
51.02
QST_ct_lourd
11199
5714
51.02
QST_ct_posture
11199
5714
51.02
QST_ct_produit
11199
5714
51.02
As we can see from the previous result, around 50% of respondents did not give an answer to the questions related to work. It does not only concern unemployed people (1,289 individuals are classified as “Inactive having never worked”):
table(df_tmp$PERSONNE_rap_pcs8) |>sort()
Farmer Craftsman, trader
383 567
Unskilled worker Commercial employee
892 1315
Executive and intellectual profession Administrative employee
1443 1498
Inactive having never worked Skilled worker
1595 1621
Intermediate occupation
1875
Let us have a look at the distribution of PERSONNE_rap_pcs8 for individuals for which QST_ct_depech is missing:
# A tibble: 5,714 × 9
QST_ct_depech QST_ct_liberte QST_ct_apprend QST_ct_aidecol QST_ct_travnuit
<fct> <fct> <fct> <fct> <fct>
1 Not surveyed Not surveyed Not surveyed Not surveyed Not surveyed
2 Not surveyed Not surveyed Not surveyed Not surveyed Not surveyed
3 Not surveyed Not surveyed Not surveyed Not surveyed Not surveyed
4 Not surveyed Not surveyed Not surveyed Not surveyed Not surveyed
5 Not surveyed Not surveyed Not surveyed Not surveyed Not surveyed
6 Not surveyed Not surveyed Not surveyed Not surveyed Not surveyed
7 Not surveyed Not surveyed Not surveyed Not surveyed Not surveyed
8 Not surveyed Not surveyed Not surveyed Not surveyed Not surveyed
9 Not surveyed Not surveyed Not surveyed Not surveyed Not surveyed
10 Not surveyed Not surveyed Not surveyed Not surveyed Not surveyed
# ℹ 5,704 more rows
# ℹ 4 more variables: QST_ct_repet <fct>, QST_ct_lourd <fct>,
# QST_ct_posture <fct>, QST_ct_produit <fct>
We do the same for variables which give information relative to social interactions and which give information on the parents’ background (i.e., variables that begin with QES). First, let us have a look at the proportion of NA in those:
Ordinal variables need to be altered prior to the estimation (because of SHAP). We change them to numerical variables. To keep a track of the corresponding numerical values associated with each label, we create a table.
Then, we create the table correspondance_ordinal that contains the numerical values associated with the different categories.
Then, we remove observations that contain missing values. Let us keep a track on the number of observation removed at each time.
nb_obs <-nrow(df_tmp)eye_on_proportions <-vector(mode="list", length =ncol(df_tmp))for(p in1:ncol(df_tmp)){ variable_to_check <-colnames(df_tmp)[p] eye_on_proportions[[p]] <- df_tmp |>filter(is.na(!!sym(variable_to_check))) |>group_by(status) |>count() |>ungroup() |>mutate(prop =round(100* n /sum(n), digits =2))# Filtering out the observations with missing values df_tmp <- df_tmp |>filter(!is.na(!!sym(variable_to_check)))if(nrow(df_tmp) < nb_obs){ nb_obs_df_tmp <-tibble(step = variable_to_check, n =nrow(df_tmp)) nb_obs_df <- nb_obs_df |>bind_rows(nb_obs_df_tmp) nb_obs <-nrow(df_tmp) }}names(eye_on_proportions) <-colnames(df_tmp)
Let us have a look at the nb_obs_df table. The first 4 rows give:
the number of observations with the raw sample
how many observations were left after removing people younger than 15 years old
homw many observations were left when removing individuals who did not report their status with regard to depression
how many individuals reported feeling not depressed over the last 12 months.
The remaining rows indicate how many individuals (column n) are left in the sample once individuals with missing data for the variable reported in column step are removed from the sample. The number given in column n_drop indicates the loss due to lack of data for the variable of interest.
Recall we coded missing values for categorical variables as Not reported and missing values because the individuals were not included in the sub-survey as Not surveyed. Let us look at how many observation fall into each of those categories:
Let us remove the 12 observations where the couple status is not reported (too few observations for this category). The proportion of imaginary healthy patients among the individuals that are about to be discarded:
df_tmp |>filter(PERSONNE_couple =="No answer") |>left_join(df |>select(id, inf_q1_mhi_3), by ="id") |>group_by(inf_q1_mhi_3, status) |>count() |>ungroup() |>mutate(prop =round(100* n /sum(n), digits =2))
# A tibble: 1 × 4
inf_q1_mhi_3 status n prop
<fct> <fct> <int> <dbl>
1 >Q1 Not_D_and_sup_Q1 12 100