This chapter presents the codes used to clean data. It especially explains the different steps we took to deal with missing values. We use the self-assessed health status instead of the self-declared status with respect to depression here to define the target variable.
Note: the following variables were excluded for the following reasons:
PERSONNE_etatleg gives the marital status; we keep PERSONNE_couple which states whether the person lives as a couple
MENAGE_revdetail is the income in numerical values; we prefer keeping MENAGE_revucinsee (Net Income per Cons. Unit)
MENAGE_rap_zau: the variable MENAGE_tu already provides a very similar information
SOINS_remamb, SOINS_tmamb, SOINS_dpaamb, SOINS_pf_frambSOINS_pf_framb : these are the sum of other reimbursements; they would therefore cause colinearity issues.
Some additional information should be provided regarding healthcare variables. These variables are composed of :
expenditures (dep, dépenses)
health insurance reimbursements (remb, remboursements par l’assurance maladie)
co-payment (tm, ticket modérateur)
fee overruns (dpa, dépassement d’honoraires, non remboursés)
Let us keep track on the number of individuals that were removed:
nb_obs_df <- nb_obs_df |>bind_rows(tibble(step ="health_status_not_reported", n =nrow(df_tmp)) )nb_obs_df
# A tibble: 4 × 2
step n
<chr> <dbl>
1 raw 19940
2 age_15 18561
3 missing_mhi5_score 12373
4 health_status_not_reported 12304
Let us focus on the MHI-5 score here and discard the MHI-3. Let us create a variable (status) that classifies individuals depending on their MHI-5 score and their answer regarding depression:
df_tmp <- df_tmp |>select(-inf_q1_mhi_3) |>mutate(status =case_when( PERSONNE_etat_sante =="Very Bad or Bad"& inf_q1_mhi_5 =="<=Q1"~"D_and_inf_Q1", PERSONNE_etat_sante =="Very Bad or Bad"& inf_q1_mhi_5 ==">Q1"~"D_and_sup_Q1", PERSONNE_etat_sante =="Very Good, Good or Fairly Good"& inf_q1_mhi_5 =="<=Q1"~"Not_D_and_inf_Q1", PERSONNE_etat_sante =="Very Good, Good or Fairly Good"& inf_q1_mhi_5 ==">Q1"~"Not_D_and_sup_Q1",TRUE~"problem" ) )
We can look at the distribution of this newly created variable:
# A tibble: 0 × 3
# ℹ 3 variables: status <chr>, PERSONNE_etat_sante <chr>, inf_q1_mhi_5 <fct>
We are interested in people who self report having a good SAH status. Other people will be discarded. The proportion of imaginary healthy patients among the individuals that are about to be discarded:
df_tmp |>filter(! status %in%c("Not_D_and_inf_Q1", "Not_D_and_sup_Q1")) |>group_by(inf_q1_mhi_5, status) |>count() |>ungroup() |>mutate(prop =round(100* n /sum(n), digits =2))
# A tibble: 2 × 4
inf_q1_mhi_5 status n prop
<fct> <chr> <int> <dbl>
1 <=Q1 D_and_inf_Q1 633 74.9
2 >Q1 D_and_sup_Q1 212 25.1
Let us discard other people. We define the status variable as a binary factor:
Let us keep track on the number of individuals that were removed:
nb_obs_df <- nb_obs_df |>bind_rows(tibble(step ="not_D", n =nrow(df_tmp)) )nb_obs_df
# A tibble: 5 × 2
step n
<chr> <dbl>
1 raw 19940
2 age_15 18561
3 missing_mhi5_score 12373
4 health_status_not_reported 12304
5 not_D 11459
15.1 Looking at missing data
We need to check where are the missing data. Let us have a look at the “Working conditions” questions (columns which begin with QST_c), and count the number and corresponding proportion of missing values:
Table 15.1: Missing values in the Working Conditions questions
variable
no_obs
nb_not_surveyed
prop
QST_ct_depech
11459
5638
49.2
QST_ct_liberte
11459
5638
49.2
QST_ct_apprend
11459
5638
49.2
QST_ct_aidecol
11459
5638
49.2
QST_ct_travnuit
11459
5638
49.2
QST_ct_repet
11459
5638
49.2
QST_ct_lourd
11459
5638
49.2
QST_ct_posture
11459
5638
49.2
QST_ct_produit
11459
5638
49.2
As we can see from the previous result, around 50% of respondents did not give an answer to the questions related to work. It does not only concern unemployed people:
table(df_tmp$PERSONNE_rap_pcs8) |>sort()
Farmer Craftsman, trader
376 558
Unskilled worker Commercial employee
872 1360
Executive and intellectual profession Administrative employee
1464 1549
Skilled worker Inactive having never worked
1633 1658
Intermediate occupation
1979
Let us have a look at the distribution of PERSONNE_rap_pcs8 for individuals for which QST_ct_depech is missing:
# A tibble: 5,638 × 9
QST_ct_depech QST_ct_liberte QST_ct_apprend QST_ct_aidecol QST_ct_travnuit
<fct> <fct> <fct> <fct> <fct>
1 Not surveyed Not surveyed Not surveyed Not surveyed Not surveyed
2 Not surveyed Not surveyed Not surveyed Not surveyed Not surveyed
3 Not surveyed Not surveyed Not surveyed Not surveyed Not surveyed
4 Not surveyed Not surveyed Not surveyed Not surveyed Not surveyed
5 Not surveyed Not surveyed Not surveyed Not surveyed Not surveyed
6 Not surveyed Not surveyed Not surveyed Not surveyed Not surveyed
7 Not surveyed Not surveyed Not surveyed Not surveyed Not surveyed
8 Not surveyed Not surveyed Not surveyed Not surveyed Not surveyed
9 Not surveyed Not surveyed Not surveyed Not surveyed Not surveyed
10 Not surveyed Not surveyed Not surveyed Not surveyed Not surveyed
# ℹ 5,628 more rows
# ℹ 4 more variables: QST_ct_repet <fct>, QST_ct_lourd <fct>,
# QST_ct_posture <fct>, QST_ct_produit <fct>
We do the same for variables which give information relative to social interactions and which give information on the parents’ background (i.e., variables that begin with QES). First, let us have a look at the proportion of NA in those:
Ordinal variables need to be altered prior to the estimation (because of SHAP). We change them to numerical variables. To keep a track of the corresponding numerical values associated with each label, we create a table.
Then, we create the table correspondance_ordinal that contains the numerical values associated with the different categories.
Then, we remove observations that contain missing values. Let us keep a track on the number of observation removed at each time.
nb_obs <-nrow(df_tmp)eye_on_proportions <-vector(mode="list", length =ncol(df_tmp))for(p in1:ncol(df_tmp)){ variable_to_check <-colnames(df_tmp)[p] eye_on_proportions[[p]] <- df_tmp |>filter(is.na(!!sym(variable_to_check))) |>group_by(status) |>count() |>ungroup() |>mutate(prop =round(100* n /sum(n), digits =2))# Filtering out the observations with missing values df_tmp <- df_tmp |>filter(!is.na(!!sym(variable_to_check)))if(nrow(df_tmp) < nb_obs){ nb_obs_df_tmp <-tibble(step = variable_to_check, n =nrow(df_tmp)) nb_obs_df <- nb_obs_df |>bind_rows(nb_obs_df_tmp) nb_obs <-nrow(df_tmp) }}names(eye_on_proportions) <-colnames(df_tmp)
Let us have a look at the nb_obs_df table. The first 4 rows give:
the number of observations with the raw sample
how many observations were left after removing people younger than 15 years old
how many observations were left when removing individuals who did not report their health status
how many individuals reported feeling good (SAH).
The remaining rows indicate how many individuals (column n) are left in the sample once individuals with missing data for the variable reported in column step are removed from the sample. The number given in column n_drop indicates the loss due to lack of data for the variable of interest.
Recall we coded missing values for categorical variables as Not reported and missing values because the individuals were not included in the sub-survey as Not surveyed. Let us look at how many observation fall into each of those categories:
Let us remove the 12 observations where the couple status is not reported (too few observations for this category). The proportion of imaginary healthy patients among the individuals that are about to be discarded:
df_tmp |>filter(PERSONNE_couple =="No answer") |>left_join(df |>select(id, inf_q1_mhi_5), by ="id") |>group_by(inf_q1_mhi_5, status) |>count() |>ungroup() |>mutate(prop =round(100* n /sum(n), digits =2))
# A tibble: 2 × 4
inf_q1_mhi_5 status n prop
<fct> <fct> <int> <dbl>
1 <=Q1 Not_D_and_inf_Q1 1 8.33
2 >Q1 Not_D_and_sup_Q1 11 91.7