1 Load Data

Objectives

This notebook explains how the data are loaded.

We merge together different datasets:

ESPS survey data
SNIIRAM data (Health consumption from Social Security data)
sunlight data (see Chapter 3).

The data dictionary for the ESPS survey is available on the IRDES website.

1.1 Loading and Merging Datasets

Let us first load {tidyverse}:

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Then we can import the survey data from the CSV files:

File name	Source	Description
`personne.csv`	ESPS	Individual characteristics
`menage.csv`	ESPS	Household characteristics
`mutuelle.csv`	ESPS	Mutual insurance characteristics
`opinion1.csv`	ESPS	Opinion on some research questions (during the first contact)
`qst.csv`	ESPS	Working conditions
`qes.csv`	ESPS	Economic and social situation
`deppers2012.dta`	SNIIRAM	Data on health consumption
`map_regions_2014.rda`	IGN	France Map
`ensol_regions.rda`	Météo France	Sunlight data

library(tidyverse)
df_personne <- read_csv("../data/data_esps/personne.csv")
df_menage <- read_csv("../data/data_esps/menage.csv")
mutuelle <- read_csv("../data/data_esps/mutuelle.csv")
df_opinion_1 <- read_csv("../data/data_esps/opinion1.csv")
df_qst <- read_csv("../data/data_esps/qst.csv")
df_qes <- read_csv("../data/data_esps/qes.csv")
data_sniiram <- foreign::read.dta("../data/sniiram/deppers2012.dta") |> 
  as_tibble() |> 
  mutate(nind = as.numeric(nind))

We can also load some sunlight data (see the sunlight notebook for more details on those):

load("../data/meteo/ensol_regions.rda")

Let us add a prefix ("SOIN") to the names of the variables from the SNIIRAM:

data_sniiram <- 
  data_sniiram |> 
  rename_with(~str_c("SOINS_", .)) |> 
  rename(ben_n4 = SOINS_ben_n4, nind = SOINS_nind)

Some people in the table mutuelle (mutual insurance) have two rows (we create a group index here to avoid displaying the real ID):

mutuelle$group_ID <- group_by(mutuelle, ben_n4, nind) |> group_indices()
mutuelle |> 
  select(-ben_n4, -nind) |> 
  group_by(group_ID) |> 
  mutate(count = n()) |> 
  filter(count > 1) |> 
  arrange(desc(count), group_ID)

# A tibble: 1,647 × 10
# Groups:   group_ID [809]
    ...1 nmut  typcc  idcc  assu i_entrepr entrepr nindassu_gv group_ID count
   <dbl> <chr> <dbl> <dbl> <dbl>     <dbl>   <dbl> <chr>          <int> <int>
 1   830 01        0 99999     1         1       1 01               813     4
 2   831 02        1 55924     2         1       1 02               813     4
 3   835 03        1 60038     1         1       2 01               813     4
 4   837 04        1 60038     1         1       2 01               813     4
 5  7975 01        1 10002     1         1       1 02              7673     4
 6  7977 02        2 70506     2         1       1 01              7673     4
 7  7980 03        2 70109     2         1       1 01              7673     4
 8  7983 04        1 60038     2         1       2 01              7673     4
 9   626 01        1 31003     2         1       1 01               616     3
10   627 02        4 90001     1         1       1 02               616     3
# ℹ 1,637 more rows

The number of observations in mutuelles:

nrow(mutuelle)

[1] 20779

The number of unique individuals in mutuelles:

mutuelle |> 
  group_by(group_ID) |> 
  slice(1) |> 
  ungroup() |> 
  nrow()

[1] 19941

Let us randomly pick an observation among the two available for the 838 persons for which two rows are present in mutuelle

set.seed(17463)
df_mutuelle_sans_redondance <- 
  mutuelle |> 
  group_by(group_ID) |> 
  sample_n(size = 1) |> 
  ungroup() |> 
  select(-group_ID)

The number of unique individuals left is:

nrow(mutuelle)

[1] 20779

Let us add a prefix ("MUTUELLE") to the names of the variables from df_mutuelle_sans_redondance:

df_mutuelle_sans_redondance <- 
  df_mutuelle_sans_redondance |> 
  rename_with(~str_c("MUTUELLE_", .)) |> 
  rename(ben_n4 = MUTUELLE_ben_n4, nind = MUTUELLE_nind)

Let us focus on individual characteristics. Those are recorded in the table df_personne.

nrow(df_personne)

[1] 23047

As can be seen, this table contains 23047 individuals. We can also observe that we are able to find a match for all these people in the household characteristics (df_menage):

df_personne |> 
  # Add household characteristics
  anti_join(df_menage, by = "ben_n4") |> 
  nrow()

[1] 0

But we are not able to find a match for all individuals in some parts of the ESPS database. That is normal, as some questions were asked to the respondents only if they have specific attributes. Let us check how many people from df_personne cannot be found in the different parts of the ESPS survey. First, let us consider the df_opinion_1 dataset. The number of individuals who did not respond to questions relative to this part (see variable dictionary)

df_personne |> 
  anti_join(df_opinion_1, by = c("ben_n4", "nind")) |> 
  nrow()

[1] 14634

Then, we can consider the df_qst dataset (see online dictionary). This table gives information relative to working conditions. Only people aged 16 to 65 years old can be included in this dataset. The number of people from df_personne that are not found in df_qst is:

df_personne |> 
  anti_join(df_qst, by = c("ben_n4", "nind")) |> 
  nrow()

[1] 16848

The economic and social situation of indidivuals is filled in df_qes. The number of individuals for whom no match is found is:

df_personne |> 
  anti_join(df_qes, by = c("ben_n4", "nind")) |> 
  nrow()

[1] 9936

Lastly, we can look at the number of people for whom we do not have a match with social security data (SNIIRAM):

df_personne |> 
  anti_join(data_sniiram, by = c("ben_n4", "nind")) |> 
  nrow()

[1] 11026

Then, the tables can be merged into a single one. To that end, we should first create an ID for each individual.

df_opinion_1 <- 
  df_opinion_1 |> 
  mutate(id_individual = str_c(ben_n4, nind, sep = "_"))

df_qst <- 
  df_qst |> 
  mutate(id_individual = str_c(ben_n4, nind, sep = "_"))

df_qes <- 
  df_qes |> 
  mutate(id_individual = str_c(ben_n4, nind, sep = "_"))

data_sniiram <- 
  data_sniiram |> 
  mutate(id_individual = str_c(ben_n4, nind, sep = "_"))

df_mutuelle_sans_redondance <- 
  df_mutuelle_sans_redondance |> 
  mutate(nind = as.numeric(nind)) |> 
  mutate(id_individual = str_c(ben_n4, nind, sep = "_"))

When we merge the different tables together, we would like to keep track on whether each individual was included in the joined table. To that end,

# Household characteristics
df <- 
  df_personne |> 
  left_join(df_menage) |> 
  mutate(id_individual = str_c(ben_n4, nind, sep = "_"))

Joining with `by = join_by(ben_n4)`

# Opinion first wave
df <- 
  df |> 
  left_join(df_opinion_1, by = c("ben_n4", "nind", "id_individual")) |> 
  mutate(is_in_df_opinion_1 = id_individual %in% df_opinion_1$id_individual)

# Working conditions
df <- 
  df |> 
  left_join(df_qst, by = c("ben_n4", "nind", "id_individual")) |> 
  mutate(is_in_df_qst = id_individual %in% df_qst$id_individual)

# Economic and social situation
df <- 
  df |> 
  left_join(df_qes, by = c("ben_n4", "nind", "id_individual")) |> 
  mutate(is_in_df_qes = id_individual %in% df_qes$id_individual)

# Social Insurance
df <- 
  df |> 
  left_join(data_sniiram, by = c("ben_n4", "nind", "id_individual")) |> 
  mutate(is_in_data_sniiram = id_individual %in% data_sniiram$id_individual)

# Mutual Insurance
df <- 
  df |> 
  left_join(
    df_mutuelle_sans_redondance,
    by = c("ben_n4", "nind", "id_individual")
  ) |> 
  mutate(
    is_in_df_mutuelle = id_individual %in% 
      df_mutuelle_sans_redondance$id_individual
  )

Let us check (again) how many individuals were found in the different datasets:

var_g <- c(
  "is_in_df_opinion_1", "is_in_df_qst", "is_in_df_qes", "is_in_data_sniiram", 
  "is_in_df_mutuelle"
)

df |> 
  select(!!!syms(var_g)) |> 
  summarise(across(.cols = everything(), .fns = ~sum(.))) |> 
  kableExtra::kable(
    caption = paste0(
      "Number of individuals present in the different parts ",
      "of the survey"
    )
  ) |>
  kableExtra::kable_classic(full_width = F, html_font = "Cambria")

Number of individuals present in the different parts of the survey
is_in_df_opinion_1	is_in_df_qst	is_in_df_qes	is_in_data_sniiram	is_in_df_mutuelle
8413	6199	13111	12021	19941

# Corresponding proportions
df |> 
  select(!!!syms(var_g)) |> 
  summarise(
    across(.cols = everything(), .fns = ~round(100*sum(.) / n(), digits = 2))
  ) |> 
  kableExtra::kable(
    caption = paste0(
      "Proportion of individuals present in the different parts ",
      "of the survey (in percent)"
    )
  )|>
  kableExtra::kable_classic(full_width = F, html_font = "Cambria")

Proportion of individuals present in the different parts of the survey (in percent)
is_in_df_opinion_1	is_in_df_qst	is_in_df_qes	is_in_data_sniiram	is_in_df_mutuelle
36.5	26.9	56.89	52.16	86.52

Here are the dimensions of this table:

dim(df)

[1] 23047   821

If we want to work with regional aspects at some point, we can create a table which provides the correspondence between the survey data and INSEE data.

The map of French regions can be loaded:

load("../data/map_regions_2014.rda")

regions_fr <- 
  map_regions_2014 |> 
  sf::st_set_geometry(NULL) |> 
  select(code_insee, nom)

The correspondence between the codes can be put on a table:

corresp_regions <- tribble(
  ~MENAGE_region, ~code_insee,
  1, 11,
  2, 21,
  3, 22,
  4, 23,
  5, 24,
  6, 25,
  7, 26,
  8, 31,
  9, 41,
  10, 42,
  11, 43,
  12, 52,
  13, 53,
  14, 54,
  15, 72,
  16, 73,
  17, 74,
  18, 82,
  19, 83,
  20, 91,
  21, 93,
  22, 94
) |> 
  left_join(
    regions_fr |> mutate(code_insee = as.numeric(code_insee)),
    by = "code_insee"
  )

We can eventually add the sunlight data:

df <- 
  df |> 
  left_join(
    corresp_regions |> rename(MENAGE_region_nom = nom),
    by = "MENAGE_region"
  ) |> 
  left_join(
    ensol_regions |> mutate(code_insee = as.numeric(code_insee)),
    by = "code_insee"
  )

Let us keep track on the number of observation each time we filter out observations.

nb_obs_df <- tibble(step = "raw", n = nrow(df))

We first remove individuals younger than 15 years old:

df <- df |> 
  filter(PERSONNE_age >= 15)

nb_obs_df <- 
  nb_obs_df |> 
  bind_rows(
    tibble(step = "age_15", n = nrow(df))
  )
nb_obs_df

# A tibble: 2 × 2
  step       n
  <chr>  <int>
1 raw    23047
2 age_15 18561

1.2 Correction of the MHI-5 score

One of the key variables of the dataset which is of interest for us is the MHI-5 score (see, e.g. Hoeymans et al. (2004), Rumpf et al. (2001), Thorsen et al. (2013)). For somewhat reason, the score was incorrectly calculated. Traditionally, the MHI-5 score is calculated from the answers to five questions, each question containing 6 ordered answer items (1 = all of the time, 2 =most of the time, 3 = a good bit of time, 4 = some of the time, 5 = a little of the time, 6 = none of the time). The MHI-5 score formula writes: \[\text{MHI-5} = 100 \times \frac{\sum_{q=1}^{5} s_q - 5}{25},\] where \(s_q\) is the score of the \(q^{\text{th}}\) question.

However, in the French version of the survey, the answer items are only 5 (because the translation of some items does not allow the same richness of nuance as English). To obtain a normalized score between 0 and 100, the formula shoud write: \[\text{MHI-5} = 100 \times \frac{\sum_{q=1}^{5} s_q - 5}{\boldsymbol{20}}.\]

We therefore need to correct the MHI-5 score from the data, simply by multiplying it by 1.25:

df <- 
  df |> 
  mutate(PERSONNE_score_t_corrected = PERSONNE_score_t * 1.25)

Section 1.4 provides some codes to compute the MHI-5 score manually from the answers given to the questions that allow it to be calculated.

1.3 Data formatting

Qualitative variables have numerical levels in the original dataset. Consulting the data dictionary makes it possible to assign more comprehensible labels.

All the following variables are qualitative. We will adopt the following rules for variables from sub-parts of the survey:

if the individual was included in the sub-part of the survey but did not respond, we will set the answer to “Not reported”
if the individual was not surveyed (was not part of the sub-part of the survey), we will set the answer to “Not surveyed”

df <- 
  df |> 
  mutate(
    across(
      c(
        "PERSONNE_pb_depress",
        "PERSONNE_etasante",
        "PERSONNE_statut"
      ),
      as.character
    )
  ) |> 
  mutate(
    PERSONNE_pb_depress = replace_na(PERSONNE_pb_depress, "Not reported"),
    PERSONNE_pb_depress = factor(
      PERSONNE_pb_depress, 
      levels = c(2, 1, "Not reported"),
      labels = c("No depression", "Depression", "Not reported")
    ),
    PERSONNE_etasante = replace_na(PERSONNE_etasante, "Not reported"),
    PERSONNE_etasante = factor(
      PERSONNE_etasante,
      levels = c(1:5, "Not reported"),
      labels = c(
        "Very Good",
        "Good",
        "Fairly Good",
        "Bad",
        "Very Bad",
        "Not reported"
      )
    )
  ) |> 
  mutate(
    PERSONNE_etat_sante = case_when(
      PERSONNE_etasante %in% c(
        "Very Good", "Good", "Fairly Good"
      ) ~ "Very Good, Good or Fairly Good",
      PERSONNE_etasante %in% c("Bad", "Very Bad") ~ "Very Bad or Bad",
      PERSONNE_etasante == "Not reported" ~ "Not reported",
      TRUE ~ "other"
    )
  ) |> 
  mutate(
    across(
      c(
        PERSONNE_pb_asthm,
        PERSONNE_pb_bronchit,
        PERSONNE_pb_infarctus,
        PERSONNE_pb_coronair,
        PERSONNE_pb_hypertens,
        PERSONNE_pb_avc,
        PERSONNE_pb_arthros,
        PERSONNE_pb_lombalgi,
        PERSONNE_pb_cervical,
        PERSONNE_pb_diabet,
        PERSONNE_pb_allergi,
        PERSONNE_pb_cirrhos,
        PERSONNE_pb_urinair,
        PERSONNE_pb_non
      ),
      ~factor(
        replace_na(as.character(.x), "Not reported"),
        levels = c(2, 1, "Not reported"),
        labels = c("No", "Yes", "Not reported")
      )
    )
  ) |> 
  mutate(
    PERSONNE_sexe = factor(
      PERSONNE_sexe, levels = c(2, 1), labels = c("Female", "Male")
    ),
    PERSONNE_couple = factor(
      PERSONNE_couple, levels = c(1,2), labels = c("Yes", "No")
    ),
    PERSONNE_etatleg = factor(
      PERSONNE_etatleg, 
      levels = c(1,2,3,4,5,7,8),
      labels = c(
        "Married, Civil Union", "Divorced, separated", "Widowed",
        "Lives in a marriage or concubinage", "Single",
        "Does not know", "Refuses to answer"
      )
    ),
    PERSONNE_statut = ifelse(
      PERSONNE_statut %in% c(1,2,3,4), 
      yes = "Public employee", 
      no = PERSONNE_statut
    ),
    PERSONNE_statut = ifelse(
      PERSONNE_statut %in% c(6, 7, 8, 9, 10), 
      yes = "Other", 
      no = PERSONNE_statut
    ),
    PERSONNE_statut = replace_na(PERSONNE_statut, "Not reported"),
    PERSONNE_statut = factor(
      PERSONNE_statut, levels = c("Public employee", 5, "Other", "Not reported"),
      labels = c("Public employee",
                 "Private employee",
                 "Other",
                 "Not reported")
    ),
    PERSONNE_ss = factor(
      PERSONNE_ss, levels = c(1, 2),
      labels = c(
        "Yes (own)",
        "Yes (third party)")
    ),
    # "Yes, because he/she works, or receives unemployment benefits, 
    # is a student, retired, widower of a pensioner, disabled, beneficiary of 
    # the basic CMU...",
    # "Yes, as the beneficiary of a person living in your household")),
    PERSONNE_regime = ifelse(
      PERSONNE_regime %in% c(2,3,4,5,6,7), 
      yes = "Public service", 
      no = PERSONNE_regime
    ),
    PERSONNE_regime = ifelse(
      PERSONNE_regime %in% c(12, 13, 14), 
      yes = "Other (Student, abroad, other)", 
      no = PERSONNE_regime
    ),
    PERSONNE_regime = factor(
      PERSONNE_regime, 
      levels = c(
        1, "Public service", 8, 9, 10, 11, "Other (Student, abroad, other)", 15
      ),
      labels = c(
        "The general scheme (Cnamts)",
        "Public service",
        "The local Alsace-Moselle scheme",
        "The basic Universal Health Coverage",
        "The agricultural scheme",
        "The self-employed scheme",
        "Other (Student, abroad, other)",
        "Does not know"
      )
    ),
    PERSONNE_rap_pcs8 = factor(
      PERSONNE_rap_pcs8,
      levels = 1:9,
      labels = c(
        "Farmer",
        "Craftsman, trader",
        "Executive and intellectual profession",
        "Intermediate occupation",
        "Administrative employee",
        "Commercial employee",
        "Skilled worker",
        "Unskilled worker",
        "Inactive having never worked"
      )
    ),
    PERSONNE_ald = factor(
      PERSONNE_ald, 
      levels = 1:3,
      labels = c("Yes", "No", "Does not know")
    )
  ) |> 
  mutate(
    MENAGE_rap_typmen = factor(
      MENAGE_rap_typmen,
      levels = c(1, 2, 5, 6, 7),
      labels = c(
        "Single person",
        "Single-parent family",
        "Couple without children",
        "Couple with children",
        "Other"
      )
    ),
    MENAGE_region = as.character(MENAGE_region),
    MENAGE_region = factor(
      MENAGE_region,
      levels = c(1:22),
      labels = c(
        "Région parisienne",
        "Champagne-Ardenne",
        "Picardie",
        "Haute-Normandie",
        "Centre",
        "Basse-Normandie",
        "Bourgogne",
        "Nord-Pas-de-Calais",
        "Lorraine",
        "Alsace",
        "Franche-Comté",
        "Pays de la Loire",
        "Bretagne",
        "Poitou-Charentes",
        "Aquitaine",
        "Midi-Pyrénées",
        "Limousin",
        "Rhône-Alpes",
        "Auvergne",
        "Languedoc-Roussillon",
        "Provence-Alpes-Côte d'Azur",
        "Corse"
      )
    ),
    MENAGE_revenu = factor(
      MENAGE_revenu, 
      levels = c(0:12, 99),
      labels = c(
        "Does not know",
        "< 700",
        "[700, 1000)",
        "[1000, 1200)",
        "[1200, 1500)",
        "[1500, 1800)",
        "[1800, 2200)",
        "[2200, 2500)",
        "[2500, 3000)",
        "[3000, 3500)",
        "[3500, 4500)",
        ">= 4500",
        "Refuse to declare",
        "Abandonment before 2nd contact "
      )
    ),
  ) |> 
  mutate(
    MUTUELLE_assu = ifelse(
      MUTUELLE_assu %in% c(2,3), 
      yes = "Beneficiary", 
      no = MUTUELLE_assu
    ),
    MUTUELLE_assu = ifelse(
      is_in_df_mutuelle & is.na(MUTUELLE_assu), 
      yes = "Not reported", 
      no = MUTUELLE_assu
    ),
    MUTUELLE_assu = ifelse(
      !is_in_df_mutuelle, 
      yes = "Not surveyed", 
      no = MUTUELLE_assu
    ),
    MUTUELLE_assu = factor(
      MUTUELLE_assu,
      levels = c(0, 1, "Beneficiary", "Not reported", "Not surveyed"),
      labels = c(
        "Does not know",
        "Self insured",
        "Beneficiary",
        "Not reported",
        "Not surveyed"
      )
    ),
    #
    MUTUELLE_typcc = ifelse(
      is_in_df_mutuelle & is.na(MUTUELLE_typcc), 
      yes = "Not reported", 
      no = MUTUELLE_typcc
    ),
    MUTUELLE_typcc = ifelse(
      !is_in_df_mutuelle, 
      yes = "Not surveyed", 
      no = MUTUELLE_typcc),
    MUTUELLE_typcc = factor(
      MUTUELLE_typcc, 
      levels = c(0, 1, 2, 3, 4, "Not reported", "Not surveyed"),
      labels = c(
        "Indefinite",
        "Mutual insurance",
        "Pension fund",
        "Private insurance",
        "Insurance broker",
        "Not reported",
        "Not surveyed"
      )
    ),
  ) |> 
  # First wave answers to research questions
  mutate(
    OPINION1_renonc_cons = ifelse(
      is_in_df_opinion_1 & is.na(OPINION1_renonc_cons),
      yes = "Not reported", 
      no = OPINION1_renonc_cons
    ),
    OPINION1_renonc_cons = ifelse(
      !is_in_df_opinion_1,
      yes = "Not surveyed", 
      no = OPINION1_renonc_cons
    ),
    OPINION1_renonc_cons = factor(
      OPINION1_renonc_cons,
      levels = c(1,2, "Not reported", "Not surveyed"),
      labels = c("Yes", "No", "Not reported", "Not surveyed")
    ),
    #
    OPINION1_renonc_dent = ifelse(
      is_in_df_opinion_1 & is.na(OPINION1_renonc_dent),
      yes = "Not reported", 
      no = OPINION1_renonc_dent
    ),
    OPINION1_renonc_dent = ifelse(
      !is_in_df_opinion_1,
      yes = "Not surveyed", 
      no = OPINION1_renonc_dent
    ),
    OPINION1_renonc_dent = factor(
      OPINION1_renonc_dent, 
      levels = c(1,2, "Not reported", "Not surveyed"),
      labels = c("Yes", "No", "Not reported", "Not surveyed")
    ),
    #
    OPINION1_renonc_fin = ifelse(
      is_in_df_opinion_1 & is.na(OPINION1_renonc_fin),
      yes = "Not reported", 
      no = OPINION1_renonc_fin
    ),
    OPINION1_renonc_fin = ifelse(
      !is_in_df_opinion_1,
      yes = "Not surveyed", 
      no = OPINION1_renonc_fin
    ),
    OPINION1_renonc_fin = factor(
      OPINION1_renonc_fin, 
      levels = c(1,2, "Not reported", "Not surveyed"),
      labels = c("Yes", "No", "Not reported", "Not surveyed")
    ),
    #
    OPINION1_renonc_loin = ifelse(
      is_in_df_opinion_1 & is.na(OPINION1_renonc_loin),
      yes = "Not reported",
      no = OPINION1_renonc_loin
    ),
    OPINION1_renonc_loin = ifelse(
      !is_in_df_opinion_1,
      yes = "Not surveyed", 
      no = OPINION1_renonc_loin
    ),
    OPINION1_renonc_loin = factor(
      OPINION1_renonc_loin, 
      levels = c(1,2, "Not reported", "Not surveyed"),
      labels = c("Yes", "No", "Not reported", "Not surveyed")
    ),
    #
    OPINION1_renonc_long = ifelse(
      is_in_df_opinion_1 & is.na(OPINION1_renonc_long),
      yes = "Not reported", 
      no = OPINION1_renonc_long
    ),
    OPINION1_renonc_long = ifelse(!is_in_df_opinion_1,
                                  yes = "Not surveyed", 
                                  no = OPINION1_renonc_long
    ),
    OPINION1_renonc_long = factor(
      OPINION1_renonc_long,
      levels = c(1,2, "Not reported", "Not surveyed"),
      labels = c("Yes", "No", "Not reported", "Not surveyed")
    ),
  ) |> 
  # Working conditions
  mutate(
    QST_ct_depech = ifelse(
      is_in_df_qst & is.na(QST_ct_depech),
      yes = "Not reported", 
      no = QST_ct_depech
    ),
    QST_ct_depech = ifelse(
      !is_in_df_qst,
      yes = "Not surveyed", 
      no = QST_ct_depech
    ),
    QST_ct_depech = factor(
      QST_ct_depech, 
      levels = c(1:4, "Not reported", "Not surveyed"),
      labels = c(
        "Always", "Often", "Sometimes", "Never", "Not reported", "Not surveyed"
      )
    ),
    #
    QST_ct_liberte = ifelse(
      is_in_df_qst & is.na(QST_ct_liberte),
      yes = "Not reported", 
      no = QST_ct_liberte
    ),
    QST_ct_liberte = ifelse(
      !is_in_df_qst,
      yes = "Not surveyed", 
      no = QST_ct_liberte
    ),
    QST_ct_liberte = factor(
      QST_ct_liberte, 
      levels = c(1:4, "Not reported", "Not surveyed"),
      labels = c(
        "Always", "Often", "Sometimes", "Never", "Not reported", "Not surveyed"
      )
    ),
    #
    QST_ct_apprend = ifelse(
      is_in_df_qst & is.na(QST_ct_apprend),
      yes = "Not reported", 
      no = QST_ct_apprend
    ),
    QST_ct_apprend = ifelse(
      !is_in_df_qst,
      yes = "Not surveyed",
      no = QST_ct_apprend
    ),
    QST_ct_apprend = factor(
      QST_ct_apprend, 
      levels = c(1:4, "Not reported", "Not surveyed"),
      labels = c(
        "Always", "Often", "Sometimes", "Never", "Not reported", "Not surveyed"
      )
    ),
    #
    QST_ct_aidecol = ifelse(
      is_in_df_qst & is.na(QST_ct_aidecol),
      yes = "Not reported", 
      no = QST_ct_aidecol
    ),
    QST_ct_aidecol = ifelse(!is_in_df_qst,
                            yes = "Not surveyed", 
                            no = QST_ct_aidecol
    ),
    QST_ct_aidecol = factor(
      QST_ct_aidecol, 
      levels = c(1:5, "Not reported", "Not surveyed"),
      labels = c(
        "Always", "Often", "Sometimes", "Never", "Not concered", 
        "Not reported", "Not surveyed")
    ),
    #
    QST_ct_travnuit = ifelse(
      is_in_df_qst & is.na(QST_ct_travnuit),
      yes = "Not reported", 
      no = QST_ct_travnuit
    ),
    QST_ct_travnuit = ifelse(
      !is_in_df_qst,
      yes = "Not surveyed", 
      no = QST_ct_travnuit
    ),
    QST_ct_travnuit = factor(
      QST_ct_travnuit, 
      levels = c(1:4, "Not reported", "Not surveyed"),
      labels = c(
        "Always", "Often", "Sometimes", "Never", 
        "Not reported", "Not surveyed"
      )
    ),
    #
    QST_ct_repet = ifelse(
      is_in_df_qst & is.na(QST_ct_repet),
      yes = "Not reported", 
      no = QST_ct_repet
    ),
    QST_ct_repet = ifelse(
      !is_in_df_qst,
      yes = "Not surveyed", 
      no = QST_ct_repet
    ),
    QST_ct_repet = factor(
      QST_ct_repet, 
      levels = c(1:4, "Not reported", "Not surveyed"),
      labels = c(
        "Always", "Often", "Sometimes", "Never", 
        "Not reported", "Not surveyed"
      )
    ),
    #
    QST_ct_lourd = ifelse(
      is_in_df_qst & is.na(QST_ct_lourd),
      yes = "Not reported", 
      no = QST_ct_lourd
    ),
    QST_ct_lourd = ifelse(
      !is_in_df_qst,
      yes = "Not surveyed", 
      no = QST_ct_lourd
    ),
    QST_ct_lourd = factor(
      QST_ct_lourd, 
      levels = c(1:4, "Not reported", "Not surveyed"),
      labels = c(
        "Always", "Often", "Sometimes", "Never", 
        "Not reported", "Not surveyed"
      )
    ),
    #
    QST_ct_posture = ifelse(
      is_in_df_qst & is.na(QST_ct_posture),
      yes = "Not reported", 
      no = QST_ct_posture
    ),
    QST_ct_posture = ifelse(
      !is_in_df_qst,
      yes = "Not surveyed", 
      no = QST_ct_posture
    ),
    QST_ct_posture = factor(
      QST_ct_posture, 
      levels = c(1:4, "Not reported", "Not surveyed"),
      labels = c(
        "Always", "Often", "Sometimes", "Never", 
        "Not reported", "Not surveyed"
      )
    ),
    #
    QST_ct_produit = ifelse(
      is_in_df_qst & is.na(QST_ct_produit),
      yes = "Not reported", 
      no = QST_ct_produit
    ),
    QST_ct_produit = ifelse(
      !is_in_df_qst,
      yes = "Not surveyed", 
      no = QST_ct_produit
    ),
    QST_ct_produit = factor(
      QST_ct_produit, 
      levels = c(1:4, "Not reported", "Not surveyed"),
      labels = c(
        "Always", "Often", "Sometimes", "Never", 
        "Not reported", "Not surveyed"
      )
    )
  ) |> 
  # Economic and social situation
  mutate(
    QES_association = ifelse(
      is_in_df_qes & is.na(QES_association),
      yes = "Not reported", 
      no = QES_association
    ),
    QES_association = ifelse(
      !is_in_df_qes,
      yes = "Not surveyed", 
      no = QES_association
    ),
    QES_association = factor(
      QES_association, 
      levels = c(1:2, "Not reported", "Not surveyed"),
      labels = c("Yes", "No", "Not reported", "Not surveyed")
    ),
    #
    QES_tpsami = ifelse(
      is_in_df_qes & is.na(QES_tpsami),
      yes = "Not reported", 
      no = QES_tpsami
    ),
    QES_tpsami = ifelse(
      !is_in_df_qes,
      yes = "Not surveyed", 
      no = QES_tpsami
    ),
    QES_tpsami = factor(
      QES_tpsami, 
      levels = c(1:5, "Not reported", "Not surveyed"),
      labels = c(
        "Every day or almost every day",
        "At least once a week",
        "At least once a month",
        "Less than once a month",
        "Never",
        "Not reported", "Not surveyed"
      )
    ),
    #
    QES_tpsasso = ifelse(
      is_in_df_qes & is.na(QES_tpsasso),
      yes = "Not reported", 
      no = QES_tpsasso
    ),
    QES_tpsasso = ifelse(
      !is_in_df_qes,
      yes = "Not surveyed", 
      no = QES_tpsasso
    ),
    QES_tpsasso = factor(
      QES_tpsasso, 
      levels = c(1:5, "Not reported", "Not surveyed"),
      labels = c(
        "Every day or almost every day",
        "At least once a week",
        "At least once a month",
        "Less than once a month",
        "Never",
        "Not reported", "Not surveyed"
      )
    ),
    #
    QES_tpscolleg = ifelse(
      is_in_df_qes & is.na(QES_tpscolleg),
      yes = "Not reported", 
      no = QES_tpscolleg
    ),
    QES_tpscolleg = ifelse(
      !is_in_df_qes,
      yes = "Not surveyed", 
      no = QES_tpscolleg
    ),
    QES_tpscolleg = factor(
      QES_tpscolleg, 
      levels = c(1:5, "Not reported", "Not surveyed"),
      labels = c(
        "Every day or almost every day",
        "At least once a week",
        "At least once a month",
        "Less than once a month",
        "Never",
        "Not reported", "Not surveyed"
      )
    ),
    #
    QES_tpsfamil = ifelse(
      is_in_df_qes & is.na(QES_tpsfamil),
      yes = "Not reported", 
      no = QES_tpsfamil
    ),
    QES_tpsfamil = ifelse(
      !is_in_df_qes,
      yes = "Not surveyed", 
      no = QES_tpsfamil
    ),
    QES_tpsfamil = factor(
      QES_tpsfamil,
      levels = c(1:5, "Not reported", "Not surveyed"),
      labels = c(
        "Every day or almost every day",
        "At least once a week",
        "At least once a month",
        "Less than once a month",
        "Never",
        "Not reported", "Not surveyed"
      )
    ),
    #
    QES_mere_etude = ifelse(
      is_in_df_qes & is.na(QES_mere_etude),
      yes = "Not reported", 
      no = QES_mere_etude
    ),
    QES_mere_etude = ifelse(
      !is_in_df_qes,
      yes = "Not surveyed", 
      no = QES_mere_etude
    ),
    QES_mere_etude = factor(
      QES_mere_etude,
      levels = c(1:7, "Not reported", "Not surveyed"),
      labels = c(
        "Never Been to School",
        "Nursery School, Primary school, Certificate of studies",
        "1st cycle (~Middle School)",
        "2nd cycle (~High School",
        "Higher Education",
        "Other",
        "Does not know",
        "Not reported", "Not surveyed"
      )
    ),
    QES_pere_etude = ifelse(
      is_in_df_qes & is.na(QES_pere_etude),
      yes = "Not reported", 
      no = QES_pere_etude
    ),
    QES_pere_etude = ifelse(
      !is_in_df_qes,
      yes = "Not surveyed", 
      no = QES_pere_etude
    ),
    QES_pere_etude = factor(
      QES_pere_etude,
      levels = c(1:7, "Not reported", "Not surveyed"),
      labels = c(
        "Never Been to School",
        "Nursery School, Primary school, Certificate of studies",
        "1st cycle (~Middle School)",
        "2nd cycle (~High School",
        "Higher Education",
        "Other",
        "Does not know",
        "Not reported", "Not surveyed"
      )
    )
  ) |> 
  # SNIIRAM
  mutate(
    SOINS_ald_am = ifelse(
      is_in_data_sniiram & is.na(SOINS_ald_am), 
      yes = "Not reported", 
      no = SOINS_ald_am)
    ,
    SOINS_ald_am = ifelse(
      !is_in_data_sniiram, 
      yes = "Not surveyed", 
      no = SOINS_ald_am
    ),
    SOINS_ald_am = factor(
      SOINS_ald_am, 
      levels = c(1,0, "Not reported", "Not surveyed"),
      labels = c("Yes", "No", "Not reported", "Not surveyed")
    ),
  )

The same can be done for ordered qualitative variables.

df <- 
  df |> 
  mutate(
    MENAGE_tu = ifelse(
      MENAGE_tu %in% c(0, 1, 2), yes = "Small Municipality", no = MENAGE_tu
    ),
    MENAGE_tu = ifelse(
      MENAGE_tu %in% c(3, 4), yes = "Medium Municipality", no = MENAGE_tu
    ),
    MENAGE_tu = ifelse(
      MENAGE_tu %in% c(5, 6, 7), yes = "Large Municipality", no = MENAGE_tu
    ),
    MENAGE_tu = ifelse(
      MENAGE_tu %in% c(8), yes = "Paris metropolitan area", no = MENAGE_tu
    ),
    MENAGE_tu = ifelse(
      is_in_df_mutuelle & is.na(MENAGE_tu), yes = "Not reported", no = MENAGE_tu
    ),
    MENAGE_tu = ifelse(
      !is_in_df_mutuelle, yes = "Not surveyed", no = MENAGE_tu
    ),
    MENAGE_tu = factor(
      MENAGE_tu,
      levels = c(
        "Small Municipality",
        "Medium Municipality",
        "Large Municipality",
        "Paris metropolitan area",
        "Not reported",
        "Not surveyed"), 
      ordered = TRUE
    ),
    #
    MENAGE_rap_zau = ifelse(
      MENAGE_rap_zau %in% c(1,2),
      yes = "Major urban cluster (and its crown)",
      no = MENAGE_rap_zau
    ),
    MENAGE_rap_zau = ifelse(
      MENAGE_rap_zau %in% c(3,4),
      yes = "Medium/small urban cluster (and its crown)",
      no = MENAGE_rap_zau
    ),
    MENAGE_rap_zau = ifelse(
      is_in_df_mutuelle & is.na(MENAGE_rap_zau), 
      yes = "Not reported", 
      no = MENAGE_rap_zau
    ),
    MENAGE_rap_zau = ifelse(
      !is_in_df_mutuelle, yes = "Not surveyed", no = MENAGE_rap_zau),
    MENAGE_rap_zau = factor(
      MENAGE_rap_zau,
      levels = c(
        "Major urban cluster (and its crown)",
        "Medium/small urban cluster (and its crown)",
        5,
        "Not reported",
        "Not surveyed"
      ),
      labels = c(
        "Major urban cluster (and its crown)",
        "Medium/small urban cluster (and its crown)",
        "Spaces outside the area of influence of cities",
        "Not reported",
        "Not surveyed"
      ), ordered = TRUE
    )
  )

Let us make the an assumption regarding the variable PERSONNE_couple. This variable has missing values.

for all individuals younger than 15 (this question was not asked to people \(\leq\) 15). We assume here that people 15 for which the value of the variable PERSONNE_couple is missing are actually not living as a couple.
for individuals older than 15 and living alone, we consider, if they have missing value for PERSONNE_couple that tey are not living as a couple.
for other individuals: we assign then the value Not reported.

The number of individuals in the first case (15 years old and missing value for PERSONNE_couple:

df |> filter(PERSONNE_age == 15) |>
  group_by(PERSONNE_couple) |> 
  count() |> 
  filter(is.na(PERSONNE_couple))

# A tibble: 1 × 2
# Groups:   PERSONNE_couple [1]
  PERSONNE_couple     n
  <fct>           <int>
1 <NA>              292

The number of individuals in the second case (>15 years old, living alone, and missing value for PERSONNE_couple:

df |> 
  filter(PERSONNE_age > 15 & MENAGE_nbpers == 1) |>
  group_by(PERSONNE_couple) |> 
  count() |> 
  filter(is.na(PERSONNE_couple))

# A tibble: 1 × 2
# Groups:   PERSONNE_couple [1]
  PERSONNE_couple     n
  <fct>           <int>
1 <NA>             1502

The number of individuals in the third case (remaining people with missing value for PERSONNE_couple:

df |> 
  filter((PERSONNE_age > 15 & MENAGE_nbpers != 1)) |>
  group_by(PERSONNE_couple) |> 
  count() |> 
  filter(is.na(PERSONNE_couple))

# A tibble: 1 × 2
# Groups:   PERSONNE_couple [1]
  PERSONNE_couple     n
  <fct>           <int>
1 <NA>               53

Let us make the changes:

df <- 
  df |> 
  # Recode NA as "Not reported"
  mutate(
    PERSONNE_couple = ifelse(
      is.na(PERSONNE_couple),
      yes = "Not reported",
      no = as.character(PERSONNE_couple)
    )
  ) |> 
  # If the "couple" variable is "NA" and either lives alone or is 15yo: not in a couple
  mutate(
    PERSONNE_couple = ifelse(
      test = PERSONNE_couple == "Not reported" & 
        (MENAGE_nbpers == 1 | PERSONNE_age == 15),
      yes = "No",
      no = PERSONNE_couple)
  )

1.4 Computing manually the MHI Scores

As we noticed the MHI-Score is miscalculated in the ESPS data, we decided to compute it from scratch.

Some individuals did not answer all the questions required to construct the MHI-5 and MHI-3 scores (for more details on MHI-3, please refer to Yamazaki, Fukuhara, and Green (2005)). As described in the literature, to construct the MHI-5 score, missing values are imputed provided that individuals answered at least 3 of the 5 questions. The imputed value is the average of the responses given.

df <- 
  df |> 
  mutate(PERSONNE_score_t_corrected = PERSONNE_score_t * 1.25) |> 
  # Some variables (PERSONNE_calme and PERSONNE_heureux) need to be adjusted
  mutate(
    mhi_score_nerveux = PERSONNE_nerveu,
    mhi_score_calme = 6 - PERSONNE_calme,
    mhi_score_triste = PERSONNE_triste,
    mhi_score_heureux = 6 - PERSONNE_heureux,
    mhi_score_decourage = PERSONNE_decourag
  ) |> 
  # Counting the number of missing values for each respondant
  mutate(
    nb_na_mhi = (!is.na(mhi_score_nerveux)) + 
      (!is.na(mhi_score_calme)) + 
      (!is.na(mhi_score_triste)) + 
      (!is.na(mhi_score_heureux)) + 
      (!is.na(mhi_score_decourage))
  ) |> 
  mutate(
    nb_na_mhi_3 = (!is.na(mhi_score_triste)) + 
      (!is.na(mhi_score_heureux)) + 
      (!is.na(mhi_score_decourage))
  ) %>%
  # Computing the average score for the items to which individuals responded
  mutate(
    mean_scores = rowMeans(
      select(., mhi_score_nerveux:mhi_score_decourage), 
      na.rm = T
    )
  ) |> 
  # Assigning the average of the items to the missing elements
  mutate(
    mhi_score_nerveux = ifelse(
      is.na(mhi_score_nerveux), 
      yes = mean_scores, 
      no = mhi_score_nerveux
    ),
    mhi_score_calme = ifelse(
      is.na(mhi_score_calme), 
      yes = mean_scores, 
      no = mhi_score_calme
    ),
    mhi_score_triste = ifelse(
      is.na(mhi_score_triste), 
      yes = mean_scores, 
      no = mhi_score_triste
    ),
    mhi_score_heureux = ifelse(
      is.na(mhi_score_heureux), 
      yes = mean_scores, 
      no = mhi_score_heureux
    ),
    mhi_score_decourage = ifelse(
      is.na(mhi_score_decourage), 
      yes = mean_scores, 
      no = mhi_score_decourage
    )
  ) |> 
  mutate(
    mhi_5_raw = mhi_score_nerveux + mhi_score_calme + mhi_score_triste + 
      mhi_score_heureux + mhi_score_decourage
  ) |> 
  mutate(
    score_t_corrected = (mhi_5_raw - min(mhi_5_raw, na.rm=T)) * (100) /
      (max(mhi_5_raw, na.rm = TRUE) - min(mhi_5_raw, na.rm = TRUE))
  ) |> 
  # If the individual has not answered more than 3 items, the MHI-5 score is
  # not calculable
  # In such a case, the MHI-5 score is set to 0
  mutate(
    score_t_corrected = ifelse(nb_na_mhi < 3, yes = NA, no = score_t_corrected)
  ) |> 
  # MHI-3 score
  mutate(
    mhi_3_raw = mhi_score_triste + mhi_score_heureux + mhi_score_decourage,
    score_mhi_3 = 100*(mhi_3_raw - min(mhi_3_raw, na.rm=TRUE)) / 
      (max(mhi_3_raw, na.rm = TRUE) - min(mhi_3_raw, na.rm = TRUE))
  ) |> 
  mutate(score_mhi_3 = ifelse(nb_na_mhi < 3, yes = NA, no = score_mhi_3)) |> 
  select(-mhi_5_raw, -mhi_3_raw)

Let us create two variables which provide the first quantile of the distribution of each score.

quant <- .25
q1_mhi_5 <- quantile(df$PERSONNE_score_t_corrected, probs = quant, na.rm=TRUE)
q1_mhi_5

25% 
 60

q1_mhi_3 <- quantile(df$score_mhi_3, probs = quant, na.rm=TRUE)
q1_mhi_3

     25% 
58.33333

We can create a variable named PERSONNE_pb_depress_2 which will recode people who did not respond to the question regarding depression as No depression (this is not used later in the paper).

df <- 
  df |> 
  mutate(
    PERSONNE_pb_depress_2 = as.character(PERSONNE_pb_depress)) |> 
  mutate(
    PERSONNE_pb_depress_2 = ifelse(
      PERSONNE_pb_depress_2 == "Not reported",
      yes = "No depression", 
      no = PERSONNE_pb_depress_2)
  )

Let us create a variable which states if each individual’s MHI score is below the first quartile of the distribution.

df <- 
  df |> 
  mutate(
    inf_q1_mhi_5 = PERSONNE_score_t_corrected <= q1_mhi_5,
    inf_q1_mhi_5 = factor(
      inf_q1_mhi_5,
      levels = c(TRUE, FALSE), labels = c("<=Q1", ">Q1")
    ),
    inf_q1_mhi_3 = score_mhi_3 <= q1_mhi_3,
    inf_q1_mhi_3 = factor(
      inf_q1_mhi_3,
      levels = c(TRUE, FALSE), labels = c("<=Q1", ">Q1"))
  )

Lastly, let us add an ID to each individual (the row number in the table).

df <- df |> mutate(id = row_number())

The results can be saved:

save(df, file = "../data/df_merged.rda")