As in Chapters 5, we merge the different datasets to produce the data table used in the local projections that uses annual data instead of monthl or quarterly data.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(cli)
12.1 Load Intermediate Files
The (annual) weather data (Chapter 1) can be loaded:
Weather <- weather_annual_regions_df |># Add ENSO dataleft_join( ONI_temp |>mutate(year =as.numeric(Year)) |>group_by(year) |>summarise(ONI =mean(ONI)), by =c("year"="year" ) ) |>group_by(IDDPTO) |>mutate( temp_min_dev_ENSO = temp_min -mean(temp_min),temp_max_dev_ENSO = temp_max -mean(temp_max),temp_mean_dev_ENSO = temp_mean -mean(temp_mean),precip_sum_dev_ENSO = precip_sum -mean(precip_sum))|>ungroup() |> labelled::set_variable_labels(temp_min_dev_ENSO ="Deviation of Min. Temperature from ENSO Normals",temp_max_dev_ENSO ="Deviation of Max. Temperature from ENSO Normals",temp_mean_dev_ENSO ="Deviation of Mean Temperature from ENSO Normals",precip_sum_dev_ENSO ="Deviation of Total Rainfall from ENSO Normals", )
We compute percentage deviation of production from the regional average over the period of interest, but we will actually not use those values in the subsequent estimations. We will, however, use the demeaned values computed in this section.
This section outlines a two-step procedure for expressing agricultural production data at the annual regional level for a specific crop as a percentage deviation from the regional crop-specific average over the period of interest. The procedure involves handling missing values.
Step 1: Handling Missing Values
In the first step, we address missing values by linear interpolation. This approach helps us estimate the missing values by considering the neighboring data points.
Step 1.1: Imputing missing values with linear interpolation.
The missing values get replaced by linear interpolation. However, if there are more than two consecutive missing values, they are not replaced with interpolated values. Instead, the series for the specific crop in the given region is split based on the locations of the missing values. The split with the highest number of consecutive non-missing values is retained, while the other splits are discarded.
Step 1.2: Dropping Series with Remaining Missing Values
After imputing missing values using the moving median, we check if any missing values still remain in the dataset. If there are any remaining missing values for a particular series, we choose to exclude that series from further analysis. By doing so, we ensure that the subsequent detrending process is performed only on reliable and complete data.
Step 2: Normalized Agricultural Production
For each month ( m ), region ( i ), and crop ( c ), we calculate the average production over the entire period (January 2001 to December 2015): \[\overline{y}_{c,i,m} = \frac{1}{n_{T_c}} \sum_{t=1}^{T_c} y_{c,i,m,t}^{\text{raw}}
\] Then, we express agricultural production relative to the average: \[y_{c,i,m,t} = \begin{cases}
\frac{y_{c,i,m,t}^{\text{raw}}}{\overline{y}_{c,i,m}}, & \overline{y}_{c,i,m} > 0\\
0, & \overline{y}_{c,i,m} = 0
\end{cases}\] Values of \(y_{c,i,m,t}>1\) means that the production for crop \(c\) in region \(i\) during month \(m\) of year \(t\) is higher than the average monthly production for that crop and region over the period 2001 to 2015. For example, a value of 1.5 means that the production is 50% higher than average.
Step 2 (alternative version): Deviation from regional monthly average, in percent (this step is useless in the new version of the analysis: it lead to discard too many observations)
Once we have addressed the missing values, we proceed to the second step, which consists in computing the deviation of production from the regional average. First, we compute the average production of each crop \(c\) in each region \(i\): \[\overline{y}_{c,i} = \frac{1}{n_{T_c}} \sum_{t=1}^{T_c} y_{c,i,t}^{raw}\] Then, we compute the percentage deviation from this average at each date (i.e., year) \(t\): \[y_{c,i,t} = \frac{y_{c,i,t}^{raw} - \overline{y}_{c,i}}{\overline{y}_{c,i}}\]
Let us implement this process in R. First, we need to define two functions to handle the missing values:
The get_index_longest_non_na() function retrieves the indices of the longest consecutive sequence without missing values from a given input vector. It helps us identify the positions of elements in that sequence.
The keep_values_longest_non_na() function uses the obtained indices to create a logical vector. Each element of this vector indicates whether the corresponding element in the input vector belongs to the longest consecutive sequence of non-missing values. This allows us to filter the data and retain only the values from the longest consecutive sequence without missing values.
These two functions combined help us handle missing data in the weather series and ensure that we work with the most complete sequences for each region and crop.
The first function:
#' Returns the index of the longest sequence of non NA values in a vector#'#' @param y vector of numerical values#' @exportget_index_longest_non_na <-function(y) { split_indices <-which(is.na(y)) nb_obs <-length(y)if (length(split_indices) ==0) { res <-seq_len(nb_obs) } else { idx_beg <-c(1, split_indices)if (idx_beg[length(idx_beg)] != nb_obs) { idx_beg <-c(idx_beg, nb_obs) } lengths <-diff(idx_beg) ind_max <-which.max(lengths) index_beginning <- idx_beg[ind_max]if(!index_beginning ==1|is.na(y[index_beginning])) { index_beginning <- index_beginning +1 } index_end <- idx_beg[ind_max] + lengths[ind_max]if(is.na(y[index_end])) { index_end <- index_end -1 } res <-seq(index_beginning, index_end) } res}
The second one:
#' Returns a logical vector that identifies the longest sequence of non NA#' values within the input vector#' #' @param y numeric vectorkeep_values_longest_non_na <-function(y) { ids_to_keep <-get_index_longest_non_na(y) ids <-seq(1, length(y)) ids %in% ids_to_keep}
Note
Those two functions are defined in weatherperu/R/utils.R.
We define a function, pct_prod_production(), that takes the data frame of observations as input, as well as a crop name and a region ID. It returns a tibble with the following variables:
product_eng: the English name of the crop
region_id: the ID of the region
year: year
y_new_normalized (our variable of interest in Chapter 14): the production demeaned by the average for the crop of interest in the region of interest
y_new: the production (in tons) where missing values were imputed, if possible
y_dev_pct: the production expressed as the percentage deviation from the average (for the crop of interest, in the region of interest)
y: same as y_dev_pct but without an estimated quadratic trend estimated by OLS
t: trend.
#' Computes the percentage deviation of production from annual regional average#'#' @param df data#' @param crop_name name of the crop#' @param region_id id of the region#'#' @returns data frame with the product, the region id, the date, the production#' with imputed missing values (`y_new`), the production demeaned (`y_new_normalized`), #' the percentage deviation from mean production (`y_dev_pct`), the percentage #' deviation from mean production minus an estimated quadratic trend (estimated#' by OLS) (`y`), and, a trend (`t`)#' @export#' @importFrom dplyr filter arrange mutate select row_number group_by#' @importFrom tidyr nest unnest#' @importFrom purrr map#' @importFrom imputeTS na_interpolation#' @importFrom stats lm predict residualspct_prod_production <-function(df, crop_name, region_id) {# The current data df_current <- df |>filter( product_eng ==!!crop_name, region_id ==!!region_id ) |>arrange(year)## Dealing with missing values ----# Look for negative production values df_current <- df_current |>mutate(y_new =ifelse(Value_prod <0, NA, Value_prod) )if (any(is.na(df_current$y_new))) {# Replacing NAs by interpolation# If there are more than two contiguous NAs, they are not replaced df_current <- df_current |>mutate(y_new = imputeTS::na_interpolation(y_new, maxgap =3) )# Removing obs at the beginning/end if they are still missing df_current <- df_current |>mutate(row_to_keep =!(is.na(y_new) &row_number() %in%c(1:2, (n()-1):(n()))) ) |>filter(row_to_keep) |>select(-row_to_keep)# Keeping the longest series of continuous non-NA values df_current <- df_current |>mutate(row_to_keep =keep_values_longest_non_na(y_new) ) |>filter(row_to_keep) |>select(-row_to_keep) } rle_y_new <-rle(df_current$y_new) check_contiguous_zeros <- rle_y_new$lengths[rle_y_new$values==0]## Percent deviation from regional average over the period resul <- df_current |>ungroup() |>mutate(y_new_normalized =case_when(mean(y_new) ==0~0,TRUE~ y_new /mean(y_new) ),y_dev_pct =case_when(mean(y_new) ==0~0,TRUE~ (y_new -mean(y_new)) /mean(y_new) ) ) |>arrange(year) |>mutate(t =row_number()) |>nest(.by =c(product_eng, region_id)) |>mutate(ols_fit =map(data, ~lm(y_new_normalized ~-1+ t +I(t^2), data = .x)),resid =map(ols_fit, residuals),fitted =map(ols_fit, predict) ) |>unnest(cols =c(data, resid, fitted)) |>mutate(y = resid ) |>select( product_eng, region_id, year, y_new, y_dev_pct, y_new_normalized, y, t ) |>ungroup() |>arrange(year) resul}
We can apply this function to all crops of interest, in each region. Let us define a table that contains all the possible values for the combination of crops and regions: