1414summary: |
1515 Building on our previous two posts (on our COVID-19 symptom surveys through
1616 Facebook and Google)
17- this post offers a deeper dive into empirical analysis, examining whether the
18- % CLI-in-community indicators from our two surveys can be used to improve
17+ this post offers a deeper dive into empirical analysis, examining whether the
18+ % CLI-in-community indicators from our two surveys can be used to improve
1919 the accuracy of short-term forecasts of county-level COVID-19 case rates.
2020acknowledgements: |
2121 Delphi's forecasting effort involves many people from our
22- modeling team, from forecaster design, to implementation, to evaluation. The
22+ modeling team, from forecaster design, to implementation, to evaluation. The
2323 broader insights on forecasting shared in this post certainly cannot be
24- attributable to Ryan's work alone, and are a reflection of the work carried out
24+ attributable to Ryan's work alone, and are a reflection of the work carried out
2525 by all these team members.
2626related:
2727 - 2020-09-18-google-survey
@@ -120,35 +120,32 @@ <h2>Problem Setup</h2>
120120We evaluate the following four models:</ p >
121121< p > < span class ="math display "> \[
122122\begin{aligned}
123- &\text{Cases:} \\
124- & h(Y_{\ell,t+d})
125- \approx \alpha + \sum_{j=0}^2 \beta_j h(Y_{\ell,t-7j}) \\
126- &\text{Cases + Facebook:} \\
127- & h(Y_{\ell,t+d})
128- \approx \alpha + \sum_{j=0}^2 \beta_j h(Y_{\ell,t-7j}) +
123+ h(Y_{\ell,t+d})
124+ &\approx \alpha + \sum_{j=0}^2 \beta_j h(Y_{\ell,t-7j}) \\
125+ h(Y_{\ell,t+d})
126+ &\approx \alpha + \sum_{j=0}^2 \beta_j h(Y_{\ell,t-7j}) +
129127\sum_{j=0}^2 \gamma_j h(F_{\ell,t-7j}) \\
130- &\text{Cases + Google:} \\
131- & h(Y_{\ell,t+d})
132- \approx \alpha + \sum_{j=0}^2 \beta_j h(Y_{\ell,t-7j}) +
128+ h(Y_{\ell,t+d})
129+ &\approx \alpha + \sum_{j=0}^2 \beta_j h(Y_{\ell,t-7j}) +
133130\sum_{j=0}^2 \gamma_j h(G_{\ell,t-7j}) \\
134- &\text{Cases + Facebook + Google:} \\
135- & h(Y_{\ell,t+d})
136- \approx \alpha + \sum_{j=0}^2 \beta_j h(Y_{\ell,t-7j}) +
131+ h(Y_{\ell,t+d})
132+ &\approx \alpha + \sum_{j=0}^2 \beta_j h(Y_{\ell,t-7j}) +
137133\sum_{j=0}^2 \gamma_j h(F_{\ell,t-7j}) +
138134\sum_{j=0}^2 \tau_j h(G_{\ell,t-7j}).
139135\end{aligned}
140136\]</ span > </ p >
141137< p > Here < span class ="math inline "> \(d=7\)</ span > or < span class ="math inline "> \(d=14\)</ span > , depending on the target value
142138(number of days we predict ahead),
143139and < span class ="math inline "> \(h\)</ span > is a transformation to be specified later.</ p >
144- < p > Informally, the first model bases its predictions of future case rates
145- on the following three features:
140+ < p > Informally, the first model, which we’ll call the “Cases” model,
141+ bases its predictions of future case rates on the following three features:
146142current COVID-19 case rates, and those 1 and 2 weeks back.
147- The second model additionally incorporates the current Facebook signal,
148- and the Facebook signal from 1 and 2 weeks back.
149- The third model is exactly same but substitutes the Google signal
150- instead of the Facebook one.
151- Finally, the fourth model uses both Facebook and Google signals.
143+ The second model, “Cases + Facebook”, additionally incorporates the
144+ current Facebook signal, and the Facebook signal from 1 and 2 weeks back.
145+ The third model, “Cases + Google”, is exactly the same but substitutes the
146+ Google signal instead of the Facebook one.
147+ Finally, the fourth model, “Cases + Facebook + Google”,
148+ uses both Facebook and Google signals.
152149For each model, in order to make a forecast at time < span class ="math inline "> \(t_0\)</ span >
153150(to predict case rates at time < span class ="math inline "> \(t_0+d\)</ span > ),
154151we fit a linear model using least absolute deviations (LAD) regression,
@@ -217,17 +214,17 @@ <h2>Forecasting Code</h2>
217214 as.Date(max(time_value)),
218215 by = "day")) %>% ungroup()
219216 df = full_join(df, df_all, by = c("geo_value", "time_value"))
220-
217+
221218 # Group by geo value, sort rows by increasing time
222- df = df %>% group_by(geo_value) %>% arrange(time_value)
223-
219+ df = df %>% group_by(geo_value) %>% arrange(time_value)
220+
224221 # Load over shifts, and add lag value or lead value
225222 for (shift in shifts) {
226223 fun = ifelse(shift < 0, lag, lead)
227224 varname = sprintf("value%+d", shift)
228225 df = mutate(df, !!varname := fun(value, n = abs(shift)))
229226 }
230-
227+
231228 # Ungroup and return
232229 return(ungroup(df))
233230}
@@ -261,40 +258,40 @@ <h2>Forecasting Code</h2>
261258case_num = 200
262259geo_values = covidcast_signal("jhu-csse", "confirmed_cumulative_num",
263260 "2020-05-14", "2020-05-14") %>%
264- filter(value >= case_num) %>% pull(geo_value)
261+ filter(value >= case_num) %>% pull(geo_value)
265262
266263# Fetch county-level Google and Facebook % CLI-in-community signals, and JHU
267264# confirmed case incidence proportion
268265start_day = "2020-04-11"
269266end_day = "2020-09-01"
270267g = covidcast_signal("google-survey", "smoothed_cli") %>%
271- filter(geo_value %in% geo_values) %>%
272- select(geo_value, time_value, value)
273- f = covidcast_signal("fb-survey", "smoothed_hh_cmnty_cli",
268+ filter(geo_value %in% geo_values) %>%
269+ select(geo_value, time_value, value)
270+ f = covidcast_signal("fb-survey", "smoothed_hh_cmnty_cli",
274271 start_day, end_day) %>%
275- filter(geo_value %in% geo_values) %>%
276- select(geo_value, time_value, value)
272+ filter(geo_value %in% geo_values) %>%
273+ select(geo_value, time_value, value)
277274c = covidcast_signal("jhu-csse", "confirmed_7dav_incidence_prop",
278275 start_day, end_day) %>%
279- filter(geo_value %in% geo_values) %>%
276+ filter(geo_value %in% geo_values) %>%
280277 select(geo_value, time_value, value)
281278
282- # Find "complete" counties, present in all three data signals at all times
279+ # Find "complete" counties, present in all three data signals at all times
283280geo_values_complete = intersect(intersect(g$geo_value, f$geo_value),
284281 c$geo_value)
285282
286- # Filter to complete counties, transform the signals, append 1-2 week lags to
283+ # Filter to complete counties, transform the signals, append 1-2 week lags to
287284# all three, and also 1-2 week leads to case rates
288- lags = 1:2 * -7
285+ lags = 1:2 * -7
289286leads = 1:2 * 7
290- g = g %>% filter(geo_value %in% geo_values_complete) %>%
291- mutate(value = trans(value * rescale_g)) %>%
292- append_shifts(shifts = lags)
293- f = f %>% filter(geo_value %in% geo_values_complete) %>%
294- mutate(value = trans(value * rescale_f)) %>%
295- append_shifts(shifts = lags)
287+ g = g %>% filter(geo_value %in% geo_values_complete) %>%
288+ mutate(value = trans(value * rescale_g)) %>%
289+ append_shifts(shifts = lags)
290+ f = f %>% filter(geo_value %in% geo_values_complete) %>%
291+ mutate(value = trans(value * rescale_f)) %>%
292+ append_shifts(shifts = lags)
296293c = c %>% filter(geo_value %in% geo_values_complete) %>%
297- mutate(value = trans(value * rescale_c)) %>%
294+ mutate(value = trans(value * rescale_c)) %>%
298295 append_shifts(shifts = c(lags, leads))
299296
300297# Rename columns
@@ -310,55 +307,55 @@ <h2>Forecasting Code</h2>
310307
311308# Use quantgen for LAD regression (this package supports quantile regression and
312309# more; you can find it on GitHub here: https://github.com/ryantibs/quantgen)
313- library(quantgen)
310+ library(quantgen)
314311
315312res_list = vector("list", length = length(leads))
316313
317314# Loop over lead, forecast dates, build models and record errors (warning: this
318315# computation takes a while)
319- for (i in 1:length(leads)) {
316+ for (i in 1:length(leads)) {
320317 lead = leads[i]; if (verbose) cat("***", lead, "***\n")
321-
318+
322319 # Create a data frame to store our forecast results. Code below populates its
323- # rows in a way that breaks from typical dplyr operations, done for efficiency
324- res_list[[i]] = z %>%
325- filter(between(time_value, as.Date(start_day) - min(lags) + lead,
320+ # rows in a way that breaks from typical dplyr operations, done for efficiency
321+ res_list[[i]] = z %>%
322+ filter(between(time_value, as.Date(start_day) - min(lags) + lead,
326323 as.Date(end_day) - lead)) %>%
327324 select(geo_value, time_value) %>%
328- mutate(err0 = as.double(NA), err1 = as.double(NA), err2 = as.double(NA),
329- err3 = as.double(NA), err4 = as.double(NA), lead = lead)
325+ mutate(err0 = as.double(NA), err1 = as.double(NA), err2 = as.double(NA),
326+ err3 = as.double(NA), err4 = as.double(NA), lead = lead)
330327 valid_dates = unique(res_list[[i]]$time_value)
331-
328+
332329 for (k in 1:length(valid_dates)) {
333330 date = valid_dates[k]; if (verbose) cat(format(date), "... ")
334-
331+
335332 # Filter down to training set and test set
336333 z_tr = z %>% filter(between(time_value, date - lead - n, date - lead))
337334 z_te = z %>% filter(time_value == date)
338335 inds = which(res_list[[i]]$time_value == date)
339-
336+
340337 # Create training and test responses
341338 y_tr = z_tr %>% pull(paste0("case+", lead))
342339 y_te = z_te %>% pull(paste0("case+", lead))
343-
340+
344341 # Strawman model
345342 if (verbose) cat("0")
346343 y_hat = z_te %>% pull(case)
347344 res_list[[i]][inds,]$err0 = abs(inv_trans(y_hat) - inv_trans(y_te))
348-
345+
349346 # Cases only model
350347 if (verbose) cat("1")
351348 x_tr_case = z_tr %>% select(starts_with("case") & !contains("+"))
352349 x_te_case = z_te %>% select(starts_with("case") & !contains("+"))
353- x_tr = x_tr_case; x_te = x_te_case # For symmetry wrt what follows
350+ x_tr = x_tr_case; x_te = x_te_case # For symmetry wrt what follows
354351 ok = complete.cases(x_tr, y_tr)
355352 if (sum(ok) > 0) {
356353 obj = quantile_lasso(as.matrix(x_tr[ok,]), y_tr[ok], tau = 0.5,
357354 lambda = 0, lp_solver = lp_solver)
358355 y_hat = as.numeric(predict(obj, newx = as.matrix(x_te)))
359356 res_list[[i]][inds,]$err1 = abs(inv_trans(y_hat) - inv_trans(y_te))
360357 }
361-
358+
362359 # Cases and Facebook model
363360 if (verbose) cat("2")
364361 x_tr_fb = z_tr %>% select(starts_with("fb"))
@@ -386,7 +383,7 @@ <h2>Forecasting Code</h2>
386383 y_hat = as.numeric(predict(obj, newx = as.matrix(x_te)))
387384 res_list[[i]][inds,]$err3 = abs(inv_trans(y_hat) - inv_trans(y_te))
388385 }
389-
386+
390387 # Cases, Facebook, and Google model
391388 if (verbose) cat("4\n")
392389 x_tr = cbind(x_tr_case, x_tr_fb, x_tr_goog)
@@ -401,7 +398,7 @@ <h2>Forecasting Code</h2>
401398 }
402399}
403400
404- # Bind results over different leads into one big data frame, and save
401+ # Bind results over different leads into one big data frame, and save
405402res = do.call(rbind, res_list)
406403save(list = ls(), file = "demo.rda")</ code > </ pre >
407404</ div >
@@ -1036,8 +1033,8 @@ <h2>Wrap-Up</h2>
10361033test</ a >
10371034(for paired data, as we have here) is more popular,
10381035because it tends to be more powerful than the sign test.
1039- Applied here, it does indeed give smaller p-values pretty much across the board.
1040- However, it assumes symmetry of the distribution in question
1036+ Applied here, it does indeed give smaller p-values pretty much across the
1037+ board. However, it assumes symmetry of the distribution in question
10411038(in our case, the difference in scaled errors),
10421039whereas the sign test does not, and thus we show results from the latter.< a href ="#fnref2 " class ="footnote-back "> ↩︎</ a > </ p > </ li >
10431040< li id ="fn3 "> < p > Delphi’s “production” forecasters are still based on relatively simple
0 commit comments