Exercise solutions: Section 7.10

Author

Rob J Hyndman and George Athanasopoulos

fpp3 7.10, Ex 4

The data set souvenirs concerns the monthly sales figures of a shop which opened in January 1987 and sells gifts, souvenirs, and novelties. The shop is situated on the wharf at a beach resort town in Queensland, Australia. The sales volume varies with the seasonal population of tourists. There is a large influx of visitors to the town at Christmas and for the local surfing festival, held every March since 1988. Over time, the shop has expanded its premises, range of products, and staff.

  1. Produce a time plot of the data and describe the patterns in the graph. Identify any unusual or unexpected fluctuations in the time series.
souvenirs |> autoplot(Sales)

Features of the data:

  • Seasonal data – similar scaled pattern repeats every year
  • A spike every March (except for year 1987) is the influence of the surfing festival
  • The size of the pattern increases proportionally to the level of sales
  1. Explain why it is necessary to take logarithms of these data before fitting a model.

The last feature above suggests taking logs to make the pattern (and variance) more stable

# Taking logarithms of the data
souvenirs |> autoplot(log(Sales))

After taking logs, the trend looks more linear and the seasonal variation is roughly constant.

  1. Fit a regression model to the logarithms of these sales data with a linear trend, seasonal dummies and a “surfing festival” dummy variable.
fit <- souvenirs |>
  mutate(festival = month(Month) == 3 & year(Month) != 1987) |>
  model(reg = TSLM(log(Sales) ~ trend() + season() + festival))
souvenirs |>
  autoplot(Sales, col = "gray") +
  geom_line(data = augment(fit), aes(y = .fitted), col = "blue")

  1. Plot the residuals against time and against the fitted values. Do these plots reveal any problems with the model?
fit |> gg_tsresiduals()

The residuals are serially correlated. This is both visible from the time plot but also from the ACF. The residuals reveal nonlinearity in the trend.

augment(fit) |>
  ggplot(aes(x = .fitted, y = .innov)) +
  geom_point() +
  scale_x_log10()

The plot of residuals against fitted values looks fine - no notable patterns emerge. We take logarithms of fitted values because we took logs in the model.

  1. Do boxplots of the residuals for each month. Does this reveal any problems with the model?
augment(fit) |>
  mutate(month = month(Month, label = TRUE)) |>
  ggplot(aes(x = month, y = .innov)) +
  geom_boxplot()

The boxplots show differences in variation across the months revealing some potential heteroscedasticity.

  1. What do the values of the coefficients tell you about each variable?
tidy(fit) |> mutate(pceffect = (exp(estimate) - 1) * 100)
# A tibble: 14 × 7
   .model term           estimate std.error statistic  p.value  pceffect
   <chr>  <chr>             <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
 1 reg    (Intercept)      7.62    0.0742      103.   4.67e-78 203688.  
 2 reg    trend()          0.0220  0.000827     26.6  2.32e-38      2.23
 3 reg    season()year2    0.251   0.0957        2.63 1.06e- 2     28.6 
 4 reg    season()year3    0.266   0.193         1.38 1.73e- 1     30.5 
 5 reg    season()year4    0.384   0.0957        4.01 1.48e- 4     46.8 
 6 reg    season()year5    0.409   0.0957        4.28 5.88e- 5     50.6 
 7 reg    season()year6    0.449   0.0958        4.69 1.33e- 5     56.6 
 8 reg    season()year7    0.610   0.0958        6.37 1.71e- 8     84.1 
 9 reg    season()year8    0.588   0.0959        6.13 4.53e- 8     80.0 
10 reg    season()year9    0.669   0.0959        6.98 1.36e- 9     95.3 
11 reg    season()year10   0.747   0.0960        7.79 4.48e-11    111.  
12 reg    season()year11   1.21    0.0960       12.6  1.29e-19    234.  
13 reg    season()year12   1.96    0.0961       20.4  3.39e-31    612.  
14 reg    festivalTRUE     0.502   0.196         2.55 1.29e- 2     65.1 
  • (Intercept) is not interpretable.
  • trend coefficient shows that with every month sales increases on average by 2.2%.
  • season2 coefficient shows that February sales are greater than January on average by 28.6%, after allowing for the trend.
  • season12 coefficient shows that December sales are greater than January on average by 611.5%, after allowing for the trend.
  • festivalTRUE coefficient shows that for months that include the surfing festival, sales increases on average by 65.1% compared to months without the festival, after allowing for the trend and seasonality.
  1. What does the Ljung-Box test tell you about your model?
augment(fit) |>
  features(.innov, ljung_box, lag = 24)
# A tibble: 1 × 3
  .model lb_stat lb_pvalue
  <chr>    <dbl>     <dbl>
1 reg       112.  2.15e-13

The serial correlation in the residuals is significant.

  1. Regardless of your answers to the above questions, use your regression model to predict the monthly sales for 1994, 1995, and 1996. Produce prediction intervals for each of your forecasts.
future_souvenirs <- new_data(souvenirs, n = 36) |>
  mutate(festival = month(Month) == 3)
fit |>
  forecast(new_data = future_souvenirs) |>
  autoplot(souvenirs)

  1. How could you improve these predictions by modifying the model?
  • The model can be improved by taking into account nonlinearity of the trend.

fpp3 7.10, Ex 5

The us_gasoline series consists of weekly data for supplies of US finished motor gasoline product, from 2 February 1991 to 20 January 2017. The units are in “thousand barrels per day”. Consider only the data to the end of 2004. a. Fit a harmonic regression with trend to the data. Experiment with changing the number Fourier terms. Plot the observed gasoline and fitted values and comment on what you see.

gas <- us_gasoline |> filter(year(Week) <= 2004)
gas |> autoplot(Barrels)

fit <- gas |>
  model(
    fourier1 = TSLM(Barrels ~ trend() + fourier(K = 1)),
    fourier2 = TSLM(Barrels ~ trend() + fourier(K = 2)),
    fourier3 = TSLM(Barrels ~ trend() + fourier(K = 3)),
    fourier4 = TSLM(Barrels ~ trend() + fourier(K = 4)),
    fourier5 = TSLM(Barrels ~ trend() + fourier(K = 5)),
    fourier6 = TSLM(Barrels ~ trend() + fourier(K = 6)),
    fourier7 = TSLM(Barrels ~ trend() + fourier(K = 7)),
    fourier8 = TSLM(Barrels ~ trend() + fourier(K = 8)),
    fourier9 = TSLM(Barrels ~ trend() + fourier(K = 9)),
    fourier10 = TSLM(Barrels ~ trend() + fourier(K = 10)),
    fourier11 = TSLM(Barrels ~ trend() + fourier(K = 11)),
    fourier12 = TSLM(Barrels ~ trend() + fourier(K = 12)),
    fourier13 = TSLM(Barrels ~ trend() + fourier(K = 13)),
    fourier14 = TSLM(Barrels ~ trend() + fourier(K = 14)),
    fourier15 = TSLM(Barrels ~ trend() + fourier(K = 15))
  )
glance(fit) |>
  arrange(AICc) |>
  select(.model, AICc, CV)
# A tibble: 15 × 3
   .model      AICc     CV
   <chr>      <dbl>  <dbl>
 1 fourier7  -1887. 0.0740
 2 fourier11 -1885. 0.0742
 3 fourier8  -1885. 0.0743
 4 fourier12 -1884. 0.0742
 5 fourier6  -1884. 0.0744
 6 fourier9  -1882. 0.0745
 7 fourier10 -1881. 0.0746
 8 fourier13 -1881. 0.0745
 9 fourier14 -1879. 0.0748
10 fourier15 -1876. 0.0750
11 fourier5  -1862. 0.0766
12 fourier4  -1856. 0.0773
13 fourier3  -1853. 0.0777
14 fourier2  -1814. 0.0819
15 fourier1  -1809. 0.0825
# Best model has order 7
gas |>
  autoplot(Barrels, col = "gray") +
  geom_line(
    data = augment(fit) |> filter(.model == "fourier7"),
    aes(y = .fitted), col = "blue"
  )

This seems to have captured the annual seasonality quite well.

  1. Select the appropriate number of Fourier terms to include by minimising the AICc or CV value.

As above.

  1. Check the residuals of the final model using the gg_tsresiduals() function. Use a Ljung-Box test to check for residual autocorrelation.
fit |>
  select(fourier7) |>
  gg_tsresiduals()

There is some small remaining autocorrelation at lag 1.

augment(fit) |>
  filter(.model == "fourier7") |>
  features(.innov, ljung_box, lag = 52)
# A tibble: 1 × 3
  .model   lb_stat lb_pvalue
  <chr>      <dbl>     <dbl>
1 fourier7    56.9     0.299

The Ljung-Box test is not significant. Even if the Ljung-Box test had given a significant result here, the correlations are very small, so it would not make much difference to the forecasts and prediction intervals.

  1. Generate forecasts for the next year of data. Plot these along with the actual data for 2005. Comment on the forecasts.
fit |>
  select("fourier7") |>
  forecast(h = "1 year") |>
  autoplot(filter_index(us_gasoline, "2000" ~ "2006"))

The forecasts fit pretty well for the first six months, but not so well after that.

fpp3 7.10, Ex 6

The annual population of Afghanistan is available in the global_economy data set.

  1. Plot the data and comment on its features. Can you observe the effect of the Soviet-Afghan war?
global_economy |>
  filter(Country == "Afghanistan") |>
  autoplot(Population / 1e6) +
  labs(y = "Population (millions)") +
  geom_ribbon(aes(xmin = 1979.98, xmax = 1989.13), fill = "pink", alpha = 0.4) +
  annotate("text", x = 1984.5, y = 10, label = "Soviet-Afghan war", col = "red", size = 3)

The population increases slowly from 1960 to 1980, then decreases during the Soviet-Afghan war (24 Dec 1979 – 15 Feb 1989), and has increased relatively rapidly since then. The last 30 years has shown an almost linear increase in population.

  1. Fit a linear trend model and compare this to a piecewise linear trend model with knots at 1980 and 1989.
fit <- global_economy |>
  filter(Country == "Afghanistan") |>
  model(
    linear = TSLM(Population ~ trend()),
    piecewise = TSLM(Population ~ trend(knots = c(1980, 1989)))
  )
augment(fit) |>
  autoplot(.fitted) +
  geom_line(aes(y = Population), colour = "black")

The fitted values show that the piecewise linear model has tracked the data very closely, while the linear model is inaccurate.

  1. Generate forecasts from these two models for the five years after the end of the data, and comment on the results.
fc <- fit |> forecast(h = "5 years")
autoplot(fc) +
  autolayer(filter(global_economy |> filter(Country == "Afghanistan")), Population)

The linear model is clearly incorrect with prediction intervals too wide, and the point forecasts too low.

The piecewise linear model looks good, but the prediction intervals are probably too narrow. This model assumes that the trend since the last knot will continue unchanged, which is unrealistic.