-
Notifications
You must be signed in to change notification settings - Fork 8
/
Copy path03-arma.Rmd
2729 lines (1996 loc) · 185 KB
/
03-arma.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# The Family of Autoregressive Moving Average Models
>
> "*Essentially, all models are wrong, but some are useful*", George Box
>
```{r, echo = FALSE, message = FALSE, warning = FALSE}
#detach("package:mgcv", TRUE)
library(simts)
library(ggplot2)
library(robcor)
library(tsdl)
```
In this chapter we introduce a class of time series models that is considerably flexible and among the most commonly used to describe stationary time series. This class is represented by the Seasonal AutoRegressive Integrated Moving Average (SARIMA) models which, among others, combine and include the autoregressive and moving average models seen in the previous chapter. To introduce this class of models, we start by describing a sub-class called AutoRegressive Moving Average (ARMA) models which represent the backbone on which the SARIMA class is built. The importance of ARMA models resides in their flexibility as well as their capacity of describing (or closely approximating) almost all the features of a stationary time series. The autoregressive parts of these models describe how consecutive observations in time influence each other while the moving average parts capture some possible unobserved shocks thereby allowing to model different phenomena which can be observed in various fields going from biology to finance.
With this premise, the first part of this chapter introduces and explains the class of ARMA models in the following manner. First of all we will discuss the class of linear processes, which ARMA models belong to, and we will then proceed to a detailed description of autoregressive models in which we review their definition, explain their properties, introduce the main estimation methods for their parameters and highlight the diagnostic tools which can help understand if the estimated models appear to be appropriate or sufficient to well describe the observed time series. Once this is done, we will then use most of the results given for the autoregressive models to further describe and discuss moving average models, for which we underline the property of invertibility, and finally the ARMA models. Indeed, the properties and estimation methods for the latter class are directly inherited from the discussions on the autoregressive and moving average models.
The second part of this chapter introduces the general class of SARIMA models, passing through the class of ARIMA models. These models allow to apply the ARMA modeling framework also to time series that have particular non-stationary components to them such as, for example, linear and/or seasonal trends. Extending ARMA modeling to these cases allows SARIMA models to be an extremely flexible class of models that can be used to describe a wide range of phenomena.
## Linear Processes
In order to discuss the classes of models mentioned above, we first present the class of linear processes which underlie many of the most common time series models.
```{definition, label="lp", name="Linear Process"}
A time series, $(X_t)$, is defined to be a linear process if it can be expressed
as a linear combination of white noise as follows:
\[{X_t} = \mu + \sum\limits_{j = - \infty }^\infty {{\psi _j}{W_{t - j}}} \]
where $W_t \sim \mathcal{N}(0, \sigma^2)$ and $\sum\limits_{j = - \infty }^\infty {\left| {{\psi _j}} \right|} < \infty$.
```
<br>
Note, the latter assumption is required to ensure that the series has a limit. Furthermore, the set of coefficients \[{( {\psi _j}) _{j = - \infty , \cdots ,\infty }}\] can be viewed as a linear filter. These coefficients do not have to be all equal nor symmetric as later examples will show. Generally, the properties of a linear process related to mean and variance are given by:
\[\begin{aligned}
\mu_{X} &= \mu \\
\gamma_{X}(h) &= \sigma _W^2\sum\limits_{j = - \infty }^\infty {{\psi _j}{\psi _{h + j}}}
\end{aligned}\]
The latter is derived from
\[\begin{aligned}
\gamma \left( h \right) &= Cov\left( {{x_t},{x_{t + h}}} \right) \\
&= Cov\left( {\mu + \sum\limits_{j = - \infty }^\infty {{\psi _j}{w_{t - j}}} ,\mu + \sum\limits_{j = - \infty }^\infty {{\psi _j}{w_{t + h - j}}} } \right) \\
&= Cov\left( {\sum\limits_{j = - \infty }^\infty {{\psi _j}{w_{t - j}}} ,\sum\limits_{j = - \infty }^\infty {{\psi _j}{w_{t + h - j}}} } \right) \\
&= \sum\limits_{j = - \infty }^\infty {{\psi _j}{\psi _{j + h}}Cov\left( {{w_{t - j}},{w_{t - j}}} \right)} \\
&= \sigma _w^2\sum\limits_{j = - \infty }^\infty {{\psi _j}{\psi _{j + h}}} \\
\end{aligned} \]
Within the above derivation, the key is to realize that
$Cov\left( {{w_{t - j}},{w_{t + h - j}}} \right) = 0$ if $t - j \ne t + h - j$.
Lastly, another convenient way to formalize the definition of a linear process is through the use of the **backshift operator** (or lag operator) which is itself defined as follows:
\[B\,X_t = X_{t-1}.\]
The properties of the backshift operator allow us to create composite functions of the type
$$B^2 \, X_t = B (B \, X_t) = B \, X_{t-1} = X_{t-2}$$
which allows to generalize as follows
$$B^k \, X_t = X_{t-k}.$$
Moreover, we can apply the inverse operator to it (i.e. $B^{-1} \, B = 1$) thereby allowing us to have, for example:
$$X_t = B^{-1} \, B X_t = B^{-1} X_{t-1}$$
```{example, label="backdiff", name="d-order Differences"}
We can re-express $X_t - X_{t-1}$ as
$$\delta X_t = (1 - B) X_t$$
or a second order difference as
$$\delta^2 X_t = (1 - B)^2 X_t$$
thereby generalizing to a d-order difference as follows:
$$\delta^d X_t = (1 - B)^d X_t.$$
```
<br>
Having defined the backshift operator, we can now provide an alternative definition of a linear process as follows:
\[{X_t} = \mu + \psi \left( B \right){W_t}\]
where $\psi ( B )$ is a polynomial function in $B$ whose coefficients are given by the linear filters $(\psi_j)$ (we'll describe these polynomials further on).
```{example, label="lpwn", name="Linear Process of White Noise"}
The white noise process $(X_t)$, defined in \@ref(wn),
can be expressed as a linear process as follows:
\[\psi _j = \begin{cases}
1 , &\mbox{ if } j = 0\\
0 , &\mbox{ if } |j| \ge 1
\end{cases}.\]
and $\mu = 0$.
Therefore, $X_t = W_t$, where $W_t \sim WN(0, \sigma^2_W)$
```
<br>
```{example, label="lpma1", name="Linear Process of Moving Average Order 1"}
Similarly, consider $(X_t)$ to be a MA(1) process,
given by \@ref(ma1). The process can be expressed linearly through the following filters:
\[\psi _j = \begin{cases}
1, &\mbox{ if } j = 0\\
\theta , &\mbox{ if } j = 1 \\
0, &\mbox{ if } j \ge 2
\end{cases}.\]
and $\mu = 0$.
Thus, we have: $X_t = W_t + \theta W_{t-1}$
```
<br>
```{example, label="lpsma", name="Linear Process and Symmetric Moving Average"}
Consider a symmetric moving average given by:
\[{X_t} = \frac{1}{{2q + 1}}\sum\limits_{j = - q}^q {{W_{t + j}}} \]
Thus, $(X_t)$ is defined for $q + 1 \le t \le n-q$. The above process
would be a linear process since:
\[\psi _j = \begin{cases}
\frac{1}{{2q + 1}} , &\mbox{ if } -q \le j \le q\\
0 , &\mbox{ if } |j| > q
\end{cases}.\]
and $\mu = 0$.
In practice, if $q = 1$, we would have:
\[{X_t} = \frac{1}{3}\left( {{W_{t - 1}} + {W_t} + {W_{t + 1}}} \right)\]
```
<br>
```{example, label="lpar1", name="Autoregressive Process of Order 1"}
If $\left\{X_t\right\}$ follows an AR(1) model defined in \@ref(ar1), the linear filters are a function of the time lag:
\[\psi _j = \begin{cases}
\phi^j , &\mbox{ if } j \ge 0\\
0 , &\mbox{ if } j < 0
\end{cases}.\]
and $\mu = 0$. We would require the condition that $\left| \phi \right| < 1$ in order to respect the condition on the filters (i.e. $\sum\limits_{j = - \infty }^\infty {\left| {{\psi _j}} \right|} < \infty$).
```
## Autoregressive Models - AR(p) {#ardefinition}
The class of autoregressive models is based on the idea that previous values in the time series are needed to explain current values in the series. For this class of models, we assume that the $p$ previous observations are needed for this purpose and we therefore denote this class as AR($p$). In the previous chapter, the model we introduced was an AR(1) in which only the immediately previous observation is needed to explain the following one and therefore represents a particular model which is part of the more general class of AR($p$) models.
```{definition, label="arp", name="Autoregressive Models of Order p"}
The AR(p) models can be formally represented as follows
$$(X_t) = {\phi_1}{X_{t - 1}} + ... + {\phi_p}{X_{t - p}} + {W_t},$$
where $\phi_i \neq 0$ (for $i = 1, ..., p$) and $W_t$ is a (Gaussian) white noise process with variance $\sigma^2$.
```
As earlier in this book, we will assume that the expectation of the process $({X_t})$, as well as that of the following ones in this chapter, is zero. The reason for this simplification is that if $\mathbb{E} [ X_t ] = \mu$, we can define an AR process *around* $\mu$ as follows:
$$X_t - \mu = \sum_{i = 1}^p \phi_i \left(X_{t-i} - \mu \right) + W_t,$$
which is equivalent to
$$X_t = \mu^{\star} + \sum_{i = 1}^p \phi_i X_{t-i} + W_t,$$
where $\mu^{\star} = \mu (1 - \sum_{i = 1}^p \phi_i)$. Therefore, to simplify the notation we will generally consider only zero mean processes, since adding means (as well as other deterministic trends) is easy.
A useful way of representing AR($p$) processes is through the backshift operator introduced in the previous section and is as follows
\[\begin{aligned}
{X_t} &= {\phi_1}{X_{t - 1}} + ... + {\phi_p}{X_{t - p}} + {W_t} \\
&= {\phi_1}B{X_t} + ... + {\phi_p}B^p{X_t} + {W_t} \\
&= ({\phi_1}B + ... + {\phi_p}B^p){X_t} + {W_t} \\
\end{aligned},\]
which finally yields
$$(1 - {\phi _1}B - ... - {\phi_p}B^p){X_t} = {W_t},$$
which, in abbreviated form, can be expressed as
$$\phi(B){X_t} = W_t.$$
We will see that $\phi(B)$ is important to establish the stationarity of these processes and is called the *autoregressive* operator. Moreover, this quantity is closely related to another important property of AR(p) processes called *causality*. Before formally defining this new property we consider the following example which provides an intuitive illustration of its importance.
**Example:** Consider a classical AR(1) model with $|\phi| > 1$. Such a model could be expressed as
$$X_t = \phi^{-1} X_{t+1} - \phi^{-1} W_t = \phi^{-k} X_{t+k} - \sum_{i = 1}^{k-1} \phi^{-i} W_{t+i}.$$
Since $|\phi| > 1$, we obtain
$$X_t = - \sum_{j = 1}^{\infty} \phi^{-j} W_{t+j},$$
which is a linear process and therefore is stationary. Unfortunately, such a model is useless because we need the future to predict the future. These processes are called non-causal.
### Properties of AR(p) models
In this section we will describe the main property of the AR(p) model which has already been mentioned in the previous paragraphs and therefore let us now introduce the property of causality in a more formal manner.
**Definition:** An AR(p) model is *causal* if the time series $(X_t)_{-\infty}^{\infty}$ can be written as a one-sided linear process:
\begin{equation}
X_t = \sum_{j = 0}^{\infty} \psi_j W_{t-j} = \frac{1}{\phi(B)} W_t = \psi(B) W_t,
(\#eq:causal)
\end{equation}
where $\phi(B) = \sum_{j = 0}^{\infty} \phi_j B^j$, and $\sum_{j=0}^{\infty}|\phi_j| < \infty$ and setting $\phi_0 = 1$.
As discussed earlier this condition implies that only the past values of the time series can explain the future values of it and not viceversa. Moreover, given the expression of the linear filters given by
$$\frac{1}{\phi(B)}$$
it is obvious that a solution exists only when $\phi(B) = \sum_{j = 0}^{\infty} \phi_j B^j \neq 0$ (thereby implying causality). A condition for this to be respected is for the roots of $\phi(B) = 0$ to lie outside the unit circle.
<!-- However, it might be difficult and not obvious to show the causality of an AR(p) process by using the above definitions directly, thus the following properties are useful in practice. -->
<!-- **Causality** -->
<!-- If an AR(p) model is causal, then the coefficients of the one-sided linear process given in \@ref(eq:causal) can be obtained by solving -->
<!-- \begin{equation*} -->
<!-- \psi(z) = \frac{1}{\sum_{j=0}^{\infty} \phi_j z^j} = \frac{1}{\phi(z)}, \mbox{ } |z| \leq 1. -->
<!-- \end{equation*} -->
<!-- It can be seen how there is no solution to the above equation if $\phi(z) = 0$ and therefore an AR(p) is causal if and only if $\phi(z) \neq 0$. A condition for this to be respected is for the roots of $\phi(z) = 0$ to lie outside the unit circle. -->
```{example, label="AR2asLP", name="Transform an AR(2) into a Linear Process"}
Consider an AR(2) process $$X_t = 1.3 X_{t-1} - 0.4 X_{t-2} + W_t,$$ which we would like to transform into a linear process. This can be done using the following approach:
- Step 1:
The autoregressive operator of this model can be expressed as
$$
\phi(B) = 1-1.3B+0.4B^2 = (1-0.5B)(1-0.8B),
$$
and has roots 2 and 1.25, both $>1$. Thus, we should be able to convert it into a linear process.
- Step 2: We know that if an AR(p) process has all its roots outside the unit circle, then we can write $X_t = \frac{1}{\phi(B)} W_t$. By applying the partial fractions trick, we can inverse the autoregressive operator $\phi(B)$ as follows:
\[ \begin{aligned}
\phi^{-1}(B) &= \frac{1}{(1-0.5B)(1-0.8B)} = \frac{c_1}{(1-0.5B)} + \frac{c_2}{(1-0.8B)} \\
&= \frac{c_2(1-0.5B) + c_1(1-0.8B)}{(1-0.5B)(1-0.8B)} = \frac{(c_1 + c_2)-(0.8c_1+0.5c_2)B}{(1-0.5B)(1-0.8B)}.
\end{aligned} \]
To solve for $c_1$ and $c_2$:
\[ \begin{cases}
c_1 + c_2 &=1\\
0.8c_1+0.5c_2 &=0
\end{cases} \to
\begin{cases}
c_1 &= -5/3\\
c_2 &= 8/3.
\end{cases} \]
So we obtain
$$
\phi^{-1}(B) = \frac{-5}{3(1-0.5B)} + \frac{8}{3(1-0.8B)}.
$$
- Step 3: Using the Geometric series, i.e. $a\sum_{j=0}^{\infty} r^j = \frac{a}{1-r}$ if $|r| <1$, we have
\[ \begin{cases}
\frac{-5}{3(1-0.5B)} = -\frac{5}{3} \sum_{j=0}^\infty 0.5^j B^j, &\mbox{ if } |B| < 2 \\
\frac{8}{3(1-0.8B)} = \frac{8}{3} \sum_{j=0}^\infty 0.8^j B^j, &\mbox{ if } |B| < 1.25.
\end{cases} \]
So we can express $\phi^{-1}(B)$ as
$$
\phi^{-1}(B) = \sum_{j=0}^\infty \Big[ -\frac{5}{3} (0.5)^j + \frac{8}{3} (0.8)^j \Big] B^j, \;\;\; \text{if } |B|<1.25.
$$
- Step 4: Finally, we obtain
\[ \begin{aligned}
X_t &= \phi(B)^{-1} W_t = \sum_{j=0}^\infty \Big[ -\frac{5}{3} (0.5)^j + \frac{8}{3} (0.8)^j \Big] B^j W_t \\
&= \sum_{j=0}^\infty \Big[ -\frac{5}{3} (0.5)^j + \frac{8}{3} (0.8)^j \Big] W_{t-j},
\end{aligned} \]
which verifies that the AR(2) is causal, and therefore is stationary.
```
```{example, label="AR2causalcond", name="Causal Conditions for an AR(2) Process"}
We already know that an AR(1) is causal with the simple condition $|\phi_1|<1$. It seems natural to believe that an AR(2) should be causal (and therefore stationary) with the condition that $|\phi_i| <1, \; i=1,2$. However, this is actually not the case as we illustrate below.
We can express an AR(2) process as
$$
X_t = \phi_1 X_{t-1} + \phi_2 X_{t-2} + W_t = \phi_1 BX_t + \phi_2 B^2 X_t + W_t,
$$
thereby delivering the following autoregressive operator:
$$
\phi(B) = 1-\phi_1 B - \phi_2 B^2 = \Big( 1-\frac{B}{\lambda_1} \Big) \Big( 1-\frac{B}{\lambda_2} \Big)
$$
where $\lambda_1$ and $\lambda_2$ are the roots of $\phi(B)$ such that
\[ \begin{aligned}
\phi_1 &= \frac{1}{\lambda_1} + \frac{1}{\lambda_2}, \\
\phi_2 &= - \frac{1}{\lambda_1} \frac{1}{\lambda_2}.
\end{aligned} \]
That is,
\[\begin{aligned}
\lambda_1 &= \frac{\phi_1 + \sqrt{\phi_1^2 + 4\phi_2}}{-2\phi_2}, \\
\lambda_2 &= \frac{\phi_1 - \sqrt{\phi_1^2 + 4\phi_2}}{-2\phi_2}.
\end{aligned} \]
In order to ensure the causality of the model, we need the roots of $\phi(B)$, i.e. $\lambda_1$ and $\lambda_2$, to lie outside the unit circle.
\[ \begin{cases}
|\lambda_1| &> 1 \\
|\lambda_2| &> 1,
\end{cases} \]
if and only if
\[ \begin{cases}
\phi_1 + \phi_2 &< 1 \\
\phi_2 - \phi_1 &< 1 \\
|\phi_2| &<1.
\end{cases} \]
We can show the *if* part of the statement as follows:
\[ \begin{aligned}
& \phi_1 + \phi_2 = \frac{1}{\lambda_1} + \frac{1}{\lambda_2} - \frac{1}{\lambda_1 \lambda_2} = \frac{1}{\lambda_1} \Big(1-\frac{1}{\lambda_2} \Big) + \frac{1}{\lambda_2} < 1 - \frac{1}{\lambda_2} + \frac{1}{\lambda_2} = 1 \;\; \text{since } 1-\frac{1}{\lambda_2} > 0, \\
& \phi_2 - \phi_1 = -\frac{1}{\lambda_1 \lambda_2} - \frac{1}{\lambda_1} - \frac{1}{\lambda_2} = -\frac{1}{\lambda_1} \Big( \frac{1}{\lambda_2} +1 \Big) - \frac{1}{\lambda_2} < \frac{1}{\lambda_2}+1-\frac{1}{\lambda_2} = 1 \;\; \text{since } \frac{1}{\lambda_2}+1 > 0, \\
& |\phi_2| = \frac{1}{|\lambda_1| |\lambda_2|} < 1.
\end{aligned} \]
We can also show the *only if* part of the statement as follows:
Since $\lambda_1 = \frac{\phi_1 + \sqrt{\phi_1^2 + 4\phi_2}}{-2\phi_2}$ and $\phi_2 - 1 < \phi_1 < 1- \phi_2$, we have
$$
\lambda_1^2 = \frac{(\phi_1 + \sqrt{\phi_1^2 + 4\phi_2})^2}{4\phi_2^2} < \frac{\Big( (1-\phi_2)+ \sqrt{(1-\phi_2)^2 + 4\phi_2} \Big)^2}{4\phi_2^2} = \frac{4}{4\phi_2^2} \leq 1.
$$
Since $\lambda_2 = \frac{\phi_1 - \sqrt{\phi_1^2 + 4\phi_2}}{-2\phi_2}$ and $\phi_2 - 1 < \phi_1 < 1- \phi_2$, we have
$$
\lambda_2^2 = \frac{(\phi_1 - \sqrt{\phi_1^2 + 4\phi_2})^2}{4\phi_2^2} < \frac{\Big( (\phi_2-1)+ \sqrt{(\phi_2-1)^2 + 4\phi_2} \Big)^2}{4\phi_2^2} = \frac{4\phi_2^2}{4\phi_2^2} = 1.
$$
Finally, the causal region of an AR(2) is demonstrated as
```
```{r correxample2, cache = TRUE, echo = FALSE, fig.cap="Causal Region for Parameters of an AR(2) Process", fig.align = 'center'}
knitr::include_graphics("images/causal_AR2.png")
```
### Estimation of AR(p) models
Given the above defined properties of AR(p) models, we will now discuss how these models can be estimated, more specifically how the $p+1$ parameters can be obtained from an observed time series. Indeed, a reliable estimation of these models is necessary in order to intepret and describe different natural phenomena and/or forecast possible future values of the time series.
A first approach builds upon the earlier definition of AR($p$) models being a linear process. Recall that
\begin{equation}
X_t = \sum_{j = 1}^{p} \phi_j X_{t-j}
\end{equation}
which delivers the following autocovariance function
\begin{equation}
\gamma(h) = \text{cov}(X_{t+h}, X_t) = \text{cov}\left(\sum_{j = 1}^{p} \phi_j X_{t+h-j}, X_t\right) = \sum_{j = 1}^{p} \phi_j \gamma(h-j), \mbox{ } h \geq 1.
\end{equation}
Rearranging the above expressions we obtain the following general equations
\begin{equation}
\gamma(h) - \sum_{j = 1}^{p} \phi_j \gamma(h-j) = 0, \mbox{ } h \geq 1
\end{equation}
and, recalling that $\gamma(h) = \gamma(-h)$,
\begin{equation}
\gamma(0) - \sum_{j = 1}^{p} \phi_j \gamma(j) = \sigma_w^2.
\end{equation}
We can now define the Yule-Walker equations.
**Definition:** The Yule-Walker equations are given by
\begin{equation}
\gamma(h) = \phi_1 \gamma(h-1) + ... + \phi_p \gamma(h-p), \mbox{ } h = 1,...,p
\end{equation}
and
\begin{equation}
\sigma_w^2 = \gamma(0) - \phi_1 \gamma(1) - ... - \phi_p \gamma(p).
\end{equation}
which in matrix notation can be defined as follows
\begin{equation}
\Gamma_p \mathbf{\phi} = \mathbf{\gamma}_p \,\, \text{and} \,\, \sigma_w^2 = \gamma(0) - \mathbf{\phi}'\mathbf{\gamma}_p
\end{equation}
where $\Gamma_p$ is the $p\times p$ matrix containing the autocovariances $\gamma(k-j)$, where $j,k = 1, ...,p$, while $\mathbf{\phi} = (\phi_1,...,\phi_p)'$ and $\mathbf{\gamma}_p = (\gamma(1),...,\gamma(p))'$ are $p\times 1$ vectors.
Considering the Yule-Walker equations, it is possible to use a method of moments approach and simply replace the theoretical quantities given in the previous definition with their empirical (estimated) counterparts that we saw in the previous chapter. This gives us the following Yule-Walker estimators
\begin{equation}
\hat{\mathbf{\phi}} = \hat{\Gamma}_p^{-1}\hat{\mathbf{\gamma}}_p \,\, \text{and} \,\, \hat{\sigma}_w^2 = \hat{\gamma}(0) - \hat{\mathbf{\gamma}}_p'\hat{\Gamma}_p^{-1}\hat{\mathbf{\gamma}}_p .
\end{equation}
These estimators have the following asymptotic properties.
**Consistency and Asymptotic Normality of Yule-Walker estimators:**
The Yule-Walker estimators for a causal AR($p$) model have the following asymptotic properties:
\begin{equation*}
\sqrt{T}(\hat{\mathbf{\phi}}- \mathbf{\phi}) \xrightarrow{\mathcal{D}} \mathcal{N}(\mathbf{0},\sigma_w^2\Gamma_p^{-1}) \,\, \text{and} \,\, \hat{\sigma}_w^2 \xrightarrow{\mathcal{P}} \sigma_w^2 .
\end{equation*}
Therefore the Yule-Walker estimators have an asymptotically normal distribution and the estimator of the innovation variance is consistent. Moreover, these estimators are also optimal for AR($p$) models, meaning that they are also efficient. However, there is also another method which allows to achieve this efficiency (also for general ARMA models that will be tackled further on) and this is the Maximum Likelihood Estimation (MLE) method. Considering an AR(1) model as an example, and assuming without loss of generality that its expectation is zero, we have the following representation of the AR(1) model
\begin{equation*}
X_t = \phi X_{t-1} + W_t,
\end{equation*}
where $|\phi|<1$ and $W_t \overset{iid}{\sim} \mathcal{N}(0,\sigma_w^2)$. Supposing we have observations $(x_t)_{t=1,...,T}$ issued from this model, then the likelihood function for this setting is given by
\begin{equation*}
L(\phi,\sigma_w^2) = f(\phi,\sigma_w^2|x_1,...,x_T)
\end{equation*}
which, for an AR(1) model, can be rewritten as follows
\begin{equation*}
L(\phi,\sigma_w^2) = f(x_1)f(x_2|x_1)\cdot \cdot \cdot f(x_T|x_{T-1}).
\end{equation*}
If we define $\Omega_t^p$ as the information contained in the previous $p$ observations (before time $t$), the above expression can be generalized for an AR(p) model as follows
\begin{equation*}
L(\phi,\sigma_w^2) = f(x_1,...,x_p)f(x_{p+1}|\Omega_{p+1}^p)\cdot \cdot \cdot f(x_T|\Omega_{T}^p)
\end{equation*}
where $f(x_1,...,x_p)$ is the joint probability distribution of the first $p$ observations. Going back to the AR(1) setting, based on our assumption on $(W_t)$ we know that $x_t|x_{t-1} \sim \mathcal{N}(\phi x_{t-1},\sigma_w^2)$ and therefore we have that
\begin{equation*}
f(x_t|x_{t-1}) = f_w(x_t - \phi x_{t-1})
\end{equation*}
where $f_w(\cdot)$ is the distribution of $W_t$. This rearranges the likelihood function as follows
\begin{equation*}
L(\phi,\sigma_w^2) = f(x_1)\prod_{t=2}^T f_w(x_t - \phi x_{t-1}),
\end{equation*}
where $f(x_1)$ can be found through the causal representation
\begin{equation*}
x_1 = \sum_{j=0}^{\infty} \phi^j w_{1-j},
\end{equation*}
which implies that $x_1$ follows a normal distribution with zero expectation and a variance given by $\frac{\sigma_w^2}{(1-\phi^2)}$. Based on this, the likelihood function of an AR(1) finally becomes
\begin{equation*}
L(\phi,\sigma_w^2) = (2\pi \sigma_w^2)^{-\frac{T}{2}} (1 - \phi^2)^{\frac{1}{2}} \exp \left(-\frac{S(\phi)}{2 \sigma_w^2}\right),
\end{equation*}
with $S(\phi) = (1-\phi^2) x_1^2 + \sum_{t=2}^T (x_t -\phi x_{t-1})^2$. Once the derivative of the logarithm of the likelihood is taken, the minimization of the negative of this function is usually done numerically. However, if we condition on the initial values, the AR(p) models are linear and, for example, we can then define the conditional likelihood of an AR(1) as
\begin{equation*}
L(\phi,\sigma_w^2|x_1) = (2\pi \sigma_w^2)^{-\frac{T-1}{2}} \exp \left(-\frac{S_c(\phi)}{2 \sigma_w^2}\right),
\end{equation*}
where
\begin{equation*}
S_c(\phi) = \sum_{t=2}^T (x_t -\phi x_{t-1})^2 .
\end{equation*}
The latter is called the conditional sum of squares and $\phi$ can be estimated as a straightforward linear regression problem. Once an estimate $\hat{\phi}$ is obtained, this can be used to obtain the conditional maximum likelihood estimate of $\sigma_w^2$
\begin{equation*}
\hat{\sigma}_w^2 = \frac{S_c(\hat{\phi})}{(T-1)} .
\end{equation*}
The estimation methods presented so far are standard for these kind of models. Nevertheless, if the data suffers from some form of contamination, these methods can become highly biased. For this reason, some robust estimators are available to limit this problematic if there are indeed outliers in the observed time series. One of these methods relies on the estimator proposed in Kunsch (1984) who underlines that the MLE score function of an AR(p) is given by
\begin{equation*}
\kappa(\mathbf{\theta}|x_j,...x_{j+p}) = \frac{\partial}{\partial \mathbf{\theta}} (x_{j+p} - \sum_{k=1}^p \phi_k x_{j+p-k})^2,
\end{equation*}
where $\theta$ is the parameter vector containing, in the case of an AR(1)
model, the two parameters $\phi$ and $\sigma_w^2$
(i.e. $\theta = [\phi \,\, \sigma_w^2]$). This delivers the estimating equation
\begin{equation*}
\sum_{j=1}^{n-p} \kappa (\hat{\mathbf{\theta}}|x_j,...x_{j+p}) = 0 .
\end{equation*}
The score function $\kappa(\cdot)$ is clearly not bounded, in the sense that if
we arbitrarily move a value of $(x_t)$ to infinity then the score function also
goes to infinity thereby delivering a biased estimation procedure. To avoid that
outlying observations bias the estimation excessively, a bounded score function
can be used to deliver an M-estimator given by
\begin{equation*}
\sum_{j=1}^{n-p} \psi (\hat{\mathbf{\theta}}|x_j,...x_{j+p}) = 0,
\end{equation*}
where $\psi(\cdot)$ is a function of bounded variation. When conditioning on the
first $p$ observations, this problem can be brought back to a linear regression
problem which can be applied in a robust manner using the robust regression
tools available in `R` such as `rlm` or `lmrob`. However, another
available tool in `R` which does not require a strict specification of the distribution function (also for general ARMA models) is the `method = "rgmwm"` option within the `estimate()` function (in the `simts` package). This function makes use of a quantity called the wavelet variance (denoted as $\boldsymbol{\nu}$) which is estimated robustly and then used to retrieve the parameters $\theta$ of the time series model. The robust estimate is obtained by solving the following minimization problem
\begin{equation*}
\hat{\boldsymbol{\theta}} = \underset{\boldsymbol{\theta} \in \boldsymbol{\Theta}}{\text{argmin}} (\hat{\boldsymbol{\nu}} - \boldsymbol{\nu}(\boldsymbol{\theta}))^T\boldsymbol{\Omega}(\hat{\boldsymbol{\nu}} - \boldsymbol{\nu}({\boldsymbol{\theta}})),
\end{equation*}
where $\hat{\boldsymbol{\nu}}$ is the robustly estimated wavelet variance, $\boldsymbol{\nu}({\boldsymbol{\theta}})$ is
the theoretical wavelet variance (implied by the model we want to estimate) and $\boldsymbol{\Omega}$ is a positive definite weighting matrix. Below we show some simulation studies where we present the results of the above estimation procedures in absence and in presence of contamination in the data. As a reminder, so far we have mainly discussed three estimators for the parameters of AR($p$) models (i.e. Yule-Walker, maximum likelihod, and RGMWM estimators).
Using the `simts` package, the first three estimators can be computed as follows:
```{r, eval=FALSE}
mod = estimate(AR(p), Xt, method = select_method, demean = TRUE)
```
In the above sample code `Xt` denotes the time series (a vector of length $T$), `p` is the order of the AR($p$) and `demean = TRUE` indicates that the mean of the process should be estimated (if this is not the case, then use `demean = FALSE`). The `select_method` input can be (among others) `"mle"` for the maximum likelihood and `"yule-walker"` for the Yule-Walker estimator or `"rgmwm"` for the RGMWM. For example, if you would like to estimate a zero mean AR(3) with the MLE you can use the code:
```{r, eval = FALSE}
mod = estimate(AR(3), Xt, method = "mle", demean = FALSE)
```
On the other hand, if one wishes to estimate the parameters of this model through the RGMWM one can use the following syntax:
```{r, eval = FALSE}
mod = estimate(AR(3), Xt, method = "rgmwm", demean = FALSE)
```
Removing the mean is not strictly necessary for the `rgmwm` (or `gmwm`) function since it won't estimate it and can consistently estimate the parameters of the time series model anyway. We now have the necessary `R` functions to deliver the above mentioned estimators and we can now proceed to the simulation study. In particular, we simulate three different processes $X_t, Y_t, Z_t$ by using the first as an uncontaminated process defined as \[X_t = 0.5 X_{t-1} - 0.25 X_{t-2} + W_t,\] with $W_t \overset{iid}{\sim} N(0, 1)$. This first process $(X_t)$ is uncontaminated while the other two processes are contaminated versions of the first that can often be observed in practice. The first type of contamination can be seen in $(Y_t)$ and is delivered by replacing a portion of the original process with a process defined as \[U_t = 0.90 U_{t-1} - 0.40 U_{t-2} + V_t,\] where $V_t \overset{iid}{\sim} N(0, 9)$. The second form of contamination can be seen in $(Z_t)$ and consists in the so-called point-wise contamination where randomly selected points from $X_t$ are replaced with $N_t \overset{iid}{\sim} N(0, 9)$.
The code below performs the simulation study where it can be seen how the contaminated processes $(Y_t)$ and $(Z_t)$ are generated. Once this is done, for each simultation the code estimates the parameters of the AR(2) model using the three different estimation methods.
```{r simuAR2study, eval = TRUE, cache = TRUE}
# Load simts
library(simts)
# Number of bootstrap iterations
B = 250
# Sample size
n = 500
# Proportion of contamination
eps = 0.05
# Number of contaminated observations
cont = round(eps*n)
# Simulation storage
res.Xt.MLE = res.Xt.YW = res.Xt.RGMWM = matrix(NA, B, 3)
res.Yt.MLE = res.Yt.YW = res.Yt.RGMWM = matrix(NA, B, 3)
res.Zt.MLE = res.Zt.YW = res.Zt.RGMWM = matrix(NA, B, 3)
# Begin bootstrap
for (i in seq_len(B)){
# Set seed for reproducibility
set.seed(1982 + i)
# Generate processes
Xt = gen_gts(n, AR(phi = c(0.5, 0.25), sigma2 = 1))
Yt = Zt = Xt
# Generate Ut contamination process that replaces a portion of original signal
index_start = sample(1:(n-cont-1), 1)
index_end = index_start + cont - 1
Yt[index_start:index_end] = gen_gts(cont, AR(phi = c(0.9,-0.4), sigma2 = 9))
# Generate Nt contamination that inject noise at random
Zt[sample(n, cont, replace = FALSE)] = gen_gts(cont, WN(sigma2 = 9))
# Fit Yule-Walker estimators on the three time series
mod.Xt.YW = estimate(AR(2), Xt, method = "yule-walker",
demean = FALSE)
mod.Yt.YW = estimate(AR(2), Yt, method = "yule-walker",
demean = FALSE)
mod.Zt.YW = estimate(AR(2), Zt, method = "yule-walker",
demean = FALSE)
# Store results
res.Xt.YW[i, ] = c(mod.Xt.YW$mod$coef, mod.Xt.YW$mod$sigma2)
res.Yt.YW[i, ] = c(mod.Yt.YW$mod$coef, mod.Yt.YW$mod$sigma2)
res.Zt.YW[i, ] = c(mod.Zt.YW$mod$coef, mod.Zt.YW$mod$sigma2)
# Fit MLE on the three time series
mod.Xt.MLE = estimate(AR(2), Xt, method = "mle",
demean = FALSE)
mod.Yt.MLE = estimate(AR(2), Yt, method = "mle",
demean = FALSE)
mod.Zt.MLE = estimate(AR(2), Zt, method = "mle",
demean = FALSE)
# Store results
res.Xt.MLE[i, ] = c(mod.Xt.MLE$mod$coef, mod.Xt.MLE$mod$sigma2)
res.Yt.MLE[i, ] = c(mod.Yt.MLE$mod$coef, mod.Yt.MLE$mod$sigma2)
res.Zt.MLE[i, ] = c(mod.Zt.MLE$mod$coef, mod.Zt.MLE$mod$sigma2)
# Fit RGMWM on the three time series
res.Xt.RGMWM[i, ] = estimate(AR(2), Xt, method = "rgmwm",
demean = FALSE)$mod$estimate
res.Yt.RGMWM[i, ] = estimate(AR(2), Yt, method = "rgmwm",
demean = FALSE)$mod$estimate
res.Zt.RGMWM[i, ] = estimate(AR(2), Zt, method = "rgmwm",
demean = FALSE)$mod$estimate
}
```
Having performed the estimation, we should now have `r B` estimates for each AR(2) parameter and each estimation method. The code below takes the results of the simulation and shows them in the shape of boxplots along with the true values of the parameters. The estimation methods that are denoted as follows:
- **YW**: Yule-Walker estimator
- **MLE**: Maximum Likelihood Estimator
- **RGMWM**: the robust version of the GMWM estimator
```{r, echo=FALSE, fig.asp = 0.75, fig.width = 9, cache = TRUE, fig.align='center', fig.cap="Boxplots of the empirical distribution functions of the Yule-Walker (YW), MLE and GMWM estimators for the parameters of the AR(2) model when using the Xt process (first row of boxplots), Yt process (second row of boxplots) and Zt (third row of boxplots)."}
# Define colors
hues = seq(15, 375, length = 4)
cols = hcl(h = hues, l = 65, c = 100, alpha = 0.8)[1:3]
# True values
theta0 = c(0.5, 0.25, 1)
# Define windows params
par(mfrow = c(3, 3), oma = c(2.5,2.5,0,0) + 1.5, mar = c(0,0,2.5,1) + 1.5)
# Make boxplots - Xt
main = c(expression(paste("Simulated process ", X[t], " - ", phi[1])),
expression(paste("Simulated process ", X[t], " - ", phi[2])),
expression(paste("Simulated process ", X[t], " - ", sigma^2)))
for (i in 1:3){
boxplot(res.Xt.YW[,i], res.Xt.MLE[,i], res.Xt.RGMWM[,i], names = c("YW", "MLE", "RGMWM"), col = cols, main = main[i],
outline = FALSE)
abline(h = theta0[i], col = 2, lwd = 2)
}
# Make boxplots - Yt
main = c(expression(paste("Simulated process ", Y[t], " - ", phi[1])),
expression(paste("Simulated process ", Y[t], " - ", phi[2])),
expression(paste("Simulated process ", Y[t], " - ", sigma^2)))
for (i in 1:3){
boxplot(res.Yt.YW[,i], res.Yt.MLE[,i], res.Yt.RGMWM[,i], names = c("YW", "MLE", "RGMWM"), col = cols, main = main[i],
outline = FALSE)
abline(h = theta0[i], col = 2, lwd = 2)
}
# Make boxplots - Yt
main = c(expression(paste("Simulated process ", Z[t], " - ", phi[1])),
expression(paste("Simulated process ", Z[t], " - ", phi[2])),
expression(paste("Simulated process ", Z[t], " - ", sigma^2)))
for (i in 1:3){
boxplot(res.Zt.YW[,i], res.Zt.MLE[,i], res.Zt.RGMWM[,i], names = c("YW", "MLE", "RGMWM"), col = cols, main = main[i],
outline = FALSE)
abline(h = theta0[i], col = 2, lwd = 2)
}
```
It can be seen how all methods appear to properly estimate the true parameter values on average when they are applied to the simulated time series from the uncontaminated process $(X_t)$. However, the MLE appears to be slightly more efficient (less variable) compared to the other methods and, in addition, the robust method (RGMWM) appears to be less efficient than the other two estimators. The latter is a known result since robust estimators usually pay a price in terms of efficiency (as an insurance against bias).
On the other hand, when checking the performance of the same methods when applied to the two contaminated processes $(Y_t)$ and $(Z_t)$ it can be seen that the standard estimators appear to be (highly) biased for most of the estimated parameters (with one exception) while the robust estimator remains close (on average) to the true parameter values that we are aiming to estimate. Therefore, when there's a suspicion that there could be some (small) contamination in the observed time series, it may be more appropriate to use a robust estimator.
To conclude this section on estimation, we now compare the above studied estimators in different applied settings where we can highlight how to assess which estimator is more appropriate according to the type of setting. For this purpose, let us start with an example we have already checked in the previous chapter when discussing standard and robust estimators of the ACF, more specifically the data on monthly precipitations. As mentioned before when discussing this example, the importance of modelling precipitation data lies in the fact that its usually used to successively model the entire water cycle. Common models for this purpose are either the white noise (WN) model or the AR(1) model. Let us compare the standard and robust ACF again to understand which of these two models seems more appropriate for the data at hand.
```{r, fig.asp = 0.5, fig.width = 9, fig.align='center', fig.cap="Standard (left) and robust (left) estimates of the ACF function on the monthly precipitation data (hydro)"}
compare_acf(hydro)
```
As we had underlined in the previous chapter, the standard ACF estimates would suggest that there appears to be no correlation among lags and consequently, the WN model would be the most appropriate. However, the robust ACF estimates depict an entirely different picture where it can be seen that there appears to be a significant autocorrelation over different lags which exponentially decay. Although there appears to be some seasonality in the plot, we will assume that the correct model for this data is an AR(1) since that's what hydrology theory suggests. Let us therefore estimate the parameters of this model by using a standard estimator (MLE) and a robust estimator (RGMWM). The estimates for the MLE are the following:
```{r}
mle_hydro = estimate(AR(1), as.vector(hydro), method = "mle", demean = TRUE)
# MLE Estimates
c(mle_hydro$mod$coef[1], mle_hydro$mod$sigma2)
```
From these estimates it would appear that the autocorrelation between lagged variables (i.e. lags of order 1) is extremely low and that (as suggested by the standard ACF plot) a WN model may be more appropriate. Considering the robust ACF however, it is possible that the MLE estimates may not be reliable in this setting. Hence, let us use the RGMWM to estimate the same parameters.
```{r}
rgmwm_hydro = estimate(AR(1), hydro, method = "rgmwm")$mod$estimate
# RGMWM Estimates
t(rgmwm_hydro)
```
In this case, we see how the autocorrelation between lagged values is much higher (0.4 compared to 0.06) indicating that there is a stronger dependence in the data than what is suggested by standard estimators. Moreover, the innovation variance is smaller compared to that of the MLE. This is also a known phenomenon when there's contamination in the data since it leads to less dependence and more variability being detected by non-robust estimators. This estimate of the variance also has a considerable impact on forecast precision (as we'll see in the next section).
A final applied example that highlights the (potential) difference between estimators according to the type of setting is given by the "Recruitment" data set (in the `astsa` library). This data refers to the presence of new fish in the population of the Pacific Ocean and is often linked to the currents and temperatures passing through the ocean. As for the previous data set, let us take a look at the data itself and then analyse the standard and robust estimations of the ACF.
```{r, echo = FALSE, message=FALSE}
library(astsa)
```
```{r, fig.asp = 0.5, fig.width = 8, fig.align='center', fig.cap="Plot of the time series on fish recruitment monthly data in the Pacific Ocean from 1950 to 1987"}
# Format data
fish = gts(rec, start = 1950, freq = 12, unit_time = 'month', name_ts = 'Recruitment')
# Plot data
plot(fish)
```
```{r, fig.asp = 0.5, fig.width = 9, fig.align='center', fig.cap="Standard (left) and robust (left) estimates of the ACF function on the monthly fish recruitment data (rec)"}
compare_acf(fish)
```
```{r, echo=FALSE}
par(mfrow = c(1,1))
```
We can see that there appears to be a considerable dependence between the lagged variables which decays (in a similar way to the ACF of an AR($p$)). Also in this case we see a seasonality in the data but we won't consider this for the purpose of this example. Given that there doesn't appear to be any significant contamination in the data, let us consider the Yule-Walker and MLE estimators. The MLE highly depends on the assumed parametric distribution of the time series (i.e. usually Gaussian) and, if this is not respected, the resulting estimations could be unreliable. Hence, a first difference of the time series can often give an idea of the marginal distribution of the time series.
```{r, fig.asp = 0.5, fig.width = 8, fig.align='center', fig.cap="Plot of the first difference of the time series on fish recruitment monthly data in the Pacific Ocean from 1950 to 1987"}
# Take first differencing of the recruitment data
diff_fish = gts(diff(rec), start = 1950, freq = 12, unit_time = 'month', name_ts = 'Recruitment')
# Plot first differencing of the recruitment data
plot(diff_fish)
```
From the plot we can see that observations appear to be collected around a constant value and fewer appear to be further from this value (as would be the case for a normal distribution). However, various of these "more extreme" observations appear to be quite frequent suggesting that the underlying distribution may have a heavier tail compared to the normal distribution. Let us compare the standard and robust ACF estimates of this new time series.
```{r, fig.asp = 0.5, fig.width = 9, fig.align='center', fig.cap="Standard (left) and robust (left) estimates of the ACF function on the first difference of the monthly fish recruitment data (rec)"}
compare_acf(diff_fish)
```
In this case we see that the patterns appear to be the same but the values between the standard and robust estimates are slightly (to moderately) different over different lags. This would suggest that there could be some contamination in the data or, in any case, that the normal assumption may not hold exactly. With this in mind, let us estimate an AR(2) model for this data using the Yule-Walker and MLE.
```{r}
# MLE of Recruitment data
yw_fish = estimate(AR(2), rec, method = "yule-walker", demean = TRUE)
# MLE of Recruitment data
mle_fish = estimate(AR(2), rec, method = "mle", demean = TRUE)
# Compare estimates
# Yule-Walker Estimation
c(yw_fish$mod$coef[1], yw_fish$mod$sigma2)
# MLE Estimation
c(mle_fish$mod$coef[1], mle_fish$mod$sigma2)
```
It can be seen that, in this setting, the two estimators deliver very similar results (at least in terms of the $\phi_1$ and $\phi_2$ coefficients). Indeed, there doesn't appear to be a strong need for robust estimation and the choice of a standard estimator is justified by the will to obtain efficient estimations. The only slight difference between the two estimations is in the innovation variance parameter $\sigma^2$ and this could be (evenutally) due to the normality assumption that the MLE estimator upholds in this case. If there is therefore a doubt on the fact that the Gaussian assumption does not hold for this data, then it is probably more convenient to use the Yule-Walker estimates.
Until now we have focussed on estimation based on an assumed model. However, how do we choose a model? How can we make inference on the models and their parameters? To perform all these tasks we will need to compute residuals (as, for example, in the linear regression framework). In order to obtain residuals, we need to be able to predict (forecast) values of the time series and, consequently, the next section focuses on forecasting time series.
### Forecasting AR(p) Models
One of the most interesting aspects of time series analysis is to predict the future unobserved values based on the values that have been observed up to now. However, this is not possible if the underlying (parametric) model is unknown, thus in this section we assume, for purpose of illustration, that the time series $(X_t)$ is **stationary** and its model is known. In particular, we denote forecasts by $X^{t}_{t+j}$, where $t$ represents the time point from which we would like to make a forecast assuming we have an *observed* time series (e.g. $\mathbf{X} = (X_{1}, X_{2}, \cdots , X_{t-1}, X_t)$) and $j$ represents the $j^{th}$-ahead future value we wish to predict. So, $X^{t}_{t+1}$ represents a one-step-ahead prediction of $X_{t+1}$ given data $(X_{1}, X_{2}, \cdots, X_{t-1}, X_{t})$.
Let us now focus on defining a prediction operator and, for this purpose, let us define the Mean Squared Prediction Error (MSPE) as follows:
$$\mathbb{E}[(X_{t+j} - X^{t}_{t+j})^2] .$$
Intuitively, the MPSE measures the square distance (i.e. always positive) between the actual future values and the corresponding predictions. Ideally, we would want this measure to be equal to zero (meaning that we don't make any prediction errors) but, if this is not possible, we would like to define a predictor that has the smallest MPSE among all possible predictors. The MPSE is not necessarily the best measure of prediction accuracy since, for example, it can be greatly affected by outliers or doesn't take into account importance of positive or negative misspredictions (e.g. for risks in finance or insurance). Nevertheless, it is a good overall measure of forecast accuracy and the next theorem states what the best predictor is for this measure.
```{theorem, label="condexp", name="Minimum Mean Squared Error Predictor"}
Let us define
\[X_{t + j}^t \equiv \mathbb{E}\left[ {{X_{t + j}}|{X_t}, \cdots ,{X_1}} \right], j > 0\]
Then \[\mathbb{E}\left[ {{{\left( {{X_{t + j}} - m\left( {{X_1}, \cdots ,{X_t}} \right)} \right)}^2}} \right] \ge \mathbb{E}\left[ {{{\left( {{X_{t + j}} - X_{t + j}^t} \right)}^2}} \right]\] for any function $m(.)$.
```
<br>
The proof of this theorem can be found in Appendix \@ref(appendixc). Although this theorem defines the best possible predictor, there can be many functional forms for this operator. We first restrict our attention to the set of linear predictors defined as
$$X_{t+j}^t = \sum_{i=1}^t \alpha_i X_i,$$
where $\alpha_i \in \mathbb{R}$. It can be noticed, for example, that the $\alpha_i$'s are not always the same based on the values of $t$ and $j$ (i.e. it depends from which time point you want to predict and how far into the future). Another aspect to notice is that, if the time series model underlying the observed time series can be expressed in the form of a linear operator (e.g. a linear process), then we can derive a linear predictor from this framework.
<!-- Considering the above, let us now give the following theorem which provides further insight into linear prediction. -->
<!-- ```{theorem, label="projtheo", name="Projection Theorem"} -->
<!-- Let $\mathcal{M} \subset \mathcal{L}_2$ be a closed linear subspace of a Hibert space. -->
<!-- For every $y \in \mathcal{L}_2$, there exists a unique element $\hat{y} \in \mathcal{M}$ that -->
<!-- minimizes $||y - z||^2$ over $z \in \mathcal{M}$. This element is uniquely -->
<!-- determined by the requirements -->
<!-- 1. $\hat{y} \in \mathcal{M}$ and -->
<!-- 2. $(y - \hat{y}) \perp \mathcal{M}$. -->
<!-- ``` -->
<!-- Based on this theorem, it is possible to define the Best Linear Predictor (BLP) for stationary processes. -->
<!-- ```{definition, label="BLP"} -->
<!-- The best linear predictor $X_{t+j}^t = \sum_{i=1}^t \alpha_i X_i$, for $j \geq 1$ is found by solving -->
<!-- \[ \mathbb{E} [(X_{t+h} - \hat{X}_{t+h})X_i ] = 0, \mbox{ for } i = 1, \dots, t.\] -->
<!-- ``` -->
<!-- If we denote $\mathbb{E}(X_{i}, X_{j})$ as $\gamma(|i - j|)$, these prediction equations can alternatively be represented in the following form: -->
<!-- \begin{equation} -->
<!-- \begin{aligned} -->
<!-- \begin{pmatrix} -->
<!-- \gamma(0) & \gamma(1) & \cdots & \gamma(T-1) \\ -->
<!-- \gamma(1) & \gamma(0) & \cdots & \gamma(T-2) \\ -->
<!-- \vdots & \vdots & \ddots & \vdots \\ -->
<!-- \gamma(T-1) & \gamma(T-2) & \cdots &\gamma(0) -->
<!-- \end{pmatrix}_{T \times T} -->
<!-- \begin{pmatrix} -->
<!-- \alpha_1 \\ -->
<!-- \vdots \\ -->
<!-- \alpha_T -->
<!-- \end{pmatrix}_{T \times 1} -->
<!-- &= -->
<!-- \begin{pmatrix} -->
<!-- \gamma(1) \\ -->
<!-- \vdots \\ -->
<!-- \gamma(T) -->
<!-- \end{pmatrix}_{T \times 1} , -->
<!-- \end{aligned} -->
<!-- \end{equation} -->
<!-- which can be written in a compact matrix form as $\Gamma_T \mathbf{\alpha}_T = \mathbf{\gamma}_T$. Assuming that $\Gamma_T$ is non-singular, then the values of $\mathbf{\alpha}_T$ are -->
<!-- given by: -->
<!-- $$\mathbf{\alpha}_T = \Gamma^{-1}_T\mathbf{\gamma}_T.$$ -->
<!-- which delivers the linear predictor -->
<!-- $$X_{t+j}^t = \alpha_T^T \mathbf{X}^* .$$ -->
<!-- where $\mathbf{X}^* = (X_{T}, X_{T-1}, \cdots , X_{2}, X_1)$. We can now examine these predictors for the AR($p$) models that we've studied this far (starting from the AR(1)). -->
Let us consider some examples to understand how these linear predictors can be delivered based on the AR(p) models studied this far. For this purpose, let us start with an AR(1) process.
```{example, label="ar1forecast", name="Forecasting with an AR(1) model"}
The AR(1) model is defined as follows:
$${X_t} = \phi {X_{t - 1}} + {W_t},$$
where $W_t \sim WN(0, \sigma^2)$.
From here, the conditional expected mean and variance are given by
\[\begin{aligned}
\mathbb{E}\left[ X_{t + j} | \Omega_t \right] &= \mathbb{E}[\phi X_{t+j-1} + W_{t+j}\,| \, \Omega_t ] \\
&= ... = {\phi ^j}{X_t} \\
\operatorname{var}\left[ {{X_{t + j}}} | \Omega_t \right] &= \left( {1 + {\phi ^2} + {\phi ^4} + \cdots + {\phi ^{2\left( {j - 1} \right)}}} \right){\sigma ^2} \\
\end{aligned} \]
Within this derivation, it is important to remember that
\[\begin{aligned}
\mathop {\lim }\limits_{j \to \infty } \mathbb{E}\left[ {{X_{t + j}}} | \Omega_t \right] &= \mathbb{E}\left[ {{X_t}} \right] = 0 \\
\mathop {\lim }\limits_{j \to \infty } \operatorname{var}_t\left[ {{X_{t + j}}} | \Omega_t \right] &= \operatorname{var} \left( {{X_t}} \right) = \frac{{{\sigma ^2}}}{{1 - {\phi ^2}}}
\end{aligned} \]
From these last results we see that the forecast for an AR(1) process is "mean-reverting" since in general we have that the AR(1) can be written with respect to the mean $\mu$ as follows
$$(X_t - \mu) = \phi (X_{t-1} - \mu) + W_t,$$
which gives
$$X_t = \mu + \phi (X_{t-1} - \mu) + W_t,$$
and consequently
\[\begin{aligned}
\mathbb{E}[X_{t+j} | \Omega_t] &= \mu + \mathbb{E}[(X_{t+j} - \mu) | \Omega_t] \\
&= \mu + \phi^j(X_t - \mu),
\end{aligned} \]
which tends to $\mu$ as $j \to \infty$
```
Let us now consider a more complicated model, i.e. an AR(2) model.
```{example, label="ar2forecast", name="Forecasting with an AR(2) model"}
Consider an AR(2) process defined as follows:
$${X_t} = {\phi _1}{X_{t - 1}} + {\phi _2}{X_{t - 2}} + {W_t},$$
where $W_t \sim WN(0, \sigma^2)$.
Based on this process we are able to find the predictor for each j-step ahead prediction using the following approach:
\[\begin{aligned}
\mathbb{E}\left[ {{X_{t + 1}}} | \Omega_t \right] &= {\phi _1}{X_t} + {\phi _2}{X_{t - 1}} \\
\mathbb{E}\left[ {{X_{t + 2}}} | \Omega_t \right] &= \mathbb{E}\left[ {{\phi _1}{X_{t + 1}} + {\phi _2}{X_t}} | \Omega_t \right] = {\phi _1} \mathbb{E}\left[ {{X_{t + 1}}} | \Omega_t \right] + {\phi _2}{X_t} \\
&= {\phi _1}\left( {{\phi _1}{X_t} + {\phi _2}{X_{t - 1}}} \right) + {\phi _2}{X_t} \\
&= \left( {\phi _1^2 + {\phi _2}} \right){X_t} + {\phi _1}{\phi _2}{X_{t - 1}}. \\
\end{aligned} \]
```
Having seen how to build a forecast for an AR(1), let us now generalise this method to an AR(p) using matrix notation.
```{example, label="arpforecast", name="Forecasting with an AR(p) model"}
Consider AR(p) process defined as:
$${X_t} = {\phi _1}{X_{t - 1}} + {\phi _2}{X_{t - 2}} + \cdots + {\phi _p}{X_{t - p}} + {W_t},$$
where $W_t \sim WN(0, \sigma^2_W)$.
The process can be rearranged into matrix form as follows:
\[\begin{aligned}
\underbrace {\left[ {\begin{array}{*{20}{c}}
{{X_t}} \\
\vdots \\
\vdots \\
{{X_{t - p + 1}}}
\end{array}} \right]}_{{Y_t}} &= \underbrace {\left[ {\begin{array}{*{20}{c}}
{{\phi _1}}& \cdots &{}&{{\phi _p}} \\
{}&{}&{}&0 \\
{}&{{I_{p - 1}}}&{}& \vdots \\
{}&{}&{}&0
\end{array}} \right]}_A\underbrace {\left[ {\begin{array}{*{20}{c}}
{{X_{t - 1}}} \\
\vdots \\
\vdots \\
{{X_{t - p}}}
\end{array}} \right]}_{{Y_{t - 1}}} + \underbrace {\left[ {\begin{array}{*{20}{c}}
1 \\
0 \\
\vdots \\
0
\end{array}} \right]}_C{W_t} \\
{Y_t} &= A{Y_{t - 1}} + C{W_t}
\end{aligned}\]
From here, the conditional expectation and variance can be computed as follows:
\[\begin{aligned}
{E}\left[ {{Y_{t + j}}} | \Omega_t \right] &= {E}\left[ {A{Y_{t + j - 1}} + C{W_{t + j}}} | \Omega_t\right] = {E}\left[ {A{Y_{t + j - 1}}} | \Omega_t \right] + \underbrace {{E}\left[ {C{W_{t + j}}} | \Omega_t \right]}_{ = 0} \\
&= {E}\left[ {A\left( {A{Y_{t + j - 2}} + C{W_{t + j - 1}}} \right)} | \Omega_t \right] = {E}\left[ {{A^2}{Y_{t + j - 2}}} | \Omega_t\right] = {A^j}{Y_t} \\
{\operatorname{var}}\left[ {{Y_{t + j}}} | | \Omega_t\right] &= {\operatorname{var}}\left[ {A{Y_{t + j - 1}} + C{W_{t + j}}} | \Omega_t \right] \\
&= {\sigma ^2}C{C^T} + {\operatorname{var}}\left[ {A{Y_{t + j - 1}}} | \Omega_t \right] = {\sigma ^2}A{\operatorname{var}}\left[ {{Y_{t + j - 1}}} | \Omega_t \right]{A^T} \\
&= {\sigma ^2}C{C^T} + {\sigma ^2}AC{C^T}A + {\sigma ^2}{A^2}{\operatorname{var}}\left[ {{Y_{t + j - 2}}} | \Omega_t \right]{\left( {{A^2}} \right)^T} \\
&= {\sigma ^2}\sum\limits_{k = 0}^{j - 1} {{A^k}C{C^T}{{\left( {{A^K}} \right)}^T}} \\
\end{aligned} \]
Considering the recursive pattern coming from the expressions of the conditional expectation and variance, the predictions can be obtained via the following recursive formulation:
\[\begin{aligned}
{E}\left[ {{Y_{t + j}}} | \Omega_t \right] &= A{E}\left[ {{Y_{t + j - 1}}} | \Omega_t\right] \\
{\operatorname{var}}\left[ {{Y_{t + j}}} | \Omega_t \right] &= {\sigma ^2}C{C^T} + A{\operatorname{var}}\left[ {{Y_{t + j - 1}}} | \Omega_t \right]{A^T} \\
\end{aligned} \]
```
With this recursive expression we can now compute the conditional expectation and variance of different AR(p) models. Let us therefore revisit the previous AR(2) example using this recursive formula.
```{example, label="rear2forecast", name="Forecasting with an AR(2) in Matrix Form"}
We can rewrite our previous example of the predictions for an AR(2) process as follows
\[\begin{aligned} \underbrace {\left[ {\begin{array}{*{20}{c}}
{{X_t}} \\
{{X_{t - 1}}}
\end{array}} \right]}_{{Y_t}} &= \underbrace {\left[ {\begin{array}{*{20}{c}}
{{\phi _1}}&{{\phi _2}} \\
1&0
\end{array}} \right]}_A\underbrace {\left[ {\begin{array}{*{20}{c}}
{{X_{t - 1}}} \\
{{X_{t - 2}}}
\end{array}} \right]}_{{Y_{t - 1}}} + \underbrace {\left[ {\begin{array}{*{20}{c}}
1 \\
0
\end{array}} \right]}_C{W_t} \\
{Y_t} &= A{Y_{t - 1}} + C{W_t}
\end{aligned}\]
Then, we are able to calculate the prediction as
\[\begin{aligned}
{E}\left[ {{Y_{t + 2}}} | \Omega_t \right] &= {A^2}{Y_t} = \left[ {\begin{array}{*{20}{c}}
{{\phi _1}}&{{\phi _2}} \\
1&0
\end{array}} \right]\left[ {\begin{array}{*{20}{c}}
{{\phi _1}}&{{\phi _2}} \\
1&0
\end{array}} \right]\left[ {\begin{array}{*{20}{c}}
{{X_t}} \\
{{X_{t - 1}}}
\end{array}} \right] \hfill \\
&= \left[ {\begin{array}{*{20}{c}}
{\phi _1^2 + {\phi _2}}&{{\phi _1}{\phi _2}} \\
{{\phi _1}}&{{\phi _2}}
\end{array}} \right]\left[ {\begin{array}{*{20}{c}}
{{X_t}} \\
{{X_{t - 1}}}
\end{array}} \right] = \left[ {\begin{array}{*{20}{c}}
{\left( {\phi _1^2 + {\phi _2}} \right){X_t} + {\phi _1}{\phi _2}{X_{t - 1}}} \\
{{\phi _1}{X_t} + {\phi _2}{X_{t - 1}}}
\end{array}} \right] \hfill \\
\end{aligned} \]
```
The above examples have given insight as to how to compute predictors for the AR(p) models that we have studied this far. Let us now observe the consequences of using such an approach to predict future values from these models. For this purpose, using the recursive formulation seen in the examples above we perform a simulation study where, for a fixed set of parameters, we make 5000 predictions from an observed time series and predict 50 values ahead into the future. The known parameters for the AR(2) process we use for the simulation study are $\phi _1 = 0.75$, $\phi _2 = 0.2$ and $\sigma^2 = 1$. The figure below shows the distribution of these predictions starting from the last observation $T = 200$.
```{r predplot, cache = TRUE, echo = FALSE, fig.cap="Values of the AR(2) predictions with the pink line being the median prediction, red and green lines are respectively 95% and 75% Confidence intervals."}
gg_color_hue <- function(n, alpha = 1) {
hues = seq(15, 375, length = n + 1)
hcl(h = hues, l = 65, c = 100, alpha = alpha)[1:n]
}
color = gg_color_hue(5, alpha = 1)
color2 = gg_color_hue(3, alpha = 0.01)
set.seed(2)
Xt = gen_gts(n = 200, AR(phi = c(0.75, 0.2), sigma2 = 1))
plot(NA, xlim = c(0,250), ylim = c(-10,15), ylab="Prediction", xlab="Time")
grid()
lines(1:200, Xt, col = color[4])
B = 5000
mat = matrix(0,B,52)
mat[,1] = rep(Xt[199], B)
mat[,2] = rep(Xt[200], B)
for (i in 1:B){
for (j in 3:52){
mat[i,j] = 0.75*mat[i,(j-1)] + 0.2*mat[i,(j-2)] + rnorm(1)
}
lines(200:250, mat[i,2:52], col = color2[3])
}
quant = matrix(NA, 51,5)
quant[1,] = rep(Xt[200],1)
for (i in 2:51){
quant[i,] = quantile(mat[,i+1], c(0.95,0.75,0.5,0.25,0.05))
}
lines(200:250, quant[,1], col = color[1], lty = 1)
lines(200:250, quant[,5], col = color[1], lty = 1)
lines(200:250, quant[,2], col = color[3], lty = 1)
lines(200:250, quant[,4], col = color[3], lty = 1)
lines(200:250, quant[,3], col = color[5], lty = 1)
```
It can be observed that, as hinted by the expressions for the variance of the predictions (in the examples above), the variability of the predictions increases as we try to predict further into the future.
Having now defined the basic concepts for forecasting of time series we can now start another topic of considerable importance for time series analysis. Indeed, the whole discussion on prediction is not only important to derive forecasts for future values of the phenomena one may be interested in, but also in understanding how well a model explains (and predicts) an observed time series. In fact, predictions allow to deliver residuals within the time series setting and, based on these residuals, we can obtain different inference and diagnostic tools.
Given the knowledge on predictions seen above, residuals can be computed as
$$r_{t+j} = X_{t+j} - \hat{X}_{t+j},$$
where $\hat{X}_{t+j}$ represents an estimator of the conditional expectation $E[X_{t+j} | \Omega_{t}]$. The latter quantity will depend on the model underlying the time series and, assuming we know the true model, we could use the true parameters to obtain a value for this expectation. However, in an applied setting it is improbable that we know the true parameters and therefore $\hat{X}_{t+j}$ can be obtained by estimating the parameters (as seen in the previous sections) and then plugging thmem into the expression of $E[X_{t+j} | \Omega_{t}]$.
## Diagnostic Tools for Time Series