ARIMA model example

Autoregressive integrated moving average (ARIMA) models combine the abilities of AUTOREGRESSOR and MOVING_AVERAGE models by making future predictions based on both preceding time series values and errors of previous predictions.

Autoregressive integrated moving average (ARIMA) models combine the abilities of AUTOREGRESSOR and MOVING_AVERAGE models by making predictions based on both preceding time series values and errors of previous predictions. ARIMA models also provide the option to apply a differencing operation to the input data, which can turn a non-stationary time series into a stationary time series. At model training time, you specify the differencing order and the number of preceding values and previous prediction errors that the model uses to calculate predictions.

You can use the following functions to train and make predictions with ARIMA models:

ARIMA: Creates and trains an ARIMA model
PREDICT_ARIMA: Applies a trained ARIMA model to an input relation or makes predictions using the in-sample data

These functions require time series data with consistent timesteps. To normalize a time series with inconsistent timesteps, see Gap filling and interpolation (GFI).

The following example trains three ARIMA models, two that use differencing and one that does not, and then makes predictions using the models.

Load the training data

Before you begin the example, load the Machine Learning sample data.

This example uses the following data:

daily-min-temperatures: provided in the machine learning sample data, this dataset contains data on the daily minimum temperature in Melbourne, Australia from 1981 through 1990. After you load the sample datasets, this data is available in the temp_data table.
db_size: a table that tracks the size of a database over consecutive months.

=> SELECT * FROM temp_data;
        time         | Temperature
---------------------+-------------
 1981-01-01 00:00:00 |        20.7
 1981-01-02 00:00:00 |        17.9
 1981-01-03 00:00:00 |        18.8
 1981-01-04 00:00:00 |        14.6
 1981-01-05 00:00:00 |        15.8
 ...
 1990-12-27 00:00:00 |          14
 1990-12-28 00:00:00 |        13.6
 1990-12-29 00:00:00 |        13.5
 1990-12-30 00:00:00 |        15.7
 1990-12-31 00:00:00 |          13
(3650 rows)

=> SELECT COUNT(*) FROM temp_data;
 COUNT
-------
3650
(1 row)

=> SELECT * FROM db_size;
 month | GB
-------+-----
     1 |   5
     2 |  10
     3 |  20
     4 |  35
     5 |  55
     6 |  80
     7 | 110
     8 | 145
     9 | 185
    10 | 230
(10 rows)

Train the ARIMA models

After you load the daily-min-temperatures data, you can use the ARIMA function to create and train an ARIMA model. For this example, the model is trained with lags of p=3 and q=3, taking the value and prediction error of three previous time steps into account for each prediction. Because the input time series is stationary, you don't need to apply differencing to the data:

=> SELECT ARIMA('arima_temp', 'temp_data', 'temperature', 'time' USING PARAMETERS p=3, d=0, q=3);
                             ARIMA
--------------------------------------------------------------
Finished in 20 iterations.
3650 elements accepted, 0 elements rejected.

(1 row)

You can view a summary of the model with the GET_MODEL_SUMMARY function:

=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='arima_temp');
               GET_MODEL_SUMMARY
-----------------------------------------------

============
coefficients
============
parameter| value
---------+--------
  phi_1  | 0.64189
  phi_2  | 0.46667
  phi_3  |-0.11777
 theta_1 |-0.05109
 theta_2 |-0.58699
 theta_3 |-0.15882


==============
regularization
==============
none

===============
timeseries_name
===============
temperature

==============
timestamp_name
==============
time

==============
missing_method
==============
linear_interpolation

===========
call_string
===========
ARIMA('public.arima_temp', 'temp_data', 'temperature', 'time' USING PARAMETERS p=3, d=0, q=3, missing='linear_interpolation', init_method='Zero', epsilon=1e-06, max_iterations=100);

===============
Additional Info
===============
       Name       | Value
------------------+--------
        p         |   3
        q         |   3
        d         |   0
       mean       |11.17775
      lambda      | 1.00000
mean_squared_error| 5.80490
rejected_row_count|   0
accepted_row_count|  3650

(1 row)

Examining the db_size table, it is clear that there is an upward trend to the database size over time. Each month the database size increases five more gigabytes than the increase in the previous month. This trend indicates the time series is non-stationary.

To account for this in the ARIMA model, you must difference the data by setting a non-zero d parameter value. For comparison, two ARIMA models are trained on this data, the first with a d value of one and the second with a d value of two:

=> SELECT ARIMA('arima_d1', 'db_size', 'GB', 'month' USING PARAMETERS p=2, d=1, q=2);
                                 ARIMA
------------------------------------------------------------------------
 Finished in 9 iterations.
 10 elements accepted, 0 elements rejected.

(1 row)

=> SELECT ARIMA('arima_d2', 'db_size', 'GB', 'month' USING PARAMETERS p=2, d=2, q=2);
                                 ARIMA
------------------------------------------------------------------------
 Finished in 0 iterations.
 10 elements accepted, 0 elements rejected.

(1 row)

Make predictions

After you train the ARIMA models, you can call the PREDICT_ARIMA function to predict future time series values. This function supports making predictions using the in-sample data that the models were trained on or applying the model to an input relation.

Using in-sample data

The following PREIDCT_ARIMA call makes temperature predictions using the in-sample data that the arima_temp model was trained on. The model begins prediction at the end of the temp_data table and returns predicted values for ten timesteps:

=> SELECT PREDICT_ARIMA(USING PARAMETERS model_name='arima_temp', start=0, npredictions=10) OVER();
 index |   prediction
-------+------------------
     1 | 12.9745063293842
     2 | 13.4389080858551
     3 | 13.3955791360528
     4 | 13.3551146487462
     5 | 13.3149336514747
     6 | 13.2750516811057
     7 | 13.2354710353376
     8 | 13.1961939790513
     9 | 13.1572226788109
    10 | 13.1185592045127
(10 rows)

For both prediction methods, if you want the function to return the standard error of each prediction, you can set output_standard_errors to true:

=> SELECT PREDICT_ARIMA(USING PARAMETERS model_name='arima_temp', start=0, npredictions=10, output_standard_errors=true) OVER();
 index |    prediction    |     std_err
-------+------------------+------------------
     1 | 12.9745063293842 | 1.00621890780865
     2 | 13.4389080858551 | 1.45340836833232
     3 | 13.3955791360528 | 1.61041524562932
     4 | 13.3551146487462 | 1.76368421116143
     5 | 13.3149336514747 | 1.91223938476627
     6 | 13.2750516811057 | 2.05618464609977
     7 | 13.2354710353376 | 2.19561771498385
     8 | 13.1961939790513 | 2.33063553781651
     9 | 13.1572226788109 | 2.46133422924445
    10 | 13.1185592045127 | 2.58780904243988
(10 rows)

To make predictions with the two models trained on the db_size table, you only need to change the specified model_name in the above calls:

=> SELECT PREDICT_ARIMA(USING PARAMETERS model_name='arima_d1', start=0, npredictions=10) OVER();
 index |    prediction
-------+------------------
     1 | 279.882778508943
     2 | 334.398317856829
     3 | 393.204492820962
     4 | 455.909453114272
     5 | 522.076165355683
     6 | 591.227478668175
     7 | 662.851655189833
     8 | 736.408301395412
     9 | 811.334631481162
    10 | 887.051990217688
(10 rows)

=> SELECT PREDICT_ARIMA(USING PARAMETERS model_name='arima_d2', start=0, npredictions=10) OVER();
 index | prediction
-------+------------
     1 | 280
     2 | 335
     3 | 395
     4 | 460
     5 | 530
     6 | 605
     7 | 685
     8 | 770
     9 | 860
    10 | 955
(10 rows)

Comparing the outputs from the two models, you can see that the model trained with a d value of two correctly captures the trend in the data. Each month the rate of database growth increases by five gigabytes.

Applying to an input relation

You can also apply the model to an input relation. The following example makes predictions by applying the arima_temp model to the temp_data training set:

=> SELECT PREDICT_ARIMA(temperature USING PARAMETERS model_name='arima_temp', start=3651, npredictions=10, output_standard_errors=true) OVER(ORDER BY time) FROM temp_data;
 index |    prediction    |     std_err
-------+------------------+------------------
     1 | 12.9745063293842 | 1.00621890780865
     2 | 13.4389080858551 | 1.45340836833232
     3 | 13.3955791360528 | 1.61041524562932
     4 | 13.3551146487462 | 1.76368421116143
     5 | 13.3149336514747 | 1.91223938476627
     6 | 13.2750516811057 | 2.05618464609977
     7 | 13.2354710353376 | 2.19561771498385
     8 | 13.1961939790513 | 2.33063553781651
     9 | 13.1572226788109 | 2.46133422924445
    10 | 13.1185592045127 | 2.58780904243988
(10 rows)

Because the same data and relative start index were provided to both prediction methods, the arima_temp model predictions for each method are identical.

When applying a model to an input relation, you can set add_mean to false so that the function returns the predicted difference from the mean instead of the sum of the model mean and the predicted difference:

=> SELECT PREDICT_ARIMA(temperature USING PARAMETERS model_name='arima_temp', start=3680, npredictions=10, add_mean=false) OVER(ORDER BY time) FROM temp_data;
 index |  prediction
-------+------------------
     1 | 1.2026877112171
     2 | 1.17114068517961
     3 | 1.13992534953432
     4 | 1.10904183333367
     5 | 1.0784901998692
     6 | 1.04827044781798
     7 | 1.01838251238116
     8 | 0.98882626641461
     9 | 0.959601521551628
    10 | 0.93070802931751
(10 rows)