This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Time series forecasting

Time series models are trained on stationary time series (that is, time series where the mean doesn't change over time) of stochastic processes with consistent time steps.

Time series models are trained on stationary time series (that is, time series where the mean doesn't change over time) of stochastic processes with consistent time steps. These algorithms forecast future values by taking into account the influence of values at some number of preceding timesteps (lags).

Examples of applicable datasets include those for temperature, stock prices, earthquakes, product sales, etc.

To normalize datasets with inconsistent timesteps, see Gap filling and interpolation (GFI).

1 - Autoregression algorithms

Autoregessive models use past time series values to predict future values. These models assume that a future value in a time series is dependent on its past values, and attempt to capture this relationship in a set of coefficient values.

Vertica supports both autoregression (AR) and vector autoregression (VAR) models:

  • AR is a univariate autoregressive time series algorithm that predicts a variable's future values based on its preceding values. The user specifies the number of lagged timesteps taken into account during computation, and the model then predicts future values as a linear combination of the values at each lag.
  • VAR is a multivariate autoregressive time series algorithm that captures the relationship between multiple time series variables over time. Unlike AR, which only considers a single variable, VAR models incorporate feedback between different variables in the model, enabling the model to analyze how variables interact across lagged time steps. For example, with two variables—atmospheric pressure and rain accumulation—a VAR model could determine whether a drop in pressure tends to result in rain at a future date.

The AUTOREGRESSOR function automatically executes the algorithm that fits your input data:

  • One value column: the function executes autoregression and returns a trained AR model.
  • Multiple value columns: the function executes vector autoregression and returns a trained VAR model.

1.1 - Autoregressive model example

Autoregressive models predict future values of a time series based on the preceding values. More specifically, the user-specified lag determines how many previous timesteps it takes into account during computation, and predicted values are linear combinations of the values at each lag.

Use the following functions when training and predicting with autoregressive models. Note that these functions require datasets with consistent timesteps.

To normalize datasets with inconsistent timesteps, see Gap filling and interpolation (GFI).

Example

  1. Load the datasets from the Machine-Learning-Examples repository.

    This example uses the daily-min-temperatures dataset, which contains data on the daily minimum temperature in Melbourne, Australia from 1981 through 1990:

    => SELECT * FROM temp_data;
            time         | Temperature
    ---------------------+-------------
     1981-01-01 00:00:00 |        20.7
     1981-01-02 00:00:00 |        17.9
     1981-01-03 00:00:00 |        18.8
     1981-01-04 00:00:00 |        14.6
     1981-01-05 00:00:00 |        15.8
    ...
     1990-12-27 00:00:00 |          14
     1990-12-28 00:00:00 |        13.6
     1990-12-29 00:00:00 |        13.5
     1990-12-30 00:00:00 |        15.7
     1990-12-31 00:00:00 |          13
    (3650 rows)
    
  2. Use AUTOREGRESSOR to create the autoregressive model AR_temperature from the temp_data dataset. In this case, the model is trained with a lag of p=3, taking the previous 3 entries into account for each estimation:

    => SELECT AUTOREGRESSOR('AR_temperature', 'temp_data', 'Temperature', 'time' USING PARAMETERS p=3);
                        AUTOREGRESSOR
    ---------------------------------------------------------
     Finished. 3650 elements accepted, 0 elements rejected.
    (1 row)
    

    You can view a summary of the model with GET_MODEL_SUMMARY:

    => SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='AR_temperature');
    
     GET_MODEL_SUMMARY
    -------------------
    
    ============
    coefficients
    ============
    parameter| value
    ---------+--------
      alpha  | 1.88817
    phi_(t-1)| 0.70004
    phi_(t-2)|-0.05940
    phi_(t-3)| 0.19018
    
    
    ==================
    mean_squared_error
    ==================
    not evaluated
    
    ===========
    call_string
    ===========
    autoregressor('public.AR_temperature', 'temp_data', 'temperature', 'time'
    USING PARAMETERS p=3, missing=linear_interpolation, regularization='none', lambda=1, compute_mse=false);
    
    ===============
    Additional Info
    ===============
           Name       |Value
    ------------------+-----
        lag_order     |  3
    rejected_row_count|  0
    accepted_row_count|3650
    (1 row)
    
  3. Use PREDICT_AUTOREGRESSOR to predict future temperatures. The following query starts the prediction at the end of the dataset and returns 10 predictions.

    => SELECT PREDICT_AUTOREGRESSOR(Temperature USING PARAMETERS model_name='AR_temperature', npredictions=10) OVER(ORDER BY time) FROM temp_data;
    
     index |    prediction
    -------+------------------
          1 | 12.6235419917807
          2 | 12.9387860506032
          3 | 12.6683380680058
          4 | 12.3886937385419
          5 | 12.2689506237424
          6 | 12.1503023330142
          7 | 12.0211734746741
          8 | 11.9150531529328
          9 | 11.825870404008
         10 | 11.7451846722395
    (10 rows)
    

1.2 - VAR model example

The following example trains a vector autoregression (VAR) model on an electricity usage dataset, and then makes predictions with that model.

Download and load the data

You can download the data from Mendeley Data. The dataset consists of metrics related to daily electricity load forecasting in Panama, including columns such as the following:

  • datetime: hourly datetime index corresponding to Panama time-zone (UTC-05:00)
  • nat_demand: national electricity load in Panama, measured in megawatt hours
  • T2M_toc: temperature at 2 meters in Tocumen, Panama City
  • QV2M_toc: relative humidity at 2 meters in Tocumen, Panama City
  • TQL_toc: liquid precipitation in Tocumen, Panama City
  • W2M_toc: wind speed at 2 meters in Tocumen, Panama City

The dataset also includes temperature, humidity, precipitation, and wind speed columns for two other cities in Panama: Santiago City and David City.

Unlike the AR algorithm, which can only model a single variable over time, the VAR algorithm captures the relationship between time series variables across lagged time steps. For example, when trained on the above dataset, the VAR algorithm could model whether a rise in temperature tends to lead to an increase in electricity demand hours later.

After you have downloaded the data locally, you can load the data into Vertica with the following statements:

=> CREATE TABLE electric_data(ts timestamp, nat_demand float, T2M_toc float, QV2M_toc float,
    TQL_toc float, W2M_toc float, T2M_san float, QV2M_san float, TQL_san float, W2M_san float, T2M_dav 
    float, QV2M_dav float, TQL_dav float, W2M_dav float, Holiday_ID int, holiday bool, school bool);

=> COPY electric_data (ts,nat_demand,T2M_toc,QV2M_toc,TQL_toc,W2M_toc,T2M_san,QV2M_san,TQL_san,W2M_san,
    T2M_dav,QV2M_dav,TQL_dav,W2M_dav,Holiday_ID,holiday,school) FROM LOCAL 'path-to-data' 
    DELIMITER ',';

The dataset includes some categorical columns, which need to be dropped before training a VAR model:

=> ALTER TABLE electric_data DROP column Holiday_ID;
=> ALTER TABLE electric_data DROP column holiday;
=> ALTER TABLE electric_data DROP column school;

Query the electric_data table to view a sample of the data that will be used to train the model:

=> SELECT * FROM electric_data LIMIT 3;
         ts          | nat_demand |     T2M_toc      |  QV2M_toc   |   TQL_toc    |     W2M_toc      |     T2M_san      |  QV2M_san   |   TQL_san    |     W2M_san      |     T2M_dav      |  QV2M_dav   |   TQL_dav   |     W2M_dav
---------------------+------------+------------------+-------------+--------------+------------------+------------------+-------------+--------------+------------------+------------------+-------------+-------------+------------------
 2015-01-03 01:00:00 |    970.345 | 25.8652587890625 | 0.018576382 |  0.016174316 | 21.8505458178752 | 23.4824462890625 | 0.017271755 | 0.0018553734 | 10.3289487293842 | 22.6621337890625 | 0.016562222 |  0.09609985 | 5.36414795209389
 2015-01-03 02:00:00 |   912.1755 | 25.8992553710938 | 0.018653292 |  0.016418457 | 22.1669442769882 | 23.3992553710938 | 0.017264742 | 0.0013270378 | 10.6815171370742 | 22.5789428710938 | 0.016509432 | 0.087646484 | 5.57247134626166
 2015-01-03 03:00:00 |   900.2688 | 25.9372802734375 |  0.01876786 | 0.0154800415 | 22.4549108762635 | 23.3435302734375 | 0.017211463 | 0.0014281273 | 10.8749237899741 | 22.5310302734375 | 0.016479041 |  0.07873535 | 5.87118374509993
(3 rows)

Train the model

To train a VAR model, use the AUTOREGRESSOR function. In this example, the model is trained with a lag of 3, meaning 3 previous timesteps are taken into account during computation:

=> SELECT AUTOREGRESSOR ( 'var_electric', 'electric_data', 'nat_demand, T2M_toc, QV2M_toc, TQL_toc, W2M_toc, T2M_san, QV2M_san,
    TQL_san, W2M_san, T2M_dav, QV2M_dav, TQL_dav, W2M_dav', 'ts' USING PARAMETERS P=3, compute_mse=True );
WARNING 0:  Only the Yule Walker method is currently supported for Vector Autoregression, setting method to Yule Walker
                      AUTOREGRESSOR
---------------------------------------------------------
 Finished. 48048 elements accepted, 0 elements rejected.
(1 row)

Use the GET_MODEL_SUMMARY function to view a summary of the model, including its coefficient and parameter values:

=> SELECT GET_MODEL_SUMMARY( USING PARAMETERS model_name='var_electric');
                                     GET_MODEL_SUMMARY
-------------------------------------------------------------------------------------------------------

=========
phi_(t-1)
=========
predictor |nat_demand|t2m_toc |  qv2m_toc  | tql_toc |w2m_toc |t2m_san |  qv2m_san  | tql_san  |w2m_san | t2m_dav  |  qv2m_dav  | tql_dav |w2m_dav
----------+----------+--------+------------+---------+--------+--------+------------+----------+--------+----------+------------+---------+--------
nat_demand|  1.14622 |-1.66306|-26024.21189|853.81013| 2.88135|59.48676|-28244.33013|-194.95364| 1.26052|-135.79998|-66425.52272|229.38130| 0.58089
 t2m_toc  |  0.00048 | 0.53187| -112.46610 | 1.14647 |-0.06549| 0.13092| -200.78913 | -1.58964 | 0.21530| -0.46069 | -494.82893 |-1.22152 | 0.13995
 qv2m_toc |  0.00000 | 0.00057|   0.89702  |-0.00218 |-0.00001|-0.00017|   0.06365  |  0.00151 |-0.00003| -0.00012 |  -0.06966  |-0.00027 |-0.00002
 tql_toc  | -0.00002 | 0.01902|   9.77736  | 1.39416 |-0.00235|-0.01847|   0.84052  | -0.27010 | 0.00264|  0.00547 |  15.98251  |-0.19235 |-0.00214
 w2m_toc  | -0.00392 | 2.33136|  99.79514  |14.08998 | 1.44729|-1.80924|-1756.46929 |  1.11483 | 0.12860|  0.18757 | 386.03986  |-3.28553 | 1.08902
 t2m_san  |  0.00008 |-0.28536| 474.30004  | 0.02968 |-0.05527| 1.29437| -475.49532 | -6.68332 | 0.24776| -1.09044 | -871.84132 | 0.10465 | 0.40562
 qv2m_san | -0.00000 | 0.00021|  -0.39224  | 0.00080 | 0.00002|-0.00010|   1.13421  |  0.00252 |-0.00006|  0.00001 |  -0.30554  |-0.00206 | 0.00004
 tql_san  | -0.00003 |-0.00361|  17.26050  | 0.13776 |-0.00267| 0.01761|  -9.63725  |  1.03554 |-0.00053| -0.01907 |  -8.70276  |-0.56572 | 0.00523
 w2m_san  | -0.00007 |-0.92628|   9.34644  |22.01868 | 0.15592|-0.38963| 219.53687  |  6.32666 | 0.98779|  0.50404 | 607.06291  |-2.93982 | 1.01091
 t2m_dav  | -0.00009 |-0.19734| 447.10894  |-2.09032 | 0.00302| 0.27105| -266.05516 | -3.03434 | 0.28049| -0.03718 | -750.51074 |-3.00557 | 0.01414
 qv2m_dav | -0.00000 | 0.00003|  -0.25311  | 0.00255 |-0.00002| 0.00012|   0.36524  | -0.00043 | 0.00001| -0.00001 |   0.55553  |-0.00040 | 0.00010
 tql_dav  | -0.00000 | 0.00638|  36.02787  | 0.40214 |-0.00116| 0.00352|   2.09579  |  0.14142 | 0.00192| -0.01039 | -35.63238  | 0.91257 |-0.00834
 w2m_dav  |  0.00316 |-0.48625| -250.62285 | 6.92672 | 0.13897|-0.30942|  21.40057  | -2.77030 |-0.05098|  0.49791 |  86.43985  |-5.61450 | 1.36653


=========
phi_(t-2)
=========
predictor |nat_demand| t2m_toc | qv2m_toc  |  tql_toc  |w2m_toc | t2m_san | qv2m_san  | tql_san |w2m_san |t2m_dav | qv2m_dav  | tql_dav  | w2m_dav
----------+----------+---------+-----------+-----------+--------+---------+-----------+---------+--------+--------+-----------+----------+---------
nat_demand| -0.50332 |-46.08404|18727.70487|-1054.48563|-1.64150|-36.33188|-4175.69233|498.68770|24.05641|97.39713|15062.02349|-418.70514|-27.63335
 t2m_toc  | -0.00240 |-0.01253 | 304.73283 | -5.13242  |-0.03848|-0.08204 | -27.48349 | 1.86556 | 0.06103| 0.31194| 46.52296  |  0.25737 |-0.30230
 qv2m_toc |  0.00000 |-0.00023 | -0.16013  |  0.00443  | 0.00003| 0.00003 | -0.02433  |-0.00226 | 0.00001| 0.00006|  0.10786  |  0.00055 | 0.00008
 tql_toc  | -0.00003 |-0.01472 |  3.90260  | -0.60709  | 0.00318| 0.01456 | -15.78706 | 0.23611 |-0.00201|-0.00475| -11.18542 |  0.09355 | 0.00227
 w2m_toc  |  0.00026 |-1.93177 | 864.02774 | -33.46154 |-0.64149| 1.99313 | 924.12187 |-2.23520 |-0.12906|-0.46720|-763.47613 |  8.44744 |-1.42164
 t2m_san  | -0.00449 |-0.12273 | 317.82860 | -9.44207  |-0.11911|-0.32597 | 11.71343  | 6.41420 | 0.17817| 0.78725| 147.48679 | -1.93663 |-0.73511
 qv2m_san |  0.00000 | 0.00001 |  0.29127  |  0.00430  | 0.00001| 0.00018 | -0.25311  |-0.00256 | 0.00000|-0.00030|  0.16254  |  0.00213 | 0.00007
 tql_san  | -0.00001 | 0.01255 |  8.58348  |  0.17678  | 0.00236|-0.01895 |  2.89453  |-0.13436 | 0.00492| 0.01080| -5.78848  |  0.39776 |-0.00792
 w2m_san  | -0.00402 |-0.23875 | 152.07427 | -34.35207 |-0.14634| 0.54952 |-458.82955 |-2.37583 |-0.04254| 0.02246|-523.69779 |  7.92699 |-1.13203
 t2m_dav  | -0.00340 |-0.17349 | 218.48562 | -3.75842  |-0.14256|-0.16157 | 104.28631 | 2.42379 | 0.09304| 0.59601| 47.14495  |  0.00436 |-0.36515
 qv2m_dav |  0.00000 | 0.00011 |  0.26173  |  0.00065  | 0.00004| 0.00006 | -0.36303  | 0.00036 |-0.00003|-0.00030|  0.00561  |  0.00151 |-0.00008
 tql_dav  |  0.00002 |-0.01052 | -24.63826 | -0.08533  | 0.00290| 0.00260 |  0.91509  |-0.14088 |-0.00035| 0.00148| 11.70604  | -0.09407 | 0.01194
 w2m_dav  | -0.00222 | 0.06791 | 74.24493  | -7.00070  |-0.16760| 0.41837 | -47.66437 | 2.82942 | 0.03768|-0.35919| -78.61307 |  3.12482 |-0.60352


=========
phi_(t-3)
=========
predictor |nat_demand|t2m_toc | qv2m_toc  | tql_toc |w2m_toc | t2m_san | qv2m_san  | tql_san  | w2m_san |t2m_dav | qv2m_dav  | tql_dav |w2m_dav
----------+----------+--------+-----------+---------+--------+---------+-----------+----------+---------+--------+-----------+---------+--------
nat_demand|  0.28126 |53.13756|3338.49498 |-26.56706| 1.60491|-57.36669|-8104.16577|-172.24549|-38.55564|74.30481|96465.63629|245.06242|44.47450
 t2m_toc  |  0.00221 | 0.46994|-348.87103 | 2.62141 | 0.16466|-0.25496 |-137.21750 |  2.19341 |-0.40865 | 0.30965|1079.93357 | 0.14082 | 0.25677
 qv2m_toc | -0.00000 |-0.00039|  0.32237  |-0.00208 |-0.00004| 0.00015 | -0.01736  |  0.00052 | 0.00004 | 0.00014| -0.12526  |-0.00012 |-0.00006
 tql_toc  |  0.00006 |-0.00387| -17.12826 | 0.11237 |-0.00034| 0.00313 | 13.96030  |  0.08266 |-0.00155 |-0.00080| -0.47584  | 0.14030 | 0.00030
 w2m_toc  |  0.00498 |-0.70555|-955.01026 |17.98079 | 0.18909|-0.64634 | 496.07411 |  2.36480 |-0.07010 | 1.03103| 779.48099 |-5.90297 | 0.44579
 t2m_san  |  0.00510 | 0.47889|-1166.20907| 7.15240 | 0.27574|-0.48793 |-215.11669 |  5.31181 |-0.66397 | 0.72268|1871.66380 | 0.51522 | 0.50609
 qv2m_san | -0.00000 |-0.00029|  0.28296  |-0.00588 |-0.00005|-0.00014 | -0.04735  | -0.00013 | 0.00010 | 0.00040|  0.17917  |-0.00013 |-0.00015
 tql_san  |  0.00004 |-0.00467| -27.72270 |-0.40887 | 0.00016| 0.00234 |  4.26912  | -0.00376 |-0.00318 | 0.00015| 22.54523  | 0.24970 | 0.00081
 w2m_san  |  0.00493 | 1.14186|-373.37429 |13.42031 | 0.02690|-0.47403 | 49.48392  | -3.10954 |-0.11381 |-0.05802| 153.91529 |-4.97422 | 0.21238
 t2m_dav  |  0.00365 | 0.55030|-1007.89174| 3.91560 | 0.22359|-0.44299 |-400.17878 |  4.14859 |-0.58507 | 0.58492|1659.35503 | 2.60303 | 0.54396
 qv2m_dav | -0.00000 |-0.00014|  0.19838  |-0.00385 |-0.00004|-0.00020 | -0.08521  | -0.00031 | 0.00008 | 0.00030|  0.37780  |-0.00163 |-0.00007
 tql_dav  | -0.00002 | 0.00898| -10.89014 |-0.36915 |-0.00241|-0.00164 | -8.04515  | -0.03469 | 0.00140 |-0.00446| 34.64281  | 0.11446 |-0.00757
 w2m_dav  | -0.00120 | 0.51977| 149.50193 |-0.12824 | 0.02324| 0.07751 | 109.64435 | -0.52736 | 0.08222 |-0.46739| -20.74587 | 2.62600 | 0.09365


====
mean
====
predictor | value
----------+--------
nat_demand| 0.00000
 t2m_toc  | 0.00000
 qv2m_toc | 0.00000
 tql_toc  | 0.00000
 w2m_toc  | 0.00000
 t2m_san  | 0.00000
 qv2m_san | 0.00000
 tql_san  | 0.00000
 w2m_san  | 0.00000
 t2m_dav  | 0.00000
 qv2m_dav | 0.00000
 tql_dav  | 0.00000
 w2m_dav  | 0.00000


==================
mean_squared_error
==================
predictor |   value
----------+-----------
nat_demand|17962.23433
 t2m_toc  |  1.69488
 qv2m_toc |  0.00000
 tql_toc  |  0.00055
 w2m_toc  |  2.46161
 t2m_san  |  5.50800
 qv2m_san |  0.00000
 tql_san  |  0.00129
 w2m_san  |  2.09820
 t2m_dav  |  4.17091
 qv2m_dav |  0.00000
 tql_dav  |  0.00131
 w2m_dav  |  0.37255


=================
predictor_columns
=================
"nat_demand", "t2m_toc", "qv2m_toc", "tql_toc", "w2m_toc", "t2m_san", "qv2m_san", "tql_san", "w2m_san", "t2m_dav", "qv2m_dav", "tql_dav", "w2m_dav"

================
timestamp_column
================
ts

==============
missing_method
==============
error

===========
call_string
===========
autoregressor('public.var_electric', 'electric_data', '"nat_demand", "t2m_toc", "qv2m_toc", "tql_toc", "w2m_toc", "t2m_san", "qv2m_san", "tql_san", "w2m_san", "t2m_dav", "qv2m_dav", "tql_dav", "w2m_dav"', 'ts'
USING PARAMETERS p=3, method=yule-walker, missing=error, regularization='none', lambda=1, compute_mse=true, subtract_mean=false);

===============
Additional Info
===============
       Name       | Value
------------------+--------
    lag_order     |   3
  num_predictors  |   13
      lambda      | 1.00000
rejected_row_count|   0
accepted_row_count| 48048

(1 row)

Make predictions

After you have trained the model, use PREDICT_AUTOREGRESSOR to make predictions. The following query makes predictions for 10 timesteps after the end of the training data:

=> SELECT PREDICT_AUTOREGRESSOR (nat_demand, T2M_toc, QV2M_toc, TQL_toc, W2M_toc, T2M_san, QV2M_san, TQL_san, W2M_san, T2M_dav,
    QV2M_dav, TQL_dav, W2M_dav USING PARAMETERS model_name='var_electric', npredictions=10) OVER (ORDER BY ts) FROM electric_data;
 index |    nat_demand    |     T2M_toc      |      QV2M_toc      |      TQL_toc       |     W2M_toc      |     T2M_san      |      QV2M_san      |      TQL_san      |     W2M_san      |     T2M_dav      |      QV2M_dav      |      TQL_dav      |     W2M_dav
-------+------------------+------------------+--------------------+--------------------+------------------+------------------+--------------------+-------------------+------------------+------------------+--------------------+-------------------+------------------
     1 |  1078.0855626373 | 27.5432174013135 | 0.0202896580655671 | 0.0735420728737344 | 10.8175311126823 |  26.430434929925 | 0.0192831578421391 | 0.107198653008438 | 3.05244789585641 | 24.5096742655262 | 0.0184037299403769 |  0.16295453121027 | 3.08228477169708
     2 | 1123.01343816948 | 27.5799917618547 | 0.0204207744201445 | 0.0720447905881737 | 10.8724214941076 | 26.3610153442989 | 0.0194263633273137 | 0.108930877007977 | 2.72589694499722 | 24.4670623561271 | 0.0186472805344351 | 0.165398914107496 | 2.87751855475047
     3 | 1131.90496161147 | 27.3065074421367 | 0.0206625082516192 | 0.0697170726826932 | 10.5264893921207 |   25.81201637743 |  0.019608966941237 |  0.10637712791638 | 2.17340369566314 | 24.1521703335357 | 0.0188528868910987 | 0.167392378142989 | 2.80663029425841
     4 | 1138.96441161386 | 27.4576230482214 | 0.0207001599239755 | 0.0777394805028406 | 10.3601575817394 | 26.1392475032107 | 0.0195632331195498 | 0.104149788020336 | 2.46022124286432 | 24.4888899706856 | 0.0187304304955302 | 0.164373252722466 | 2.78678931032488
     5 | 1171.39047791301 | 28.0057288751278 | 0.0205956267885475 | 0.0848090062223719 | 10.6253279384262 | 27.0670669914329 | 0.0195635438719142 | 0.114456870352482 | 2.76078540220627 | 25.2647929547485 | 0.0187651697256172 |  0.17343826852935 | 2.69927291097792
     6 | 1207.73967000806 | 28.3228814221316 | 0.0206018765585195 | 0.0822472149970854 | 10.9208806093031 | 27.4723192020112 | 0.0197095736612743 | 0.120234389446089 | 2.66435208358109 | 25.5897884504046 | 0.0190138227508656 | 0.181730688934231 | 2.48952353476086
     7 | 1201.36034218262 | 28.0729800850783 | 0.0207697611016373 | 0.0771476148672282 | 10.8746443523915 | 26.9706455927136 | 0.0198268925550646 | 0.108597236269397 |  2.4209888271894 | 25.1762432351852 | 0.0191249010707135 | 0.172747303182877 | 2.23183544428884
     8 | 1224.23208010817 | 28.3089311565846 | 0.0208114201116026 | 0.0850389146168386 | 11.0236068974249 | 27.4555219109112 | 0.0198191134457088 | 0.106317040996309 | 2.53443574238199 | 25.6496524683801 | 0.0189950796868336 | 0.164772798858117 | 2.03398232662887
     9 | 1276.63054426938 | 28.9264909381223 | 0.0207288153284714 | 0.0949502386997371 | 11.3825529048681 | 28.5602280897709 | 0.0198383353555985 | 0.119437241768372 | 2.81035565170478 | 26.6092121876548 | 0.0189965295736137 | 0.173848915902565 | 1.97943631786186
    10 | 1279.80379750225 | 28.9655412855392 | 0.0207630823553549 | 0.0899990557577538 | 11.4992364863754 | 28.5096132340911 | 0.0199982128259766 | 0.122916656987609 |  2.8045000981305 | 26.4950144592136 | 0.0192642881348006 | 0.183789154418929 | 1.97414217031828
(10 rows)

See also

2 - ARIMA model example

Autoregressive integrated moving average (ARIMA) models combine the abilities of AUTOREGRESSOR and MOVING_AVERAGE models by making future predictions based on both preceding time series values and errors of previous predictions.

Autoregressive integrated moving average (ARIMA) models combine the abilities of AUTOREGRESSOR and MOVING_AVERAGE models by making predictions based on both preceding time series values and errors of previous predictions. ARIMA models also provide the option to apply a differencing operation to the input data, which can turn a non-stationary time series into a stationary time series. At model training time, you specify the differencing order and the number of preceding values and previous prediction errors that the model uses to calculate predictions.

You can use the following functions to train and make predictions with ARIMA models:

  • ARIMA: Creates and trains an ARIMA model

  • PREDICT_ARIMA: Applies a trained ARIMA model to an input relation or makes predictions using the in-sample data

These functions require time series data with consistent timesteps. To normalize a time series with inconsistent timesteps, see Gap filling and interpolation (GFI).

The following example trains three ARIMA models, two that use differencing and one that does not, and then makes predictions using the models.

Load the training data

Before you begin the example, load the Machine Learning sample data.

This example uses the following data:

  • daily-min-temperatures: provided in the machine learning sample data, this dataset contains data on the daily minimum temperature in Melbourne, Australia from 1981 through 1990. After you load the sample datasets, this data is available in the temp_data table.
  • db_size: a table that tracks the size of a database over consecutive months.
=> SELECT * FROM temp_data;
        time         | Temperature
---------------------+-------------
 1981-01-01 00:00:00 |        20.7
 1981-01-02 00:00:00 |        17.9
 1981-01-03 00:00:00 |        18.8
 1981-01-04 00:00:00 |        14.6
 1981-01-05 00:00:00 |        15.8
 ...
 1990-12-27 00:00:00 |          14
 1990-12-28 00:00:00 |        13.6
 1990-12-29 00:00:00 |        13.5
 1990-12-30 00:00:00 |        15.7
 1990-12-31 00:00:00 |          13
(3650 rows)

=> SELECT COUNT(*) FROM temp_data;
 COUNT
-------
3650
(1 row)

=> SELECT * FROM db_size;
 month | GB
-------+-----
     1 |   5
     2 |  10
     3 |  20
     4 |  35
     5 |  55
     6 |  80
     7 | 110
     8 | 145
     9 | 185
    10 | 230
(10 rows)

Train the ARIMA models

After you load the daily-min-temperatures data, you can use the ARIMA function to create and train an ARIMA model. For this example, the model is trained with lags of p=3 and q=3, taking the value and prediction error of three previous time steps into account for each prediction. Because the input time series is stationary, you don't need to apply differencing to the data:

=> SELECT ARIMA('arima_temp', 'temp_data', 'temperature', 'time' USING PARAMETERS p=3, d=0, q=3);
                             ARIMA
--------------------------------------------------------------
Finished in 20 iterations.
3650 elements accepted, 0 elements rejected.

(1 row)

You can view a summary of the model with the GET_MODEL_SUMMARY function:

=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='arima_temp');
               GET_MODEL_SUMMARY
-----------------------------------------------

============
coefficients
============
parameter| value
---------+--------
  phi_1  | 0.64189
  phi_2  | 0.46667
  phi_3  |-0.11777
 theta_1 |-0.05109
 theta_2 |-0.58699
 theta_3 |-0.15882


==============
regularization
==============
none

===============
timeseries_name
===============
temperature

==============
timestamp_name
==============
time

==============
missing_method
==============
linear_interpolation

===========
call_string
===========
ARIMA('public.arima_temp', 'temp_data', 'temperature', 'time' USING PARAMETERS p=3, d=0, q=3, missing='linear_interpolation', init_method='Zero', epsilon=1e-06, max_iterations=100);

===============
Additional Info
===============
       Name       | Value
------------------+--------
        p         |   3
        q         |   3
        d         |   0
       mean       |11.17775
      lambda      | 1.00000
mean_squared_error| 5.80490
rejected_row_count|   0
accepted_row_count|  3650

(1 row)

Examining the db_size table, it is clear that there is an upward trend to the database size over time. Each month the database size increases five more gigabytes than the increase in the previous month. This trend indicates the time series is non-stationary.

To account for this in the ARIMA model, you must difference the data by setting a non-zero d parameter value. For comparison, two ARIMA models are trained on this data, the first with a d value of one and the second with a d value of two:

=> SELECT ARIMA('arima_d1', 'db_size', 'GB', 'month' USING PARAMETERS p=2, d=1, q=2);
                                 ARIMA
------------------------------------------------------------------------
 Finished in 9 iterations.
 10 elements accepted, 0 elements rejected.

(1 row)

=> SELECT ARIMA('arima_d2', 'db_size', 'GB', 'month' USING PARAMETERS p=2, d=2, q=2);
                                 ARIMA
------------------------------------------------------------------------
 Finished in 0 iterations.
 10 elements accepted, 0 elements rejected.

(1 row)

Make predictions

After you train the ARIMA models, you can call the PREDICT_ARIMA function to predict future time series values. This function supports making predictions using the in-sample data that the models were trained on or applying the model to an input relation.

Using in-sample data

The following PREIDCT_ARIMA call makes temperature predictions using the in-sample data that the arima_temp model was trained on. The model begins prediction at the end of the temp_data table and returns predicted values for ten timesteps:

=> SELECT PREDICT_ARIMA(USING PARAMETERS model_name='arima_temp', start=0, npredictions=10) OVER();
 index |   prediction
-------+------------------
     1 | 12.9745063293842
     2 | 13.4389080858551
     3 | 13.3955791360528
     4 | 13.3551146487462
     5 | 13.3149336514747
     6 | 13.2750516811057
     7 | 13.2354710353376
     8 | 13.1961939790513
     9 | 13.1572226788109
    10 | 13.1185592045127
(10 rows)

For both prediction methods, if you want the function to return the standard error of each prediction, you can set output_standard_errors to true:

=> SELECT PREDICT_ARIMA(USING PARAMETERS model_name='arima_temp', start=0, npredictions=10, output_standard_errors=true) OVER();
 index |    prediction    |     std_err
-------+------------------+------------------
     1 | 12.9745063293842 | 1.00621890780865
     2 | 13.4389080858551 | 1.45340836833232
     3 | 13.3955791360528 | 1.61041524562932
     4 | 13.3551146487462 | 1.76368421116143
     5 | 13.3149336514747 | 1.91223938476627
     6 | 13.2750516811057 | 2.05618464609977
     7 | 13.2354710353376 | 2.19561771498385
     8 | 13.1961939790513 | 2.33063553781651
     9 | 13.1572226788109 | 2.46133422924445
    10 | 13.1185592045127 | 2.58780904243988
(10 rows)

To make predictions with the two models trained on the db_size table, you only need to change the specified model_name in the above calls:

=> SELECT PREDICT_ARIMA(USING PARAMETERS model_name='arima_d1', start=0, npredictions=10) OVER();
 index |    prediction
-------+------------------
     1 | 279.882778508943
     2 | 334.398317856829
     3 | 393.204492820962
     4 | 455.909453114272
     5 | 522.076165355683
     6 | 591.227478668175
     7 | 662.851655189833
     8 | 736.408301395412
     9 | 811.334631481162
    10 | 887.051990217688
(10 rows)

=> SELECT PREDICT_ARIMA(USING PARAMETERS model_name='arima_d2', start=0, npredictions=10) OVER();
 index | prediction
-------+------------
     1 | 280
     2 | 335
     3 | 395
     4 | 460
     5 | 530
     6 | 605
     7 | 685
     8 | 770
     9 | 860
    10 | 955
(10 rows)

Comparing the outputs from the two models, you can see that the model trained with a d value of two correctly captures the trend in the data. Each month the rate of database growth increases by five gigabytes.

Applying to an input relation

You can also apply the model to an input relation. The following example makes predictions by applying the arima_temp model to the temp_data training set:

=> SELECT PREDICT_ARIMA(temperature USING PARAMETERS model_name='arima_temp', start=3651, npredictions=10, output_standard_errors=true) OVER(ORDER BY time) FROM temp_data;
 index |    prediction    |     std_err
-------+------------------+------------------
     1 | 12.9745063293842 | 1.00621890780865
     2 | 13.4389080858551 | 1.45340836833232
     3 | 13.3955791360528 | 1.61041524562932
     4 | 13.3551146487462 | 1.76368421116143
     5 | 13.3149336514747 | 1.91223938476627
     6 | 13.2750516811057 | 2.05618464609977
     7 | 13.2354710353376 | 2.19561771498385
     8 | 13.1961939790513 | 2.33063553781651
     9 | 13.1572226788109 | 2.46133422924445
    10 | 13.1185592045127 | 2.58780904243988
(10 rows)

Because the same data and relative start index were provided to both prediction methods, the arima_temp model predictions for each method are identical.

When applying a model to an input relation, you can set add_mean to false so that the function returns the predicted difference from the mean instead of the sum of the model mean and the predicted difference:

=> SELECT PREDICT_ARIMA(temperature USING PARAMETERS model_name='arima_temp', start=3680, npredictions=10, add_mean=false) OVER(ORDER BY time) FROM temp_data;
 index |  prediction
-------+------------------
     1 | 1.2026877112171
     2 | 1.17114068517961
     3 | 1.13992534953432
     4 | 1.10904183333367
     5 | 1.0784901998692
     6 | 1.04827044781798
     7 | 1.01838251238116
     8 | 0.98882626641461
     9 | 0.959601521551628
    10 | 0.93070802931751
(10 rows)

See also

3 - Moving-average model example

Moving average models use the errors of previous predictions to make future predictions. More specifically, the user-specified lag determines how many previous predictions and errors it takes into account during computation.

Use the following functions when training and predicting with moving-average models. Note that these functions require datasets with consistent timesteps.

To normalize datasets with inconsistent timesteps, see Gap filling and interpolation (GFI).

Example

  1. Load the datasets from the Machine-Learning-Examples repository.

    This example uses the daily-min-temperatures dataset, which contains data on the daily minimum temperature in Melbourne, Australia from 1981 through 1990:

    => SELECT * FROM temp_data;
            time         | Temperature
    ---------------------+-------------
     1981-01-01 00:00:00 |        20.7
     1981-01-02 00:00:00 |        17.9
     1981-01-03 00:00:00 |        18.8
     1981-01-04 00:00:00 |        14.6
     1981-01-05 00:00:00 |        15.8
    ...
     1990-12-27 00:00:00 |          14
     1990-12-28 00:00:00 |        13.6
     1990-12-29 00:00:00 |        13.5
     1990-12-30 00:00:00 |        15.7
     1990-12-31 00:00:00 |          13
    (3650 rows)
    
  2. Use MOVING_AVERAGE to create the moving-average model MA_temperature from the temp_data dataset. In this case, the model is trained with a lag of p=3, taking the error of 3 previous predictions into account for each estimation:

    => SELECT MOVING_AVERAGE('MA_temperature', 'temp_data', 'temperature', 'time' USING PARAMETERS q=3, missing='linear_interpolation', regularization='none', lambda=1);
                        MOVING_AVERAGE
    ---------------------------------------------------------
     Finished. 3650 elements accepted, 0 elements rejected.
    (1 row)
    

    You can view a summary of the model with GET_MODEL_SUMMARY:

    => SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='MA_temperature');
    
     GET_MODEL_SUMMARY
    -------------------
    
    ============
    coefficients
    ============
    parameter| value
    ---------+--------
    phi_(t-0)|-0.90051
    phi_(t-1)|-0.10621
    phi_(t-2)| 0.07173
    
    
    ===============
    timeseries_name
    ===============
    temperature
    
    ==============
    timestamp_name
    ==============
    time
    
    ===========
    call_string
    ===========
    moving_average('public.MA_temperature', 'temp_data', 'temperature', 'time'
    USING PARAMETERS q=3, missing=linear_interpolation, regularization='none', lambda=1);
    
    ===============
    Additional Info
    ===============
           Name       | Value
    ------------------+--------
           mean       |11.17780
        lag_order     |   3
          lambda      | 1.00000
    rejected_row_count|   0
    accepted_row_count|  3650
    
    (1 row)
    
  3. Use PREDICT_MOVING_AVERAGE to predict future temperatures. The following query starts the prediction at the end of the dataset and returns 10 predictions.

    => SELECT PREDICT_MOVING_AVERAGE(Temperature USING PARAMETERS model_name='MA_temperature', npredictions=10) OVER(ORDER BY time) FROM temp_data;
    
     index |    prediction
    -------+------------------
         1 | 13.1324365636272
         2 | 12.8071086272833
         3 | 12.7218966671721
         4 | 12.6011086656032
         5 | 12.506624729879
         6 | 12.4148247026733
         7 | 12.3307873804812
         8 | 12.2521385975133
         9 | 12.1789741993396
        10 | 12.1107640076638
    (10 rows)