ARIMA

Creates and trains an autoregressive integrated moving average (ARIMA) model from a time series with consistent timesteps.

Creates and trains an autoregressive integrated moving average (ARIMA) model from a time series with consistent timesteps. ARIMA models combine the abilities of AUTOREGRESSOR and MOVING_AVERAGE models by making future predictions based on both preceding time series values and errors of previous predictions. After the model is trained, you can make predictions with the PREDICT_ARIMA function.

In Vertica, ARIMA is implemented using a Kalman Filter state-space approach, similar to Gardner, G., et al. This approach updates the state-space model with each element in the training data in order to calculate a loss score over the training data. A BFGS optimizer is then used to adjust the coefficients, and the state-space estimation is rerun until convergence. Because of this repeated estimation process, ARIMA consumes large amounts of memory when called with high values of p and q.

Given that the input data must be sorted by timestamp, this algorithm is single-threaded.

This is a meta-function. You must call meta-functions in a top-level SELECT statement.

Behavior type

Immutable

Syntax

ARIMA( 'model-name', 'input-relation', 'timeseries-column', 'timestamp-column'
    USING PARAMETERS param=value[,...] )

Arguments

model-name: Model to create, where model-name conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema.
input-relation: Name of the table or view containing timeseries-column and timestamp-column.
timeseries-column: Name of a NUMERIC column in input-relation that contains the dependent variable or outcome.
timestamp-column: Name of an INTEGER, FLOAT, or TIMESTAMP column in input-relation that represents the timestamp variable. The timestep between consecutive entries should be consistent throughout the timestamp-column.

Tip
If your timestamp-column has varying timesteps, consider standardizing the step size with the TIME_SLICE function.

Parameters

p

Integer in the range [0, 1000], the number of lags to include in the autoregressive component of the computation. If q is unspecified or set to zero, p must be set to a nonzero value. In some cases, using a large p value can result in a memory overload error.

Default: 0

d

Difference order of the model.

Caution

Currently, this value must be zero. If you attempt to train an ARIMA model with a nonzero d value, the function errors.

Default: 0

q

Integer in the range [0, 1000], the number of lags to include in the moving average component of the computation. If p is unspecified or set to zero, q must be set to a nonzero value. In some cases, using a large q value can result in a memory overload error.

Default: 0

missing

Method for handling missing values, one of the following strings:

'drop': Missing values are ignored.
'raise': Missing values raise an error.
'zero': Missing values are set to zero.
'linear_interpolation': Missing values are replaced by a linearly interpolated value based on the nearest valid entries before and after the missing value. In cases where the first or last values in a dataset are missing, the function errors.

Default: 'linear_interpolation'

init_method

Initialization method, one of the following strings:

'Zero': Coefficients are initialized to zero.
'Hannan-Rissanen' or 'HR': Coefficients are initialized using the Hannan-Rissanen algorithm.

Default: 'Zero'

epsilon

Float in the range (0.0, 1.0), controls the convergence criteria of the optimization algorithm.

Default: 1e-6

max_iterations

Integer in the range [1, 1000000), the maximum number of training iterations. If you set this value too low, the algorithm might not converge.

Default: 100

Model attributes

coefficients

Coefficients of the model:

phi: parameters for the autoregressive component of the computation. The number of returned phi values is equal to the value of p.
theta: parameters for the moving average component of the computation. The number of returned theta values is equal to the value of q.

p, q, d

ARIMA component values:

p: number of lags included in the autoregressive component of the computation
d: difference order of the model
q: number of lags included in the moving average component of the computation

mean

The model mean, average of the accepted sample values from timeseries-column

regularization

Type of regularization used when training the model

lambda

Regularization parameter. Higher values indicates stronger regularization.

mean_squared_error

Mean squared error of the model on the training set

rejected_row_count

Number of samples rejected during training

accepted_row_count

Number of samples accepted for training from the data set

timeseries_name

Name of the timeseries-column used to train the model

timestamp_name

Name of the timestamp-column used to train the model

missing_method

Method used for handling missing values

call_string

SQL statement used to train the model

Examples

The function requires that at least one of the p and q parameters be a positive, nonzero integer. The following example trains a model where both of these parameters are set to two:

=> SELECT ARIMA('arima_temp', 'temp_data', 'temperature', 'time' USING PARAMETERS p=2, q=2);
               ARIMA
-------------------------------------
Finished in 24 iterations.
3650 elements accepted, 0 elements rejected.
(1 row)

To see a summary of the model, including all model coefficients and parameter values, call GET_MODEL_SUMMARY:

=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='arima_temp');
                    GET_MODEL_SUMMARY
------------------------------------------------------------

============
coefficients
============
parameter| value
---------+--------
  phi_1  | 1.23639
  phi_2  |-0.24201
 theta_1 |-0.64535
 theta_2 |-0.23046


==============
regularization
==============
none

===============
timeseries_name
===============
temperature

==============
timestamp_name
==============
time

==============
missing_method
==============
linear_interpolation

===========
call_string
===========
ARIMA('public.arima_temp', 'temp_data', 'temperature', 'time' USING PARAMETERS p=2, d=0, q=2, missing='linear_interpolation', init_method='Zero', epsilon=1e-06, max_iterations=100);

===============
Additional Info
===============
       Name       | Value
------------------+--------
        p         |   2
        q         |   2
        d         |   0
       mean       |11.17775
      lambda      | 1.00000
mean_squared_error| 5.80628
rejected_row_count|   0
accepted_row_count|  3650

(1 row)

For an in-depth example that trains and makes predictions with an ARIMA model, see ARIMA model example.

ARIMA

Behavior type

Syntax

Arguments

Tip

Parameters

Caution

Model attributes

Examples

See also