ARIMA

Creates and trains an autoregressive integrated moving average (ARIMA) model from a time series with consistent timesteps.

Creates and trains an autoregressive integrated moving average (ARIMA) model from a time series with consistent timesteps. ARIMA models combine the abilities of AUTOREGRESSOR and MOVING_AVERAGE models by making future predictions based on both preceding time series values and errors of previous predictions. ARIMA models also provide the option to apply a differencing operation to the input data, which can turn a non-stationary time series into a stationary time series. After the model is trained, you can make predictions with the PREDICT_ARIMA function.

In Vertica, ARIMA is implemented using a Kalman Filter state-space approach, similar to Gardner, G., et al. This approach updates the state-space model with each element in the training data in order to calculate a loss score over the training data. A BFGS optimizer is then used to adjust the coefficients, and the state-space estimation is rerun until convergence. Because of this repeated estimation process, ARIMA consumes large amounts of memory when called with high values of p and q.

Given that the input data must be sorted by timestamp, this algorithm is single-threaded.

This is a meta-function. You must call meta-functions in a top-level SELECT statement.

Behavior type

Immutable

Syntax

ARIMA( 'model-name', 'input-relation', 'timeseries-column', 'timestamp-column'
    USING PARAMETERS param=value[,...] )

Arguments

model-name: Model to create, where model-name conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema.
input-relation: Name of the table or view containing timeseries-column and timestamp-column.
timeseries-column: Name of a NUMERIC column in input-relation that contains the dependent variable or outcome.
timestamp-column: Name of an INTEGER, FLOAT, or TIMESTAMP column in input-relation that represents the timestamp variable. The timestep between consecutive entries should be consistent throughout the timestamp-column.

Tip
If your timestamp-column has varying timesteps, consider standardizing the step size with the TIME_SLICE function.

Parameters

p

Integer in the range [0, 1000], the number of lags to include in the autoregressive component of the computation. If q is unspecified or set to zero, p must be set to a nonzero value. In some cases, using a large p value can result in a memory overload error.

Note

The AUTOREGRESSOR and ARIMA models use different training techniques that produce distinct models when trained with matching parameter values on the same data. For example, if you train an autoregressor model using the same data and p value as an ARIMA model trained with d and q parameters set to zero, those two models will not be identical.

Default: 0

d

Integer in the range [0, 10], the difference order of the model.

If the timeseries-column is a non-stationary time series, whose statistical properties change over time, you can specify a non-zero d value to difference the input data. This operation can remove or reduce trends in the time series data.

Differencing computes the differences between consecutive time series values and then trains the model on these values. The difference order d, where 0 implies no differencing, determines how many times to repeat the differencing operation. For example, second-order differencing takes the results of the first-order operation and differences these values again to obtain the second-order values. For an example that trains an ARIMA model that uses differencing, see ARIMA model example.

Default: 0

q

Integer in the range [0, 1000], the number of lags to include in the moving average component of the computation. If p is unspecified or set to zero, q must be set to a nonzero value. In some cases, using a large q value can result in a memory overload error.

Note

The MOVING_AVERAGE and ARIMA models use different training techniques that produce distinct models when trained with matching parameter values on the same data. For example, if you train a moving-average model using the same data and q value as an ARIMA model trained with p and d parameters set to zero, those two models will not be identical.

Default: 0

missing

Method for handling missing values, one of the following strings:

'drop': Missing values are ignored.
'raise': Missing values raise an error.
'zero': Missing values are set to zero.
'linear_interpolation': Missing values are replaced by a linearly interpolated value based on the nearest valid entries before and after the missing value. In cases where the first or last values in a dataset are missing, the function errors.

Default: 'linear_interpolation'

init_method

Initialization method, one of the following strings:

'Zero': Coefficients are initialized to zero.
'Hannan-Rissanen' or 'HR': Coefficients are initialized using the Hannan-Rissanen algorithm.

Default: 'Zero'

epsilon

Float in the range (0.0, 1.0), controls the convergence criteria of the optimization algorithm.

Default: 1e-6

max_iterations

Integer in the range [1, 1000000), the maximum number of training iterations. If you set this value too low, the algorithm might not converge.

Default: 100

Model attributes

coefficients

Coefficients of the model:

phi: parameters for the autoregressive component of the computation. The number of returned phi values is equal to the value of p.
theta: parameters for the moving average component of the computation. The number of returned theta values is equal to the value of q.

p, q, d

ARIMA component values:

p: number of lags included in the autoregressive component of the computation
d: difference order of the model
q: number of lags included in the moving average component of the computation

mean

The model mean, average of the accepted sample values from timeseries-column

regularization

Type of regularization used when training the model

lambda

Regularization parameter. Higher values indicates stronger regularization.

mean_squared_error

Mean squared error of the model on the training set

rejected_row_count

Number of samples rejected during training

accepted_row_count

Number of samples accepted for training from the data set

timeseries_name

Name of the timeseries-column used to train the model

timestamp_name

Name of the timestamp-column used to train the model

missing_method

Method used for handling missing values

call_string

SQL statement used to train the model

Examples

The function requires that at least one of the p and q parameters be a positive, nonzero integer. The following example trains a model where both of these parameters are set to two:

=> SELECT ARIMA('arima_temp', 'temp_data', 'temperature', 'time' USING PARAMETERS p=2, q=2);
               ARIMA
-------------------------------------
Finished in 24 iterations.
3650 elements accepted, 0 elements rejected.
(1 row)

To see a summary of the model, including all model coefficients and parameter values, call GET_MODEL_SUMMARY:

=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='arima_temp');
                    GET_MODEL_SUMMARY
------------------------------------------------------------

============
coefficients
============
parameter| value
---------+--------
  phi_1  | 1.23639
  phi_2  |-0.24201
 theta_1 |-0.64535
 theta_2 |-0.23046


==============
regularization
==============
none

===============
timeseries_name
===============
temperature

==============
timestamp_name
==============
time

==============
missing_method
==============
linear_interpolation

===========
call_string
===========
ARIMA('public.arima_temp', 'temp_data', 'temperature', 'time' USING PARAMETERS p=2, d=0, q=2, missing='linear_interpolation', init_method='Zero', epsilon=1e-06, max_iterations=100);

===============
Additional Info
===============
       Name       | Value
------------------+--------
        p         |   2
        q         |   2
        d         |   0
       mean       |11.17775
      lambda      | 1.00000
mean_squared_error| 5.80628
rejected_row_count|   0
accepted_row_count|  3650

(1 row)

For an in-depth example that trains and makes predictions with ARIMA models, see ARIMA model example.

ARIMA

Behavior type

Syntax

Arguments

Tip

Parameters

Note

Note

Model attributes

Examples

See also