ARIMA

Creates and trains an autoregressive integrated moving average (ARIMA) model from a time series with consistent timesteps.

Creates and trains an autoregressive integrated moving average (ARIMA) model from a time series with consistent timesteps. ARIMA models combine the abilities of AUTOREGRESSOR and MOVING_AVERAGE models by making future predictions based on both preceding time series values and errors of previous predictions. ARIMA models also provide the option to apply a differencing operation to the input data, which can turn a non-stationary time series into a stationary time series. After the model is trained, you can make predictions with the PREDICT_ARIMA function.

In Vertica, ARIMA is implemented using a Kalman Filter state-space approach, similar to Gardner, G., et al. This approach updates the state-space model with each element in the training data in order to calculate a loss score over the training data. A BFGS optimizer is then used to adjust the coefficients, and the state-space estimation is rerun until convergence. Because of this repeated estimation process, ARIMA consumes large amounts of memory when called with high values of p and q.

Given that the input data must be sorted by timestamp, this algorithm is single-threaded.

This is a meta-function. You must call meta-functions in a top-level SELECT statement.

Behavior type

Immutable

Syntax

ARIMA( 'model-name', 'input-relation', 'timeseries-column', 'timestamp-column'
    USING PARAMETERS param=value[,...] )

Arguments

model-name
Model to create, where model-name conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema.
input-relation
Name of the table or view containing timeseries-column and timestamp-column.
timeseries-column
Name of a NUMERIC column in input-relation that contains the dependent variable or outcome.
timestamp-column
Name of an INTEGER, FLOAT, or TIMESTAMP column in input-relation that represents the timestamp variable. The timestep between consecutive entries should be consistent throughout the timestamp-column.

Parameters

p
Integer in the range [0, 1000], the number of lags to include in the autoregressive component of the computation. If q is unspecified or set to zero, p must be set to a nonzero value. In some cases, using a large p value can result in a memory overload error.

Default: 0

d
Integer in the range [0, 10], the difference order of the model.

If the timeseries-column is a non-stationary time series, whose statistical properties change over time, you can specify a non-zero d value to difference the input data. This operation can remove or reduce trends in the time series data.

Differencing computes the differences between consecutive time series values and then trains the model on these values. The difference order d, where 0 implies no differencing, determines how many times to repeat the differencing operation. For example, second-order differencing takes the results of the first-order operation and differences these values again to obtain the second-order values. For an example that trains an ARIMA model that uses differencing, see ARIMA model example.

Default: 0

q
Integer in the range [0, 1000], the number of lags to include in the moving average component of the computation. If p is unspecified or set to zero, q must be set to a nonzero value. In some cases, using a large q value can result in a memory overload error.

Default: 0

missing
Method for handling missing values, one of the following strings:
  • 'drop': Missing values are ignored.

  • 'raise': Missing values raise an error.

  • 'zero': Missing values are set to zero.

  • 'linear_interpolation': Missing values are replaced by a linearly interpolated value based on the nearest valid entries before and after the missing value. In cases where the first or last values in a dataset are missing, the function errors.

Default: 'linear_interpolation'

init_method
Initialization method, one of the following strings:
  • 'Zero': Coefficients are initialized to zero.

  • 'Hannan-Rissanen' or 'HR': Coefficients are initialized using the Hannan-Rissanen algorithm.

Default: 'Zero'

epsilon
Float in the range (0.0, 1.0), controls the convergence criteria of the optimization algorithm.

Default: 1e-6

max_iterations
Integer in the range [1, 1000000), the maximum number of training iterations. If you set this value too low, the algorithm might not converge.

Default: 100

Model attributes

coefficients
Coefficients of the model:
  • phi: parameters for the autoregressive component of the computation. The number of returned phi values is equal to the value of p.

  • theta: parameters for the moving average component of the computation. The number of returned theta values is equal to the value of q.

p, q, d
ARIMA component values:
  • p: number of lags included in the autoregressive component of the computation

  • d: difference order of the model

  • q: number of lags included in the moving average component of the computation

mean
The model mean, average of the accepted sample values from timeseries-column
regularization
Type of regularization used when training the model
lambda
Regularization parameter. Higher values indicates stronger regularization.
mean_squared_error
Mean squared error of the model on the training set
rejected_row_count
Number of samples rejected during training
accepted_row_count
Number of samples accepted for training from the data set
timeseries_name
Name of the timeseries-column used to train the model
timestamp_name
Name of the timestamp-column used to train the model
missing_method
Method used for handling missing values
call_string
SQL statement used to train the model

Examples

The function requires that at least one of the p and q parameters be a positive, nonzero integer. The following example trains a model where both of these parameters are set to two:

=> SELECT ARIMA('arima_temp', 'temp_data', 'temperature', 'time' USING PARAMETERS p=2, q=2);
               ARIMA
-------------------------------------
Finished in 24 iterations.
3650 elements accepted, 0 elements rejected.
(1 row)

To see a summary of the model, including all model coefficients and parameter values, call GET_MODEL_SUMMARY:

=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='arima_temp');
                    GET_MODEL_SUMMARY
------------------------------------------------------------

============
coefficients
============
parameter| value
---------+--------
  phi_1  | 1.23639
  phi_2  |-0.24201
 theta_1 |-0.64535
 theta_2 |-0.23046


==============
regularization
==============
none

===============
timeseries_name
===============
temperature

==============
timestamp_name
==============
time

==============
missing_method
==============
linear_interpolation

===========
call_string
===========
ARIMA('public.arima_temp', 'temp_data', 'temperature', 'time' USING PARAMETERS p=2, d=0, q=2, missing='linear_interpolation', init_method='Zero', epsilon=1e-06, max_iterations=100);

===============
Additional Info
===============
       Name       | Value
------------------+--------
        p         |   2
        q         |   2
        d         |   0
       mean       |11.17775
      lambda      | 1.00000
mean_squared_error| 5.80628
rejected_row_count|   0
accepted_row_count|  3650

(1 row)

For an in-depth example that trains and makes predictions with ARIMA models, see ARIMA model example.

See also