ARIMA
Creates and trains an autoregressive integrated moving average (ARIMA) model from a time series with consistent timesteps. ARIMA models combine the abilities of AUTOREGRESSOR and MOVING_AVERAGE models by making future predictions based on both preceding time series values and errors of previous predictions. After the model is trained, you can make predictions with the PREDICT_ARIMA function.
In Vertica, ARIMA is implemented using a Kalman Filter state-space approach, similar to Gardner, G., et al. This approach updates the state-space model with each element in the training data in order to calculate a loss score over the training data. A BFGS optimizer is then used to adjust the coefficients, and the state-space estimation is rerun until convergence. Because of this repeated estimation process, ARIMA consumes large amounts of memory when called with high values of p and q.
Given that the input data must be sorted by timestamp, this algorithm is single-threaded.
This is a meta-function. You must call meta-functions in a top-level SELECT statement.
Behavior type
ImmutableSyntax
ARIMA( 'model-name', 'input-relation', 'timeseries-column', 'timestamp-column'
    USING PARAMETERS param=value[,...] )
Arguments
- model-name
- Model to create, where model-nameconforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema.
- input-relation
- Name of the table or view containing timeseries-columnandtimestamp-column.
- timeseries-column
- Name of a NUMERIC column in input-relationthat contains the dependent variable or outcome.
- timestamp-column
- Name of an INTEGER, FLOAT, or TIMESTAMP column in input-relationthat represents the timestamp variable. The timestep between consecutive entries should be consistent throughout thetimestamp-column.TipIf yourtimestamp-columnhas varying timesteps, consider standardizing the step size with the TIME_SLICE function.
Parameters
- p
- Integer in the range [0, 1000], the number of lags to include in the autoregressive component of the computation. If qis unspecified or set to zero,pmust be set to a nonzero value. In some cases, using a largepvalue can result in a memory overload error.Default: 0 
- d
- Difference order of the model.
CautionCurrently, this value must be zero. If you attempt to train an ARIMA model with a nonzerodvalue, the function errors.Default: 0 
- q
- Integer in the range [0, 1000], the number of lags to include in the moving average component of the computation. If pis unspecified or set to zero,qmust be set to a nonzero value. In some cases, using a largeqvalue can result in a memory overload error.Default: 0 
- missing
- Method for handling missing values, one of the following strings:
- 
'drop': Missing values are ignored. 
- 
'raise': Missing values raise an error. 
- 
'zero': Missing values are set to zero. 
- 
'linear_interpolation': Missing values are replaced by a linearly interpolated value based on the nearest valid entries before and after the missing value. In cases where the first or last values in a dataset are missing, the function errors. 
 Default: 'linear_interpolation' 
- 
- init_method
- Initialization method, one of the following strings:
- 
'Zero': Coefficients are initialized to zero. 
- 
'Hannan-Rissanen' or 'HR': Coefficients are initialized using the Hannan-Rissanen algorithm. 
 Default: 'Zero' 
- 
- epsilon
- Float in the range (0.0, 1.0), controls the convergence criteria of the optimization algorithm.
Default: 1e-6 
- max_iterations
- Integer in the range [1, 1000000), the maximum number of training iterations. If you set this value too low, the algorithm might not converge.
Default: 100 
Model attributes
- coefficients
- Coefficients of the model:
- 
phi: parameters for the autoregressive component of the computation. The number of returnedphivalues is equal to the value ofp.
- 
theta: parameters for the moving average component of the computation. The number of returnedthetavalues is equal to the value ofq.
 
- 
- p, q, d
- ARIMA component values:
- 
p: number of lags included in the autoregressive component of the computation
- 
d: difference order of the model
- 
q: number of lags included in the moving average component of the computation
 
- 
- mean
- The model mean, average of the accepted sample values from timeseries-column
- regularization
- Type of regularization used when training the model
- lambda
- Regularization parameter. Higher values indicates stronger regularization.
- mean_squared_error
- Mean squared error of the model on the training set
- rejected_row_count
- Number of samples rejected during training
- accepted_row_count
- Number of samples accepted for training from the data set
- timeseries_name
- Name of the timeseries-columnused to train the model
- timestamp_name
- Name of the timestamp-columnused to train the model
- missing_method
- Method used for handling missing values
- call_string
- SQL statement used to train the model
Examples
The function requires that at least one of the p and q parameters be a positive, nonzero integer. The following example trains a model where both of these parameters are set to two:
=> SELECT ARIMA('arima_temp', 'temp_data', 'temperature', 'time' USING PARAMETERS p=2, q=2);
               ARIMA
-------------------------------------
Finished in 24 iterations.
3650 elements accepted, 0 elements rejected.
(1 row)
To see a summary of the model, including all model coefficients and parameter values, call GET_MODEL_SUMMARY:
=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='arima_temp');
                    GET_MODEL_SUMMARY
------------------------------------------------------------
============
coefficients
============
parameter| value
---------+--------
  phi_1  | 1.23639
  phi_2  |-0.24201
 theta_1 |-0.64535
 theta_2 |-0.23046
==============
regularization
==============
none
===============
timeseries_name
===============
temperature
==============
timestamp_name
==============
time
==============
missing_method
==============
linear_interpolation
===========
call_string
===========
ARIMA('public.arima_temp', 'temp_data', 'temperature', 'time' USING PARAMETERS p=2, d=0, q=2, missing='linear_interpolation', init_method='Zero', epsilon=1e-06, max_iterations=100);
===============
Additional Info
===============
       Name       | Value
------------------+--------
        p         |   2
        q         |   2
        d         |   0
       mean       |11.17775
      lambda      | 1.00000
mean_squared_error| 5.80628
rejected_row_count|   0
accepted_row_count|  3650
(1 row)
For an in-depth example that trains and makes predictions with an ARIMA model, see ARIMA model example.