ARIMA
Creates and trains an autoregressive integrated moving average (ARIMA) model from a time series with consistent timesteps. ARIMA models combine the abilities of AUTOREGRESSOR and MOVING_AVERAGE models by making future predictions based on both preceding time series values and errors of previous predictions. After the model is trained, you can make predictions with the PREDICT_ARIMA function.
In Vertica, ARIMA is implemented using a Kalman Filter state-space approach, similar to Gardner, G., et al. This approach updates the state-space model with each element in the training data in order to calculate a loss score over the training data. A BFGS optimizer is then used to adjust the coefficients, and the state-space estimation is rerun until convergence. Because of this repeated estimation process, ARIMA consumes large amounts of memory when called with high values of p and q.
Given that the input data must be sorted by timestamp, this algorithm is single-threaded.
This is a meta-function. You must call meta-functions in a top-level SELECT statement.
Behavior type
ImmutableSyntax
ARIMA( 'model-name', 'input-relation', 'timeseries-column', 'timestamp-column'
USING PARAMETERS param=value[,...] )
Arguments
model-name- Model to create, where
model-nameconforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema. input-relation- Name of the table or view containing
timeseries-columnandtimestamp-column. timeseries-column- Name of a NUMERIC column in
input-relationthat contains the dependent variable or outcome. timestamp-column- Name of an INTEGER, FLOAT, or TIMESTAMP column in
input-relationthat represents the timestamp variable. The timestep between consecutive entries should be consistent throughout thetimestamp-column.Tip
If yourtimestamp-columnhas varying timesteps, consider standardizing the step size with the TIME_SLICE function.
Parameters
p- Integer in the range [0, 1000], the number of lags to include in the autoregressive component of the computation. If
qis unspecified or set to zero,pmust be set to a nonzero value. In some cases, using a largepvalue can result in a memory overload error.Default: 0
d- Difference order of the model.
Caution
Currently, this value must be zero. If you attempt to train an ARIMA model with a nonzerodvalue, the function errors.Default: 0
q- Integer in the range [0, 1000], the number of lags to include in the moving average component of the computation. If
pis unspecified or set to zero,qmust be set to a nonzero value. In some cases, using a largeqvalue can result in a memory overload error.Default: 0
missing- Method for handling missing values, one of the following strings:
-
'drop': Missing values are ignored.
-
'raise': Missing values raise an error.
-
'zero': Missing values are set to zero.
-
'linear_interpolation': Missing values are replaced by a linearly interpolated value based on the nearest valid entries before and after the missing value. In cases where the first or last values in a dataset are missing, the function errors.
Default: 'linear_interpolation'
-
init_method- Initialization method, one of the following strings:
-
'Zero': Coefficients are initialized to zero.
-
'Hannan-Rissanen' or 'HR': Coefficients are initialized using the Hannan-Rissanen algorithm.
Default: 'Zero'
-
epsilon- Float in the range (0.0, 1.0), controls the convergence criteria of the optimization algorithm.
Default: 1e-6
max_iterations- Integer in the range [1, 1000000), the maximum number of training iterations. If you set this value too low, the algorithm might not converge.
Default: 100
Model attributes
coefficients- Coefficients of the model:
-
phi: parameters for the autoregressive component of the computation. The number of returnedphivalues is equal to the value ofp. -
theta: parameters for the moving average component of the computation. The number of returnedthetavalues is equal to the value ofq.
-
p, q, d- ARIMA component values:
-
p: number of lags included in the autoregressive component of the computation -
d: difference order of the model -
q: number of lags included in the moving average component of the computation
-
mean- The model mean, average of the accepted sample values from
timeseries-column regularization- Type of regularization used when training the model
lambda- Regularization parameter. Higher values indicates stronger regularization.
mean_squared_error- Mean squared error of the model on the training set
rejected_row_count- Number of samples rejected during training
accepted_row_count- Number of samples accepted for training from the data set
timeseries_name- Name of the
timeseries-columnused to train the model timestamp_name- Name of the
timestamp-columnused to train the model missing_method- Method used for handling missing values
call_string- SQL statement used to train the model
Examples
The function requires that at least one of the p and q parameters be a positive, nonzero integer. The following example trains a model where both of these parameters are set to two:
=> SELECT ARIMA('arima_temp', 'temp_data', 'temperature', 'time' USING PARAMETERS p=2, q=2);
ARIMA
-------------------------------------
Finished in 24 iterations.
3650 elements accepted, 0 elements rejected.
(1 row)
To see a summary of the model, including all model coefficients and parameter values, call GET_MODEL_SUMMARY:
=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='arima_temp');
GET_MODEL_SUMMARY
------------------------------------------------------------
============
coefficients
============
parameter| value
---------+--------
phi_1 | 1.23639
phi_2 |-0.24201
theta_1 |-0.64535
theta_2 |-0.23046
==============
regularization
==============
none
===============
timeseries_name
===============
temperature
==============
timestamp_name
==============
time
==============
missing_method
==============
linear_interpolation
===========
call_string
===========
ARIMA('public.arima_temp', 'temp_data', 'temperature', 'time' USING PARAMETERS p=2, d=0, q=2, missing='linear_interpolation', init_method='Zero', epsilon=1e-06, max_iterations=100);
===============
Additional Info
===============
Name | Value
------------------+--------
p | 2
q | 2
d | 0
mean |11.17775
lambda | 1.00000
mean_squared_error| 5.80628
rejected_row_count| 0
accepted_row_count| 3650
(1 row)
For an in-depth example that trains and makes predictions with an ARIMA model, see ARIMA model example.