ARIMA
Creates and trains an autoregressive integrated moving average (ARIMA) model from a time series with consistent timesteps. ARIMA models combine the abilities of AUTOREGRESSOR and MOVING_AVERAGE models by making future predictions based on both preceding time series values and errors of previous predictions. ARIMA models also provide the option to apply a differencing operation to the input data, which can turn a non-stationary time series into a stationary time series. After the model is trained, you can make predictions with the PREDICT_ARIMA function.
In Vertica, ARIMA is implemented using a Kalman Filter state-space approach, similar to Gardner, G., et al. This approach updates the state-space model with each element in the training data in order to calculate a loss score over the training data. A BFGS optimizer is then used to adjust the coefficients, and the state-space estimation is rerun until convergence. Because of this repeated estimation process, ARIMA consumes large amounts of memory when called with high values of p
and q
.
Given that the input data must be sorted by timestamp, this algorithm is single-threaded.
This is a meta-function. You must call meta-functions in a top-level SELECT statement.
Behavior type
ImmutableSyntax
ARIMA( 'model-name', 'input-relation', 'timeseries-column', 'timestamp-column'
USING PARAMETERS param=value[,...] )
Arguments
model-name
- Model to create, where
model-name
conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema. input-relation
- Name of the table or view containing
timeseries-column
andtimestamp-column
. timeseries-column
- Name of a NUMERIC column in
input-relation
that contains the dependent variable or outcome. timestamp-column
- Name of an INTEGER, FLOAT, or TIMESTAMP column in
input-relation
that represents the timestamp variable. The timestep between consecutive entries should be consistent throughout thetimestamp-column
.Tip
If yourtimestamp-column
has varying timesteps, consider standardizing the step size with the TIME_SLICE function.
Parameters
p
- Integer in the range [0, 1000], the number of lags to include in the autoregressive component of the computation. If
q
is unspecified or set to zero,p
must be set to a nonzero value. In some cases, using a largep
value can result in a memory overload error.Note
The AUTOREGRESSOR and ARIMA models use different training techniques that produce distinct models when trained with matching parameter values on the same data. For example, if you train an autoregressor model using the same data andp
value as an ARIMA model trained withd
andq
parameters set to zero, those two models will not be identical.Default: 0
d
- Integer in the range [0, 10], the difference order of the model.
If the
timeseries-column
is a non-stationary time series, whose statistical properties change over time, you can specify a non-zerod
value to difference the input data. This operation can remove or reduce trends in the time series data.Differencing computes the differences between consecutive time series values and then trains the model on these values. The difference order
d
, where 0 implies no differencing, determines how many times to repeat the differencing operation. For example, second-order differencing takes the results of the first-order operation and differences these values again to obtain the second-order values. For an example that trains an ARIMA model that uses differencing, see ARIMA model example.Default: 0
q
- Integer in the range [0, 1000], the number of lags to include in the moving average component of the computation. If
p
is unspecified or set to zero,q
must be set to a nonzero value. In some cases, using a largeq
value can result in a memory overload error.Note
The MOVING_AVERAGE and ARIMA models use different training techniques that produce distinct models when trained with matching parameter values on the same data. For example, if you train a moving-average model using the same data andq
value as an ARIMA model trained withp
andd
parameters set to zero, those two models will not be identical.Default: 0
missing
- Method for handling missing values, one of the following strings:
-
'drop': Missing values are ignored.
-
'raise': Missing values raise an error.
-
'zero': Missing values are set to zero.
-
'linear_interpolation': Missing values are replaced by a linearly interpolated value based on the nearest valid entries before and after the missing value. In cases where the first or last values in a dataset are missing, the function errors.
Default: 'linear_interpolation'
-
init_method
- Initialization method, one of the following strings:
-
'Zero': Coefficients are initialized to zero.
-
'Hannan-Rissanen' or 'HR': Coefficients are initialized using the Hannan-Rissanen algorithm.
Default: 'Zero'
-
epsilon
- Float in the range (0.0, 1.0), controls the convergence criteria of the optimization algorithm.
Default: 1e-6
max_iterations
- Integer in the range [1, 1000000), the maximum number of training iterations. If you set this value too low, the algorithm might not converge.
Default: 100
Model attributes
coefficients
- Coefficients of the model:
-
phi
: parameters for the autoregressive component of the computation. The number of returnedphi
values is equal to the value ofp
. -
theta
: parameters for the moving average component of the computation. The number of returnedtheta
values is equal to the value ofq
.
-
p, q, d
- ARIMA component values:
-
p
: number of lags included in the autoregressive component of the computation -
d
: difference order of the model -
q
: number of lags included in the moving average component of the computation
-
mean
- The model mean, average of the accepted sample values from
timeseries-column
regularization
- Type of regularization used when training the model
lambda
- Regularization parameter. Higher values indicates stronger regularization.
mean_squared_error
- Mean squared error of the model on the training set
rejected_row_count
- Number of samples rejected during training
accepted_row_count
- Number of samples accepted for training from the data set
timeseries_name
- Name of the
timeseries-column
used to train the model timestamp_name
- Name of the
timestamp-column
used to train the model missing_method
- Method used for handling missing values
call_string
- SQL statement used to train the model
Examples
The function requires that at least one of the p
and q
parameters be a positive, nonzero integer. The following example trains a model where both of these parameters are set to two:
=> SELECT ARIMA('arima_temp', 'temp_data', 'temperature', 'time' USING PARAMETERS p=2, q=2);
ARIMA
-------------------------------------
Finished in 24 iterations.
3650 elements accepted, 0 elements rejected.
(1 row)
To see a summary of the model, including all model coefficients and parameter values, call GET_MODEL_SUMMARY:
=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='arima_temp');
GET_MODEL_SUMMARY
------------------------------------------------------------
============
coefficients
============
parameter| value
---------+--------
phi_1 | 1.23639
phi_2 |-0.24201
theta_1 |-0.64535
theta_2 |-0.23046
==============
regularization
==============
none
===============
timeseries_name
===============
temperature
==============
timestamp_name
==============
time
==============
missing_method
==============
linear_interpolation
===========
call_string
===========
ARIMA('public.arima_temp', 'temp_data', 'temperature', 'time' USING PARAMETERS p=2, d=0, q=2, missing='linear_interpolation', init_method='Zero', epsilon=1e-06, max_iterations=100);
===============
Additional Info
===============
Name | Value
------------------+--------
p | 2
q | 2
d | 0
mean |11.17775
lambda | 1.00000
mean_squared_error| 5.80628
rejected_row_count| 0
accepted_row_count| 3650
(1 row)
For an in-depth example that trains and makes predictions with ARIMA models, see ARIMA model example.