ARIMA
Creates and trains an autoregressive integrated moving average (ARIMA) model from a time series with consistent timesteps. ARIMA models combine the abilities of AUTOREGRESSOR and MOVING_AVERAGE models by making future predictions based on both preceding time series values and errors of previous predictions. ARIMA models also provide the option to apply a differencing operation to the input data, which can turn a nonstationary time series into a stationary time series. After the model is trained, you can make predictions with the PREDICT_ARIMA function.
In Vertica, ARIMA is implemented using a Kalman Filter statespace approach, similar to Gardner, G., et al. This approach updates the statespace model with each element in the training data in order to calculate a loss score over the training data. A BFGS optimizer is then used to adjust the coefficients, and the statespace estimation is rerun until convergence. Because of this repeated estimation process, ARIMA consumes large amounts of memory when called with high values of p
and q
.
Given that the input data must be sorted by timestamp, this algorithm is singlethreaded.
This is a metafunction. You must call metafunctions in a toplevel SELECT statement.
Behavior type
ImmutableSyntax
ARIMA( 'modelname', 'inputrelation', 'timeseriescolumn', 'timestampcolumn'
USING PARAMETERS param=value[,...] )
Arguments
modelname
 Model to create, where
modelname
conforms to conventions described in Identifiers. It must also be unique among all names of sequences, tables, projections, views, and models within the same schema. inputrelation
 Name of the table or view containing
timeseriescolumn
andtimestampcolumn
. timeseriescolumn
 Name of a NUMERIC column in
inputrelation
that contains the dependent variable or outcome. timestampcolumn
 Name of an INTEGER, FLOAT, or TIMESTAMP column in
inputrelation
that represents the timestamp variable. The timestep between consecutive entries should be consistent throughout thetimestampcolumn
.Tip
If yourtimestampcolumn
has varying timesteps, consider standardizing the step size with the TIME_SLICE function.
Parameters
p
 Integer in the range [0, 1000], the number of lags to include in the autoregressive component of the computation. If
q
is unspecified or set to zero,p
must be set to a nonzero value. In some cases, using a largep
value can result in a memory overload error.Note
The AUTOREGRESSOR and ARIMA models use different training techniques that produce distinct models when trained with matching parameter values on the same data. For example, if you train an autoregressor model using the same data andp
value as an ARIMA model trained withd
andq
parameters set to zero, those two models will not be identical.Default: 0
d
 Integer in the range [0, 10], the difference order of the model.
If the
timeseriescolumn
is a nonstationary time series, whose statistical properties change over time, you can specify a nonzerod
value to difference the input data. This operation can remove or reduce trends in the time series data.Differencing computes the differences between consecutive time series values and then trains the model on these values. The difference order
d
, where 0 implies no differencing, determines how many times to repeat the differencing operation. For example, secondorder differencing takes the results of the firstorder operation and differences these values again to obtain the secondorder values. For an example that trains an ARIMA model that uses differencing, see ARIMA model example.Default: 0
q
 Integer in the range [0, 1000], the number of lags to include in the moving average component of the computation. If
p
is unspecified or set to zero,q
must be set to a nonzero value. In some cases, using a largeq
value can result in a memory overload error.Note
The MOVING_AVERAGE and ARIMA models use different training techniques that produce distinct models when trained with matching parameter values on the same data. For example, if you train a movingaverage model using the same data andq
value as an ARIMA model trained withp
andd
parameters set to zero, those two models will not be identical.Default: 0
missing
 Method for handling missing values, one of the following strings:

'drop': Missing values are ignored.

'raise': Missing values raise an error.

'zero': Missing values are set to zero.

'linear_interpolation': Missing values are replaced by a linearly interpolated value based on the nearest valid entries before and after the missing value. In cases where the first or last values in a dataset are missing, the function errors.
Default: 'linear_interpolation'

init_method
 Initialization method, one of the following strings:

'Zero': Coefficients are initialized to zero.

'HannanRissanen' or 'HR': Coefficients are initialized using the HannanRissanen algorithm.
Default: 'Zero'

epsilon
 Float in the range (0.0, 1.0), controls the convergence criteria of the optimization algorithm.
Default: 1e6
max_iterations
 Integer in the range [1, 1000000), the maximum number of training iterations. If you set this value too low, the algorithm might not converge.
Default: 100
Model attributes
coefficients
 Coefficients of the model:

phi
: parameters for the autoregressive component of the computation. The number of returnedphi
values is equal to the value ofp
. 
theta
: parameters for the moving average component of the computation. The number of returnedtheta
values is equal to the value ofq
.

p, q, d
 ARIMA component values:

p
: number of lags included in the autoregressive component of the computation 
d
: difference order of the model 
q
: number of lags included in the moving average component of the computation

mean
 The model mean, average of the accepted sample values from
timeseriescolumn
regularization
 Type of regularization used when training the model
lambda
 Regularization parameter. Higher values indicates stronger regularization.
mean_squared_error
 Mean squared error of the model on the training set
rejected_row_count
 Number of samples rejected during training
accepted_row_count
 Number of samples accepted for training from the data set
timeseries_name
 Name of the
timeseriescolumn
used to train the model timestamp_name
 Name of the
timestampcolumn
used to train the model missing_method
 Method used for handling missing values
call_string
 SQL statement used to train the model
Examples
The function requires that at least one of the p
and q
parameters be a positive, nonzero integer. The following example trains a model where both of these parameters are set to two:
=> SELECT ARIMA('arima_temp', 'temp_data', 'temperature', 'time' USING PARAMETERS p=2, q=2);
ARIMA

Finished in 24 iterations.
3650 elements accepted, 0 elements rejected.
(1 row)
To see a summary of the model, including all model coefficients and parameter values, call GET_MODEL_SUMMARY:
=> SELECT GET_MODEL_SUMMARY(USING PARAMETERS model_name='arima_temp');
GET_MODEL_SUMMARY

============
coefficients
============
parameter value
+
phi_1  1.23639
phi_2 0.24201
theta_1 0.64535
theta_2 0.23046
==============
regularization
==============
none
===============
timeseries_name
===============
temperature
==============
timestamp_name
==============
time
==============
missing_method
==============
linear_interpolation
===========
call_string
===========
ARIMA('public.arima_temp', 'temp_data', 'temperature', 'time' USING PARAMETERS p=2, d=0, q=2, missing='linear_interpolation', init_method='Zero', epsilon=1e06, max_iterations=100);
===============
Additional Info
===============
Name  Value
+
p  2
q  2
d  0
mean 11.17775
lambda  1.00000
mean_squared_error 5.80628
rejected_row_count 0
accepted_row_count 3650
(1 row)
For an indepth example that trains and makes predictions with ARIMA models, see ARIMA model example.