This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Failure recovery

Hardware or software issues can force nodes in your cluster to fail.

Hardware or software issues can force nodes in your cluster to fail. In this case, the node or nodes leave the database. You must recover these failed nodes before they can rejoin the cluster and resume normal operation.

Node failure's impact on the database

Having failed nodes in your database affects how your database operates. If you have an Enterprise Mode database with K-safety 0, the loss of any node causes the database to shut down. Eon Mode databases usually do not have a K-safety of 0 (see Data integrity and high availability in an Eon Mode database).

In a database in either mode with K-safety of 1 or greater, your database continues to run normally after losing a node. However, its performance is affected:

  • In Enterprise Mode, another node fills in for a down node, using its copy of the down node's data. This node must perform up to twice the amount of work it usually does. Operations such as queries will take longer because the rest of the cluster waits for the node to finish.

  • In Eon Mode, another node fills in for the down node. Nodes in Eon Mode databases do not maintain buddy projections like nodes in Enterprise Mode databases. The node filling in for the down node retrieves the down node's data from communal storage to process queries. It does not store that data in the depot. Having to retrieve all of the data from communal storage slows down the processing of the query, in addition to the node having to perform more work. The performance impact of the down node is usually limited to the subcluster that contains it.

Because of these performance impacts, you should recover the failed nodes as soon as possible.

If too many database nodes fail, your database loses the ability to maintain K-safety or quorum. In an Eon Mode database, loss of primary nodes can also result in loss of primary shard coverage. In any of these cases, your database stops normal operations to prevent data corruption. How it responds to the loss of K-safety or quorum depends on its mode:

  • In Enterprise Mode, the database shuts down because it does not have access to all of its data.

  • In Eon Mode, the database continues running in read-only mode. Operations that change the global catalog such as inserting data or altering table schemas fail. However, queries can run on any subcluster that still has shard coverage. See Read-Only Mode.

To return your database to normal operation, you must restore the failed nodes and recover the database.

Recovery scenarios

Vertica begins the database recovery process when you restart failed nodes or the database. The mode of recovery for a K-safe database depends on the type of failure:

In the first three cases, nodes automatically rejoin the database after you resolve their failure; in the fourth case (unclean shutdown), you must manually intervene to recover the database. The following sections discuss these cases in greater detail.

If a recovery causes a table or schema to exceed its disk quota, the recovery proceeds anyway. You must then either reduce disk usage or increase the quota before you can perform other operations that consume disk space. For more information, see Disk quotas.

Recovery of failed nodes

One or more nodes in your database have failed. However, the database maintained quorum and K-safety so it continued running without interruption.

Recover the down nodes by restarting the Vertica process on them using:

While restarted nodes recover their data from other nodes, their status is set to RECOVERING. Except for a short period at the end, the recovery phase has no effect on database transaction processing. After recovery is complete, the restarted nodes status changes to UP.

Recovery after clean shutdown

An administrator shut down the database cleanly after the loss of nodes. To recover:

  1. Resolve any hardware or system problems that caused the node's host to go down.

  2. Restart the database. See Starting the database.

On restart, all nodes whose status was UP before the shutdown resume a status of UP. If the database contained one or more failed nodes on shutdown and they are now available, they begin the recovery process as described in the previous section.

Recovery of a read-only Eon Mode database

A database in Eon Mode has lost enough primary nodes to cause it to go into read-only mode. To return the database to normal operation, restart the failed nodes. See Recover from Read-Only Mode.

Recovery after unclean shutdown

In an unclean shutdown, Vertica was not able to complete a normal shutdown process. Reasons for unclean shutdown include:

  • A critical node in an Enterprise Mode database failed, leaving part of the database's data unavailable. The database immediately shuts down to prevent potential data corruption.

  • A site-wide event such as a power failure caused all nodes to reboot.

  • Vertica processes on the nodes exited due to a software or hardware failure.

Unclean shutdown can put the database in an inconsistent state—for example, Vertica might have been in the middle of writing data to disk at the time of failure, and this process was left incomplete. When you restart the database, Vertica determines that normal startup is not possible and uses the Last Good Epoch to determine when data was last consistent on all nodes. Vertica prompts you to accept recovery with the suggested epoch. If you accept, the database recovers and all data changes after the Last Good Epoch are lost. If you do not accept, the database does not start.

Instead of accepting the recommended epoch, you can recover from a backup. You can also choose an epoch that precedes the Last Good Epoch, through the Administration Tools Advanced Menu option Roll Back Database to Last Good Epoch. This is useful in special situations—for example the failure occurs during a batch of loads, where it is easier to restart the entire batch, even though some of the work must be repeated. In most cases, you should accept the recommended epoch.

Epochs and node recovery

The checkpoint epochs (CPEs) for both the source and target projections are updated as ROS containers are moved. The start and end epochs of all storage containers, such as ROS containers, are modified to the commit epoch. When this occurs, the epochs of all columns without an actual data file rewrite advance the CPE to the commit epoch of MOVE_PARTITIONS_TO_TABLE. If any nodes are down during the partition move operation, they detect that there is storage to recover. On rejoining the cluster, the restarted nodes recover from other nodes with the correct epoch.

See Epochs for additional information about how Vertica uses epochs.

Manual recovery notes

  • You can manually recover a database where up to K nodes are offline—for example, they were physically removed for repair or were not reachable at the time of recovery. When the missing nodes are restored, they recover and rejoin the cluster as described in Recovery Scenarios.

  • You can manually recover a database if the nodes to be restarted can supply all partition segments, even if more than K nodes remain down at startup. In this case, all data is available from the remaining cluster nodes, so the database can successfully start.

  • The default setting for the HistoryRetentionTime configuration parameter is 0, so Vertica only keeps historical data when nodes are down. This setting prevents use of the Administration tools Roll Back Database to Last Good Epoch option because the AHM remains close to the current epoch and a rollback is not permitted to an epoch that precedes the AHM. If you rely on the Roll Back option to remove recently loaded data, consider setting a day-wide window to remove loaded data. For example:

    => ALTER DATABASE DEFAULT SET HistoryRetentionTime = 86400;
    

    For more information, see Epoch management parameters.

  • When a node is down and manual recovery is required, it can take a full minute or longer for Vertica processes to time out while the system tries to form a cluster. Wait approximately one minute until the system returns the manual recovery prompt. Do not press CTRL-C during database startup.

1 - Restarting Vertica on a host

When one node in a running database cluster fails, or if any files from the catalog or data directories are lost from any one of the nodes, you can check the status of failed nodes using either the Administration Tools or the Management Console.

When one node in a running database cluster fails, or if any files from the catalog or data directories are lost from any one of the nodes, you can check the status of failed nodes using either the Administration Tools or the Management Console.

Restarting Vertica on a host using the administration tools

  1. Run Administration tools.

  2. From the Main Menu, select Restart Vertica on Host and click OK.

  3. Select the database host you want to recover and click OK.

  4. Verify recovery state by selecting View Database Cluster State from the Main Menu.

After the database is fully recovered, you can check the status at any time by selecting View Database Cluster State from the Administration Tools Main Menu.

Restarting Vertica on a host using the Management Console

  1. Connect to a cluster node (or the host on which MC is installed).

  2. Open a browser and connect to MC as an MC administrator.

  3. On the MC Home page, double-click the running database under the Recent Databases section.

  4. Within the Overview page, look at the node status under the Database sub-section and see if all nodes are up. The status will indicate how many nodes are up, critical, down, recovering, or other.

  5. If a node is down, click Manage at the bottom of the page and inspect the graph. A failed node will appear in red.

  6. Click the failed node to select it and in the Node List, click the Start node button.

2 - Restarting the database

If you lose the Vertica process on more than one node (for example, due to power loss), or if the servers are shut down without properly shutting down the Vertica database first, the database cluster indicates that it did not shut down gracefully the next time you start it.

If you lose the Vertica process on more than one node (for example, due to power loss), or if the servers are shut down without properly shutting down the Vertica database first, the database cluster indicates that it did not shut down gracefully the next time you start it.

The database automatically detects when the cluster was last in a consistent state and then shuts down, at which point an administrator can restart it.

From the Main Menu in the Administration tools:

  1. Verify that the database has been stopped by clicking Stop Database.

    A message displays: No databases owned by <dbadmin> are running

  2. Start the database by selecting Start Database from the Main Menu.

  3. Select the database you want to restart and click OK.

    If you are starting the database after an unclean shutdown, messages display, which indicate that the startup failed. Press RETURN to continue with the recovery process.

    An epoch represents committed changes to the data stored in a database between two specific points in time. When starting the database, Vertica searches for last good epoch.

  4. Upon determining the last good epoch, you are prompted to verify that you want to start the database from the good epoch date. Select Yes to continue with the recovery.

    Vertica continues to initialize and recover all data prior to the last good epoch.

    If recovery takes more than a minute, you are prompted to answer <Yes> or <No> to "Do you want to continue waiting?"

    When all the nodes' status have changed to RECOVERING or UP, selecting <No> lets you exit this screen and monitor progress via the Administration Tools Main Menu. Selecting <Yes> continues to display the database recovery window.

3 - Recovering the cluster from a backup

To recover a cluster from a backup, refer to the following topics:.

To recover a cluster from a backup, refer to the following topics:

4 - Phases of a recovery

The phases of a Vertica recovery are the same regardless of whether you are recovering by table or node.

The phases of a Vertica recovery are the same regardless of whether you are recovering by table or node. In the case of a recovery by table, tables become individually available as they complete the final phase. In the case of a recovery by node, the database objects only become available after the entire node completes recovery.

When you perform a recovery in Vertica, each recovered table goes through the following phases:

Order Phase Description Lock Type
1 Historical Vertica copies any historical data it may have missed while in a state of DOWN or INITIALIZING. none
2 Historical Dirty Vertica recovers any DML transactions that committed after the node or table began recovery. none
3 Current Replay Delete Vertica replays any delete transactions that took place during the recovery. T-lock
4 Aggregate Projections Vertica recovers any aggregate projections. T-lock

After a table completes the last phase, Vertica considers it fully recovered. At this point, the table can participate in DDL and DML operations.

5 - Epochs

An epoch represents a cutoff point of historical data within the database.

An epoch represents a cutoff point of historical data within the database. The timestamp of all commits within a given epoch are equal to or less than the epoch's timestamp. Understanding epochs is useful when you need to perform the following operations:

  • Database recovery: Vertica uses epochs to determine the last time data was consistent across all nodes in a database cluster.

  • Execute historical queries: A SELECT statement that includes an AT epoch clause only returns data that was committed on or before the specified epoch.

  • Purge deleted data: Deleted data is not removed from physical storage until it is purged from the database. You can purge deleted data from the database only if it precedes the ancient history marker (AHM) epoch.

Vertica has one open epoch and any number of closed epochs, depending on your system configuration. New and updated data is written into the open epoch, and each closed epoch represents a previous commit to your database. When data is committed with a DML operation (INSERT, UPDATE, MERGE, COPY, or DELETE), Vertica writes the data, closes the open epoch, and opens a new epoch. Each row committed to the database is associated with the epoch in which it was written.

The EPOCHS system table contains information about each available closed epoch. The epoch_close_time column stores the date and time of the commit. The epoch_number column stores the corresponding epoch number:

=> SELECT * FROM EPOCHS;
       epoch_close_time        | epoch_number
-------------------------------+--------------
 2020-07-27 14:29:49.687106-04 |           91
 2020-07-28 12:51:53.291795-04 |           92
(2 rows)

Epoch milestones

As an epoch progresses through its life cycle, it reaches milestones that Vertica uses it to perform a variety of operations and maintain the state of the database. The following image generally depicts these milestones within the epoch life cycle:

Vertica defines each milestone as follows:

  • Current epoch (CE): The current, open epoch that you are presently writing data to.

  • Latest epoch (LE): The most recently closed epoch.

  • Checkpoint epoch: Enterprise Mode only. A node-level epoch that is the latest epoch in which data is consistent across all projections on that node.

  • Last good epoch (LGE): The minimum checkpoint epoch in which data is consistent across all nodes.

  • Ancient history mark (AHM): The oldest epoch that contains data that is accessible by historical queries.

See Epoch life cycle for detailed information about each stage.

5.1 - Epoch life cycle

The epoch life cycle consists of a sequence of milestones that enable you to perform a variety of operations and manage the state of your database.

The epoch life cycle consists of a sequence of milestones that enable you to perform a variety of operations and manage the state of your database.

Vertica provides epoch management parameters and functions so that you can retrieve and adjust epoch values. Additionally, see Configuring epochs for recommendations on how to set epochs for specific use cases.

Current epoch (CE)

The open epoch that contains all uncommitted changes that you are presently writing to the database. The current epoch is stored in the SYSTEM system table:

=> SELECT CURRENT_EPOCH FROM SYSTEM;
 CURRENT_EPOCH
---------------
         71
(1 row)

The following example demonstrates how the current epoch advances when you commit data:

  1. Query the SYSTEM systems table to return the current epoch:

    => SELECT CURRENT_EPOCH FROM SYSTEM;
     CURRENT_EPOCH
    ---------------
             71
    (1 row)
    

    The current epoch is open, which means it is the epoch that you are presently writing data to.

  2. Insert a row into the orders table:

    => INSERT INTO orders VALUES ('123456789', 323426, 'custacct@example.com');
     OUTPUT
    --------
          1
    (1 row)
    

    Each row of data has an implicit epoch column that stores that row's commit epoch. The row that you just inserted into the table was not committed, so the epoch column is blank:

    => SELECT epoch, orderkey, custkey, email_addrs FROM orders;
     epoch | orderkey  | custkey |     email_addrs
    -------+-----------+---------+----------------------
           | 123456789 |  323426 | custacct@example.com
    (1 row)
    
  3. Commit the data, then query the table again. The committed data is associated with epoch 71, the current epoch that was previously returned from the SYSTEM systems table:

    => COMMIT;
    COMMIT
    => SELECT epoch, orderkey, custkey, email_addrs FROM orders;
     epoch | orderkey  | custkey |     email_addrs
    -------+-----------+---------+----------------------
        71 | 123456789 |  323426 | custacct@example.com
    (1 row)
    
  4. Query the SYSTEMS table again to return the current epoch. The current epoch is 1 integer higher:

    => SELECT CURRENT_EPOCH FROM SYSTEM;
     CURRENT_EPOCH
    ---------------
             72
    (1 row)
    

Latest epoch (LE)

The most recently closed epoch. The current epoch becomes the latest epoch after a commit operation.

The LE is the most recent epoch stored in the EPOCHS system table:

=> SELECT * FROM EPOCHS;
       epoch_close_time        | epoch_number
-------------------------------+--------------
 2020-07-27 14:29:49.687106-04 |           91
 2020-07-28 12:51:53.291795-04 |           92
(2 rows)

Checkpoint epoch (CPE)

Valid in Enterprise Mode only. Each node has a checkpoint epoch, which is the most recent epoch in which the data on that node is consistent across all projections. When the database runs optimally, the checkpoint epoch is equal to the LE, which is always one epoch older than the current epoch.

The checkpoint epoch is used during node failure and recovery. When a single node fails, that node attempts to rebuild data beyond its checkpoint epoch from other nodes. If the failed node cannot recover data using any of those epochs, then the failed node recovers data using the checkpoint epoch.

Use PROJECTION_CHECKPOINT_EPOCHS to query information about the checkpoint epochs. The following query returns information about the checkpoint epoch on nodes that store the orders projection:

=> SELECT checkpoint_epoch, node_name, projection_name, is_up_to_date, would_recover, is_behind_ahm
      FROM PROJECTION_CHECKPOINT_EPOCHS WHERE projection_name ILIKE 'orders_b%';
 checkpoint_epoch |    node_name     | projection_name | is_up_to_date | would_recover | is_behind_ahm
------------------+------------------+-----------------+---------------+---------------+---------------
               92 | v_vmart_node0001 | orders_b1       | t             | f             | f
               92 | v_vmart_node0001 | orders_b0       | t             | f             | f
               92 | v_vmart_node0003 | orders_b1       | t             | f             | f
               92 | v_vmart_node0003 | orders_b0       | t             | f             | f
               92 | v_vmart_node0002 | orders_b0       | t             | f             | f
               92 | v_vmart_node0002 | orders_b1       | t             | f             | f
(6 rows)

This query confirms that the database epochs are advancing correctly. The would_recover column displays an f when the last good epoch (LGE) is equal to the CPE because Vertica gives precedence to the LGE for recovery when possible. The is_behind_ahm column shows whether the checkpoint epoch is behind the AHM. Any data in an epoch that precedes the ancient history mark (AHM) is unrecoverable in case of a database or node failure.

Last good epoch (LGE)

The minimum checkpoint epoch in which data is consistent across all nodes in the cluster. Each node has an LGE, and Vertica evaluates the LGE for each node to determine the cluster LGE. The cluster's LGE is stored in the SYSTEM system table:

=> SELECT LAST_GOOD_EPOCH FROM SYSTEM;
 LAST_GOOD_EPOCH
-----------------
              70
(1 row)

You can retrieve the LGE for each node by querying the expected recovery epoch:

=> SELECT GET_EXPECTED_RECOVERY_EPOCH();
INFO 4544:  Recovery Epoch Computation:
Node Dependencies:
011 - cnt: 21
101 - cnt: 21
110 - cnt: 21
111 - cnt: 9

001 - name: v_vmart_node0001
010 - name: v_vmart_node0002
100 - name: v_vmart_node0003
Nodes certainly in the cluster:
        Node 0(v_vmart_node0001), epoch 70
        Node 1(v_vmart_node0002), epoch 70
Filling more nodes to satisfy node dependencies:
Data dependencies fulfilled, remaining nodes LGEs don't matter:
        Node 2(v_vmart_node0003), epoch 70
--
 GET_EXPECTED_RECOVERY_EPOCH
-----------------------------
                          70
(1 row)

Because the LGE is a snapshot of all of the most recent data on the disk, it is used to recover from database failure. Administration Tools uses the LGE to manually reset the database. If you are recovering from database failure after an unclean shutdown, Vertica prompts you to accept recovery using the LGE during restart.

Ancient history mark (AHM)

The oldest epoch that contains data that is accessible by historical queries. The AHM is stored in the SYSTEM system table:

=> SELECT AHM_EPOCH FROM SYSTEM;
 AHM_EPOCH
-----------
        70
(1 row)

Epochs that precede the AHM are unavailable for historical queries. The following example returns the AHM, and then returns an error when executing a historical query that precedes the AHM:

=> SELECT GET_AHM_EPOCH();
 GET_AHM_EPOCH
---------------
            93
(1 row)

=> AT EPOCH 92 SELECT * FROM orders;
ERROR 3183:  Epoch number out of range
HINT:  Epochs prior to [93] do not exist. Epochs [94] and later have not yet closed

The AHM advances according to your HistoryRetentionTime, HistoryRetentionEpochs, and AdvanceAHMInterval parameter settings. By default, the AHM advances every 180 seconds until it is equal with the LGE. This helps reduce the number of epochs saved to the epoch map, which reduces the catalog size. The AHM cannot advance beyond the LGE.

The AHM serves as the cutoff epoch for purging data from physical disk. As the AHM advances, the Tuple Mover mergeout process purges any deleted data that belongs to an epoch that precedes the AHM. See Purging deleted data for details about automated or manual purges.

5.2 - Managing epochs

Epochs are stored in the epoch map, a catalog object that contains a list of closed epochs beginning at ancient history mark (AHM) epoch and ending at the latest epoch (LE).

Epochs are stored in the epoch map, a catalog object that contains a list of closed epochs beginning at ancient history mark (AHM) epoch and ending at the latest epoch (LE). As the epoch map increases in size, the catalog uses more memory. Additionally, the AHM is used to determine what data is purged from disk. It is important to monitor database epochs to verify that they are advancing correctly to optimize database performance.

Monitoring epochs

When Vertica is running properly using the default Vertica settings, the ancient history mark, last good epoch (LGE), and checkpoint epoch (CPE, Enterprise Mode only) are equal to the latest epoch, or 1 less than the current epoch. This maintains control on the size of the epoch map and catalog by making sure that disk space is not used storing data that is eligible for purging. The SYSTEM system table stores the current epoch, last good epoch, and ancient history mark:

=> SELECT CURRENT_EPOCH, LAST_GOOD_EPOCH, AHM_EPOCH FROM SYSTEM;
 CURRENT_EPOCH | LAST_GOOD_EPOCH | AHM_EPOCH
---------------+-----------------+-----------
            88 |              87 |        87
(1 row)

Vertica provides GET_AHM_EPOCH, GET_AHM_TIME, GET_CURRENT_EPOCH, and GET_LAST_GOOD_EPOCH to retrieve these epochs individually.

In Enterprise Mode, you can query the checkpoint epoch using the PROJECTION_CHECKPOINT_EPOCHS table to return the checkpoint epoch for each node in your cluster. The following query returns the CPE for any node that stores the orders projection:

=> SELECT checkpoint_epoch, node_name, projection_name
     FROM PROJECTION_CHECKPOINT_EPOCHS WHERE projection_name ILIKE 'orders_b%';
 checkpoint_epoch |    node_name     | projection_name
------------------+------------------+-----------------
               87 | v_vmart_node0001 | orders_b1
               87 | v_vmart_node0001 | orders_b0
               87 | v_vmart_node0003 | orders_b1
               87 | v_vmart_node0003 | orders_b0
               87 | v_vmart_node0002 | orders_b0
               87 | v_vmart_node0002 | orders_b1
(6 rows)

Troubleshooting the ancient history mark

A properly functioning AHM is critical in determining how well your database utilizes disk space and executes queries. When you commit a DELETE or UPDATE (a combination of DELETE and INSERT) operation, the data is not deleted from disk immediately. Instead, Vertica marks the data for deletion so that you can retrieve it with historical queries. Deleted data takes up space on disk and impacts query performance because Vertica must read the deleted data during non-historical queries.

Epochs advance as you commit data, and any data that is marked for deletion is automatically purged by the Tuple Mover mergeout process when its epoch advances past the AHM. You can create an automated purge policy or manually purge any deleted data that was committed in an epoch that precedes the AHM. See Setting a purge policy for additional information.

By default, the AHM advances every 180 seconds until it is equal to the LGE. Monitor the SYSTEM system table to ensure that the AHM is advancing according properly:

=> SELECT CURRENT_EPOCH, LAST_GOOD_EPOCH, AHM_EPOCH FROM SYSTEM;
 CURRENT_EPOCH | LAST_GOOD_EPOCH | AHM_EPOCH
---------------+-----------------+-----------
            94 |              93 |        86
(1 row)

If you notice that the AHM is not advancing correctly, it might be due to one or more of the following:

  • Your database contains unrefreshed projections. This occurs when you create a projection for a table that already contains data. See Refreshing projections for details on how to refresh projections.

  • A node is DOWN. When a node is DOWN, the AHM cannot advance. See Restarting Vertica on a host for information on how to resolve this issue.

  • Confirm that the AHMBackupManagement epoch parameter is set to 0. If this parameter is set to 1, the AHM does not advance beyond the most recent full backup:

    => SELECT node_name, parameter_name, current_value FROM CONFIGURATION_PARAMETERS WHERE parameter_name='AHMBackupManagement';
     node_name |   parameter_name    | current_value
    -----------+---------------------+---------------
     ALL       | AHMBackupManagement | 0
    (1 row)
    

5.3 - Configuring epochs

Epoch configuration impacts how your database recovers from failure, handles historical data, and purges data from disk.

Epoch configuration impacts how your database recovers from failure, handles historical data, and purges data from disk. Vertica provides epoch management parameters for system-wide epoch configuration. Epoch management functions enable you to make ad hoc adjustments to epoch values.

Historical query and recovery precision

When you execute a historical query, Vertica returns an epoch within the amount of time specified by the EpochMapInterval configuration parameter. For example, when you execute a historical query using the AT TIME time epoch clause, Vertica returns an epoch within the parameter setting. By default, EpochMapInterval is set to 180 seconds. You must set EpochMapInterval to a value greater than or equal to the AdvanceAHMInterval parameter:

=> SELECT node_name, parameter_name, current_value FROM CONFIGURATION_PARAMETERS
     WHERE parameter_name='EpochMapInterval' OR parameter_name='AdvanceAHMInterval';
 node_name |   parameter_name   | current_value
-----------+--------------------+---------------
 ALL       | EpochMapInterval   | 180
 ALL       | AdvanceAHMInterval | 180
(2 rows)

During failure recovery, Vertica uses the EpochMapInterval setting to determine which epoch is reported as the last good epoch (LGE).

History retention and purge workflows

Vertica recommends that you configure your epoch parameters to create a purge policy that determines when deleted data is purged from disk. If you use historical queries often, then you need to find a balance between saving deleted historical data and purging it from disk. An aggressive purge policy increases disk utilization and improves query performance, but also limits your recovery options and narrows the window of data available for historical queries.

There are two strategies to creating a purge policy:

  • Set HistoryRetentionTime to specify how long deleted data is saved (in seconds) as an historical reference.

  • Set HistoryRetentionEpochs to specify the number of historical epochs to save.

See Setting a purge policy for details about configuring each workflow.

Setting HistoryRetentionTime is the preferred method for creating a purge policy. By default, Vertica sets this value to 0, so the AHM is 1 less than the current epoch when the database is running properly. You cannot execute historical queries on epochs that precede the AHM, so you might want to adjust this setting to save more data between the present time and the AHM. Another reason to adjust this parameter is if you use the Roll Back Database to Last Good Epoch option for manual roll backs. For example, the following command sets HistoryRetentionTime to 1 day (in seconds) to provide a wider range of epoch roll back options:

=> ALTER DATABASE vmart SET HistoryRetentionTime = 86400;

Vertica checks the status of your retention settings using the AdvanceAHMInterval setting and advances the AHM as necessary. After the AHM advances, any deleted data in an epoch that precedes the AHM is purged automatically by the Tuple Mover mergeout process.

If you want to disable any purge policy and preserve all historical data, set both HistoryRetentionTime and HistoryRetentionEpochs to -1:

=> ALTER DABABASE vmart SET HistoryRetentionTime = -1;
=> ALTER DATABASE vmart SET HistoryRetentionEpochs = -1;

If you do not set a purge policy, you can use epoch management functions to adjust the AHM to manually purge deleted data as needed. Manual purges are useful if you need to update or delete data uploaded by mistake. See Manually purging data for details.

6 - Best practices for disaster recovery

To protect your database from site failures caused by catastrophic disasters, maintain an off-site replica of your database to provide a standby.

To protect your database from site failures caused by catastrophic disasters, maintain an off-site replica of your database to provide a standby. In case of disaster, you can switch database users over to the standby database. The amount of data loss between a disaster and fail over to the offsite replica depends on how frequently you save a full database backup.

The solution to employ for disaster recover depends upon two factors that you must determine for your application:

  • Recovery point objective (RPO): How much data loss can your organization tolerate upon a disaster recovery?

  • Recovery time objective (RTO): How quickly do you need to recover the database following a disaster?

Depending on your RPO and RTO, Vertica recommends choosing from the following solutions:

  1. Dual-load: During each load process for the database, simultaneously load a second database. You can achieve this easily with off-the-shelf ETL software.

  2. Periodic Incremental Backups: Use the procedure described in Copying the database to another cluster to periodically copy the data to the target database. Remember that the script copies only files that have changed.

  3. Replication solutions provided by Storage Vendors: Although some users have had success with SAN storage, the number of vendors and possible configurations prevent Vertica from providing support for SANs.

The following table summarizes the RPO, RTO, and the pros and cons of each approach:

Dual Load Periodic Incremental Storage Replication
RPO Up to the minute data Up to the last backup Recover to the minute
RTO Available at all times Available except when backup in progress Available at all times
Pros
  • Standby database can have different configuration

  • Can use the standby database for queries

  • Built-in scripts

  • High performance due to compressed file transfers

Transparent to the database
Cons
  • Possibly incur additional ETL licenses

  • Requires application logic to handle errors

Need identical standby system
  • More expensive

  • Media corruptions are also replicated

7 - Recovery by table

Vertica supports node recovery on a per-table basis.

Vertica supports node recovery on a per-table basis. Unlike node-based recovery, recovering by table makes tables available as they recover, before the node itself is completely restored. You can prioritize your most important tables so they become available as soon as possible. Recovered tables support all DDL and DML operations.

To enhance recovery speed, Vertica recovers multiple tables in parallel. The maximum number of tables recoverable at one time is set by the MAXCONCURRENCY parameter in the RECOVERY resource pool.

After a node has fully recovered, it enables full Vertica functionality.

7.1 - Prioritizing table recovery

You can specify the order in which Vertica recovers tables.

You can specify the order in which Vertica recovers tables. This feature ensures that your most critical tables become available as soon as possible. To specify the recovery order of your tables, assign an integer priority value. Tables with higher priority values recover first. For example, a table with a priority of 1000 is recovered before a table with a value of 500. Table priorities have the maximum value of a 64-bit integer.

If you do not assign a priority, or if multiple tables have the same priority, Vertica restores tables by OID order. Assign a priority with a query such as this:

=> SELECT set_table_recover_priority('avro_basic', '1000');
      set_table_recover_priority
---------------------------------------
 Table recovery priority has been set.
(1 row)

View assigned priorities with a query using this form: SELECT table_name,recover_priority FROM v_catalog.tables; The next example shows prioritized tables from the VMart sample database. In this case, the table with the highest recovery priorities are listed first (DESC). The shipping_dimension table has the highest priority and will be recovered first. (Example has hard Returns for display purposes.)

=> SELECT table_name AS Name, recover_priority from v_catalog.tables WHERE recover_priority > 1
   ORDER BY recover_priority DESC;
        Name         | recover_priority
---------------------+------------------
 shipping_dimension  |            60000
 warehouse_dimension |            50000
 employee_dimension  |            40000
 vendor_dimension    |            30000
 date_dimension      |            20000
 promotion_dimension |            10000
 iris2               |             9999
 product_dimension   |               10
 customer_dimension  |               10
(9 rows)

7.2 - Viewing table recovery status

View general information about a recovery querying the V_MONITOR.TABLE_RECOVERY_STATUS table.

View general information about a recovery querying the V_MONITOR.TABLE_RECOVERY_STATUS table. You can also view detailed information about the status of the recovery the table being restored by querying the V_MONITOR.TABLE_RECOVERIES table.