Adjusting Spread Daemon timeouts for virtual environments

relies on daemons to pass messages between database nodes.

Vertica relies on Spread daemons to pass messages between database nodes. Occasionally, nodes fail to respond to messages within the specified Spread timeout. These failures might be caused by spikes in network latency or brief pauses in the node's VM—for example, scheduled Azure maintenance timeouts. In either case, Vertica assumes that the non-responsive nodes are down and starts to remove them from the database, even though they might still be running. You can address this issue by adjusting the Spread timeout as needed.

Adjusting spread timeout

By default, the Spread timeout depends on the number of configured Spread segments:

Configured Spread segments Default timeout
1 8 seconds
> 1 25 seconds

If the Spread timeout is likely to elapse before the network or database nodes can respond, increase the timeout to the maximum length of non-responsive time plus five seconds. For example, if Azure memory-preserving maintenance pauses node VMs for up to 30 seconds, set the Spread timeout to 35 seconds.

If you are unsure how long network or node disruptions are liable to last, gradually increase the Spread timeout until fewer instances of UP nodes leave the database.

To see the current setting of the Spread timeout, query system table SPREAD_STATE. For example, the following query shows that the current timeout setting (token_timeout) is set to 8000ms:

=> SELECT * FROM V_MONITOR.SPREAD_STATE;
    node_name     | token_timeout
------------------+---------------
 v_vmart_node0003 |          8000
 v_vmart_node0001 |          8000
 v_vmart_node0002 |          8000
(3 rows)

To change the Spread timeout, call the meta-function SET_SPREAD_OPTION and set the token timeout to a new value. The following example sets the timeout to 35000ms (35 seconds):

=> SELECT SET_SPREAD_OPTION( 'TokenTimeout', '35000');
NOTICE 9003:  Spread has been notified about the change
                   SET_SPREAD_OPTION
--------------------------------------------------------
 Spread option 'TokenTimeout' has been set to '35000'.

(1 row)

=> SELECT * FROM V_MONITOR.SPREAD_STATE;
    node_name     | token_timeout
------------------+---------------
 v_vmart_node0001 |         35000
 v_vmart_node0002 |         35000
 v_vmart_node0003 |         35000
(3 rows);

Azure maintenance and spread timeouts

Azure scheduled maintenance on virtual machines might pause nodes longer than the Spread timeout period. If so, Vertica is liable to view nodes that do not respond to Spread messages as down and remove them from the database.

The length of Azure maintenance tasks is usually well-defined. For example, memory-preserving updates can pause a VM for up to 30 seconds while performing maintenance on the system hosting the VM. This pause does not disrupt the node, which resumes normal operation after maintenance is complete. To prevent Vertica from removing nodes while they undergo Azure maintenance, adjust the Spread timeout as needed.

See also