Adjusting Spread Daemon timeouts for virtual environments

You may see Vertica nodes leave the database even though they are still running.

You may see Vertica nodes leave the database even though they are still running. This issue can happen on networks that are prone to spikes in latency or in virtual environments where a node's VM may be paused for a short period of time. You can adjust a setting in Vertica to help prevent this issue from occurring.

Vertica relies on spread daemons to pass messages between database nodes. When a node fails to respond to a spread message after a timeout period, Vertica assumes the node is down and starts to remove it from the database.

The default Spread timeout depends on the number of configured Spread segments:

Configured Spread segments Default timeout
1 8 seconds
> 1 25 seconds

If network delays or temporary pauses of a VM last longer than the spread timeout period, you may see UP nodes leave the database. In these cases, you can increase the spread timeout to reduce or eliminate instances where UP nodes leave the database.

Azure's memory-preserving updates and spread timeouts

In Azure, you might see running nodes leave the database due to scheduled maintenance. Azure's maintenance down time is usually well-defined. For example, Azure's memory-preserving updates can pause a VM for up to 30 seconds while performing maintenance on the system hosting the VM. This pause does not disrupt the node. It continues normal operation once Azure resumes it. See the Azure documentation's topic on Maintenance for virtual machines in Azure for more information about updates. If Azure pauses a node for longer than the spread timeout period, Vertica interprets the node's inability to respond to a spread message as the node going down, even though it will resume running normally.

Setting the spread timeout

When you know your network or nodes may be unable to respond for a specific amount of time, you can increase the spread timeout period to longer than this time. Adjust the timeout to the period of time the node may be unable to respond, plus an additional 5 seconds as a safety margin.

For example, if you know Azure's memory-preserving maintenance can pause your VMs for up to 30 seconds, set the spread timeout to 35 seconds.

If you do not know exactly how long network or node disruptions can last, you can try increasing the spread timeout gradually, until you see reduced instances of UP nodes leaving the database. Be as conservative with this setting as you can.

You can see the current setting of the spread timeout by querying system tableSPREAD_STATE:

=> SELECT * FROM V_MONITOR.SPREAD_STATE;
    node_name     | token_timeout
------------------+---------------
 v_vmart_node0003 |          8000
 v_vmart_node0001 |          8000
 v_vmart_node0002 |          8000
(3 rows)

You change the spread timeout calling the meta-function SET_SPREAD_OPTION to set the token timeout to a new value. This value is a string, and sets the timeout in milliseconds.

This example sets the timeout to 35 seconds (35000ms):

=> SELECT SET_SPREAD_OPTION( 'TokenTimeout', '35000');
NOTICE 9003:  Spread has been notified about the change
                   SET_SPREAD_OPTION
--------------------------------------------------------
 Spread option 'TokenTimeout' has been set to '35000'.

(1 row)

=> SELECT * FROM V_MONITOR.SPREAD_STATE;
    node_name     | token_timeout
------------------+---------------
 v_vmart_node0001 |         35000
 v_vmart_node0002 |         35000
 v_vmart_node0003 |         35000
(3 rows);

See also