Adjusting Spread Daemon timeouts for virtual environments
You may see Vertica nodes leave the database even though they are still running. This issue can happen on networks that are prone to spikes in latency or in virtual environments where a node's VM may be paused for a short period of time. You can adjust a setting in Vertica to help prevent this issue from occurring.
Vertica relies on spread daemons to pass messages between database nodes. When a node fails to respond to a spread message after a timeout period, Vertica assumes the node is down and starts to remove it from the database.
The default Spread timeout depends on the number of configured Spread segments:
Configured Spread segments | Default timeout |
---|---|
1 | 8 seconds |
> 1 | 25 seconds |
If network delays or temporary pauses of a VM last longer than the spread timeout period, you may see UP nodes leave the database. In these cases, you can increase the spread timeout to reduce or eliminate instances where UP nodes leave the database.
Azure's memory-preserving updates and spread timeouts
In Azure, you might see running nodes leave the database due to scheduled maintenance. Azure's maintenance down time is usually well-defined. For example, Azure's memory-preserving updates can pause a VM for up to 30 seconds while performing maintenance on the system hosting the VM. This pause does not disrupt the node. It continues normal operation once Azure resumes it. See the Azure documentation's topic on Maintenance for virtual machines in Azure for more information about updates. If Azure pauses a node for longer than the spread timeout period, Vertica interprets the node's inability to respond to a spread message as the node going down, even though it will resume running normally.
Note
If you deploy your Vertica cluster using the Azure Marketplace, the spread timeout defaults to 35 seconds. If you manually create your cluster in Azure, the spread timeout defaults to 8 or 25 seconds, as described earlier.Setting the spread timeout
When you know your network or nodes may be unable to respond for a specific amount of time, you can increase the spread timeout period to longer than this time. Adjust the timeout to the period of time the node may be unable to respond, plus an additional 5 seconds as a safety margin.
For example, if you know Azure's memory-preserving maintenance can pause your VMs for up to 30 seconds, set the spread timeout to 35 seconds.
If you do not know exactly how long network or node disruptions can last, you can try increasing the spread timeout gradually, until you see reduced instances of UP nodes leaving the database. Be as conservative with this setting as you can.
Important
Vertica cannot react to a node going down or being shut down improperly before the timeout period has elapsed. Changing spread’s timeout to a value too high can result in longer query restarts if a node goes down.You can see the current setting of the spread timeout by querying system tableSPREAD_STATE:
=> SELECT * FROM V_MONITOR.SPREAD_STATE;
node_name | token_timeout
------------------+---------------
v_vmart_node0003 | 8000
v_vmart_node0001 | 8000
v_vmart_node0002 | 8000
(3 rows)
You change the spread timeout calling the meta-function SET_SPREAD_OPTION to set the token timeout to a new value. This value is a string, and sets the timeout in milliseconds.
Important
Changing spread settings withSET_SPREAD_OPTION
has minor impact on your cluster as it pauses while the new settings are propagated across the entire cluster.
This example sets the timeout to 35 seconds (35000ms):
=> SELECT SET_SPREAD_OPTION( 'TokenTimeout', '35000');
NOTICE 9003: Spread has been notified about the change
SET_SPREAD_OPTION
--------------------------------------------------------
Spread option 'TokenTimeout' has been set to '35000'.
(1 row)
=> SELECT * FROM V_MONITOR.SPREAD_STATE;
node_name | token_timeout
------------------+---------------
v_vmart_node0001 | 35000
v_vmart_node0002 | 35000
v_vmart_node0003 | 35000
(3 rows);