Adjusting Spread Daemon timeouts for virtual environments
Vertica relies on Spread daemons to pass messages between database nodes. Occasionally, nodes fail to respond to messages within the specified Spread timeout. These failures might be caused by spikes in network latency or brief pauses in the node's VM—for example, scheduled Azure maintenance timeouts. In either case, Vertica assumes that the non-responsive nodes are down and starts to remove them from the database, even though they might still be running. You can address this issue by adjusting the Spread timeout as needed.
Adjusting spread timeout
By default, the Spread timeout depends on the number of configured Spread segments:
Configured Spread segments | Default timeout |
---|---|
1 | 8 seconds |
> 1 | 25 seconds |
Important
If you deploy your Vertica cluster with Azure Marketplace, the default Spread timeout is set to 35 seconds. If you manually create your cluster in Azure, the default Spread timeout is set to 8 or 25 seconds.If the Spread timeout is likely to elapse before the network or database nodes can respond, increase the timeout to the maximum length of non-responsive time plus five seconds. For example, if Azure memory-preserving maintenance pauses node VMs for up to 30 seconds, set the Spread timeout to 35 seconds.
If you are unsure how long network or node disruptions are liable to last, gradually increase the Spread timeout until fewer instances of UP nodes leave the database.
Important
Vertica cannot react to a node going down or being shut down improperly before the timeout period elapses. Changing Spread’s timeout to a value too high can result in longer query restarts if a node goes down.To see the current setting of the Spread timeout, query system table
SPREAD_STATE
. For example, the following query shows that the current timeout setting (token_timeout
) is set to 8000ms:
=> SELECT * FROM V_MONITOR.SPREAD_STATE;
node_name | token_timeout
------------------+---------------
v_vmart_node0003 | 8000
v_vmart_node0001 | 8000
v_vmart_node0002 | 8000
(3 rows)
To change the Spread timeout, call the meta-function SET_SPREAD_OPTION and set the token timeout to a new value. The following example sets the timeout to 35000ms (35 seconds):
=> SELECT SET_SPREAD_OPTION( 'TokenTimeout', '35000');
NOTICE 9003: Spread has been notified about the change
SET_SPREAD_OPTION
--------------------------------------------------------
Spread option 'TokenTimeout' has been set to '35000'.
(1 row)
=> SELECT * FROM V_MONITOR.SPREAD_STATE;
node_name | token_timeout
------------------+---------------
v_vmart_node0001 | 35000
v_vmart_node0002 | 35000
v_vmart_node0003 | 35000
(3 rows);
Note
Changing Spread settings with SET_SPREAD_OPTION has minor impact on your cluster as it pauses while the new settings are propagated across the cluster. Because of this delay, changes to the Spread timeout are not immediately visible in system tableSPREAD_STATE
.
Azure maintenance and spread timeouts
Azure scheduled maintenance on virtual machines might pause nodes longer than the Spread timeout period. If so, Vertica is liable to view nodes that do not respond to Spread messages as down and remove them from the database.
The length of Azure maintenance tasks is usually well-defined. For example, memory-preserving updates can pause a VM for up to 30 seconds while performing maintenance on the system hosting the VM. This pause does not disrupt the node, which resumes normal operation after maintenance is complete. To prevent Vertica from removing nodes while they undergo Azure maintenance, adjust the Spread timeout as needed.