Health Watchdog

Health Watchdog is an automated mechanism that detects and block queries during bad health conditions on the server.

If the database is under a high concurrent load, it leads to a bad health state on the server. The Health Watchdog is designed to mitigate the bad health state by doing the following:

  • Detecting the bad health state.

  • Stopping the transactions from adding to this bad state by blocking DDL/DML transactions.

  • Once the bad health state has been mitagated, allowing all blocked transactions to proceed.

The metrics Health Watchdog uses to check the server status and enact the mitigation are:

  • Truncation Version Lag - tracks the catalog sync service and detects bad health conditions in the server when the current commit version is far ahead of the database truncation version. By default, it is set to 500. It can be tuned using TruncationVersionLag.

  • GCLX Queue Bloat - tracks the GCLX queue size and stops the GCLX requests when the server is bombarded. By default, it is set to 100. It can be tuned using GCLXBlockParameter.

  • Mergeout Queue Bloat - tracks the TM queue size and stops DML transactions if the TM pool threads cannot keep up with the number of TM requests. By default, it is set to 100. It can be tuned using MergeoutBlockParameter.

  • Watchdog Timout Interval - the amount of time a transaction is blocked before it is timed out. By default, it is set to 5 minutes. It can be tuned using WatchdogTimeoutInterval.

You can check the status of the server using check_cluster_health and the health_watchdog_blocked_transactions system table.