Prometheus integration

Vertica on Kubernetes integrates with Prometheus to scrape time series metrics about the VerticaDB operator.

Vertica on Kubernetes integrates with Prometheus to scrape time series metrics about the VerticaDB operator and Vertica server process. These metrics create a detailed model of your application over time to provide valuable performance and troubleshooting insights as well as facilitate internal and external communications and service discovery in microservice and containerized architectures.

Prometheus requires that you set up targets—metrics that you want to monitor. Each target is exposed on an endpoint, and Prometheus periodically scrapes that endpoint to collect target data. Vertica exports metrics and provides access methods for both the VerticaDB operator and server process.

Server metrics

Vertica exports server metrics on port 8443 at the following endpoint:

https://host-address:8443/api-version/metrics

Only the superuser can authenticate to the HTTPS service, and the service accepts only mutual TLS (mTLS) authentication. The setup for both Vertica on Kubernetes and non-containerized Vertica environments is identical. For details, see HTTPS service.

Vertica on Kubernetes lets you set a custom port for its HTTP service with the subclusters[i].verticaHTTPNodePort custom resource parameter. This parameter sets a custom port for the HTTPS service for NodePort serviceTypes.

For request and response examples, see the /metrics endpoint description. For a list of available metrics, see Prometheus metrics.

Grafana dashboards

You can visualize Vertica server time series metrics with Grafana dashboards. Vertica dashboards that use a Prometheus data source are available at Grafana Dashboards:

You can also download the source for each dashboard from the vertica/grafana-dashboards repository.

Operator metrics

The VerticaDB operator supports the Operator SDK framework, which requires that an authorization proxy impose role-based-access control (RBAC) to access operator metrics over HTTPS. To increase flexibility, Vertica provides the following options to access the Prometheus /metrics endpoint:

  • HTTPS access: Meet operator SDK requirements and use a sidecar container as an RBAC proxy to authorize connections.

  • HTTP access: Expose the /metrics endpoint to external connections without RBAC. Any client with network access can read from /metrics.

  • Disable Prometheus entirely.

Vertica provides Helm chart parameters and YAML manifests to configure each option.

Prerequisites

HTTPS with RBAC

The operator SDK framework requires that operators use an authorization proxy for metrics access. Because the operator sends metrics to localhost only, Vertica meets these requirements with a sidecar container with localhost access that enforces RBAC.

RBAC rules are cluster-scoped, and the sidecar authorizes connections from clients associated with a service account that has the correct ClusterRole and ClusterRoleBindings. Vertica provides the following example manifests:

For additional details about ClusterRoles and ClusterRoleBindings, see the Kubernetes documentation.

Create RBAC rules

The following steps create the ClusterRole and ClusterRoleBindings objects that grant access to the /metrics endpoint to a non-Kubernetes resource such as Prometheus. Because RBAC rules are cluster-scoped, you must create or add to an existing ClusterRoleBinding:

  1. Create a ClusterRoleBinding that binds the role for the RBAC sidecar proxy with a service account:

    • Create a ClusterRoleBinding:

      $ kubectl create clusterrolebinding verticadb-operator-proxy-rolebinding \
          --clusterrole=verticadb-operator-proxy-role \
          --serviceaccount=namespace:serviceaccount
      
    • Add a service account to an existing ClusterRoleBinding:

      $ kubectl patch clusterrolebinding verticadb-operator-proxy-rolebinding \
          --type='json' \
          -p='[{"op": "add", "path": "/subjects/-", "value": {"kind": "ServiceAccount", "name": "serviceaccount","namespace": "namespace" } }]'
      
  2. Create a ClusterRoleBinding that binds the role for the non-Kubernetes object to the RBAC sidecar proxy service account:

    • Create a ClusterRoleBinding:

      $ kubectl create clusterrolebinding verticadb-operator-metrics-reader \
          --clusterrole=verticadb-operator-metrics-reader \
          --serviceaccount=namespace:serviceaccount \
          --group=system:authenticated
      
    • Bind the service account to an existing ClusterRoleBinding:

      $ kubectl patch clusterrolebinding verticadb-operator-metrics-reader \
          --type='json' \
          -p='[{"op": "add", "path": "/subjects/-", "value": {"kind": "ServiceAccount", "name": "serviceaccount","namespace": "namespace"},{"op":"add","path":"/subjects/-","value":{"kind": "Group", "name": "system:authenticated"} }]'
      
      $ kubectl patch clusterrolebinding verticadb-operator-metrics-reader \
          --type='json' \
          -p='[{"op": "add", "path": "/subjects/-", "value": {"kind": "ServiceAccount", "name": "serviceaccount","namespace": "namespace" } }]'
      

When you install the Helm chart, the ClusterRole and ClusterRoleBindings are created automatically. By default, the prometheus.expose parameter is set to EnableWithProxy, which creates the service object and exposes the operator's /metrics endpoint.

For details about creating a sidecar container, see VerticaDB custom resource definition.

Service object

Vertica provides a service object verticadb-operator-metrics-service to access the Prometheus /metrics endpoint. The VerticaDB operator does not manage this service object. By default, the service object uses the ClusterIP service type to support RBAC.

Connect to the /metrics endpoint at port 8443 with the following path:

https://verticadb-operator-metrics-service.namespace.svc.cluster.local:8443/metrics

Bearer token authentication

Kubernetes authenticates requests to the API server with service account credentials. Each pod is associated with a service account and has the following credentials stored in the filesystem of each container in the pod:

  • Token at /var/run/secrets/kubernetes.io/serviceaccount/token

  • Certificate authority (CA) bundle at /var/run/secrets/kubernetes.io/serviceaccount/ca.crt

Use these credentials to authenticate to the /metrics endpoint through the service object. You must use the credentials for the service account that you used to create the ClusterRoleBindings.

For example, the following cURL request accesses the /metrics endpoint. Include the --insecure option only if you do not want to verify the serving certificate:

$ curl --insecure --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" https://verticadb-operator-metrics-service.vertica:8443/metrics

For additional details about service account credentials, see the Kubernetes documentation.

TLS client certificate authentication

Some environments might prevent you from authenticating to the /metrics endpoint with the service account token. For example, you might run Prometheus outside of Kubernetes. To allow external client connections to the /metrics endpoint, you have to supply the RBAC proxy sidecar with TLS certificates.

You must create a Secret that contains the certificates, and then use the prometheus.tlsSecret Helm chart parameter to pass the Secret to the RBAC proxy sidecar when you install the Helm chart. The following steps create the Secret and install the Helm chart:

  1. Create a Secret that contains the certificates:

    $ kubectl create secret generic metrics-tls --from-file=tls.key=/path/to/tls.key --from-file=tls.crt=/path/to/tls.crt --from-file=ca.crt=/path/to/ca.crt
    
  2. Install the Helm chart with prometheus.tlsSecret set to the Secret that you just created:

    $ helm install operator-name --namespace namespace --create-namespace vertica-charts/verticadb-operator \
      --set prometheus.tlsSecret=metrics-tls
    

    The prometheus.tlsSecret parameter forces the RBAC proxy to use the TLS certificates stored in the Secret. Otherwise, the RBAC proxy sidecar generates its own self-signed certificate.

After you install the Helm chart, you can authenticate to the /metrics endpoint with the certificates in the Secret. For example:

$ curl --key tls.key --cert tls.crt --cacert ca.crt https://verticadb-operator-metrics-service.vertica.svc:8443/metrics

HTTP access

You might have an environment that does not require privileged access to Prometheus metrics. For example, you might run Prometheus outside of Kubernetes.

To allow external access to the /metrics endpoint with HTTP, set prometheus.expose to EnableWithoutAuth. For example:

$ helm install operator-name --namespace namespace --create-namespace vertica-charts/verticadb-operator \
    --set prometheus.expose=EnableWithoutAuth

Service object

Vertica provides a service object verticadb-operator-metrics-service to access the Prometheus /metrics endpoint. The VerticaDB operator does not manage this service object. By default, the service object uses the ClusterIP service type, so you must change the serviceType for external client access. The service object's fully-qualified domain name (FQDN) is as follows:

verticadb-operator-metrics-service.namespace.svc.cluster.local

Connect to the /metrics endpoint at port 8443 with the following path:

http://verticadb-operator-metrics-service.namespace.svc.cluster.local:8443/metrics

Prometheus operator integration (optional)

Vertica on Kubernetes integrates with the Prometheus operator, which provides custom resources (CRs) that simplify targeting metrics. Vertica supports the ServiceMonitor CR that discovers the VerticaDB operator automatically, and authenticates requests with a bearer token.

The ServiceMonitor CR is available as a release artifact in our GitHub repository. See Helm chart parameters for details about the prometheus.createServiceMonitor parameter.

Disabling Prometheus

To disable Prometheus, set the prometheus.expose Helm chart parameter to Disable:

$ helm install operator-name --namespace namespace --create-namespace vertica-charts/verticadb-operator \
    --set prometheus.expose=Disable

For details about Helm install commands, see Installing the VerticaDB operator.

Metrics

The following table describes the available VerticaDB operator metrics:

Name Type Description
controller_runtime_active_workers gauge Number of currently used workers per controller.
controller_runtime_max_concurrent_reconciles gauge Maximum number of concurrent reconciles per controller.
controller_runtime_reconcile_errors_total counter Total number of reconciliation errors per controller.
controller_runtime_reconcile_time_seconds histogram Length of time per reconciliation per controller.
controller_runtime_reconcile_total counter Total number of reconciliations per controller.
controller_runtime_webhook_latency_seconds histogram Histogram of the latency of processing admission requests.
controller_runtime_webhook_requests_in_flight gauge Current number of admission requests being served.
controller_runtime_webhook_requests_total counter Total number of admission requests by HTTP status code.
go_gc_duration_seconds summary A summary of the pause duration of garbage collection cycles.
go_goroutines gauge Number of goroutines that currently exist.
go_info gauge Information about the Go environment.
go_memstats_alloc_bytes gauge Number of bytes allocated and still in use.
go_memstats_alloc_bytes_total counter Total number of bytes allocated, even if freed.
go_memstats_buck_hash_sys_bytes gauge Number of bytes used by the profiling bucket hash table.
go_memstats_frees_total counter Total number of frees.
go_memstats_gc_sys_bytes gauge Number of bytes used for garbage collection system metadata.
go_memstats_heap_alloc_bytes gauge Number of heap bytes allocated and still in use.
go_memstats_heap_idle_bytes gauge Number of heap bytes waiting to be used.
go_memstats_heap_inuse_bytes gauge Number of heap bytes that are in use.
go_memstats_heap_objects gauge Number of allocated objects.
go_memstats_heap_released_bytes gauge Number of heap bytes released to OS.
go_memstats_heap_sys_bytes gauge Number of heap bytes obtained from system.
go_memstats_last_gc_time_seconds gauge Number of seconds since 1970 of last garbage collection.
go_memstats_lookups_total counter Total number of pointer lookups.
go_memstats_mallocs_total counter Total number of mallocs.
go_memstats_mcache_inuse_bytes gauge Number of bytes in use by mcache structures.
go_memstats_mcache_sys_bytes gauge Number of bytes used for mcache structures obtained from system.
go_memstats_mspan_inuse_bytes gauge Number of bytes in use by mspan structures.
go_memstats_mspan_sys_bytes gauge Number of bytes used for mspan structures obtained from system.
go_memstats_next_gc_bytes gauge Number of heap bytes when next garbage collection will take place.
go_memstats_other_sys_bytes gauge Number of bytes used for other system allocations.
go_memstats_stack_inuse_bytes gauge Number of bytes in use by the stack allocator.
go_memstats_stack_sys_bytes gauge Number of bytes obtained from system for stack allocator.
go_memstats_sys_bytes gauge Number of bytes obtained from system.
go_threads gauge Number of OS threads created.
process_cpu_seconds_total counter Total user and system CPU time spent in seconds.
process_max_fds gauge Maximum number of open file descriptors.
process_open_fds gauge Number of open file descriptors.
process_resident_memory_bytes gauge Resident memory size in bytes.
process_start_time_seconds gauge Start time of the process since unix epoch in seconds.
process_virtual_memory_bytes gauge Virtual memory size in bytes.
process_virtual_memory_max_bytes gauge Maximum amount of virtual memory available in bytes.
vertica_cluster_restart_attempted_total counter The number of times we attempted a full cluster restart.
vertica_cluster_restart_failed_total counter The number of times we failed when attempting a full cluster restart.
vertica_cluster_restart_seconds histogram The number of seconds it took to do a full cluster restart.
vertica_nodes_restart_attempted_total counter The number of times we attempted to restart down nodes.
vertica_nodes_restart_failed_total counter The number of times we failed when trying to restart down nodes.
vertica_nodes_restart_seconds histogram The number of seconds it took to restart down nodes.
vertica_running_nodes_count gauge The number of nodes that have a running pod associated with it.
vertica_subclusters_count gauge The number of subclusters that exist.
vertica_total_nodes_count gauge The number of nodes that currently exist.
vertica_up_nodes_count gauge The number of nodes that have vertica running and can accept connections.
vertica_upgrade_total counter The number of times the operator performed an upgrade caused by an image change.
workqueue_adds_total counter Total number of adds handled by workqueue.
workqueue_depth gauge Current depth of workqueue.
workqueue_longest_running_processor_seconds gauge How many seconds has the longest running processor for workqueue been running.
workqueue_queue_duration_seconds histogram How long in seconds an item stays in workqueue before being requested.
workqueue_retries_total counter Total number of retries handled by workqueue.
workqueue_unfinished_work_seconds gauge How many seconds of work has been done that is in progress and hasn't been observed by work_duration. Large values indicate stuck threads. One can deduce the number of stuck threads by observing the rate at which this increases.
workqueue_work_duration_seconds histogram How long in seconds processing an item from workqueue takes.