Troubleshooting your Kubernetes cluster
These tips can help you avoid issues related to your Vertica on Kubernetes deployment and troubleshoot any problems that occur.
Download the kubectl command line tool to debug your Kubernetes resources.
Helm install failure
When you install the VerticaDB operator and admission controller Helm chart, the helm install
command might return the following error:
$ helm install vdb-op vertica-charts/verticadb-operator
Error: INSTALLATION FAILED: unable to build kubernetes objects from release manifest: [unable to recognize "": no matches for kind "Certificate" in version "cert-manager.io/v1", unable to recognize "": no matches for kind "Issuer" in version "cert-manager.io/v1"]
The error indicates that you have not met the TLS prerequisite for the admission controller webhook. To resolve this issue, install cert-manager or configure custom certificates. The following steps install cert-manager.
-
Install the cert-manager YAML manifest:
$ kubectl apply -f https://github.com/jetstack/cert-manager/releases/download/v1.5.3/cert-manager.yaml
-
Verify the cert-manager installation.
If you try to install the Helm chart immediately after you install cert-manager, you might receive the following error:
$ helm install vdb-op vertica-charts/verticadb-operator Error: failed to create resource: Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": dial tcp 10.96.232.154:443: connect: connection refused
You receive this error because cert-manager needs time to create its pods and register the webhook with the cluster. Wait a few minutes, and then verify the cert-manager installation with the following command:
$ kubectl get pods --namespace cert-manager NAME READY STATUS RESTARTS AGE cert-manager-7dd5854bb4-skks7 1/1 Running 5 12d cert-manager-cainjector-64c949654c-9nm2z 1/1 Running 5 12d cert-manager-webhook-6bdffc7c9d-b7r2p 1/1 Running 5 12d
For additional details about cert-manager install verification, see the cert-manager documentation.
-
After you verify the cert-manager installation, you must uninstall the Helm chart and then reinstall:
$ helm uninstall vdb-op $ helm install vdb-op vertica-charts/verticadb-operator
For additional information, see Installing the Vertica DB operator.
Custom certificate helm install error
If you use custom certificates when you install the operator with the Helm chart, the helm install
or kubectl apply
command might return an error similar to the following:
$ kubectl apply -f ../operatorcrd.yaml
Error from server (InternalError): error when creating "../operatorcrd.yaml": Internal error occurred: failed calling webhook "mverticadb.kb.io": Post "https://verticadb-operator-webhook-service.namespace.svc:443/mutate-vertica-com-v1beta1-verticadb?timeout=10s": x509: certificate is valid for ip-10-0-21-169.ec2.internal, test-bastion, not verticadb-operator-webhook-service.default.svc
You receive this error when the TLS key's Domain Name System (DNS) or Subject Alternate Name (SAN) is incorrect. To correct this error, define the DNS and SAN in a configuration file in the following format:
commonName = verticadb-operator-webhook-service.namespace.svc
...
[alt_names]
DNS.1 = verticadb-operator-webhook-service.namespace.svc
DNS.2 = verticadb-operator-webhook-service.namespace.svc.cluster.local
For additional details, see Installing the Vertica DB operator.
Verify updates to a custom resource
Because the operator takes time to perform tasks, updates to the custom resource are not effective immediately. Use the kubectl command line tool to verify that changes are applied.
You can use the kubectl wait command to wait for a specified condition. For example, the operator uses the ImageChangeInProgress condition to provide an upgrade status. After you begin the image version upgrade, wait until the operator acknowledges the upgrade and sets this condition to True:
$ kubectl wait --for=condition=ImageChangeInProgress=True vdb/cluster-name –-timeout=180s
After the upgrade begins, you can wait until the operator leaves upgrade mode and sets this condition to False:
$ kubectl wait --for=condition=ImageChangeInProgress=False vdb/cluster-name –-timeout=800s
For more information about kubectl wait, see the kubectl reference documentation.
Pods are running but the database is not ready
When you check the pods in your cluster, the pods are running but the database is not ready:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
vertica-crd-sc1-0 0/1 Running 0 12m
vertica-crd-sc1-1 0/1 Running 1 12m
vertica-crd-sc1-2 0/1 Running 0 12m
verticadb-operator-controller-manager-5d9cdc9b8-kw9nv 2/2 Running 0 24m
To find the root cause of the issue, use kubectl logs
to check the operator manager. The following example shows that the communal storage bucket does not exist:
$ kubectl logs -l app.kubernetes.io/name=verticadb-operator -c manager -f
2021-08-04T20:03:00.289Z INFO controllers.VerticaDB ExecInPod entry {"verticadb": "default/vertica-crd", "pod": {"namespace": "default", "name": "vertica-crd-sc1-0"}, "command": "bash -c ls -l /opt/vertica/config/admintools.conf && grep '^node\\|^v_\\|^host' /opt/vertica/config/admintools.conf "}
2021-08-04T20:03:00.369Z INFO controllers.VerticaDB ExecInPod stream {"verticadb": "default/vertica-crd", "pod": {"namespace": "default", "name": "vertica-crd-sc1-0"}, "err": null, "stdout": "-rw-rw-r-- 1 dbadmin verticadba 1243 Aug 4 20:00 /opt/vertica/config/admintools.conf\nhosts = 10.244.1.5,10.244.2.4,10.244.4.6\nnode0001 = 10.244.1.5,/data,/data\nnode0002 = 10.244.2.4,/data,/data\nnode0003 = 10.244.4.6,/data,/data\n", "stderr": ""}
2021-08-04T20:03:00.369Z INFO controllers.VerticaDB ExecInPod entry {"verticadb": "default/vertica-crd", "pod": {"namespace": "default", "name": "vertica-crd-sc1-0"}, "command": "/opt/vertica/bin/admintools -t create_db --skip-fs-checks --hosts=10.244.1.5,10.244.2.4,10.244.4.6 --communal-storage-location=s3://newbucket/db/26100df1-93e5-4e64-b665-533e14abb67c --communal-storage-params=/home/dbadmin/auth_parms.conf --sql=/home/dbadmin/post-db-create.sql --shard-count=12 --depot-path=/depot --database verticadb --force-cleanup-on-failure --noprompt --password ******* "}
2021-08-04T20:03:00.369Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"VerticaDB","namespace":"default","name":"vertica-crd","uid":"26100df1-93e5-4e64-b665-533e14abb67c","apiVersion":"vertica.com/v1beta1","resourceVersion":"11591"}, "reason": "CreateDBStart", "message": "Calling 'admintools -t create_db'"}
2021-08-04T20:03:17.051Z INFO controllers.VerticaDB ExecInPod stream {"verticadb": "default/vertica-crd", "pod": {"namespace": "default", "name": "vertica-crd-sc1-0"}, "err": "command terminated with exit code 1", "stdout": "Default depot size in use\nDistributing changes to cluster.\n\tCreating database verticadb\nBootstrap on host 10.244.1.5 return code 1 stdout '' stderr 'Logged exception in writeBufferToFile: RecvFiles failed in closing file [s3://newbucket/db/26100df1-93e5-4e64-b665-533e14abb67c/verticadb_rw_access_test.txt]: The specified bucket does not exist. Writing test data to file s3://newbucket/db/26100df1-93e5-4e64-b665-533e14abb67c/verticadb_rw_access_test.txt failed.\\nTesting rw access to communal location s3://newbucket/db/26100df1-93e5-4e64-b665-533e14abb67c/ failed\\n'\n\nError: Bootstrap on host 10.244.1.5 return code 1 stdout '' stderr 'Logged exception in writeBufferToFile: RecvFiles failed in closing file [s3://newbucket/db/26100df1-93e5-4e64-b665-533e14abb67c/verticadb_rw_access_test.txt]: The specified bucket does not exist. Writing test data to file s3://newbucket/db/26100df1-93e5-4e64-b665-533e14abb67c/verticadb_rw_access_test.txt failed.\\nTesting rw access to communal location s3://newbucket/db/26100df1-93e5-4e64-b665-533e14abb67c/ failed\\n'\n\n", "stderr": ""}
2021-08-04T20:03:17.051Z INFO controllers.VerticaDB aborting reconcile of VerticaDB {"verticadb": "default/vertica-crd", "result": {"Requeue":true,"RequeueAfter":0}, "err": null}
2021-08-04T20:03:17.051Z DEBUG controller-runtime.manager.events Warning {"object": {"kind":"VerticaDB","namespace":"default","name":"vertica-crd","uid":"26100df1-93e5-4e64-b665-533e14abb67c","apiVersion":"vertica.com/v1beta1","resourceVersion":"11591"}, "reason": "S3BucketDoesNotExist", "message": "The bucket in the S3 path 's3://newbucket/db/26100df1-93e5-4e64-b665-533e14abb67c' does not exist"}
Create an S3 bucket for the cluster:
$ S3_BUCKET=newbucket
$ S3_CLUSTER_IP=$(kubectl get svc | grep minio | head -1 | awk '{print $3}')
$ export AWS_ACCESS_KEY_ID=minio
$ export AWS_SECRET_ACCESS_KEY=minio123
$ aws s3 mb s3://$S3_BUCKET --endpoint-url http://$S3_CLUSTER_IP
make_bucket: newbucket
Use kubectl get pods
to verify that the cluster uses the new S3 bucket and the database is ready:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
minio-ss-0-0 1/1 Running 0 18m
minio-ss-0-1 1/1 Running 0 18m
minio-ss-0-2 1/1 Running 0 18m
minio-ss-0-3 1/1 Running 0 18m
vertica-crd-sc1-0 1/1 Running 0 20m
vertica-crd-sc1-1 1/1 Running 0 20m
vertica-crd-sc1-2 1/1 Running 0 20m
verticadb-operator-controller-manager-5d9cdc9b8-kw9nv 2/2 Running 0 63m
Database is not available
After you create a custom resource instance, the database is not available. The kubectl get
custom-resource
command does not display information:
$ kubectl get vdb
NAME AGE SUBCLUSTERS INSTALLED DBADDED UP
vertica-crd 4s
Use kubectl describe
custom-resource
to check the events for the pods to identify any issues:
$ kubectl describe vdb
Name: vertica-crd
Namespace: default
Labels: <none>
Annotations: <none>
API Version: vertica.com/v1beta1
Kind: VerticaDB
Metadata:
...
Superuser Password Secret: su-passwd
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning SuperuserPasswordSecretNotFound 5s (x12 over 15s) verticadb-operator Secret for superuser password 'su-passwd' was not found
In this circumstance, the custom resource uses a Secret named su-passwd
to store the Superuser Password Secret
, but there is no such Secret available. Create a Secret named su-passwd
to store the Secret:
$ kubectl create secret generic su-passwd --from-literal=password=sup3rs3cr3t
secret/su-passwd created
Use kubectl get
custom-resource
to verify the issue is resolved:
$ kubectl get vdb
NAME AGE SUBCLUSTERS INSTALLED DBADDED UP
vertica-crd 89s 1 0 0 0
Image pull failure
You receive an ImagePullBackOff error when you deploy a Vertica cluster with Helm charts, but you do not pre-pull the Vertica image from the local registry server:
$ kubectl describe pod pod-name-0
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
...
Warning Failed 2m32s kubelet Failed to pull image "k8s-rhel7-01:5000/vertica-k8s:default-1": rpc error: code = Unknown desc = context canceled
Warning Failed 2m32s kubelet Error: ErrImagePull
Normal BackOff 2m32s kubelet Back-off pulling image "k8s-rhel7-01:5000/vertica-k8s:default-1"
Warning Failed 2m32s kubelet Error: ImagePullBackOff
Normal Pulling 2m18s (x2 over 4m22s) kubelet Pulling image "k8s-rhel7-01:5000/vertica-k8s:default-1"
This occurs because the Vertica image size is too big to pull from the registry while deploying the Vertica cluster. Execute the following command on a Kubernetes host:
$ docker image list | grep vertica-k8s
k8s-rhel7-01:5000/vertica-k8s default-1 2d6f5d3d90d6 9 days ago 1.55GB
The solve this issue, complete one of the following:
-
Pull the Vertica images on each node before creating the Vertica StatefulSet:
$ NODES=`kubectl get nodes | grep -v NAME | awk '{print $1}'` $ for node in $NODES; do ssh $node docker pull $DOCKER_REGISTRY:5000/vertica-k8s:$K8S_TAG; done
-
Use the reduced-size vertica/vertica-k8s:latest image for the Vertica server.
Pending pods due to insufficient CPU
If your host nodes do not have enough resources to fulfill the resource request from a pod, the pod stays in pending status.
Note
As a best practice, do not request the maximum amount of resources available on a host node to leave resources for other processes on the host node.In the following example, the pod requests 40 CPUs on the host node, and the pod stays in Pending:
$ kubectl describe pod cluster-vertica-defaultsubcluster-0
...
Status: Pending
...
Containers:
server:
Image: docker.io/library/vertica-k8s:default-1
Ports: 5433/TCP, 5434/TCP, 22/TCP
Host Ports: 0/TCP, 0/TCP, 0/TCP
Command:
/opt/vertica/bin/docker-entrypoint.sh
restart-vertica-node
Limits:
memory: 200Gi
Requests:
cpu: 40
memory: 200Gi
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 3h20m default-scheduler 0/5 nodes are available: 5 Insufficient cpu.
To confirm the resources available on the host node. The following command confirms that the host node has only 40 allocatable CPUs:
$ kubectl describe node host-node-1
...
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Sat, 20 Mar 2021 22:39:10 -0400 Sat, 20 Mar 2021 13:07:02 -0400 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Sat, 20 Mar 2021 22:39:10 -0400 Sat, 20 Mar 2021 13:07:02 -0400 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Sat, 20 Mar 2021 22:39:10 -0400 Sat, 20 Mar 2021 13:07:02 -0400 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Sat, 20 Mar 2021 22:39:10 -0400 Sat, 20 Mar 2021 13:07:12 -0400 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 172.19.0.5
Hostname: eng-g9-191
Capacity:
cpu: 40
ephemeral-storage: 285509064Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 263839236Ki
pods: 110
Allocatable:
cpu: 40
ephemeral-storage: 285509064Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 263839236Ki
pods: 110
...
Non-terminated Pods: (3 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
default cluster-vertica-defaultsubcluster-0 38 (95%) 0 (0%) 200Gi (79%) 200Gi (79%) 51m
kube-system kube-flannel-ds-8brv9 100m (0%) 100m (0%) 50Mi (0%) 50Mi (0%) 9h
kube-system kube-proxy-lgjhp 0 (0%) 0 (0%) 0 (0%) 0 (0%) 9h
...
To correct this issue, reduce the resource.requests
in the subcluster to values lower than the maximum allocatable CPUs. The following example uses a YAML-formatted file named patch.yaml
to lower the resource requests for the pod:
$ cat patch.yaml
spec:
subclusters:
- name: defaultsubcluster
resources:
requests:
memory: 238Gi
cpu: "38"
limits:
memory: 238Gi
$ kubectl patch vdb cluster-vertica –-type=merge --patch “$(cat patch.yaml)”
verticadb.vertica.com/cluster-vertica patched
Adding and testing the vlogger sidecar
Vertica provides the vlogger image that sends logs from vertica.log
to standard output on the host node for log aggregation.
To add the sidecar to the CR, add an element to the spec.sidecars
definition:
spec:
...
sidecars:
- name: vlogger
image: vertica/vertica-logger:1.0.0
To test the sidecar, run the following command and verify that it returns logs:
$ kubectl logs pod-name -c vlogger
2021-12-08 14:39:08.538 DistCall Dispatch:0x7f3599ffd700-c000000000997e [Txn
2021-12-08 14:40:48.923 INFO New log
2021-12-08 14:40:48.923 Main Thread:0x7fbbe2cf6280 [Init] <INFO> Log /data/verticadb/v_verticadb_node0002_catalog/vertica.log opened; #1
2021-12-08 14:40:48.923 Main Thread:0x7fbbe2cf6280 [Init] <INFO> Processing command line: /opt/vertica/bin/vertica -D /data/verticadb/v_verticadb_node0002_catalog -C verticadb -n v_verticadb_node0002 -h 10.20.30.40 -p 5433 -P 4803 -Y ipv4
2021-12-08 14:40:48.923 Main Thread:0x7fbbe2cf6280 [Init] <INFO> Starting up Vertica Analytic Database v11.0.2-20211201
2021-12-08 14:40:48.923 Main Thread:0x7fbbe2cf6280 [Init] <INFO>
2021-12-08 14:40:48.923 Main Thread:0x7fbbe2cf6280 [Init] <INFO> vertica(v11.0.2) built by @re-docker5 from master@a44ffabdf3f05e8d104426506b088192f741c485 on 'Wed Dec 1 06:10:34 2021' $BuildId$
2021-12-08 14:40:48.923 Main Thread:0x7fbbe2cf6280 [Init] <INFO> CPU architecture: x86_64
2021-12-08 14:40:48.923 Main Thread:0x7fbbe2cf6280 [Init] <INFO> 64-bit Optimized Build
2021-12-08 14:40:48.923 Main Thread:0x7fbbe2cf6280 [Init] <INFO> Compiler Version: 7.3.1 20180303 (Red Hat 7.3.1-5)
2021-12-08 14:40:48.923 Main Thread:0x7fbbe2cf6280 [Init] <INFO> LD_LIBRARY_PATH=/opt/vertica/lib
2021-12-08 14:40:48.923 Main Thread:0x7fbbe2cf6280 [Init] <INFO> LD_PRELOAD=
2021-12-08 14:40:48.925 Main Thread:0x7fbbe2cf6280 <LOG> @v_verticadb_node0002: 00000/5081: Total swap memory used: 0
2021-12-08 14:40:48.925 Main Thread:0x7fbbe2cf6280 <LOG> @v_verticadb_node0002: 00000/4435: Process size resident set: 28651520
2021-12-08 14:40:48.925 Main Thread:0x7fbbe2cf6280 <LOG> @v_verticadb_node0002: 00000/5075: Total Memory free + cache: 59455180800
2021-12-08 14:40:48.925 Main Thread:0x7fbbe2cf6280 [Txn] <INFO> Looking for catalog at: /data/verticadb/v_verticadb_node0002_catalog/Catalog
...
Cannot find CPU metrics with VerticaAutoscaler
You might notice that your VerticaAutoScaler is not scaling correctly according to CPU utilization:
$ kubectl get hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
autoscaler-name VerticaAutoscaler/autoscaler-name <unknown>/50% 3 12 0 19h
$ kubectl describe hpa
Warning: autoscaling/v2beta2 HorizontalPodAutoscaler is deprecated in v1.23+, unavailable in v1.26+; use autoscaling/v2 HorizontalPodAutoscaler
Name: autoscaler-name
Namespace: namespace
Labels: <none>
Annotations: <none>
CreationTimestamp: Thu, 12 May 2022 10:25:02 -0400
Reference: VerticaAutoscaler/autoscaler-name
Metrics: ( current / target )
resource cpu on pods (as a percentage of request): <unknown> / 50%
Min replicas: 3
Max replicas: 12
VerticaAutoscaler pods: 3 current / 0 desired
Conditions:
Type Status Reason Message
---- ------ ------ -------
AbleToScale True SucceededGetScale the HPA controller was able to get the target's current scale
ScalingActive False FailedGetResourceMetric the HPA was unable to compute the replica count: failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedGetResourceMetric 7s horizontal-pod-autoscaler failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)
Warning FailedComputeMetricsReplicas 7s horizontal-pod-autoscaler invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)
You receive this error because the metrics server is not installed:
$ kubectl top nodes
error: Metrics API not available
To install the metrics server:
-
Download the components.yaml file:
$ kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
-
Optionally, disable TLS:
$ if ! grep kubelet-insecure-tls components.yaml; then sed -i 's/- args:/- args:\n - --kubelet-insecure-tls/' components.yaml;
-
Apply the YAML file:
$ kubectl apply -f components.yaml
-
Verify that the metrics server is running:
$ kubectl get svc metrics-server -n namespace NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE metrics-server ClusterIP 10.105.239.175 <none> 443/TCP 19h
CPU request error with VerticaAutoscaler
You might receive an error that states:
failed to get cpu utilization: missing request for cpu
You get this error because you must set resource limits on all containers, including sidecar containers. To correct this error:
-
Verify the error:
$ kubectl get hpa NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE autoscaler-name VerticaAutoscaler/autoscaler-name <unknown>/50% 3 12 0 19h $ kubectl describe hpa Warning: autoscaling/v2beta2 HorizontalPodAutoscaler is deprecated in v1.23+, unavailable in v1.26+; use autoscaling/v2 HorizontalPodAutoscaler Name: autoscaler-name Namespace: namespace Labels: <none> Annotations: <none> CreationTimestamp: Thu, 12 May 2022 15:58:31 -0400 Reference: VerticaAutoscaler/autoscaler-name Metrics: ( current / target ) resource cpu on pods (as a percentage of request): <unknown> / 50% Min replicas: 3 Max replicas: 12 VerticaAutoscaler pods: 3 current / 0 desired Conditions: Type Status Reason Message ---- ------ ------ ------- AbleToScale True SucceededGetScale the HPA controller was able to get the target's current scale ScalingActive False FailedGetResourceMetric the HPA was unable to compute the replica count: failed to get cpu utilization: missing request for cpu Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedGetResourceMetric 4s (x5 over 64s) horizontal-pod-autoscaler failed to get cpu utilization: missing request for cpu Warning FailedComputeMetricsReplicas 4s (x5 over 64s) horizontal-pod-autoscaler invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: missing request for cpu
-
Add resource limits to the CR:
$ cat /tmp/vdb.yaml apiVersion: vertica.com/v1beta1 kind: VerticaDB metadata: name: vertica-vdb spec: sidecars: - name: vlogger image: vertica/vertica-logger:latest resources: requests: memory: "100Mi" cpu: "100m" limits: memory: "100Mi" cpu: "100m" communal: credentialSecret: communal-creds endpoint: https://endpoint path: s3://bucket-location dbName: verticadb image: vertica/vertica-k8s:latest subclusters: - isPrimary: true name: sc1 resources: requests: memory: "4Gi" cpu: 2 limits: memory: "4Gi" cpu: 2 serviceType: ClusterIP serviceName: sc1 size: 3 upgradePolicy: Auto
-
Apply the update:
$ kubectl apply -f /tmp/vdb.yaml verticadb.vertica.com/vertica-vdb created
When you set a new CPU resource limit, Kubernetes reschedules each pod in the StatefulSet in a rolling update until all pods have the updated CPU resource limit.