Troubleshooting your Kubernetes cluster

These tips can help you avoid issues related to your Vertica on Kubernetes deployment and troubleshoot any problems that occur.

These tips can help you avoid issues related to your Vertica on Kubernetes deployment and troubleshoot any problems that occur.

Download the kubectl command line tool to debug your Kubernetes resources.

General cluster and database

Inspect objects to diagnose issues

When you deploy a custom resource (CR), you might encounter a variety of issues. To pinpoint an issue, use the following commands to inspect the objects that the CR creates:

kubectl get returns basic information about deployed objects:

$ kubectl get pods -n namespace
$ kubectl get statefulset -n namespace
$ kubectl get pvc -n namespace
$ kubectl get event

kubectl describe returns detailed information about deployed objects:

$ kubectl describe pod pod-name -n namespace
$ kubectl describe statefulset name -n namespace
$ kubectl describe custom-resource-name -n namespace

Verify updates to a custom resource

Because the operator takes time to perform tasks, updates to the custom resource are not effective immediately. Use the kubectl command line tool to verify that changes are applied.

You can use the kubectl wait command to wait for a specified condition. For example, the operator uses the ImageChangeInProgress condition to provide an upgrade status. After you begin the image version upgrade, wait until the operator acknowledges the upgrade and sets this condition to True:

$ kubectl wait --for=condition=ImageChangeInProgress=True vdb/cluster-name –-timeout=180s

After the upgrade begins, you can wait until the operator leaves upgrade mode and sets this condition to False:

$ kubectl wait --for=condition=ImageChangeInProgress=False vdb/cluster-name –-timeout=800s

For more information about kubectl wait, see the kubectl reference documentation.

Pods are running but the database is not ready

When you check the pods in your cluster, the pods are running but the database is not ready:

$ kubectl get pods
NAME                                                    READY   STATUS    RESTARTS   AGE
vertica-crd-sc1-0                                       0/1     Running   0          12m
vertica-crd-sc1-1                                       0/1     Running   1          12m
vertica-crd-sc1-2                                       0/1     Running   0          12m
verticadb-operator-controller-manager-5d9cdc9b8-kw9nv   2/2     Running   0          24m

To find the root cause of the issue, use kubectl logs to check the operator manager. The following example shows that the communal storage bucket does not exist:

$ kubectl logs -l app.kubernetes.io/name=verticadb-operator -c manager -f
2021-08-04T20:03:00.289Z        INFO    controllers.VerticaDB   ExecInPod entry {"verticadb": "default/vertica-crd", "pod": {"namespace": "default", "name": "vertica-crd-sc1-0"}, "command": "bash -c ls -l /opt/vertica/config/admintools.conf && grep '^node\\|^v_\\|^host' /opt/vertica/config/admintools.conf "}
2021-08-04T20:03:00.369Z        INFO    controllers.VerticaDB   ExecInPod stream        {"verticadb": "default/vertica-crd", "pod": {"namespace": "default", "name": "vertica-crd-sc1-0"}, "err": null, "stdout": "-rw-rw-r-- 1 dbadmin verticadba 1243 Aug  4 20:00 /opt/vertica/config/admintools.conf\nhosts = 10.244.1.5,10.244.2.4,10.244.4.6\nnode0001 = 10.244.1.5,/data,/data\nnode0002 = 10.244.2.4,/data,/data\nnode0003 = 10.244.4.6,/data,/data\n", "stderr": ""}
2021-08-04T20:03:00.369Z        INFO    controllers.VerticaDB   ExecInPod entry {"verticadb": "default/vertica-crd", "pod": {"namespace": "default", "name": "vertica-crd-sc1-0"}, "command": "/opt/vertica/bin/admintools -t create_db --skip-fs-checks --hosts=10.244.1.5,10.244.2.4,10.244.4.6 --communal-storage-location=s3://newbucket/db/26100df1-93e5-4e64-b665-533e14abb67c --communal-storage-params=/home/dbadmin/auth_parms.conf --sql=/home/dbadmin/post-db-create.sql --shard-count=12 --depot-path=/depot --database verticadb --force-cleanup-on-failure --noprompt --password ******* "}
2021-08-04T20:03:00.369Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"VerticaDB","namespace":"default","name":"vertica-crd","uid":"26100df1-93e5-4e64-b665-533e14abb67c","apiVersion":"vertica.com/v1beta1","resourceVersion":"11591"}, "reason": "CreateDBStart", "message": "Calling 'admintools -t create_db'"}
2021-08-04T20:03:17.051Z        INFO    controllers.VerticaDB   ExecInPod stream        {"verticadb": "default/vertica-crd", "pod": {"namespace": "default", "name": "vertica-crd-sc1-0"}, "err": "command terminated with exit code 1", "stdout": "Default depot size in use\nDistributing changes to cluster.\n\tCreating database verticadb\nBootstrap on host 10.244.1.5 return code 1 stdout '' stderr 'Logged exception in writeBufferToFile: RecvFiles failed in closing file [s3://newbucket/db/26100df1-93e5-4e64-b665-533e14abb67c/verticadb_rw_access_test.txt]: The specified bucket does not exist. Writing test data to file s3://newbucket/db/26100df1-93e5-4e64-b665-533e14abb67c/verticadb_rw_access_test.txt failed.\\nTesting rw access to communal location s3://newbucket/db/26100df1-93e5-4e64-b665-533e14abb67c/ failed\\n'\n\nError: Bootstrap on host 10.244.1.5 return code 1 stdout '' stderr 'Logged exception in writeBufferToFile: RecvFiles failed in closing file [s3://newbucket/db/26100df1-93e5-4e64-b665-533e14abb67c/verticadb_rw_access_test.txt]: The specified bucket does not exist. Writing test data to file s3://newbucket/db/26100df1-93e5-4e64-b665-533e14abb67c/verticadb_rw_access_test.txt failed.\\nTesting rw access to communal location s3://newbucket/db/26100df1-93e5-4e64-b665-533e14abb67c/ failed\\n'\n\n", "stderr": ""}
2021-08-04T20:03:17.051Z        INFO    controllers.VerticaDB   aborting reconcile of VerticaDB {"verticadb": "default/vertica-crd", "result": {"Requeue":true,"RequeueAfter":0}, "err": null}
2021-08-04T20:03:17.051Z        DEBUG   controller-runtime.manager.events       Warning {"object": {"kind":"VerticaDB","namespace":"default","name":"vertica-crd","uid":"26100df1-93e5-4e64-b665-533e14abb67c","apiVersion":"vertica.com/v1beta1","resourceVersion":"11591"}, "reason": "S3BucketDoesNotExist", "message": "The bucket in the S3 path 's3://newbucket/db/26100df1-93e5-4e64-b665-533e14abb67c' does not exist"}

Create an S3 bucket for the cluster:

$ S3_BUCKET=newbucket
$ S3_CLUSTER_IP=$(kubectl get svc | grep minio | head -1 | awk '{print $3}')
$ export AWS_ACCESS_KEY_ID=minio
$ export AWS_SECRET_ACCESS_KEY=minio123
$ aws s3 mb s3://$S3_BUCKET --endpoint-url http://$S3_CLUSTER_IP
make_bucket: newbucket

Use kubectl get pods to verify that the cluster uses the new S3 bucket and the database is ready:

$ kubectl get pods
NAME                                                    READY   STATUS    RESTARTS   AGE
minio-ss-0-0                                            1/1     Running   0          18m
minio-ss-0-1                                            1/1     Running   0          18m
minio-ss-0-2                                            1/1     Running   0          18m
minio-ss-0-3                                            1/1     Running   0          18m
vertica-crd-sc1-0                                       1/1     Running   0          20m
vertica-crd-sc1-1                                       1/1     Running   0          20m
vertica-crd-sc1-2                                       1/1     Running   0          20m
verticadb-operator-controller-manager-5d9cdc9b8-kw9nv   2/2     Running   0          63m

Database is not available

After you create a custom resource instance, the database is not available. The kubectl get custom-resource command does not display information:

$ kubectl get vdb
NAME          AGE   SUBCLUSTERS   INSTALLED   DBADDED   UP
vertica-crd   4s

Use kubectl describe custom-resource to check the events for the pods to identify any issues:

$ kubectl describe vdb
Name:         vertica-crd
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  vertica.com/v1beta1
Kind:         VerticaDB
Metadata:
  ...
  Superuser Password Secret:  su-passwd
Events:
  Type     Reason                           Age                From                Message
  ----     ------                           ----               ----                -------
  Warning  SuperuserPasswordSecretNotFound  5s (x12 over 15s)  verticadb-operator  Secret for superuser password 'su-passwd' was not found

In this circumstance, the custom resource uses a Secret named su-passwd to store the Superuser Password Secret, but there is no such Secret available. Create a Secret named su-passwd to store the Secret:

$ kubectl create secret generic su-passwd --from-literal=password=sup3rs3cr3t
secret/su-passwd created

Use kubectl get custom-resource to verify the issue is resolved:

$ kubectl get vdb
NAME          AGE   SUBCLUSTERS   INSTALLED   DBADDED   UP
vertica-crd   89s   1             0           0         0

Image pull failure

You receive an ImagePullBackOff error when you deploy a Vertica cluster with Helm charts, but you do not pre-pull the Vertica image from the local registry server:

$ kubectl describe pod pod-name-0
...
Events:
  Type     Reason            Age                    From               Message
  ----     ------            ----                   ----               -------
  ...
  Warning  Failed            2m32s                  kubelet            Failed to pull image "k8s-rhel7-01:5000/vertica-k8s:default-1": rpc error: code = Unknown desc = context canceled
  Warning  Failed            2m32s                  kubelet            Error: ErrImagePull
  Normal   BackOff           2m32s                  kubelet            Back-off pulling image "k8s-rhel7-01:5000/vertica-k8s:default-1"
  Warning  Failed            2m32s                  kubelet            Error: ImagePullBackOff
  Normal   Pulling           2m18s (x2 over 4m22s)  kubelet            Pulling image "k8s-rhel7-01:5000/vertica-k8s:default-1"

This occurs because the Vertica image size is too big to pull from the registry while deploying the Vertica cluster. Execute the following command on a Kubernetes host:

$ docker image list | grep vertica-k8s
k8s-rhel7-01:5000/vertica-k8s default-1 2d6f5d3d90d6 9 days ago 1.55GB

To solve this issue, complete one of the following:

  • Pull the Vertica images on each node before creating the Vertica StatefulSet:

    $ NODES=`kubectl get nodes | grep -v NAME | awk '{print $1}'`
    $ for node in $NODES; do ssh $node docker pull $DOCKER_REGISTRY:5000/vertica-k8s:$K8S_TAG; done
    
  • Use the reduced-size vertica/vertica-k8s:latest image for the Vertica server.

Pending pods due to insufficient CPU

If your host nodes do not have enough resources to fulfill the resource request from a pod, the pod stays in pending status.

In the following example, the pod requests 40 CPUs on the host node, and the pod stays in Pending:

$ kubectl describe pod cluster-vertica-defaultsubcluster-0
...
Status:         Pending
...
Containers:
  server:
    Image:       docker.io/library/vertica-k8s:default-1
    Ports:       5433/TCP, 5434/TCP, 22/TCP
    Host Ports:  0/TCP, 0/TCP, 0/TCP
    Command:
      /opt/vertica/bin/docker-entrypoint.sh
      restart-vertica-node
    Limits:
      memory:  200Gi
    Requests:
      cpu: 40
      memory:  200Gi
...
Events:
  Type     Reason            Age    From               Message
  ----     ------            ----   ----               -------
  Warning  FailedScheduling  3h20m  default-scheduler  0/5 nodes are available: 5 Insufficient cpu.

To confirm the resources available on the host node. The following command confirms that the host node has only 40 allocatable CPUs:

$ kubectl describe node host-node-1
...
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Sat, 20 Mar 2021 22:39:10 -0400   Sat, 20 Mar 2021 13:07:02 -0400   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Sat, 20 Mar 2021 22:39:10 -0400   Sat, 20 Mar 2021 13:07:02 -0400   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Sat, 20 Mar 2021 22:39:10 -0400   Sat, 20 Mar 2021 13:07:02 -0400   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Sat, 20 Mar 2021 22:39:10 -0400   Sat, 20 Mar 2021 13:07:12 -0400   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  172.19.0.5
  Hostname:    eng-g9-191
Capacity:
  cpu:                40
  ephemeral-storage:  285509064Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             263839236Ki
  pods:               110
Allocatable:
  cpu:                40
  ephemeral-storage:  285509064Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             263839236Ki
  pods:               110
...
Non-terminated Pods:          (3 in total)
  Namespace                   Name                                   CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                   ----                                   ------------  ----------  ---------------  -------------  ---
  default                     cluster-vertica-defaultsubcluster-0    38 (95%)      0 (0%)      200Gi (79%)      200Gi (79%)    51m
  kube-system                 kube-flannel-ds-8brv9                  100m (0%)     100m (0%)   50Mi (0%)        50Mi (0%)      9h
  kube-system                 kube-proxy-lgjhp                       0 (0%)        0 (0%)      0 (0%)           0 (0%)         9h
...

To correct this issue, reduce the resource.requests in the subcluster to values lower than the maximum allocatable CPUs. The following example uses a YAML-formatted file named patch.yaml to lower the resource requests for the pod:

$ cat patch.yaml
spec:
  subclusters:
    - name: defaultsubcluster
      resources:
        requests:
          memory: 238Gi
          cpu: "38"
        limits:
          memory: 238Gi
$ kubectl patch vdb cluster-vertica –-type=merge --patch “$(cat patch.yaml)”
verticadb.vertica.com/cluster-vertica patched

Pending pod after node removed

When you remove a host node from your Kubernetes cluster, a Vertica pod might stay in pending status if the pod uses a PersistentVolume (PV) that has a node affinity rule that prevents the pod from running on another node.

To resolve this issue, you must verify that the pods are pending because of an affinity rule, and then use the vdb-gen tool to revive the entire cluster.

First, determine if the pod is pending because of a node affinity rule. This requires details about the pending pod, the PersistentVolumeClaim (PVC) associated with the pod, and the PersistentVolume (PV) associated with the PVC:

  1. Use kubectl describe to return details about the pending pod:

    $ kubectl describe pod pod-name
    ...
    Events:
      Type     Reason            Age                From               Message
      ----     ------            ----               ----               -------
      Warning  FailedScheduling  28s (x2 over 48s)  default-scheduler  0/2 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/unschedulable: }, 1 node(s) had volume node affinity conflict, 1 node(s) were unschedulable. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling.
    

    The Message column verifies that the pod was not scheduled due a volume node affinity conflict.

  2. Get the name of the PVC associated with the pod:

    $ kubectl get pod -o jsonpath='{.spec.volumes[0].persistentVolumeClaim.claimName}{"\n"}' pod-name
    local-data-pod-name
    
  3. Use the PVC to get the PV. PVs are associated with nodes:

    $ kubectl get pvc -o jsonpath='{.spec.volumeName}{"\n"}' local-data-pod-name
    pvc-1926ae96-574d-4433-99b4-ec9ab0e5e497
    
  4. Use the PV to get the name of the node that has the affinity rule:

    $ kubectl get pv -o jsonpath='{.spec.nodeAffinity.required.nodeSelectorTerms[0].matchExpressions[0].values[0]}{"\n"}' pvc-1926ae96-574d-4433-99b4-ec9ab0e5e497
    ip-10-20-30-40.ec2.internal
    
  5. Verify that the node with the affinity rule is the node that was removed from the Kubernetes cluster.

Next, you must revive the entire cluster to get all pods running again. When you revive the cluster, you create new PVCs that restore the association between each pod and a PV to satisfy the node affinity rule.

While you have nodes running in the cluster, you can use the vdb-gen tool to generate a manifest and revive the database:

  1. Download the vdb-gen tool from the vertica-kubernetes GitHub repository:

    $ wget https://github.com/vertica/vertica-kubernetes/releases/latest/download/vdb-gen
    
  2. Copy the tool into a pod that has a running Vertica process:

    $ kubectl cp vdb-gen pod-name:/tmp/vdb-gen
    
  3. The vdb-gen tool requires the database name, so retrieve it with the following command:

    $ kubectl get vdb -o jsonpath='{.spec.dbName}{"\n"}' v
    database-name
    
  4. Run the vdb-gen tool with the database name. The following command runs the tool and pipes the output to a file named revive.yaml:

    $ kubectl exec -i pod-name -- bash -c "chmod +x /tmp/vdb-gen && /tmp/vdb-gen --ignore-cluster-lease --name v localhost database-name | tee /tmp/revive.yaml"
    
  5. Copy revive.yaml to your local machine so that you can use it after you remove the cluster:

    $ kubectl cp pod-name:/tmp/revive.yaml revive.yaml
    
  6. Save the current VerticaDB Custom Resource (CR). For example, the following command saves a CR named vertdb to a file named orig.yaml:

    $ kubectl get vdb vertdb -o yaml > orig.yaml
    
  7. Update revive.yaml with parts of orig.yaml that vdb-gen did not capture. For example, custom resource limits.

  8. Delete the existing Vertica cluster:

    $ kubectl delete vdb vertdb
    verticadb.vertica.com "vertdb" deleted
    
  9. Delete all PVCs that are associated with the deleted cluster.

    1. Retrieve the PVC names. A PVC name uses the dbname-subcluster-podindex format:

      $ kubectl get pvc
      NAME                     STATUS   VOLUME                                     CAPACITY ACCESS MODES   STORAGECLASS   AGE
      local-data-vertdb-sc-0   Bound    pvc-e9834c18-bf60-4a4b-a686-ba8f7b601230   1Gi      RWO            local-path     34m
      local-data-vertdb-sc-1   Bound    pvc-1926ae96-574d-4433-99b4-ec9ab0e5e497   1Gi      RWO            local-path     34m
      local-data-vertdb-sc-2   Bound    pvc-4541f7c9-3afc-47f0-8d04-67fac370ee88   1Gi      RWO            local-path     34m
      
    2. Delete the PVCs:

      $ kubectl delete pvc local-data-vertdb-sc-0 local-data-vertdb-sc-1 local-data-vertdb-sc-2
      persistentvolumeclaim "local-data-vertdb-sc-0" deleted
      persistentvolumeclaim "local-data-vertdb-sc-1" deleted
      persistentvolumeclaim "local-data-vertdb-sc-2" deleted
      
  10. Revive the database with revive.yaml:

    $ kubectl apply -f revive.yaml
    verticadb.vertica.com/vertdb created
    

After the revive completes, all Vertica pods are running, and PVCs are recreated on new nodes. Wait for the operator to start the database.

Helm charts

Helm install failure

When you install the VerticaDB operator and admission controller Helm chart, the helm install command might return the following error:

$ helm install vdb-op vertica-charts/verticadb-operator
Error: INSTALLATION FAILED: unable to build kubernetes objects from release manifest: [unable to recognize "": no matches for kind "Certificate" in version "cert-manager.io/v1", unable to recognize "": no matches for kind "Issuer" in version "cert-manager.io/v1"]

The error indicates that you have not met the TLS prerequisite for the admission controller webhook. To resolve this issue, install cert-manager or configure custom certificates. The following steps install cert-manager.

  1. Install the cert-manager YAML manifest:

    $ kubectl apply -f https://github.com/jetstack/cert-manager/releases/download/v1.5.3/cert-manager.yaml
    
  2. Verify the cert-manager installation.

    If you try to install the Helm chart immediately after you install cert-manager, you might receive the following error:

    $ helm install vdb-op vertica-charts/verticadb-operator
    Error: failed to create resource: Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": dial tcp 10.96.232.154:443: connect: connection refused
    

    You receive this error because cert-manager needs time to create its pods and register the webhook with the cluster. Wait a few minutes, and then verify the cert-manager installation with the following command:

    $ kubectl get pods --namespace cert-manager
    NAME                                       READY   STATUS    RESTARTS   AGE
    cert-manager-7dd5854bb4-skks7              1/1     Running   5          12d
    cert-manager-cainjector-64c949654c-9nm2z   1/1     Running   5          12d
    cert-manager-webhook-6bdffc7c9d-b7r2p      1/1     Running   5          12d
    

    For additional details about cert-manager install verification, see the cert-manager documentation.

  3. After you verify the cert-manager installation, you must uninstall the Helm chart and then reinstall:

    $ helm uninstall vdb-op
    $ helm install vdb-op vertica-charts/verticadb-operator
    

For additional information, see Installing the Vertica DB operator.

Custom certificate helm install error

If you use custom certificates when you install the operator with the Helm chart, the helm install or kubectl apply command might return an error similar to the following:

$ kubectl apply -f ../operatorcrd.yaml
Error from server (InternalError): error when creating "../operatorcrd.yaml": Internal error occurred: failed calling webhook "mverticadb.kb.io": Post "https://verticadb-operator-webhook-service.namespace.svc:443/mutate-vertica-com-v1beta1-verticadb?timeout=10s": x509: certificate is valid for ip-10-0-21-169.ec2.internal, test-bastion, not verticadb-operator-webhook-service.default.svc

You receive this error when the TLS key's Domain Name System (DNS) or Subject Alternate Name (SAN) is incorrect. To correct this error, define the DNS and SAN in a configuration file in the following format:

commonName = verticadb-operator-webhook-service.namespace.svc
...
[alt_names]
DNS.1 = verticadb-operator-webhook-service.namespace.svc
DNS.2 = verticadb-operator-webhook-service.namespace.svc.cluster.local

For additional details, see Installing the Vertica DB operator.

Metrics gathering

Adding and testing the vlogger sidecar

Vertica provides the vlogger image that sends logs from vertica.log to standard output on the host node for log aggregation.

To add the sidecar to the CR, add an element to the spec.sidecars definition:

spec:
  ...
  sidecars:
    - name: vlogger
      image: vertica/vertica-logger:1.0.0

To test the sidecar, run the following command and verify that it returns logs:

$ kubectl logs pod-name -c vlogger

2021-12-08 14:39:08.538 DistCall Dispatch:0x7f3599ffd700-c000000000997e [Txn
2021-12-08 14:40:48.923 INFO New log
2021-12-08 14:40:48.923 Main Thread:0x7fbbe2cf6280 [Init] <INFO> Log /data/verticadb/v_verticadb_node0002_catalog/vertica.log opened; #1
2021-12-08 14:40:48.923 Main Thread:0x7fbbe2cf6280 [Init] <INFO> Processing command line: /opt/vertica/bin/vertica -D /data/verticadb/v_verticadb_node0002_catalog -C verticadb -n v_verticadb_node0002 -h 10.20.30.40 -p 5433 -P 4803 -Y ipv4
2021-12-08 14:40:48.923 Main Thread:0x7fbbe2cf6280 [Init] <INFO> Starting up Vertica Analytic Database v11.0.2-20211201
2021-12-08 14:40:48.923 Main Thread:0x7fbbe2cf6280 [Init] <INFO>
2021-12-08 14:40:48.923 Main Thread:0x7fbbe2cf6280 [Init] <INFO> vertica(v11.0.2) built by @re-docker5 from master@a44ffabdf3f05e8d104426506b088192f741c485 on 'Wed Dec  1 06:10:34 2021' $BuildId$
2021-12-08 14:40:48.923 Main Thread:0x7fbbe2cf6280 [Init] <INFO> CPU architecture: x86_64
2021-12-08 14:40:48.923 Main Thread:0x7fbbe2cf6280 [Init] <INFO> 64-bit Optimized Build
2021-12-08 14:40:48.923 Main Thread:0x7fbbe2cf6280 [Init] <INFO> Compiler Version: 7.3.1 20180303 (Red Hat 7.3.1-5)
2021-12-08 14:40:48.923 Main Thread:0x7fbbe2cf6280 [Init] <INFO> LD_LIBRARY_PATH=/opt/vertica/lib
2021-12-08 14:40:48.923 Main Thread:0x7fbbe2cf6280 [Init] <INFO> LD_PRELOAD=
2021-12-08 14:40:48.925 Main Thread:0x7fbbe2cf6280 <LOG> @v_verticadb_node0002: 00000/5081: Total swap memory used: 0
2021-12-08 14:40:48.925 Main Thread:0x7fbbe2cf6280 <LOG> @v_verticadb_node0002: 00000/4435: Process size resident set: 28651520
2021-12-08 14:40:48.925 Main Thread:0x7fbbe2cf6280 <LOG> @v_verticadb_node0002: 00000/5075: Total Memory free + cache: 59455180800
2021-12-08 14:40:48.925 Main Thread:0x7fbbe2cf6280 [Txn] <INFO> Looking for catalog at: /data/verticadb/v_verticadb_node0002_catalog/Catalog
...

Core file for Vertica server container process

In some circumstances, you might need to examine a core file that contains information about the Vertica server container process:

  1. For the custom resource securityContext value, set the privileged property to true:

    apiVersion: vertica.com/v1beta1
    kind: VerticaDB
    ...
    spec:
      ...
      securityContext:
        privileged: true
    
  2. On the host machine, verify that /proc/sys/kernel/core_pattern is set to core:

    $ cat /proc/sys/kernel/core_pattern
    core
    

    The /proc/sys/kernel/core_pattern file is not namespaced, so setting this value affects all containers running on that host.

When Vertica generates a core, the machine writes a message to vertica.log that indicates where you can locate the core file.

Security

Custom PodSecurityPolicy errors

Vertica on Kubernetes requires the following Linux capabilities that enable SSH communications between the pods:

  • SYS_CHROOT

  • AUDIT_WRITE

In some circumstances, these capabilities might conflict with custom security policy restrictions and cause errors. For example:

$ kubectl describe statefulset subcluster-name
...
Events:
  Type     Reason        Age                     From                    Message
  ----     ------        ----                    ----                    -------
  Warning  FailedCreate  29m (x73 over 15h)      statefulset-controller  create Pod subcluster-name-0 in StatefulSet subcluster-name failed error: pods "subcluster-name-0" is forbidden: PodSecurityPolicy: unable to admit pod: [spec.containers[0].securityContext.capabilities.add: Invalid value: "AUDIT_WRITE": capability may not be added spec.containers[0].securityContext.capabilities.add: Invalid value: "SYS_CHROOT": capability may not be added]

When a similar error is returned, you must update your PodSecurityPolicy. For details, see the Kubernetes documentation.

VerticaAutoscaler

Cannot find CPU metrics with VerticaAutoscaler

You might notice that your VerticaAutoScaler is not scaling correctly according to CPU utilization:

$ kubectl get hpa
NAME                REFERENCE                           TARGETS         MINPODS   MAXPODS   REPLICAS   AGE
autoscaler-name     VerticaAutoscaler/autoscaler-name   <unknown>/50%   3         12        0          19h

$ kubectl describe hpa
Warning: autoscaling/v2beta2 HorizontalPodAutoscaler is deprecated in v1.23+, unavailable in v1.26+; use autoscaling/v2 HorizontalPodAutoscaler
Name: autoscaler-name
Namespace: namespace
Labels: <none>
Annotations: <none>
CreationTimestamp: Thu, 12 May 2022 10:25:02 -0400
Reference: VerticaAutoscaler/autoscaler-name
Metrics: ( current / target )
resource cpu on pods (as a percentage of request): <unknown> / 50%
Min replicas: 3
Max replicas: 12
VerticaAutoscaler pods: 3 current / 0 desired
Conditions:
Type Status Reason Message
---- ------ ------ -------
AbleToScale True SucceededGetScale the HPA controller was able to get the target's current scale
ScalingActive False FailedGetResourceMetric the HPA was unable to compute the replica count: failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedGetResourceMetric 7s horizontal-pod-autoscaler failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)
Warning FailedComputeMetricsReplicas 7s horizontal-pod-autoscaler invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)

You receive this error because the metrics server is not installed:

$ kubectl top nodes
error: Metrics API not available

To install the metrics server:

  1. Download the components.yaml file:

    $ kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
    
  2. Optionally, disable TLS:

    $ if ! grep kubelet-insecure-tls components.yaml; then
      sed -i 's/- args:/- args:\n - --kubelet-insecure-tls/' components.yaml;
    
  3. Apply the YAML file:

    $ kubectl apply -f components.yaml
    
  4. Verify that the metrics server is running:

    $ kubectl get svc metrics-server -n namespace
    NAME             TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)   AGE
    metrics-server   ClusterIP   10.105.239.175   <none>        443/TCP   19h
    

CPU request error with VerticaAutoscaler

You might receive an error that states:

failed to get cpu utilization: missing request for cpu

You get this error because you must set resource limits on all containers, including sidecar containers. To correct this error:

  1. Verify the error:

    $ kubectl get hpa
    NAME                REFERENCE                           TARGETS         MINPODS   MAXPODS   REPLICAS   AGE
    autoscaler-name     VerticaAutoscaler/autoscaler-name   <unknown>/50%   3         12        0          19h
    
    $ kubectl describe hpa
    Warning: autoscaling/v2beta2 HorizontalPodAutoscaler is deprecated in v1.23+, unavailable in v1.26+; use autoscaling/v2 HorizontalPodAutoscaler
    Name: autoscaler-name
    Namespace: namespace
    Labels: <none>
    Annotations: <none>
    CreationTimestamp: Thu, 12 May 2022 15:58:31 -0400
    Reference: VerticaAutoscaler/autoscaler-name
    Metrics: ( current / target )
    resource cpu on pods (as a percentage of request): <unknown> / 50%
    Min replicas: 3
    Max replicas: 12
    VerticaAutoscaler pods: 3 current / 0 desired
    Conditions:
    Type Status Reason Message
    ---- ------ ------ -------
    AbleToScale True SucceededGetScale the HPA controller was able to get the target's current scale
    ScalingActive False FailedGetResourceMetric the HPA was unable to compute the replica count: failed to get cpu utilization: missing request for cpu
    Events:
    Type Reason Age From Message
    ---- ------ ---- ---- -------
    Warning FailedGetResourceMetric 4s (x5 over 64s) horizontal-pod-autoscaler failed to get cpu utilization: missing request for cpu
    Warning FailedComputeMetricsReplicas 4s (x5 over 64s) horizontal-pod-autoscaler invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: missing request for cpu
    
  2. Add resource limits to the CR:

    $ cat /tmp/vdb.yaml
    apiVersion: vertica.com/v1beta1
    kind: VerticaDB
    metadata:
      name: vertica-vdb
    spec:
      sidecars:
        - name: vlogger
          image: vertica/vertica-logger:latest
          resources:
            requests:
              memory: "100Mi"
              cpu: "100m"
            limits:
              memory: "100Mi"
              cpu: "100m"
      communal:
        credentialSecret: communal-creds
        endpoint: https://endpoint
            path: s3://bucket-location
      dbName: verticadb
      image: vertica/vertica-k8s:latest
      subclusters:
      - isPrimary: true
        name: sc1
        resources:
          requests:
            memory: "4Gi"
            cpu: 2
          limits:
            memory: "4Gi"
            cpu: 2
        serviceType: ClusterIP
        serviceName: sc1
        size: 3
      upgradePolicy: Auto
    
  3. Apply the update:

    $ kubectl apply -f /tmp/vdb.yaml
    verticadb.vertica.com/vertica-vdb created
    

When you set a new CPU resource limit, Kubernetes reschedules each pod in the StatefulSet in a rolling update until all pods have the updated CPU resource limit.