General cluster and database
Inspect objects to diagnose issues
When you deploy a custom resource (CR), you might encounter a variety of issues. To pinpoint an issue, use the following commands to inspect the objects that the CR creates:
kubectl get
returns basic information about deployed objects:
$ kubectl get pods -n namespace
$ kubectl get statefulset -n namespace
$ kubectl get pvc -n namespace
$ kubectl get event
kubectl describe
returns detailed information about deployed objects:
$ kubectl describe pod pod-name -n namespace
$ kubectl describe statefulset name -n namespace
$ kubectl describe custom-resource-name -n namespace
Verify updates to a custom resource
Because the operator takes time to perform tasks, updates to the custom resource are not effective immediately. Use the kubectl command line tool to verify that changes are applied.
You can use the kubectl wait command to wait for a specified condition. For example, the operator uses the UpgradeInProgress
condition to provide an upgrade status. After you begin the image version upgrade, wait until the operator acknowledges the upgrade and sets this condition to True:
$ kubectl wait --for=condition=UpgradeInProgress=True vdb/cluster-name –-timeout=180s
After the upgrade begins, you can wait until the operator leaves upgrade mode and sets this condition to False:
$ kubectl wait --for=condition=UpgradeInProgress=False vdb/cluster-name –-timeout=800s
For more information about kubectl wait, see the kubectl reference documentation.
Pods are running but the database is not ready
When you check the pods in your cluster, the pods are running but the database is not ready:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
vertica-crd-sc1-0 0/1 Running 0 12m
vertica-crd-sc1-1 0/1 Running 1 12m
vertica-crd-sc1-2 0/1 Running 0 12m
verticadb-operator-controller-manager-5d9cdc9b8-kw9nv 2/2 Running 0 24m
To find the root cause of the issue, use kubectl logs
to check the operator manager. The following example shows that the communal storage bucket does not exist:
$ kubectl logs -l app.kubernetes.io/name=verticadb-operator -c manager -f
2021-08-04T20:03:00.289Z INFO controllers.VerticaDB ExecInPod entry {"verticadb": "default/vertica-crd", "pod": {"namespace": "default", "name": "vertica-crd-sc1-0"}, "command": "bash -c ls -l /opt/vertica/config/admintools.conf && grep '^node\\|^v_\\|^host' /opt/vertica/config/admintools.conf "}
2021-08-04T20:03:00.369Z INFO controllers.VerticaDB ExecInPod stream {"verticadb": "default/vertica-crd", "pod": {"namespace": "default", "name": "vertica-crd-sc1-0"}, "err": null, "stdout": "-rw-rw-r-- 1 dbadmin verticadba 1243 Aug 4 20:00 /opt/vertica/config/admintools.conf\nhosts = 10.244.1.5,10.244.2.4,10.244.4.6\nnode0001 = 10.244.1.5,/data,/data\nnode0002 = 10.244.2.4,/data,/data\nnode0003 = 10.244.4.6,/data,/data\n", "stderr": ""}
2021-08-04T20:03:00.369Z INFO controllers.VerticaDB ExecInPod entry {"verticadb": "default/vertica-crd", "pod": {"namespace": "default", "name": "vertica-crd-sc1-0"}, "command": "/opt/vertica/bin/admintools -t create_db --skip-fs-checks --hosts=10.244.1.5,10.244.2.4,10.244.4.6 --communal-storage-location=s3://newbucket/db/26100df1-93e5-4e64-b665-533e14abb67c --communal-storage-params=/home/dbadmin/auth_parms.conf --sql=/home/dbadmin/post-db-create.sql --shard-count=12 --depot-path=/depot --database verticadb --force-cleanup-on-failure --noprompt --password ******* "}
2021-08-04T20:03:00.369Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"VerticaDB","namespace":"default","name":"vertica-crd","uid":"26100df1-93e5-4e64-b665-533e14abb67c","apiVersion":"vertica.com/v1","resourceVersion":"11591"}, "reason": "CreateDBStart", "message": "Calling 'admintools -t create_db'"}
2021-08-04T20:03:17.051Z INFO controllers.VerticaDB ExecInPod stream {"verticadb": "default/vertica-crd", "pod": {"namespace": "default", "name": "vertica-crd-sc1-0"}, "err": "command terminated with exit code 1", "stdout": "Default depot size in use\nDistributing changes to cluster.\n\tCreating database verticadb\nBootstrap on host 10.244.1.5 return code 1 stdout '' stderr 'Logged exception in writeBufferToFile: RecvFiles failed in closing file [s3://newbucket/db/26100df1-93e5-4e64-b665-533e14abb67c/verticadb_rw_access_test.txt]: The specified bucket does not exist. Writing test data to file s3://newbucket/db/26100df1-93e5-4e64-b665-533e14abb67c/verticadb_rw_access_test.txt failed.\\nTesting rw access to communal location s3://newbucket/db/26100df1-93e5-4e64-b665-533e14abb67c/ failed\\n'\n\nError: Bootstrap on host 10.244.1.5 return code 1 stdout '' stderr 'Logged exception in writeBufferToFile: RecvFiles failed in closing file [s3://newbucket/db/26100df1-93e5-4e64-b665-533e14abb67c/verticadb_rw_access_test.txt]: The specified bucket does not exist. Writing test data to file s3://newbucket/db/26100df1-93e5-4e64-b665-533e14abb67c/verticadb_rw_access_test.txt failed.\\nTesting rw access to communal location s3://newbucket/db/26100df1-93e5-4e64-b665-533e14abb67c/ failed\\n'\n\n", "stderr": ""}
2021-08-04T20:03:17.051Z INFO controllers.VerticaDB aborting reconcile of VerticaDB {"verticadb": "default/vertica-crd", "result": {"Requeue":true,"RequeueAfter":0}, "err": null}
2021-08-04T20:03:17.051Z DEBUG controller-runtime.manager.events Warning {"object": {"kind":"VerticaDB","namespace":"default","name":"vertica-crd","uid":"26100df1-93e5-4e64-b665-533e14abb67c","apiVersion":"vertica.com/v1","resourceVersion":"11591"}, "reason": "S3BucketDoesNotExist", "message": "The bucket in the S3 path 's3://newbucket/db/26100df1-93e5-4e64-b665-533e14abb67c' does not exist"}
Create an S3 bucket for the cluster:
$ S3_BUCKET=newbucket
$ S3_CLUSTER_IP=$(kubectl get svc | grep minio | head -1 | awk '{print $3}')
$ export AWS_ACCESS_KEY_ID=minio
$ export AWS_SECRET_ACCESS_KEY=minio123
$ aws s3 mb s3://$S3_BUCKET --endpoint-url http://$S3_CLUSTER_IP
make_bucket: newbucket
Use kubectl get pods
to verify that the cluster uses the new S3 bucket and the database is ready:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
minio-ss-0-0 1/1 Running 0 18m
minio-ss-0-1 1/1 Running 0 18m
minio-ss-0-2 1/1 Running 0 18m
minio-ss-0-3 1/1 Running 0 18m
vertica-crd-sc1-0 1/1 Running 0 20m
vertica-crd-sc1-1 1/1 Running 0 20m
vertica-crd-sc1-2 1/1 Running 0 20m
verticadb-operator-controller-manager-5d9cdc9b8-kw9nv 2/2 Running 0 63m
Database is not available
After you create a custom resource instance, the database is not available. The kubectl get
custom-resource
command does not display information:
$ kubectl get vdb
NAME AGE SUBCLUSTERS INSTALLED DBADDED UP
vertica-crd 4s
Use kubectl describe
custom-resource
to check the events for the pods to identify any issues:
$ kubectl describe vdb
Name: vertica-crd
Namespace: default
Labels: <none>
Annotations: <none>
API Version: vertica.com/v1
Kind: VerticaDB
Metadata:
...
Superuser Password Secret: su-passwd
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning SuperuserPasswordSecretNotFound 5s (x12 over 15s) verticadb-operator Secret for superuser password 'su-passwd' was not found
In this circumstance, the custom resource uses a Secret named su-passwd
to store the Superuser Password Secret
, but there is no such Secret available. Create a Secret named su-passwd
to store the Secret:
$ kubectl create secret generic su-passwd --from-literal=password=sup3rs3cr3t
secret/su-passwd created
Note
For detailed steps about creating the Secret manifest and applying it to a namespace, see the Kubernetes documentation.
For details about Vertica and secret credentials, see Secrets management.
Use kubectl get
custom-resource
to verify the issue is resolved:
$ kubectl get vdb
NAME AGE SUBCLUSTERS INSTALLED DBADDED UP
vertica-crd 89s 1 0 0 0
Image pull failure
You receive an ImagePullBackOff error when you deploy a Vertica cluster with Helm charts, but you do not pre-pull the Vertica image from the local registry server:
$ kubectl describe pod pod-name-0
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
...
Warning Failed 2m32s kubelet Failed to pull image "k8s-rhel7-01:5000/vertica-k8s:default-1": rpc error: code = Unknown desc = context canceled
Warning Failed 2m32s kubelet Error: ErrImagePull
Normal BackOff 2m32s kubelet Back-off pulling image "k8s-rhel7-01:5000/vertica-k8s:default-1"
Warning Failed 2m32s kubelet Error: ImagePullBackOff
Normal Pulling 2m18s (x2 over 4m22s) kubelet Pulling image "k8s-rhel7-01:5000/vertica-k8s:default-1"
This occurs because the Vertica image size is too big to pull from the registry while deploying the Vertica cluster. Execute the following command on a Kubernetes host:
$ docker image list | grep vertica-k8s
k8s-rhel7-01:5000/vertica-k8s default-1 2d6f5d3d90d6 9 days ago 1.55GB
To solve this issue, complete one of the following:
-
Pull the Vertica images on each node before creating the Vertica StatefulSet:
$ NODES=`kubectl get nodes | grep -v NAME | awk '{print $1}'` $ for node in $NODES; do ssh $node docker pull $DOCKER_REGISTRY:5000/vertica-k8s:$K8S_TAG; done
-
Use the reduced-size vertica/vertica-k8s:latest image for the Vertica server.
Pending pods due to insufficient CPU
If your host nodes do not have enough resources to fulfill the resource request from a pod, the pod stays in pending status.
Note
As a best practice, do not request the maximum amount of resources available on a host node to leave resources for other processes on the host node.In the following example, the pod requests 40 CPUs on the host node, and the pod stays in Pending:
$ kubectl describe pod cluster-vertica-defaultsubcluster-0
...
Status: Pending
...
Containers:
server:
Image: docker.io/library/vertica-k8s:default-1
Ports: 5433/TCP, 5434/TCP, 22/TCP
Host Ports: 0/TCP, 0/TCP, 0/TCP
Command:
/opt/vertica/bin/docker-entrypoint.sh
restart-vertica-node
Limits:
memory: 200Gi
Requests:
cpu: 40
memory: 200Gi
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 3h20m default-scheduler 0/5 nodes are available: 5 Insufficient cpu.
To confirm the resources available on the host node. The following command confirms that the host node has only 40 allocatable CPUs:
$ kubectl describe node host-node-1
...
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Sat, 20 Mar 2021 22:39:10 -0400 Sat, 20 Mar 2021 13:07:02 -0400 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Sat, 20 Mar 2021 22:39:10 -0400 Sat, 20 Mar 2021 13:07:02 -0400 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Sat, 20 Mar 2021 22:39:10 -0400 Sat, 20 Mar 2021 13:07:02 -0400 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Sat, 20 Mar 2021 22:39:10 -0400 Sat, 20 Mar 2021 13:07:12 -0400 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 172.19.0.5
Hostname: eng-g9-191
Capacity:
cpu: 40
ephemeral-storage: 285509064Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 263839236Ki
pods: 110
Allocatable:
cpu: 40
ephemeral-storage: 285509064Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 263839236Ki
pods: 110
...
Non-terminated Pods: (3 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
default cluster-vertica-defaultsubcluster-0 38 (95%) 0 (0%) 200Gi (79%) 200Gi (79%) 51m
kube-system kube-flannel-ds-8brv9 100m (0%) 100m (0%) 50Mi (0%) 50Mi (0%) 9h
kube-system kube-proxy-lgjhp 0 (0%) 0 (0%) 0 (0%) 0 (0%) 9h
...
To correct this issue, reduce the resource.requests
in the subcluster to values lower than the maximum allocatable CPUs. The following example uses a YAML-formatted file named patch.yaml
to lower the resource requests for the pod:
$ cat patch.yaml
spec:
subclusters:
- name: defaultsubcluster
resources:
requests:
memory: 238Gi
cpu: "38"
limits:
memory: 238Gi
$ kubectl patch vdb cluster-vertica –-type=merge --patch “$(cat patch.yaml)”
verticadb.vertica.com/cluster-vertica patched
Pending pod after node removed
When you remove a host node from your Kubernetes cluster, a Vertica pod might stay in pending status if the pod uses a PersistentVolume (PV) that has a node affinity rule that prevents the pod from running on another node.
To resolve this issue, you must verify that the pods are pending because of an affinity rule, and then use the vdb-gen
tool to revive the entire cluster.
First, determine if the pod is pending because of a node affinity rule. This requires details about the pending pod, the PersistentVolumeClaim (PVC) associated with the pod, and the PersistentVolume (PV) associated with the PVC:
-
Use
kubectl describe
to return details about the pending pod:$ kubectl describe pod pod-name ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 28s (x2 over 48s) default-scheduler 0/2 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/unschedulable: }, 1 node(s) had volume node affinity conflict, 1 node(s) were unschedulable. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling.
The
Message
column verifies that the pod was not scheduled due avolume node affinity conflict
. -
Get the name of the PVC associated with the pod:
$ kubectl get pod -o jsonpath='{.spec.volumes[0].persistentVolumeClaim.claimName}{"\n"}' pod-name local-data-pod-name
-
Use the PVC to get the PV. PVs are associated with nodes:
$ kubectl get pvc -o jsonpath='{.spec.volumeName}{"\n"}' local-data-pod-name pvc-1926ae96-574d-4433-99b4-ec9ab0e5e497
-
Use the PV to get the name of the node that has the affinity rule:
$ kubectl get pv -o jsonpath='{.spec.nodeAffinity.required.nodeSelectorTerms[0].matchExpressions[0].values[0]}{"\n"}' pvc-1926ae96-574d-4433-99b4-ec9ab0e5e497 ip-10-20-30-40.ec2.internal
-
Verify that the node with the affinity rule is the node that was removed from the Kubernetes cluster.
Next, you must revive the entire cluster to get all pods running again. When you revive the cluster, you create new PVCs that restore the association between each pod and a PV to satisfy the node affinity rule.
While you have nodes running in the cluster, you can use the vdb-gen
tool to generate a manifest and revive the database:
-
Download the
vdb-gen
tool from the vertica-kubernetes GitHub repository:$ wget https://github.com/vertica/vertica-kubernetes/releases/latest/download/vdb-gen
-
Copy the tool into a pod that has a running Vertica process:
$ kubectl cp vdb-gen pod-name:/tmp/vdb-gen
-
The
vdb-gen
tool requires the database name, so retrieve it with the following command:$ kubectl get vdb -o jsonpath='{.spec.dbName}{"\n"}' v database-name
-
Run the
vdb-gen
tool with the database name. The following command runs the tool and pipes the output to a file namedrevive.yaml
:$ kubectl exec -i pod-name -- bash -c "chmod +x /tmp/vdb-gen && /tmp/vdb-gen --ignore-cluster-lease --name v localhost database-name | tee /tmp/revive.yaml"
-
Copy
revive.yaml
to your local machine so that you can use it after you remove the cluster:$ kubectl cp pod-name:/tmp/revive.yaml revive.yaml
-
Save the current VerticaDB Custom Resource (CR). For example, the following command saves a CR named
vertdb
to a file namedorig.yaml
:$ kubectl get vdb vertdb -o yaml > orig.yaml
-
Update
revive.yaml
with parts oforig.yaml
thatvdb-gen
did not capture. For example, custom resource limits. -
Delete the existing Vertica cluster:
$ kubectl delete vdb vertdb verticadb.vertica.com "vertdb" deleted
-
Confirm that all PVCs that are associated with the deleted cluster were removed:
-
Retrieve the PVC names. A PVC name uses the
dbname
-subcluster
-podindex
format:$ kubectl get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE local-data-vertdb-sc-0 Bound pvc-e9834c18-bf60-4a4b-a686-ba8f7b601230 1Gi RWO local-path 34m local-data-vertdb-sc-1 Bound pvc-1926ae96-574d-4433-99b4-ec9ab0e5e497 1Gi RWO local-path 34m local-data-vertdb-sc-2 Bound pvc-4541f7c9-3afc-47f0-8d04-67fac370ee88 1Gi RWO local-path 34m
-
Delete the PVCs:
$ kubectl delete pvc local-data-vertdb-sc-0 local-data-vertdb-sc-1 local-data-vertdb-sc-2 persistentvolumeclaim "local-data-vertdb-sc-0" deleted persistentvolumeclaim "local-data-vertdb-sc-1" deleted persistentvolumeclaim "local-data-vertdb-sc-2" deleted
-
-
Revive the database with
revive.yaml
:$ kubectl apply -f revive.yaml verticadb.vertica.com/vertdb created
After the revive completes, all Vertica pods are running, and PVCs are recreated on new nodes. Wait for the operator to start the database.
Deploying to Istio
Vertica does not officially support Istio because the Istio sidecar port requirement conflicts with the port that Vertica requires for internal node communication. However, you can deploy Vertica on Kubernetes to Istio with changes to the Istio InboundInterceptionMode setting. Vertica provides access to this setting with annotations on the VerticaDB CR.
REDIRECT mode
REDIRECT
mode is the default InboundInterceptionMode setting, and it requires that you disable network address translation (NAT) on port 5434, the port that the pods use for internal communication. Disable NAT on this port with the excludeInboundPorts
annotation:
apiVersion: vertica.com/v1
kind: VerticaDB
metadata:
name: vdb
spec:
annotations:
traffic.sidecar.istio.io/excludeInboundPorts: "5434"