Vertica stores data in its native format, ROS, in storage locations. You can place storage locations on the local Linux file system or in HDFS. If you place storage locations on HDFS, you must perform additional configuration in HDFS to be able to manage them. These are in addition to the requirements in Vertica for managing storage locations and backup/restore.
If you use HDFS storage locations, the HDFS data must be available when you start Vertica. Your HDFS cluster must be operational, and the ROS files must be present. If you moved data files, or they are corrupted, or your HDFS cluster is not responsive, Vertica cannot start.
1 - Hadoop configuration for backup and restore
If your Vertica cluster uses storage locations on HDFS, and you want to be able to back up and restore those storage locations using vbr, you must enable snapshotting in HDFS.
If your Vertica cluster uses storage locations on HDFS, and you want to be able to back up and restore those storage locations using vbr, you must enable snapshotting in HDFS.
The Vertica backup script uses HDFS's snapshotting feature to create a backup of HDFS storage locations. A directory must allow snapshotting before HDFS can take a snapshot. Only a Hadoop superuser can enable snapshotting on a directory. Vertica can enable snapshotting automatically if the database administrator is also a Hadoop superuser.
If HDFS is unsecured, the following instructions apply to the database administrator account, usually dbadmin. If HDFS uses Kerberos security, the following instructions apply to the principal stored in the Vertica keytab file, usually vertica. The instructions below use the term "database account" to refer to this user.
We recommend that you make the database administrator or principal a Hadoop superuser. If you are not able to do so, you must enable snapshotting on the directory before configuring it for use by Vertica.
The steps you need to take to make the Vertica database administrator account a superuser depend on the distribution of Hadoop you are using. Consult your Hadoop distribution's documentation for details.
Manually enabling snapshotting for a directory
If you cannot grant superuser status to the database account, you can instead enable snapshotting of each directory manually. Use the following command:
$ hdfs dfsadmin -allowSnapshot path
Issue this command for each directory on each node. Remember to do this each time you add a new node to your HDFS cluster.
Nested snapshottable directories are not allowed, so you cannot enable snapshotting for a parent directory to automatically enable it for child directories. You must enable it for each individual directory.
Additional requirements for Kerberos
If HDFS uses Kerberos, then in addition to granting the keytab principal access, you must give Vertica access to certain Hadoop configuration files. See Configuring Kerberos.
Testing the database account's ability to make HDFS directories snapshottable
After making the database account a Hadoop superuser, verify that the account can set directories snapshottable:
Log into the Hadoop cluster as the database account (dbadmin by default).
Determine a location in HDFS where the database administrator can create a directory. The /tmp directory is usually available. Create a test HDFS directory using the command:
$ hdfs dfs -mkdir /path/testdir
Make the test directory snapshottable using the command:
$ hdfs dfsadmin -allowSnapshot /path/testdir
The following example demonstrates creating an HDFS directory and making it snapshottable:
The steps to remove an HDFS storage location are similar to standard storage locations:.
The steps to remove an HDFS storage location are similar to standard storage locations:
Remove any existing data from the HDFS storage location by using SET_OBJECT_STORAGE_POLICY to change each object's storage location. Alternatively, you can use CLEAR_OBJECT_STORAGE_POLICY. Because the Tuple Mover runs infrequently, set the enforce-storage-move parameter to true to make the change immediately.
Retire the location on each host that has the storage location defined using RETIRE_LOCATION. Set enforce-storage-move to true.
Optionally remove the snapshots and files from the HDFS directory for the storage location.
Perform a full database backup.
For more information about changing storage policies, changing usage, retiring locations, and dropping locations, see Managing storage locations.
Important
If you have backed up the data in the HDFS storage location you are removing, you must perform a full database backup after you remove the location. If you do not and restore the database to a backup made before you removed the location, the location's data is restored.
Removing storage location files from HDFS
Dropping an HDFS storage location does not automatically clean the HDFS directory that stored the location's files. Any snapshots of the data files created when backing up the location are also not deleted. These files consume disk space on HDFS and also prevent the directory from being reused as an HDFS storage location. Vertica cannot create a storage location in a directory that contains existing files or subdirectories.
You must log into the Hadoop cluster to delete the files from HDFS. An alternative is to use some other HDFS file management tool.
Removing backup snapshots
HDFS returns an error if you attempt to remove a directory that has snapshots:
$ hdfs dfs -rm -r -f -skipTrash /user/dbadmin/v_vmart_node0001
rm: The directory /user/dbadmin/v_vmart_node0001 cannot be deleted since
/user/dbadmin/v_vmart_node0001 is snapshottable and already has snapshots
The Vertica backup script creates snapshots of HDFS storage locations as part of the backup process. If you made backups of your HDFS storage location, you must delete the snapshots before removing the directories.
HDFS stores snapshots in a subdirectory named .snapshot. You can list the snapshots in the directory using the standard HDFS ls command:
You must delete each snapshot from the directory for each host in the cluster. After you have deleted the snapshots, you can delete the directories in the storage location.
Important
Each snapshot's name is based on a timestamp down to the millisecond. Nodes independently create their own snapshots. They do not synchronize snapshot creation, so their snapshot names differ. You must list each node's snapshot directory to learn the names of the snapshots it contains.
You can remove the directories that held the storage location's data by either of the following methods:
Use an HDFS file manager to delete directories. See your Hadoop distribution's documentation to determine if it provides a file manager.
Log into the Hadoop Name Node using the database administrator’s account and use HDFS's rmr command to delete the directories. See Apache's File System Shell Guide for more information.
The following example uses the HDFS rmr command from the Linux command line to delete the directories left behind in the HDFS storage location directory /user/dbamin. It uses the -skipTrash flag to force the immediate deletion of the files: