This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Using HDFS storage locations

Vertica stores data in its native format, ROS, in storage locations.

Vertica stores data in its native format, ROS, in storage locations. You can place storage locations on the local Linux file system or in HDFS. If you place storage locations on HDFS, you must perform additional configuration in HDFS to be able to manage them. These are in addition to the requirements in Vertica for managing storage locations and backup/restore.

If you use HDFS storage locations, the HDFS data must be available when you start Vertica. Your HDFS cluster must be operational, and the ROS files must be present. If you moved data files, or they are corrupted, or your HDFS cluster is not responsive, Vertica cannot start.

1 - Hadoop configuration for backup and restore

If your Vertica cluster uses storage locations on HDFS, and you want to be able to back up and restore those storage locations using vbr, you must enable snapshotting in HDFS.

If your Vertica cluster uses storage locations on HDFS, and you want to be able to back up and restore those storage locations using vbr, you must enable snapshotting in HDFS.

The Vertica backup script uses HDFS's snapshotting feature to create a backup of HDFS storage locations. A directory must allow snapshotting before HDFS can take a snapshot. Only a Hadoop superuser can enable snapshotting on a directory. Vertica can enable snapshotting automatically if the database administrator is also a Hadoop superuser.

If HDFS is unsecured, the following instructions apply to the database administrator account, usually dbadmin. If HDFS uses Kerberos security, the following instructions apply to the principal stored in the Vertica keytab file, usually vertica. The instructions below use the term "database account" to refer to this user.

We recommend that you make the database administrator or principal a Hadoop superuser. If you are not able to do so, you must enable snapshotting on the directory before configuring it for use by Vertica.

The steps you need to take to make the Vertica database administrator account a superuser depend on the distribution of Hadoop you are using. Consult your Hadoop distribution's documentation for details.

Manually enabling snapshotting for a directory

If you cannot grant superuser status to the database account, you can instead enable snapshotting of each directory manually. Use the following command:

$ hdfs dfsadmin -allowSnapshot path

Issue this command for each directory on each node. Remember to do this each time you add a new node to your HDFS cluster.

Nested snapshottable directories are not allowed, so you cannot enable snapshotting for a parent directory to automatically enable it for child directories. You must enable it for each individual directory.

Additional requirements for Kerberos

If HDFS uses Kerberos, then in addition to granting the keytab principal access, you must give Vertica access to certain Hadoop configuration files. See Configuring Kerberos.

Testing the database account's ability to make HDFS directories snapshottable

After making the database account a Hadoop superuser, verify that the account can set directories snapshottable:

  1. Log into the Hadoop cluster as the database account (dbadmin by default).

  2. Determine a location in HDFS where the database administrator can create a directory. The /tmp directory is usually available. Create a test HDFS directory using the command:

    $ hdfs dfs -mkdir /path/testdir
    
  3. Make the test directory snapshottable using the command:

    $ hdfs dfsadmin -allowSnapshot /path/testdir
    

The following example demonstrates creating an HDFS directory and making it snapshottable:

$ hdfs dfs -mkdir /tmp/snaptest
$ hdfs dfsadmin -allowSnapshot /tmp/snaptest
Allowing snaphot on /tmp/snaptest succeeded

2 - Removing HDFS storage locations

The steps to remove an HDFS storage location are similar to standard storage locations:.

The steps to remove an HDFS storage location are similar to standard storage locations:

  1. Remove any existing data from the HDFS storage location by using SET_OBJECT_STORAGE_POLICY to change each object's storage location. Alternatively, you can use CLEAR_OBJECT_STORAGE_POLICY. Because the Tuple Mover runs infrequently, set the enforce-storage-move parameter to true to make the change immediately.

  2. Retire the location on each host that has the storage location defined using RETIRE_LOCATION. Set enforce-storage-move to true.

  3. Drop the location on each node using DROP_LOCATION.

  4. Optionally remove the snapshots and files from the HDFS directory for the storage location.

  5. Perform a full database backup.

For more information about changing storage policies, changing usage, retiring locations, and dropping locations, see Managing storage locations.

Removing storage location files from HDFS

Dropping an HDFS storage location does not automatically clean the HDFS directory that stored the location's files. Any snapshots of the data files created when backing up the location are also not deleted. These files consume disk space on HDFS and also prevent the directory from being reused as an HDFS storage location. Vertica cannot create a storage location in a directory that contains existing files or subdirectories.

You must log into the Hadoop cluster to delete the files from HDFS. An alternative is to use some other HDFS file management tool.

Removing backup snapshots

HDFS returns an error if you attempt to remove a directory that has snapshots:

$ hdfs dfs -rm -r -f -skipTrash /user/dbadmin/v_vmart_node0001
rm: The directory /user/dbadmin/v_vmart_node0001 cannot be deleted since
/user/dbadmin/v_vmart_node0001 is snapshottable and already has snapshots

The Vertica backup script creates snapshots of HDFS storage locations as part of the backup process. If you made backups of your HDFS storage location, you must delete the snapshots before removing the directories.

HDFS stores snapshots in a subdirectory named .snapshot. You can list the snapshots in the directory using the standard HDFS ls command:

$ hdfs dfs -ls /user/dbadmin/v_vmart_node0001/.snapshot
Found 1 items
drwxrwx---   - dbadmin supergroup          0 2014-09-02 10:13 /user/dbadmin/v_vmart_node0001/.snapshot/s20140902-101358.629

To remove snapshots, use the command:

$ hdfs dfs -removeSnapshot directory snapshotname

The following example demonstrates the command to delete the snapshot shown in the previous example:

$ hdfs dfs -deleteSnapshot /user/dbadmin/v_vmart_node0001 s20140902-101358.629

You must delete each snapshot from the directory for each host in the cluster. After you have deleted the snapshots, you can delete the directories in the storage location.

See Apache's HDFS Snapshot documentation for more information about managing and removing snapshots.

Removing the storage location directories

You can remove the directories that held the storage location's data by either of the following methods:

  • Use an HDFS file manager to delete directories. See your Hadoop distribution's documentation to determine if it provides a file manager.

  • Log into the Hadoop Name Node using the database administrator’s account and use HDFS's rmr command to delete the directories. See Apache's File System Shell Guide for more information.

The following example uses the HDFS rmr command from the Linux command line to delete the directories left behind in the HDFS storage location directory /user/dbamin. It uses the -skipTrash flag to force the immediate deletion of the files:

$ hdfsp dfs -ls /user/dbadmin
Found 3 items
drwxrwx---   - dbadmin supergroup          0 2014-08-29 15:11 /user/dbadmin/v_vmart_node0001
drwxrwx---   - dbadmin supergroup          0 2014-08-29 15:11 /user/dbadmin/v_vmart_node0002
drwxrwx---   - dbadmin supergroup          0 2014-08-29 15:11 /user/dbadmin/v_vmart_node0003

$ hdfs dfs -rmr -skipTrash /user/dbadmin/*
Deleted /user/dbadmin/v_vmart_node0001
Deleted /user/dbadmin/v_vmart_node0002
Deleted /user/dbadmin/v_vmart_node0003