Requirements for HDFS storage locations

Caution

If you use HDFS storage locations, the HDFS data must be available when you start Vertica. Your HDFS cluster must be operational, and the ROS files must be present. If you moved data files, or they are corrupted, or your HDFS cluster is not responsive, Vertica cannot start.

To store Vertica's data on HDFS, verify that:

Your Hadoop cluster has WebHDFS enabled.
All of the nodes in your Vertica cluster can connect to all of the nodes in your Hadoop cluster. Any firewall between the two clusters must allow connections on the ports used by HDFS.
If your HDFS cluster is unsecured, you have a Hadoop user whose username matches the name of the Vertica database superuser (usually named dbadmin). This Hadoop user must have read and write access to the HDFS directory where you want Vertica to store its data.
If your HDFS cluster uses Kerberos authentication:
- You have a Kerberos principal for Vertica, and it has read and write access to the HDFS directory that will be used for the storage location. See Kerberos below for instructions.
- The Kerberos KDC is running.
Your HDFS cluster has enough storage available for Vertica data. See Space Requirements below for details.
The data you store in an HDFS-backed storage location does not expand your database's size beyond any data allowance in your Vertica license. Vertica counts data stored in an HDFS-backed storage location as part of any data allowance set by your license. See Managing licenses in the Administrator's Guide for more information.

Backup/Restore has additional requirements.

Space requirements

If your Vertica database is K-safe, HDFS-based storage locations contain two copies of the data you store in them. One copy is the primary projection, and the other is the buddy projection. If you have enabled HDFS's data-redundancy feature, Hadoop stores both projections multiple times. This duplication might seem excessive. However, it is similar to how a RAID level 1 or higher stores redundant copies of both the primary and buddy projections. The redundant copies also help the performance of HDFS by enabling multiple nodes to process a request for a file.

Verify that your HDFS installation has sufficient space available for redundant storage of both the primary and buddy projections of your K-safe data. You can adjust the number of duplicates stored by HDFS by setting the HadoopFSReplication configuration parameter. See Troubleshooting HDFS Storage Locations for details.

Kerberos

To use a storage location in HDFS with Kerberos, take the following additional steps:

Create a Kerberos principal for each Vertica node as explained in Using Kerberos with Vertica.
Give all node principals read and write permission to the HDFS directory you will use as a storage location.

If you plan to use vbr to back up and restore the location, see additional requirements in Requirements for backing up and restoring HDFS storage locations.

Adding HDFS storage locations to new nodes

If you add nodes to your Vertica cluster, they do not automatically have access to existing HDFS storage locations. You must manually create the storage location for the new node using the CREATE LOCATION statement. Do not use the ALL NODES option in this statement. Instead, use the NODE option with the name of the new node to tell Vertica that just that node needs to add the shared location.

Caution

You must manually create the storage location. Otherwise, the new node uses the default storage policy (usually, storage on the local Linux file system) to store data that the other nodes store in HDFS. As a result, the node can run out of disk space.

Consider an HDFS storage location that was created on a three-node cluster with the following statements:

=> CREATE LOCATION 'hdfs://hadoopNS/vertica/colddata' ALL NODES SHARED
    USAGE 'data' LABEL 'coldstorage';

=> SELECT SET_OBJECT_STORAGE_POLICY('SchemaName','coldstorage');

The following example shows how to add the storage location to a new cluster node:

=> CREATE LOCATION 'hdfs://hadoopNS/vertica/colddata' NODE 'v_vmart_node0004'
   SHARED USAGE 'data' LABEL 'coldstorage';

Any active standby nodes in your cluster when you create an HDFS storage location automatically create their own instances of the location. When the standby node takes over for a down node, it uses its own instance of the location to store data for objects using the HDFS storage policy. Treat standby nodes added after you create the storage location as any other new node. You must manually define the HDFS storage location.