Requirements for HDFS storage locations

To store data on HDFS, verify that:

  • Your Hadoop cluster has WebHDFS enabled.

  • All of the nodes in your database cluster can connect to all of the nodes in your Hadoop cluster. Any firewall between the two clusters must allow connections on the ports used by HDFS.

  • If your HDFS cluster is unsecured, you have a Hadoop user whose username matches the name of the database database superuser (usually named dbadmin). This Hadoop user must have read and write access to the HDFS directory where you want the database to store its data.

  • If your HDFS cluster uses Kerberos authentication:

    • You have a Kerberos principal for the database, and it has read and write access to the HDFS directory that will be used for the storage location. See Kerberos below for instructions.

    • The Kerberos KDC is running.

  • Your HDFS cluster has enough storage available for database data. See Space Requirements below for details.

  • The data you store in an HDFS-backed storage location does not expand your database's size beyond any data allowance in your OpenText™ Analytics Database license. Data stored in an HDFS-backed storage location is counted as part of any data allowance set by your license. See Managing licenses in the Administrator's Guide for more information.

Backup/Restore has additional requirements.

Space requirements

If your database is K-safe, HDFS-based storage locations contain two copies of the data you store in them. One copy is the primary projection, and the other is the buddy projection. If you have enabled HDFS's data-redundancy feature, Hadoop stores both projections multiple times. This duplication might seem excessive. However, it is similar to how a RAID level 1 or higher stores redundant copies of both the primary and buddy projections. The redundant copies also help the performance of HDFS by enabling multiple nodes to process a request for a file.

Verify that your HDFS installation has sufficient space available for redundant storage of both the primary and buddy projections of your K-safe data. You can adjust the number of duplicates stored by HDFS by setting the HadoopFSReplication configuration parameter. See Troubleshooting HDFS Storage Locations for details.

Kerberos

To use a storage location in HDFS with Kerberos, take the following additional steps:

  1. Create a Kerberos principal for each database node as explained in Using Kerberos with Vertica.

  2. Give all node principals read and write permission to the HDFS directory you will use as a storage location.

If you plan to use vbr to back up and restore the location, see additional requirements in Requirements for backing up and restoring HDFS storage locations.

Adding HDFS storage locations to new nodes

If you add nodes to your database cluster, they do not automatically have access to existing HDFS storage locations. You must manually create the storage location for the new node using the CREATE LOCATION statement. Do not use the ALL NODES option in this statement. Instead, use the NODE option with the name of the new node to tell the database that just that node needs to add the shared location.

Consider an HDFS storage location that was created on a three-node cluster with the following statements:

=> CREATE LOCATION 'hdfs://hadoopNS/vertica/colddata' ALL NODES SHARED
    USAGE 'data' LABEL 'coldstorage';

=> SELECT SET_OBJECT_STORAGE_POLICY('SchemaName','coldstorage');

The following example shows how to add the storage location to a new cluster node:

=> CREATE LOCATION 'hdfs://hadoopNS/vertica/colddata' NODE 'v_vmart_node0004'
   SHARED USAGE 'data' LABEL 'coldstorage';

Any active standby nodes in your cluster when you create an HDFS storage location automatically create their own instances of the location. When the standby node takes over for a down node, it uses its own instance of the location to store data for objects using the HDFS storage policy. Treat standby nodes added after you create the storage location as any other new node. You must manually define the HDFS storage location.