HDFS file system
HDFS is the Hadoop Distributed File System. You can use the webhdfs
and swebhdfs
schemes to access data through the WebHDFS service. Vertica also supports the hdfs
scheme, which by default uses the deprecated LibHDFS++ package. To treat hdfs
URIs as if they were webhdfs
URIs, set the HDFSUseWebHDFS configuration parameter to 1 (enabled).
If you specify a webhdfs
URI but the Hadoop HTTP policy (dfs.http.policy
) is set to HTTPS_ONLY, Vertica automatically uses swebhdfs
instead.
If you use LibHDFS++, the WebHDFS service must still be available because Vertica falls back to WebHDFS for operations not supported by LibHDFS++.
Deprecated
Support for LibHDFS++ is deprecated. In the future, HDFSUseWebHDFS will be enabled in all cases andhdfs
URIs will be equivalent to webhdfs
URIs.
URI format
URIs in the webhdfs
, swebhdfs
, and hdfs
schemes all have two formats, depending on whether you specify a name service or the host and port of a name node:
[[s]web]hdfs://[
nameservice
]/
path
[[s]web]hdfs://
namenode-host:port
/
path
Characters may be URL-encoded (%NN where NN is a two-digit hexadecimal number) but are not required to be, except that the '%' character must be encoded.
To use the default name service specified in the HDFS configuration files, omit nameservice
. Use this shorthand only for reading external data, not for creating a storage location.
Always specify a name service or host explicitly when using Vertica with more than one HDFS cluster. The name service or host name must be globally unique. Using [web]hdfs:///
could produce unexpected results because Vertica uses the first value of fs.defaultFS
that it finds.
Authentication
Vertica can use Kerberos authentication with Cloudera or Hortonworks HDFS clusters. See Accessing kerberized HDFS data.
For loading and exporting data, Vertica can access HDFS clusters protected by mTLS through the swebhdfs
scheme. You must create a certificate and key and set the WebhdfsClientCertConf configuration parameter.
You can use CREATE KEY and CREATE CERTIFICATE to create temporary, session-scoped values if you specify the TEMPORARY keyword. Temporary keys and certificates are stored in memory, not on disk.
The WebhdfsClientCertConf configuration parameter holds client credentials for one or more HDFS clusters. The value is a JSON string listing name services or authorities and their corresponding keys. You can set the configuration parameter at the session or database level. Setting the parameter at the database level has the following additional requirements:
-
The UseServerIdentityOverUserIdentity configuration parameter must be set to 1 (true).
-
The user must be dbadmin or must have access to the user storage location on HDFS.
The following example shows how to use mTLS. The key and certificate values themselves are not shown, just the beginning and end markers:
=> CREATE TEMPORARY KEY client_key TYPE 'RSA'
AS '-----BEGIN PRIVATE KEY-----...-----END PRIVATE KEY-----';
-> CREATE TEMPORARY CERTIFICATE client_cert
AS '-----BEGIN CERTIFICATE-----...-----END CERTIFICATE-----' key client_key;
=> ALTER SESSION SET WebhdfsClientCertConf =
'[{"authority": "my.hdfs.namenode1:50088", "certName": "client_cert"}]';
=> COPY people FROM 'swebhdfs://my.hdfs.namenode1:50088/path/to/file/1.txt';
Rows Loaded
-------------
1
(1 row)
To configure access to more than one HDFS cluster, define the keys and certificates and then include one object per cluster in the value of WebhdfsClientCertConf:
=> ALTER SESSION SET WebhdfsClientCertConf =
'[{"authority" : "my.authority.com:50070", "certName" : "myCert"},
{"nameservice" : "prod", "certName" : "prodCert"}]';
Configuration parameters
The following database configuration parameters apply to the HDFS file system. You can set parameters at different levels with the appropriate ALTER statement, such as ALTER SESSION...SET PARAMETER. Query the CONFIGURATION_PARAMETERS system table to determine what levels (node, session, user, database) are valid for a given parameter. For information about all parameters related to Hadoop, see Apache Hadoop parameters.
- EnableHDFSBlockInfoCache
- Boolean, whether to distribute block location metadata collected during planning on the initiator to all database nodes for execution, reducing name node contention. Disabled by default.
- HadoopConfDir
- Directory path containing the XML configuration files copied from Hadoop. The same path must be valid on every Vertica node. The files are accessed by the Linux user under which the Vertica server process runs.
- HadoopImpersonationConfig
- Session parameter specifying the delegation token or Hadoop user for HDFS access. See HadoopImpersonationConfig format for information about the value of this parameter and Proxy users and delegation tokens for more general context.
- HDFSUseWebHDFS
- Boolean. If true, URIs in the
hdfs
scheme are treated as if they were in thewebhdfs
scheme. If false, Vertica uses LibHDFS++ where possible, though some operations can still use WebHDFS if not supported by LibHDFS++. - WebhdfsClientCertConf
- mTLS configurations for accessing one or more WebHDFS servers, a JSON string. Each object must specify either a
nameservice
orauthority
field and acertName
field. See Authentication.
Configuration files
The path specified in HadoopConfDir must include a directory containing the files listed in the following table. Vertica reads these files at database start time. If you do not set a value, Vertica looks for the files in /etc/hadoop/conf.
If a property is not defined, Vertica uses the defaults shown in the table. If no default is specified for a property, the configuration files must specify a value.
File | Properties | Default |
---|---|---|
core-site.xml | fs.defaultFS | none |
(for doAs users:) hadoop.proxyuser.username .users |
none | |
(for doAs users:) hadoop.proxyuser.username .hosts |
none | |
hdfs-site.xml | dfs.client.failover.max.attempts | 15 |
dfs.client.failover.sleep.base.millis | 500 | |
dfs.client.failover.sleep.max.millis | 15000 | |
(For HA NN:) dfs.nameservices | none | |
(WebHDFS:) dfs.namenode.http-address or dfs.namenode.https-address | none | |
(WebHDFS:) dfs.datanode.http.address or dfs.datanode.https.address | none | |
(WebHDFS:) dfs.http.policy | HTTP_ONLY |
If using High Availability (HA) Name Nodes, the individual name nodes must also be defined in hdfs-site.xml.
Note
If you are using Eon Mode with communal storage on HDFS, then if you set dfs.encrypt.data.transfer you must use theswebhdfs
scheme for communal storage.
To verify that Vertica can find configuration files in HadoopConfDir, use the VERIFY_HADOOP_CONF_DIR function.
To test access through the hdfs
scheme, use the HDFS_CLUSTER_CONFIG_CHECK function.
For more information about testing your configuration, see Verifying HDFS configuration.
To reread the configuration files, use the CLEAR_HDFS_CACHES function.
Name nodes and name services
You can access HDFS data using the default name node by not specifying a name node or name service:
=> COPY users FROM 'webhdfs:///data/users.csv';
Vertica uses the fs.defaultFS
Hadoop configuration parameter to find the name node. (It then uses that name node to locate the data.) You can instead specify a host and port explicitly using the following format:
webhdfs://nn-host:nn-port/
The specified host is the name node, not an individual data node. If you are using High Availability (HA) Name Nodes you should not use an explicit host because high availability is provided through name services instead.
If the HDFS cluster uses High Availability Name Nodes or defines name services, use the name service instead of the host and port, in the format webhdfs://nameservice/
. The name service you specify must be defined in hdfs-site.xml
.
The following example shows how you can use a name service, hadoopNS:
=> CREATE EXTERNAL TABLE users (id INT, name VARCHAR(20))
AS COPY FROM 'webhdfs://hadoopNS/data/users.csv';
If you are using Vertica to access data from more than one HDFS cluster, always use explicit name services or hosts in the URL. Using the ///
shorthand could produce unexpected results because Vertica uses the first value of fs.defaultFS
that it finds. To access multiple HDFS clusters, you must use host and service names that are globally unique. See Configuring HDFS access for more information.