CREATE HCATALOG SCHEMA
Define a schema for data stored in a Hive data warehouse using the HCatalog Connector. For more information, see Using the HCatalog Connector.
Most of the optional parameters are read out of Hadoop configuration files if available. If you copied the Hadoop configuration files as described in Configuring Vertica for HCatalog, you can omit most parameters. By default this statement uses the values specified in those configuration files. If the configuration files are complete, the following is a valid statement:
=> CREATE HCATALOG SCHEMA hcat;
If a value is not specified in the configuration files and a default is shown in the parameter list, then that default value is used.
Some parameters apply only if you are using HiveServer2 (the default). Others apply only if you are using WebHCat, a legacy Hadoop service. When using HiveServer2, use HIVESERVER2_HOSTNAME to specify the server host. When using WebHCat, use WEBSERVICE_HOSTNAME to specify the server host.
If you need to use WebHCat you must also set the HCatalogConnectorUseHiveServer2 configuration parameter to 0. See Apache Hadoop parameters.
After creating the schema, you can change many (but not all) parameters using ALTER HCATALOG SCHEMA.
Syntax
CREATE HCATALOG SCHEMA [IF NOT EXISTS] schemaName
[AUTHORIZATION user-id]
[WITH [param=value [,...] ] ]
Arguments
Argument | Description |
---|---|
[IF NOT EXISTS] |
If given, the statement exits without an error when the schema named in schemaName already exists. |
schemaName |
The name of the schema to create in the Vertica catalog. The tables in the Hive database will be available through this schema. |
AUTHORIZATION user-id |
The name of a Vertica account to own the schema being created. This parameter is ignored if Kerberos authentication is being used; in that case the current vsql user is used. |
Parameters
Parameter | Description |
---|---|
HOSTNAME |
The hostname, IP address, or URI of the database server that stores the Hive data warehouse's metastore information. If you specify this parameter and do not also specify If the Hive metastore supports High Availability, you can specify a comma-separated list of URIs for this value. If this value is not specified, hive-site.xml must be available. |
PORT |
The port number on which the metastore database is running. If you specify this parameter, you must also specify HOSTNAME and it must be a name or IP address (not a URI). |
HIVESERVER2_HOSTNAME |
The hostname or IP address of the HiveServer2 service. This parameter is optional if in hive-site.xml you set one of the following properties:
This parameter is ignored if you are using WebHCat. |
WEBSERVICE_HOSTNAME |
The hostname or IP address of the WebHCat service, if using WebHCat instead of HiveServer2. If this value is not specified, webhcat-site.xml must be available. |
WEBSERVICE_PORT |
The port number on which the WebHCat service is running, if using WebHCat instead of HiveServer2. If this value is not specified, webhcat-site.xml must be available. |
WEBHDFS_ADDRESS |
The host and port ("host:port") for the WebHDFS service. This parameter is used only for reading ORC and Parquet files. If this value is not set, hdfs-site.xml must be available to read these file types through the HCatalog Connector. |
HCATALOG_SCHEMA |
The name of the Hive schema or database that the Vertica schema is being mapped to. The default is schemaName . |
CUSTOM_PARTITIONS |
Whether the Hive schema uses custom partition locations ('YES' or 'NO'). If the schema uses custom partition locations, then Vertica queries Hive to get those locations when executing queries. These additional Hive queries can be expensive, so use this parameter only if you need to. The default is 'NO' (disabled). For more information, see Using Partitioned Data. |
HCATALOG_USER |
The username of the HCatalog user to use when making calls to the HiveServer2 or WebHCat server. The default is the current database user. |
HCATALOG_CONNECTION_TIMEOUT |
The number of seconds the HCatalog Connector waits for a successful connection to the HiveServer or WebHCat server. A value of 0 means wait indefinitely. |
HCATALOG_SLOW_TRANSFER_LIMIT |
The lowest data transfer rate (in bytes per second) from the HiveServer2 or WebHCat server that the HCatalog Connector accepts. See HCATALOG_SLOW_TRANSFER_TIME for details. |
HCATALOG_SLOW_TRANSFER_TIME |
The number of seconds the HCatalog Connector waits before enforcing the data transfer rate lower limit. After this time has passed, the HCatalog Connector tests whether the data transfer rate is at least as fast as the value set in HCATALOG_SLOW_TRANSFER_LIMIT. If it is not, then the HCatalog Connector breaks the connection and terminates the query. |
SSL_CONFIG |
The path of the Hadoop ssl-client.xml configuration file. This parameter is required if you are using HiveServer2 and it uses SSL wire encryption. This parameter is ignored if you are using WebHCat. |
The default values for HCATALOG_CONNECTOR_TIMEOUT, HCATALOG_SLOW_TRANSFER_LIMIT, and HCATALOG_SLOW_TRANSFER_TIME are set by the database configuration parameters HCatConnectionTimeout, HCatSlowTransferLimit, and HCatSlowTransferTime. See Apache Hadoop parameters for more information.
Configuration files
The HCatalog Connector uses the following values from the Hadoop configuration files if you do not override them when creating the schema.
File | Properties |
---|---|
hive-site.xml |
hive.server2.thrift.bind.host (used for HIVESERVER2_HOSTNAME) |
hive.server2.thrift.port | |
hive.server2.transport.mode | |
hive.server2.authentication | |
hive.server2.authentication.kerberos.principal | |
hive.server2.support.dynamic.service.discovery | |
hive.zookeeper.quorum (used as HIVESERVER2_HOSTNAME if dynamic service discovery is enabled) | |
hive.zookeeper.client.port | |
hive.server2.zookeeper.namespace | |
hive.metastore.uris (used for HOSTNAME and PORT) | |
ssl-client.xml |
ssl.client.truststore.location |
ssl.client.truststore.password |
Privileges
The user must be a superuser or be granted all permissions on the database to use this statement.
The user also requires access to Hive data in one of the following ways:
-
Have USAGE permissions on
hcatalog_schema
, if Hive does not use an authorization service (Sentry or Ranger) to manage access. -
Have permission through an authorization service, if Hive uses it to manage access. In this case you must either set EnableHCatImpersonation to 0, to access data as the Vertica principal, or grant users access to the HDFS data. For Sentry, you can use ACL synchronization to manage HDFS access.
-
Be the dbadmin user, with or without an authorization service.
Examples
The following example shows how to use CREATE HCATALOG SCHEMA to define a new schema for tables stored in a Hive database and then query the system tables that contain information about those tables:
=> CREATE HCATALOG SCHEMA hcat WITH HOSTNAME='hcathost' PORT=9083
HCATALOG_SCHEMA='default' HIVESERVER2_HOSTNAME='hs.example.com'
SSL_CONFIG='/etc/hadoop/conf/ssl-client.xml' HCATALOG_USER='admin';
CREATE SCHEMA
=> \x
Expanded display is on.
=> SELECT * FROM v_catalog.hcatalog_schemata;
-[ RECORD 1 ]----------------+-------------------------------------------
schema_id | 45035996273748224
schema_name | hcat
schema_owner_id | 45035996273704962
schema_owner | admin
create_time | 2017-12-05 14:43:03.353404-05
hostname | hcathost
port | -1
hiveserver2_hostname | hs.example.com
webservice_hostname |
webservice_port | 50111
webhdfs_address | hs.example.com:50070
hcatalog_schema_name | default
ssl_config | /etc/hadoop/conf/ssl-client.xml
hcatalog_user_name | admin
hcatalog_connection_timeout | -1
hcatalog_slow_transfer_limit | -1
hcatalog_slow_transfer_time | -1
custom_partitions | f
=> SELECT * FROM v_catalog.hcatalog_table_list;
-[ RECORD 1 ]------+------------------
table_schema_id | 45035996273748224
table_schema | hcat
hcatalog_schema | default
table_name | nation
hcatalog_user_name | admin
-[ RECORD 2 ]------+------------------
table_schema_id | 45035996273748224
table_schema | hcat
hcatalog_schema | default
table_name | raw
hcatalog_user_name | admin
-[ RECORD 3 ]------+------------------
table_schema_id | 45035996273748224
table_schema | hcat
hcatalog_schema | default
table_name | raw_rcfile
hcatalog_user_name | admin
-[ RECORD 4 ]------+------------------
table_schema_id | 45035996273748224
table_schema | hcat
hcatalog_schema | default
table_name | raw_sequence
hcatalog_user_name | admin
The following example shows how to specify more than one metastore host.
=> CREATE HCATALOG SCHEMA hcat
WITH HOSTNAME='thrift://node1.example.com:9083,thrift://node2.example.com:9083';
The following example shows how to include custom partition locations:
=> CREATE HCATALOG SCHEMA hcat WITH HCATALOG_SCHEMA='default'
HIVESERVER2_HOSTNAME='hs.example.com'
CUSTOM_PARTITIONS='yes';