Defining a schema using the HCatalog Connector

After you set up the HCatalog Connector, you can use it to define a schema in your Vertica database to access the tables in a Hive database.

After you set up the HCatalog Connector, you can use it to define a schema in your Vertica database to access the tables in a Hive database. You define the schema using the CREATE HCATALOG SCHEMA statement.

When creating the schema, you must supply the name of the schema to define in Vertica. Other parameters are optional. If you do not supply a value, Vertica uses default values. Vertica reads some default values from the HDFS configuration files; see Configuration Parameters.

To create the schema, you must have read access to all Hive data. Verify that the user creating the schema has been granted access, either directly or through an authorization service such as Sentry or Ranger. The dbadmin user has no automatic special privileges.

After you create the schema, you can change many parameters using the ALTER HCATALOG SCHEMA statement.

After you define the schema, you can query the data in the Hive data warehouse in the same way you query a native Vertica table. The following example demonstrates creating an HCatalog schema and then querying several system tables to examine the contents of the new schema. See Viewing Hive schema and table metadata for more information about these tables.

=> CREATE HCATALOG SCHEMA hcat WITH HOSTNAME='hcathost' PORT=9083
   HCATALOG_SCHEMA='default' HIVESERVER2_HOSTNAME='hs.example.com'
   SSL_CONFIG='/etc/hadoop/conf/ssl-client.xml' HCATALOG_USER='admin';
CREATE SCHEMA
=> \x
Expanded display is on.

=> SELECT * FROM v_catalog.hcatalog_schemata;
-[ RECORD 1 ]----------------+-------------------------------------------
schema_id                    | 45035996273748224
schema_name                  | hcat
schema_owner_id              | 45035996273704962
schema_owner                 | admin
create_time                  | 2017-12-05 14:43:03.353404-05
hostname                     | hcathost
port                         | -1
hiveserver2_hostname         | hs.example.com
webservice_hostname          |
webservice_port              | 50111
webhdfs_address              | hs.example.com:50070
hcatalog_schema_name         | default
ssl_config                   | /etc/hadoop/conf/ssl-client.xml
hcatalog_user_name           | admin
hcatalog_connection_timeout  | -1
hcatalog_slow_transfer_limit | -1
hcatalog_slow_transfer_time  | -1
custom_partitions            | f

=> SELECT * FROM v_catalog.hcatalog_table_list;
-[ RECORD 1 ]------+------------------
table_schema_id    | 45035996273748224
table_schema       | hcat
hcatalog_schema    | default
table_name         | nation
hcatalog_user_name | admin
-[ RECORD 2 ]------+------------------
table_schema_id    | 45035996273748224
table_schema       | hcat
hcatalog_schema    | default
table_name         | raw
hcatalog_user_name | admin
-[ RECORD 3 ]------+------------------
table_schema_id    | 45035996273748224
table_schema       | hcat
hcatalog_schema    | default
table_name         | raw_rcfile
hcatalog_user_name | admin
-[ RECORD 4 ]------+------------------
table_schema_id    | 45035996273748224
table_schema       | hcat
hcatalog_schema    | default
table_name         | raw_sequence
hcatalog_user_name | admin

Configuration parameters

The HCatalog Connector uses the following values from the Hadoop configuration files if you do not override them when creating the schema.

File Properties
hive-site.xml hive.server2.thrift.bind.host (used for HIVESERVER2_HOSTNAME)
hive.server2.thrift.port
hive.server2.transport.mode
hive.server2.authentication
hive.server2.authentication.kerberos.principal
hive.server2.support.dynamic.service.discovery
hive.zookeeper.quorum (used as HIVESERVER2_HOSTNAME if dynamic service discovery is enabled)
hive.zookeeper.client.port
hive.server2.zookeeper.namespace
hive.metastore.uris (used for HOSTNAME and PORT)
ssl-client.xml ssl.client.truststore.location
ssl.client.truststore.password

Using partitioned data

Hive supports partitioning data, as in the following example:

hive> create table users (name varchar(64), address string, city varchar(64))
    partitioned by (state varchar(64)) stored as orc;

Vertica takes advantage of partitioning information for formats that provide it in metadata (ORC and Parquet). Queries can skip irrelevant partitions entirely (partition pruning), and Vertica does not need to materialize partition columns. For more about use of partitions, see Improving query performance and Using partition columns.

By default Hive stores partition information under the path for the table definition, which might be local to Hive. However, a Hive user can choose to store partition information elsewhere, such as in a shared location like S3, as in the following example:

hive> alter table users add partition (state='MA')
    location 's3a://DataLake/partitions/users/state=MA';

During query execution, therefore, Vertica must query Hive for the location of a table's partition information.

Because the additional Hive queries can be expensive, Vertica defaults to looking for partition information only in Hive's default location. If your Hive table specified a custom location, use the CUSTOM_PARTITIONS parameter:

=> CREATE HCATALOG SCHEMA hcat WITH HOSTNAME='hcathost' PORT=9083
    HCATALOG_SCHEMA='default' HIVESERVER2_HOSTNAME='hs.example.com'
    SSL_CONFIG='/etc/hadoop/conf/ssl-client.xml' HCATALOG_USER='admin'
    CUSTOM_PARTITIONS='yes';

Vertica can access partition locations in the S3 and S3a schemes, and does not support the S3n scheme.

The HIVE_CUSTOM_PARTITIONS_ACCESSED system table records all custom partition locations that have been used in queries.

Using the HCatalog Connector with WebHCat

By default the HCatalog Connector uses HiveServer2 to access Hive data. If you are instead using WebHCat, set the HCatalogConnectorUseHiveServer2 configuration parameter to 0 before creating the schema as in the following example.

=> ALTER DATABASE DEFAULT SET PARAMETER HCatalogConnectorUseHiveServer2 = 0;
=> CREATE HCATALOG SCHEMA hcat WITH WEBSERVICE_HOSTNAME='webhcat.example.com';

If you have previously used WebHCat, you can switch to using HiveServer2 by setting the configuration parameter to 1 and using ALTER HCATALOG SCHEMA to set HIVESERVER2_HOSTNAME. You do not need to remove the WebHCat values; the HCatalog Connector uses the value of HCatalogConnectorUseHiveServer2 to determine which parameters to use.