This is the multi-page printable view of this section.
Click here to print.
Return to the regular view of this page.
Using the HCatalog Connector
The Vertica HCatalog Connector lets you access data stored in Apache's Hive data warehouse software the same way you access it within a native Vertica table.
The Vertica HCatalog Connector lets you access data stored in Apache's Hive data warehouse software the same way you access it within a native Vertica table.
If your files are in the Optimized Columnar Row (ORC) or Parquet format and do not use complex types, the HCatalog Connector creates an external table and uses the ORC or Parquet reader instead of using the Java SerDe. See ORC (parser) and PARQUET (parser) for more information about these readers.
The HCatalog Connector performs predicate pushdown to improve query performance. Instead of reading all data across the network to evaluate a query, the HCatalog Connector moves the evaluation of predicates closer to the data. Predicate pushdown applies to Hive partition pruning, ORC stripe pruning, and Parquet row-group pruning. The HCatalog Connector supports predicate pushdown for the following predicates: >, >=, =, <>, <=, <.
1 - Overview
There are several Hadoop components that you need to understand to use the HCatalog connector:.
There are several Hadoop components that you need to understand to use the HCatalog connector:
-
Apache Hive lets you query data stored in a Hadoop Distributed File System (HDFS) the same way you query data stored in a relational database. Behind the scenes, Hive uses a set of serializer and deserializer (SerDe) classes to extract data from files stored in HDFS and break it into columns and rows. Each SerDe handles data files in a specific format. For example, one SerDe extracts data from comma-separated data files while another interprets data stored in JSON format.
-
Apache HCatalog is a component of the Hadoop ecosystem that makes Hive's metadata available to other Hadoop components (such as Pig).
-
HiveServer2 makes HCatalog and Hive data available via JDBC. Through it, a client can make requests to retrieve data stored in Hive, as well as information about the Hive schema. HiveServer2 can use authorization services (Sentry or Ranger). HiverServer2 can use Hive LLAP (Live Long And Process).
The Vertica HCatalog Connector lets you transparently access data that is available through HiveServer2. You use the connector to define a schema in Vertica that corresponds to a Hive database or schema. When you query data within this schema, the HCatalog Connector transparently extracts and formats the data from Hadoop into tabular data. Vertica supports authorization services and Hive LLAP.
Note
You can use the WebHCat service instead of HiveServer2, but performance is usually better with HiveServer2. Support for WebHCat is deprecated. To use WebHCat, set the HCatalogConnectorUseHiveServer2 configuration parameter to 0. See
Apache Hadoop parameters. WebHCat does not support authorization services.
HCatalog connection features
The HCatalog Connector lets you query data stored in Hive using the Vertica native SQL syntax. Some of its main features are:
-
The HCatalog Connector always reflects the current state of data stored in Hive.
-
The HCatalog Connector uses the parallel nature of both Vertica and Hadoop to process Hive data. The result is that querying data through the HCatalog Connector is often faster than querying the data directly through Hive.
-
Because Vertica performs the extraction and parsing of data, the HCatalog Connector does not significantly increase the load on your Hadoop cluster.
-
The data you query through the HCatalog Connector can be used as if it were native Vertica data. For example, you can execute a query that joins data from a table in an HCatalog schema with a native table.
HCatalog Connector considerations
There are a few things to keep in mind when using the HCatalog Connector:
-
Hive's data is stored in flat files in a distributed file system, requiring it to be read and deserialized each time it is queried. This deserialization causes Hive performance to be much slower than that of Vertica. The HCatalog Connector has to perform the same process as Hive to read the data. Therefore, querying data stored in Hive using the HCatalog Connector is much slower than querying a native Vertica table. If you need to perform extensive analysis on data stored in Hive, you should consider loading it into Vertica. Vertica optimization often makes querying data through the HCatalog Connector faster than directly querying it through Hive.
-
If Hive uses Kerberos security, the HCatalog Connector uses the querying user's credentials in queries by default. If Hive uses Sentry or Ranger to enforce security, then you must either disable this behavior in Vertica by setting EnableHCatImpersonation to 0 or grant users access to the underlying data in HDFS. (Sentry supports ACL synchronization to automatically grant access.) Alternatively, you can specify delegation tokens for data and metadata access. See Configuring security.
-
Hive supports complex data types such as lists, maps, and structs that Vertica does not support. Columns containing these data types are converted to a JSON representation of the data type and stored as a VARCHAR. See Data type conversions from Hive to Vertica.
Note
The HCatalog Connector is read-only. It cannot insert data into Hive.
2 - How the HCatalog Connector works
When planning a query that accesses data from a Hive table, the Vertica HCatalog Connector on the initiator node contacts HiveServer2 (or WebHCat) in your Hadoop cluster to determine if the table exists.
When planning a query that accesses data from a Hive table, the Vertica HCatalog Connector on the initiator node contacts HiveServer2 (or WebHCat) in your Hadoop cluster to determine if the table exists. If it does, the connector retrieves the table's metadata from the metastore database so the query planning can continue. When the query executes, all nodes in the Vertica cluster directly retrieve the data necessary for completing the query from HDFS. They then use the Hive SerDe classes to extract the data so the query can execute. When accessing data in ORC or Parquet format, the HCatalog Connector uses Vertica's internal readers for these formats instead of the Hive SerDe classes.
This approach takes advantage of the parallel nature of both Vertica and Hadoop. In addition, by performing the retrieval and extraction of data directly, the HCatalog Connector reduces the impact of the query on the Hadoop cluster.
For files in the Optimized Columnar Row (ORC) or Parquet format that do not use complex types, the HCatalog Connector creates an external table and uses the ORC or Parquet reader instead of using the Java SerDe. You can direct these readers to access custom Hive partition locations if Hive used them when writing the data. By default these extra checks are turned off to improve performance.
3 - HCatalog Connector requirements
Before you can use the HCatalog Connector, both your Vertica and Hadoop installations must meet the following requirements.
Before you can use the HCatalog Connector, both your Vertica and Hadoop installations must meet the following requirements.
Vertica requirements
All of the nodes in your cluster must have a Java Virtual Machine (JVM) installed. You must use the same Java version that the Hadoop cluster uses. See Installing the Java Runtime on Your Vertica Cluster.
You must also add certain libraries distributed with Hadoop and Hive to your Vertica installation directory. See Configuring Vertica for HCatalog.
Hadoop requirements
Your Hadoop cluster must meet several requirements to operate correctly with the Vertica Connector for HCatalog:
-
It must have Hive, HiveServer2, and HCatalog installed and running. See Apache's HCatalog page for more information.
-
The HiveServer2 server and all of the HDFS nodes that store HCatalog data must be directly accessible from all of the hosts in your Vertica database. Verify that any firewall separating the Hadoop cluster and the Vertica cluster will pass HiveServer2, metastore database, and HDFS traffic.
-
The data that you want to query must be in an internal or external Hive table.
-
If a table you want to query uses a non-standard SerDe, you must install the SerDe's classes on your Vertica cluster before you can query the data. See Using nonstandard SerDes.
4 - Installing the Java runtime on your Vertica cluster
The HCatalog Connector requires a 64-bit Java Virtual Machine (JVM).
The HCatalog Connector requires a 64-bit Java Virtual Machine (JVM). The JVM must support Java 6 or later, and must be the same version as the one installed on your Hadoop nodes.
Note
If your Vertica cluster is configured to execute User Defined Extensions (UDxs) written in Java, it already has a correctly-configured JVM installed. See
Developing user-defined extensions (UDxs) for more information.
Installing Java on your Vertica cluster is a two-step process:
-
Install a Java runtime on all of the hosts in your cluster.
-
Set the JavaBinaryForUDx configuration parameter to tell Vertica the location of the Java executable.
Installing a Java runtime
For Java-based features, Vertica requires a 64-bit Java 6 (Java version 1.6) or later Java runtime. Vertica supports runtimes from either Oracle or OpenJDK. You can choose to install either the Java Runtime Environment (JRE) or Java Development Kit (JDK), since the JDK also includes the JRE.
Many Linux distributions include a package for the OpenJDK runtime. See your Linux distribution's documentation for information about installing and configuring OpenJDK.
To install the Oracle Java runtime, see the Java Standard Edition (SE) Download Page. You usually run the installation package as root in order to install it. See the download page for instructions.
Once you have installed a JVM on each host, ensure that the java
command is in the search path and calls the correct JVM by running the command:
$ java -version
This command should print something similar to:
java version "1.8.0_102"
Java(TM) SE Runtime Environment (build 1.8.0_102-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode)
Note
Any previously installed Java VM on your hosts may interfere with a newly installed Java runtime. See your Linux distribution's documentation for instructions on configuring which JVM is the default. Unless absolutely required, you should uninstall any incompatible version of Java before installing the Java 6 or Java 7 runtime.
Setting the JavaBinaryForUDx configuration parameter
The JavaBinaryForUDx configuration parameter tells Vertica where to look for the JRE to execute Java UDxs. After you have installed the JRE on all of the nodes in your cluster, set this parameter to the absolute path of the Java executable. You can use the symbolic link that some Java installers create (for example /usr/bin/java
). If the Java executable is in your shell search path, you can get the path of the Java executable by running the following command from the Linux command line shell:
$ which java
/usr/bin/java
If the java
command is not in the shell search path, use the path to the Java executable in the directory where you installed the JRE. Suppose you installed the JRE in /usr/java/default
(which is where the installation package supplied by Oracle installs the Java 1.6 JRE). In this case the Java executable is /usr/java/default/bin/java
.
You set the configuration parameter by executing the following statement as a database superuser:
=> ALTER DATABASE DEFAULT SET PARAMETER JavaBinaryForUDx = '/usr/bin/java';
See ALTER DATABASE for more information on setting configuration parameters.
To view the current setting of the configuration parameter, query the CONFIGURATION_PARAMETERS system table:
=> \x
Expanded display is on.
=> SELECT * FROM CONFIGURATION_PARAMETERS WHERE parameter_name = 'JavaBinaryForUDx';
-[ RECORD 1 ]-----------------+----------------------------------------------------------
node_name | ALL
parameter_name | JavaBinaryForUDx
current_value | /usr/bin/java
default_value |
change_under_support_guidance | f
change_requires_restart | f
description | Path to the java binary for executing UDx written in Java
Once you have set the configuration parameter, Vertica can find the Java executable on each node in your cluster.
Note
Since the location of the Java executable is set by a single configuration parameter for the entire cluster, you must ensure that the Java executable is installed in the same path on all of the hosts in the cluster.
5 - Configuring Vertica for HCatalog
Before you can use the HCatalog Connector, you must add certain Hadoop and Hive libraries to your Vertica installation.
Before you can use the HCatalog Connector, you must add certain Hadoop and Hive libraries to your Vertica installation. You must also copy the Hadoop configuration files that specify various connection properties. Vertica uses the values in those configuration files to make its own connections to Hadoop.
You need only make these changes on one node in your cluster. After you do this you can install the HCatalog connector.
Copy Hadoop libraries and configuration files
Vertica provides a tool, hcatUtil, to collect the required files from Hadoop. This tool copies selected libraries and XML configuration files from your Hadoop cluster to your Vertica cluster. This tool might also need access to additional libraries:
-
If you plan to use Hive to query files that use Snappy compression, you need access to the Snappy native libraries, libhadoop*.so and libsnappy*.so.
-
If you plan to use Hive to query files that use LZO compression, you need access to the hadoop-lzo-*.jar
and libgplcompression.so*
libraries. In core-site.xml
you must also edit the io.compression.codecs
property to include com.hadoop.compression.lzo.LzopCodec
.
-
If you plan to use a JSON SerDe with a Hive table, you need access to its library. This is the same library that you used to configure Hive; for example:
hive> add jar /home/release/json-serde-1.3-jar-with-dependencies.jar;
hive> create external table nationjson (id int,name string,rank int,text string)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION '/user/release/vt/nationjson';
-
If you are using any other libraries that are not standard across all supported Hadoop versions, you need access to those libraries.
If any of these cases applies to you, do one of the following:
-
Include the path(s) in the path you specify as the value of --hcatLibPath, or
-
Copy the file(s) to a directory already on that path.
If Vertica is not co-located on a Hadoop node, you should do the following:
-
Copy /opt/vertica/packages/hcat/tools/hcatUtil to a Hadoop node and run it there, specifying a temporary output directory. Your Hadoop, HIVE, and HCatalog lib paths might be different. In newer versions of Hadoop the HCatalog directory is usually a subdirectory under the HIVE directory, and Cloudera creates a new directory for each revision of the configuration files. Use the values from your environment in the following command:
hcatUtil --copyJars
--hadoopHiveHome="$HADOOP_HOME/lib;$HIVE_HOME/lib;/hcatalog/dist/share"
--hadoopHiveConfPath="$HADOOP_CONF_DIR;$HIVE_CONF_DIR;$WEBHCAT_CONF_DIR"
--hcatLibPath="/tmp/hadoop-files"
If you are using Hive LLAP, specify the hive2 directories.
-
Verify that all necessary files were copied:
hcatUtil --verifyJars --hcatLibPath=/tmp/hadoop-files
-
Copy that output directory (/tmp/hadoop-files, in this example) to /opt/vertica/packages/hcat/lib on the Vertica node you will connect to when installing the HCatalog connector. If you are updating a Vertica cluster to use a new Hadoop cluster (or a new version of Hadoop), first remove all JAR files in /opt/vertica/packages/hcat/lib except vertica-hcatalogudl.jar.
-
Verify that all necessary files were copied:
hcatUtil --verifyJars --hcatLibPath=/opt/vertica/packages/hcat
If Vertica is co-located on some or all Hadoop nodes, you can do this in one step on a shared node. Your Hadoop, HIVE, and HCatalog lib paths might be different; use the values from your environment in the following command:
hcatUtil --copyJars
--hadoopHiveHome="$HADOOP_HOME/lib;$HIVE_HOME/lib;/hcatalog/dist/share"
--hadoopHiveConfPath="$HADOOP_CONF_DIR;$HIVE_CONF_DIR;$WEBHCAT_CONF_DIR"
--hcatLibPath="/opt/vertica/packages/hcat/lib"
The hcatUtil script has the following arguments:
-c, --copyJars |
Copy the required JAR files from hadoopHiveHome and configuration files from hadoopHiveConfPath. |
-v, --verifyJars |
Verify that the required files are present in hcatLibPath. Check the output of hcatUtil for error and warning messages. |
--hadoopHiveHome= "value1;value2;..." |
Paths to the Hadoop, Hive, and HCatalog home directories. Separate directories by semicolons (;). Enclose paths in double quotes.
Note
Always place $HADOOP_HOME on the path before $HIVE_HOME. In some Hadoop distributions, these two directories contain different versions of the same library.
|
--hadoopHiveConfPath= "value1;value2;..." |
Paths of the following configuration files:
Separate directories by semicolons (;). Enclose paths in double quotes.
In previous releases of Vertica this parameter was optional under some conditions. It is now required.
|
--hcatLibPath= "value" |
Output path for the libraries and configuration files. On a Vertica node, use /opt/vertica/packages/hcat/lib. If you have previously run hcatUtil with a different version of Hadoop, first remove the old JAR files from the output directory (all except vertica-hcatalogudl.jar). |
After you have copied the files and verified them, install the HCatalog connector.
Install the HCatalog Connector
On the same node where you copied the files from hcatUtil, install the HCatalog connector by running the install.sql script. This script resides in the ddl/ folder under your HCatalog connector installation path. This script creates the library and VHCatSource and VHCatParser.
Note
The data that was copied using hcatUtil is now stored in the database. If you change any of those values in Hadoop, you need to rerun hcatUtil and install.sql. The following statement returns the names of the libraries and configuration files currently being used:
=> SELECT dependencies FROM user_libraries WHERE lib_name='VHCatalogLib';
Now you can create HCatalog schema parameters, which point to your existing Hadoop services, as described in Defining a schema using the HCatalog Connector.
Upgrading to a new version of Vertica
After upgrading to a new version of Vertica, perform the following steps:
- Uninstall the HCatalog Connector using the uninstall.sql script. This script resides in the ddl/ folder under your HCatalog connector installation path.
- Delete the contents of the
hcatLibPath
directory except for vertica-hcatalogudl.jar
.
- Rerun hcatUtil.
- Reinstall the HCatalog Connector using the install.sql script.
For more information about upgrading Vertica, see Upgrade Vertica.
When reading Hadoop columnar file formats (ORC or Parquet), the HCatalog Connector attempts to use the built-in readers. When doing so, it uses the hdfs
scheme by default. In order to use the hdfs
scheme, you must perform the configuration described in Configuring HDFS access.
To have the HCatalog Connector use the webhdfs
scheme instead, use ALTER DATABASE to set HDFSUseWebHDFS to 1. This setting applies to all HDFS access, not just the HCatalog Connector.
6 - Configuring security
You can use any of the security options described in Accessing Kerberized HDFS Data to access Hive data.
You can use any of the security options described in Accessing kerberized HDFS data to access Hive data. This topic describes additional steps needed specifically for using the HCatalog Connector.
If you use Kerberos from Vertica, the HCatalog Connector can use an authorization service (Sentry or Ranger). If you use delegation tokens, you must manage authorization yourself.
Kerberos
You can use Kerberos from Vertica as described in Using Kerberos with Vertica.
How you configure the HCatalog Connector depends on how Hive manages authorization.
-
If Hive uses Sentry to manage authorization, and if Sentry uses ACL synchronization, then the HCatalog Connector must access HDFS as the current user. Verify that the EnableHCatImpersonation configuration parameter is set to 1 (the default). ACL synchronization automatically provides authorized users with read access to the underlying HDFS files.
-
If Hive uses Sentry without ACL synchronization, then the HCatalog Connector must access HDFS data as the Vertica principal. (The user still authenticates and accesses metadata normally.) Set the EnableHCatImpersonation configuration parameter to 0. The Vertica principal must have read access to the underlying HDFS files.
-
If Hive uses Ranger to manage authorization, and the Vertica users have read access to the underlying HDFS files, then you can use user impersonation. Verify that the EnableHCatImpersonation configuration parameter is set to 1 (the default). You can, instead, disable user impersonation and give the Vertica principal read access to the HDFS files.
-
If Hive uses either Sentry or Ranger, the HCatalog Connector must use HiveServer2 (the default). WebHCat does not support authorization services.
-
If Hive does not use an authorization service, or if you are connecting to Hive using WebHCat instead of HiveServer2, then the HCatalog Connector accesses Hive as the current user. Verify that EnableHCatImpersonation is set to 1. All users must have read access to the underlying HDFS files.
In addition, in your Hadoop configuration files (core-site.xml in most distributions), make sure that you enable all Hadoop components to impersonate the Vertica user. The easiest way to do so is to set the proxyuser property using wildcards for all users on all hosts and in all groups. Consult your Hadoop documentation for instructions. Make sure you set this property before running hcatUtil (see Configuring Vertica for HCatalog).
Delegation tokens
You can use delegation tokens for a session as described in Bring your own delegation token. When using the HCatalog Connector you specify two delegation tokens, one for the data and one for the metadata. The metadata token is tied to a Hive schema. See HadoopImpersonationConfig format for information about how to specify these two delegation tokens.
Verifying security configuration
To verify that the HCatalog Connector can access Hive data, use the HCATALOGCONNECTOR_CONFIG_CHECK function.
For more information about testing your configuration, see Verifying HDFS configuration.
7 - Defining a schema using the HCatalog Connector
After you set up the HCatalog Connector, you can use it to define a schema in your Vertica database to access the tables in a Hive database.
After you set up the HCatalog Connector, you can use it to define a schema in your Vertica database to access the tables in a Hive database. You define the schema using the CREATE HCATALOG SCHEMA statement.
When creating the schema, you must supply the name of the schema to define in Vertica. Other parameters are optional. If you do not supply a value, Vertica uses default values. Vertica reads some default values from the HDFS configuration files; see Configuration Parameters.
To create the schema, you must have read access to all Hive data. Verify that the user creating the schema has been granted access, either directly or through an authorization service such as Sentry or Ranger. The dbadmin user has no automatic special privileges.
After you create the schema, you can change many parameters using the ALTER HCATALOG SCHEMA statement.
After you define the schema, you can query the data in the Hive data warehouse in the same way you query a native Vertica table. The following example demonstrates creating an HCatalog schema and then querying several system tables to examine the contents of the new schema. See Viewing Hive schema and table metadata for more information about these tables.
=> CREATE HCATALOG SCHEMA hcat WITH HOSTNAME='hcathost' PORT=9083
HCATALOG_SCHEMA='default' HIVESERVER2_HOSTNAME='hs.example.com'
SSL_CONFIG='/etc/hadoop/conf/ssl-client.xml' HCATALOG_USER='admin';
CREATE SCHEMA
=> \x
Expanded display is on.
=> SELECT * FROM v_catalog.hcatalog_schemata;
-[ RECORD 1 ]----------------+-------------------------------------------
schema_id | 45035996273748224
schema_name | hcat
schema_owner_id | 45035996273704962
schema_owner | admin
create_time | 2017-12-05 14:43:03.353404-05
hostname | hcathost
port | -1
hiveserver2_hostname | hs.example.com
webservice_hostname |
webservice_port | 50111
webhdfs_address | hs.example.com:50070
hcatalog_schema_name | default
ssl_config | /etc/hadoop/conf/ssl-client.xml
hcatalog_user_name | admin
hcatalog_connection_timeout | -1
hcatalog_slow_transfer_limit | -1
hcatalog_slow_transfer_time | -1
custom_partitions | f
=> SELECT * FROM v_catalog.hcatalog_table_list;
-[ RECORD 1 ]------+------------------
table_schema_id | 45035996273748224
table_schema | hcat
hcatalog_schema | default
table_name | nation
hcatalog_user_name | admin
-[ RECORD 2 ]------+------------------
table_schema_id | 45035996273748224
table_schema | hcat
hcatalog_schema | default
table_name | raw
hcatalog_user_name | admin
-[ RECORD 3 ]------+------------------
table_schema_id | 45035996273748224
table_schema | hcat
hcatalog_schema | default
table_name | raw_rcfile
hcatalog_user_name | admin
-[ RECORD 4 ]------+------------------
table_schema_id | 45035996273748224
table_schema | hcat
hcatalog_schema | default
table_name | raw_sequence
hcatalog_user_name | admin
Configuration parameters
The HCatalog Connector uses the following values from the Hadoop configuration files if you do not override them when creating the schema.
File |
Properties |
hive-site.xml |
hive.server2.thrift.bind.host (used for HIVESERVER2_HOSTNAME) |
hive.server2.thrift.port |
hive.server2.transport.mode |
hive.server2.authentication |
hive.server2.authentication.kerberos.principal |
hive.server2.support.dynamic.service.discovery |
hive.zookeeper.quorum (used as HIVESERVER2_HOSTNAME if dynamic service discovery is enabled) |
hive.zookeeper.client.port |
hive.server2.zookeeper.namespace |
hive.metastore.uris (used for HOSTNAME and PORT) |
ssl-client.xml |
ssl.client.truststore.location |
ssl.client.truststore.password |
Using partitioned data
Hive supports partitioning data, as in the following example:
hive> create table users (name varchar(64), address string, city varchar(64))
partitioned by (state varchar(64)) stored as orc;
Vertica takes advantage of partitioning information for formats that provide it in metadata (ORC and Parquet). Queries can skip irrelevant partitions entirely (partition pruning), and Vertica does not need to materialize partition columns. For more about use of partitions, see Improving query performance and Using partition columns.
By default Hive stores partition information under the path for the table definition, which might be local to Hive. However, a Hive user can choose to store partition information elsewhere, such as in a shared location like S3, as in the following example:
hive> alter table users add partition (state='MA')
location 's3a://DataLake/partitions/users/state=MA';
During query execution, therefore, Vertica must query Hive for the location of a table's partition information.
Because the additional Hive queries can be expensive, Vertica defaults to looking for partition information only in Hive's default location. If your Hive table specified a custom location, use the CUSTOM_PARTITIONS parameter:
=> CREATE HCATALOG SCHEMA hcat WITH HOSTNAME='hcathost' PORT=9083
HCATALOG_SCHEMA='default' HIVESERVER2_HOSTNAME='hs.example.com'
SSL_CONFIG='/etc/hadoop/conf/ssl-client.xml' HCATALOG_USER='admin'
CUSTOM_PARTITIONS='yes';
Vertica can access partition locations in the S3 and S3a schemes, and does not support the S3n scheme.
The HIVE_CUSTOM_PARTITIONS_ACCESSED system table records all custom partition locations that have been used in queries.
Using the HCatalog Connector with WebHCat
By default the HCatalog Connector uses HiveServer2 to access Hive data. If you are instead using WebHCat, set the HCatalogConnectorUseHiveServer2 configuration parameter to 0 before creating the schema as in the following example.
=> ALTER DATABASE DEFAULT SET PARAMETER HCatalogConnectorUseHiveServer2 = 0;
=> CREATE HCATALOG SCHEMA hcat WITH WEBSERVICE_HOSTNAME='webhcat.example.com';
If you have previously used WebHCat, you can switch to using HiveServer2 by setting the configuration parameter to 1 and using ALTER HCATALOG SCHEMA to set HIVESERVER2_HOSTNAME. You do not need to remove the WebHCat values; the HCatalog Connector uses the value of HCatalogConnectorUseHiveServer2 to determine which parameters to use.
8 - Querying Hive tables using HCatalog Connector
Once you have defined the HCatalog schema, you can query data from the Hive database by using the schema name in your query.
Once you have defined the HCatalog schema, you can query data from the Hive database by using the schema name in your query.
=> SELECT * from hcat.messages limit 10;
messageid | userid | time | message
-----------+------------+---------------------+----------------------------------
1 | nPfQ1ayhi | 2013-10-29 00:10:43 | hymenaeos cursus lorem Suspendis
2 | N7svORIoZ | 2013-10-29 00:21:27 | Fusce ad sem vehicula morbi
3 | 4VvzN3d | 2013-10-29 00:32:11 | porta Vivamus condimentum
4 | heojkmTmc | 2013-10-29 00:42:55 | lectus quis imperdiet
5 | coROws3OF | 2013-10-29 00:53:39 | sit eleifend tempus a aliquam mauri
6 | oDRP1i | 2013-10-29 01:04:23 | risus facilisis sollicitudin sceler
7 | AU7a9Kp | 2013-10-29 01:15:07 | turpis vehicula tortor
8 | ZJWg185DkZ | 2013-10-29 01:25:51 | sapien adipiscing eget Aliquam tor
9 | E7ipAsYC3 | 2013-10-29 01:36:35 | varius Cum iaculis metus
10 | kStCv | 2013-10-29 01:47:19 | aliquam libero nascetur Cum mal
(10 rows)
Since the tables you access through the HCatalog Connector act like Vertica tables, you can perform operations that use both Hive data and native Vertica data, such as a join:
=> SELECT u.FirstName, u.LastName, d.time, d.Message from UserData u
-> JOIN hcat.messages d ON u.UserID = d.UserID LIMIT 10;
FirstName | LastName | time | Message
----------+----------+---------------------+-----------------------------------
Whitney | Kerr | 2013-10-29 00:10:43 | hymenaeos cursus lorem Suspendis
Troy | Oneal | 2013-10-29 00:32:11 | porta Vivamus condimentum
Renee | Coleman | 2013-10-29 00:42:55 | lectus quis imperdiet
Fay | Moss | 2013-10-29 00:53:39 | sit eleifend tempus a aliquam mauri
Dominique | Cabrera | 2013-10-29 01:15:07 | turpis vehicula tortor
Mohammad | Eaton | 2013-10-29 00:21:27 | Fusce ad sem vehicula morbi
Cade | Barr | 2013-10-29 01:25:51 | sapien adipiscing eget Aliquam tor
Oprah | Mcmillan | 2013-10-29 01:36:35 | varius Cum iaculis metus
Astra | Sherman | 2013-10-29 01:58:03 | dignissim odio Pellentesque primis
Chelsea | Malone | 2013-10-29 02:08:47 | pede tempor dignissim Sed luctus
(10 rows)
9 - Viewing Hive schema and table metadata
When using Hive, you access metadata about schemas and tables by executing statements written in HiveQL (Hive's version of SQL) such as SHOW TABLES.
When using Hive, you access metadata about schemas and tables by executing statements written in HiveQL (Hive's version of SQL) such as SHOW TABLES
. When using the HCatalog Connector, you can get metadata about the tables in the Hive database through several Vertica system tables.
There are four system tables that contain metadata about the tables accessible through the HCatalog Connector:
-
HCATALOG_SCHEMATA lists all of the schemas that have been defined using the HCatalog Connector.
-
HCATALOG_TABLE_LIST contains an overview of all of the tables available from all schemas defined using the HCatalog Connector. This table only shows the tables that the user querying the table can access. The information in this table is retrieved using a single call to HiveServer2 for each schema defined using the HCatalog Connector, which means there is a little overhead when querying this table.
-
HCATALOG_TABLES contains more in-depth information than HCATALOG_TABLE_LIST.
-
HCATALOG_COLUMNS lists metadata about all of the columns in all of the tables available through the HCatalog Connector. As for HCATALOG_TABLES, querying this table results in one call to HiveServer2 per table, and therefore can take a while to complete.
The following example demonstrates querying the system tables containing metadata for the tables available through the HCatalog Connector.
=> CREATE HCATALOG SCHEMA hcat WITH hostname='hcathost'
-> HCATALOG_SCHEMA='default' HCATALOG_DB='default' HCATALOG_USER='hcatuser';
CREATE SCHEMA
=> SELECT * FROM HCATALOG_SCHEMATA;
-[ RECORD 1 ]--------+-----------------------------
schema_id | 45035996273864536
schema_name | hcat
schema_owner_id | 45035996273704962
schema_owner | dbadmin
create_time | 2013-11-05 10:19:54.70965-05
hostname | hcathost
port | 9083
webservice_hostname | hcathost
webservice_port | 50111
hcatalog_schema_name | default
hcatalog_user_name | hcatuser
metastore_db_name | hivemetastoredb
=> SELECT * FROM HCATALOG_TABLE_LIST;
-[ RECORD 1 ]------+------------------
table_schema_id | 45035996273864536
table_schema | hcat
hcatalog_schema | default
table_name | hcatalogtypes
hcatalog_user_name | hcatuser
-[ RECORD 2 ]------+------------------
table_schema_id | 45035996273864536
table_schema | hcat
hcatalog_schema | default
table_name | tweets
hcatalog_user_name | hcatuser
-[ RECORD 3 ]------+------------------
table_schema_id | 45035996273864536
table_schema | hcat
hcatalog_schema | default
table_name | messages
hcatalog_user_name | hcatuser
-[ RECORD 4 ]------+------------------
table_schema_id | 45035996273864536
table_schema | hcat
hcatalog_schema | default
table_name | msgjson
hcatalog_user_name | hcatuser
=> -- Get detailed description of a specific table
=> SELECT * FROM HCATALOG_TABLES WHERE table_name = 'msgjson';
-[ RECORD 1 ]---------+-----------------------------------------------------------
table_schema_id | 45035996273864536
table_schema | hcat
hcatalog_schema | default
table_name | msgjson
hcatalog_user_name | hcatuser
min_file_size_bytes |
total_number_files | 10
location | hdfs://hive.example.com:8020/user/exampleuser/msgjson
last_update_time |
output_format | org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
last_access_time |
max_file_size_bytes |
is_partitioned | f
partition_expression |
table_owner |
input_format | org.apache.hadoop.mapred.TextInputFormat
total_file_size_bytes | 453534
hcatalog_group |
permission |
=> -- Get list of columns in a specific table
=> SELECT * FROM HCATALOG_COLUMNS WHERE table_name = 'hcatalogtypes'
-> ORDER BY ordinal_position;
-[ RECORD 1 ]------------+-----------------
table_schema | hcat
hcatalog_schema | default
table_name | hcatalogtypes
is_partition_column | f
column_name | intcol
hcatalog_data_type | int
data_type | int
data_type_id | 6
data_type_length | 8
character_maximum_length |
numeric_precision |
numeric_scale |
datetime_precision |
interval_precision |
ordinal_position | 1
-[ RECORD 2 ]------------+-----------------
table_schema | hcat
hcatalog_schema | default
table_name | hcatalogtypes
is_partition_column | f
column_name | floatcol
hcatalog_data_type | float
data_type | float
data_type_id | 7
data_type_length | 8
character_maximum_length |
numeric_precision |
numeric_scale |
datetime_precision |
interval_precision |
ordinal_position | 2
-[ RECORD 3 ]------------+-----------------
table_schema | hcat
hcatalog_schema | default
table_name | hcatalogtypes
is_partition_column | f
column_name | doublecol
hcatalog_data_type | double
data_type | float
data_type_id | 7
data_type_length | 8
character_maximum_length |
numeric_precision |
numeric_scale |
datetime_precision |
interval_precision |
ordinal_position | 3
-[ RECORD 4 ]------------+-----------------
table_schema | hcat
hcatalog_schema | default
table_name | hcatalogtypes
is_partition_column | f
column_name | charcol
hcatalog_data_type | string
data_type | varchar(65000)
data_type_id | 9
data_type_length | 65000
character_maximum_length | 65000
numeric_precision |
numeric_scale |
datetime_precision |
interval_precision |
ordinal_position | 4
-[ RECORD 5 ]------------+-----------------
table_schema | hcat
hcatalog_schema | default
table_name | hcatalogtypes
is_partition_column | f
column_name | varcharcol
hcatalog_data_type | string
data_type | varchar(65000)
data_type_id | 9
data_type_length | 65000
character_maximum_length | 65000
numeric_precision |
numeric_scale |
datetime_precision |
interval_precision |
ordinal_position | 5
-[ RECORD 6 ]------------+-----------------
table_schema | hcat
hcatalog_schema | default
table_name | hcatalogtypes
is_partition_column | f
column_name | boolcol
hcatalog_data_type | boolean
data_type | boolean
data_type_id | 5
data_type_length | 1
character_maximum_length |
numeric_precision |
numeric_scale |
datetime_precision |
interval_precision |
ordinal_position | 6
-[ RECORD 7 ]------------+-----------------
table_schema | hcat
hcatalog_schema | default
table_name | hcatalogtypes
is_partition_column | f
column_name | timestampcol
hcatalog_data_type | string
data_type | varchar(65000)
data_type_id | 9
data_type_length | 65000
character_maximum_length | 65000
numeric_precision |
numeric_scale |
datetime_precision |
interval_precision |
ordinal_position | 7
-[ RECORD 8 ]------------+-----------------
table_schema | hcat
hcatalog_schema | default
table_name | hcatalogtypes
is_partition_column | f
column_name | varbincol
hcatalog_data_type | binary
data_type | varbinary(65000)
data_type_id | 17
data_type_length | 65000
character_maximum_length | 65000
numeric_precision |
numeric_scale |
datetime_precision |
interval_precision |
ordinal_position | 8
-[ RECORD 9 ]------------+-----------------
table_schema | hcat
hcatalog_schema | default
table_name | hcatalogtypes
is_partition_column | f
column_name | bincol
hcatalog_data_type | binary
data_type | varbinary(65000)
data_type_id | 17
data_type_length | 65000
character_maximum_length | 65000
numeric_precision |
numeric_scale |
datetime_precision |
interval_precision |
ordinal_position | 9
10 - Synchronizing an HCatalog schema or table with a local schema or table
Querying data from an HCatalog schema can be slow due to Hive performance issues.
Querying data from an HCatalog schema can be slow due to Hive performance issues. This slow performance can be especially annoying when you want to examine the structure of the tables in the Hive database. Getting this information from Hive requires you to query the HCatalog schema's metadata using the HCatalog Connector.
To avoid this performance problem you can use the SYNC_WITH_HCATALOG_SCHEMA function to create a snapshot of the HCatalog schema's metadata within a Vertica schema. You supply this function with the name of a pre-existing Vertica schema, typically the one created through CREATE HCATALOG SCHEMA, and a Hive schema available through the HCatalog Connector. You must have permission both in Vertica to write the data and in Hive and HDFS to read it.
Note
To synchronize a schema, you must have read permission for the underlying files in HDFS. If Hive uses Sentry to manage authorization, then you can use ACL synchronization to manage HDFS access. Otherwise, the user of this function must have read access in HDFS.
The function creates a set of external tables within the Vertica schema that you can then use to examine the structure of the tables in the Hive database. Because the metadata in the Vertica schema is local, query planning is much faster. You can also use standard Vertica statements and system-table queries to examine the structure of Hive tables in the HCatalog schema.
Caution
The SYNC_WITH_HCATALOG_SCHEMA function overwrites tables in the Vertica schema whose names match a table in the HCatalog schema. Do not use the Vertica schema to store other data.
When SYNC_WITH_HCATALOG_SCHEMA creates tables in Vertica, it matches Hive's STRING and BINARY types to Vertica's VARCHAR(65000) and VARBINARY(65000) types. You might want to change these lengths, using ALTER TABLE SET DATA TYPE, in two cases:
-
If the value in Hive is larger than 65000 bytes, increase the size and use LONG VARCHAR or LONG VARBINARY to avoid data truncation. If a Hive string uses multi-byte encodings, you must increase the size in Vertica to avoid data truncation. This step is needed because Hive counts string length in characters while Vertica counts it in bytes.
-
If the value in Hive is much smaller than 65000 bytes, reduce the size to conserve memory in Vertica.
The Vertica schema is just a snapshot of the HCatalog schema's metadata. Vertica does not synchronize later changes to the HCatalog schema with the local schema after you call SYNC_WITH_HCATALOG_SCHEMA. You can call the function again to re-synchronize the local schema to the HCatalog schema. If you altered column data types, you will need to repeat those changes because the function creates new external tables.
By default, SYNC_WITH_HCATALOG_SCHEMA does not drop tables that appear in the local schema that do not appear in the HCatalog schema. Thus, after the function call the local schema does not reflect tables that have been dropped in the Hive database since the previous call. You can change this behavior by supplying the optional third Boolean argument that tells the function to drop any table in the local schema that does not correspond to a table in the HCatalog schema.
Instead of synchronizing the entire schema, you can synchronize individual tables by using SYNC_WITH_HCATALOG_SCHEMA_TABLE. If the table already exists in Vertica the function overwrites it. If the table is not found in the HCatalog schema, this function returns an error. In all other respects this function behaves in the same way as SYNC_WITH_HCATALOG_SCHEMA.
If you change the settings of any HCatalog Connector configuration parameters (Apache Hadoop parameters), you must call this function again.
Examples
The following example demonstrates calling SYNC_WITH_HCATALOG_SCHEMA to synchronize the HCatalog schema in Vertica with the metadata in Hive. Because it synchronizes the HCatalog schema directly, instead of synchronizing another schema with the HCatalog schema, both arguments are the same.
=> CREATE HCATALOG SCHEMA hcat WITH hostname='hcathost' HCATALOG_SCHEMA='default'
HCATALOG_USER='hcatuser';
CREATE SCHEMA
=> SELECT sync_with_hcatalog_schema('hcat', 'hcat');
sync_with_hcatalog_schema
----------------------------------------
Schema hcat synchronized with hcat
tables in hcat = 56
tables altered in hcat = 0
tables created in hcat = 56
stale tables in hcat = 0
table changes erred in hcat = 0
(1 row)
=> -- Use vsql's \d command to describe a table in the synced schema
=> \d hcat.messages
List of Fields by Tables
Schema | Table | Column | Type | Size | Default | Not Null | Primary Key | Foreign Key
-----------+----------+---------+----------------+-------+---------+----------+-------------+-------------
hcat | messages | id | int | 8 | | f | f |
hcat | messages | userid | varchar(65000) | 65000 | | f | f |
hcat | messages | "time" | varchar(65000) | 65000 | | f | f |
hcat | messages | message | varchar(65000) | 65000 | | f | f |
(4 rows)
This example shows synchronizing with a schema created using CREATE HCATALOG SCHEMA. Synchronizing with a schema created using CREATE SCHEMA is also supported.
You can query tables in the local schema that you synchronized with an HCatalog schema. However, querying tables in a synchronized schema isn't much faster than directly querying the HCatalog schema, because SYNC_WITH_HCATALOG_SCHEMA only duplicates the HCatalog schema's metadata. The data in the table is still retrieved using the HCatalog Connector.
11 - Data type conversions from Hive to Vertica
The data types recognized by Hive differ from the data types recognized by Vertica.
The data types recognized by Hive differ from the data types recognized by Vertica. The following table lists how the HCatalog Connector converts Hive data types into data types compatible with Vertica.
Hive Data Type |
Vertica Data Type |
TINYINT (1-byte) |
TINYINT (8-bytes) |
SMALLINT (2-bytes) |
SMALLINT (8-bytes) |
INT (4-bytes) |
INT (8-bytes) |
BIGINT (8-bytes) |
BIGINT (8-bytes) |
BOOLEAN |
BOOLEAN |
FLOAT (4-bytes) |
FLOAT (8-bytes) |
DECIMAL (precision, scale) |
DECIMAL (precision, scale) |
DOUBLE (8-bytes) |
DOUBLE PRECISION (8-bytes) |
CHAR (length in characters) |
CHAR (length in bytes) |
VARCHAR (length in characters) |
VARCHAR (length in bytes), if length <= 65000 LONG VARCHAR (length in bytes), if length > 65000 |
STRING (2 GB max) |
VARCHAR (65000) |
BINARY (2 GB max) |
VARBINARY (65000) |
DATE |
DATE |
TIMESTAMP |
TIMESTAMP |
LIST/ARRAY |
VARCHAR (65000) containing a JSON-format representation of the list. |
MAP |
VARCHAR (65000) containing a JSON-format representation of the map. |
STRUCT |
VARCHAR (65000) containing a JSON-format representation of the struct. |
Data-width handling differences between Hive and Vertica
The HCatalog Connector relies on Hive SerDe classes to extract data from files on HDFS. Therefore, the data read from these files are subject to Hive's data width restrictions. For example, suppose the SerDe parses a value for an INT column into a value that is greater than 232-1 (the maximum value for a 32-bit integer). In this case, the value is rejected even if it would fit into a Vertica's 64-bit INTEGER column because it cannot fit into Hive's 32-bit INT.
Hive measures CHAR and VARCHAR length in characters and Vertica measures them in bytes. Therefore, if multi-byte encodings are being used (like Unicode), text might be truncated in Vertica.
Once the value has been parsed and converted to a Vertica data type, it is treated as native data. This treatment can result in some confusion when comparing the results of an identical query run in Hive and in Vertica. For example, if your query adds two INT values that result in a value that is larger than 232-1, the value overflows its 32-bit INT data type, causing Hive to return an error. When running the same query with the same data in Vertica using the HCatalog Connector, the value will probably still fit within Vertica's 64-int value. Thus the addition is successful and returns a value.
12 - Using nonstandard SerDes
Hive stores its data in unstructured flat files located in the Hadoop Distributed File System (HDFS).
Hive stores its data in unstructured flat files located in the Hadoop Distributed File System (HDFS). When you execute a Hive query, it uses a set of serializer and deserializer (SerDe) classes to extract data from these flat files and organize it into a relational database table. For Hive to be able to extract data from a file, it must have a SerDe that can parse the data the file contains. When you create a table in Hive, you can select the SerDe to be used for the table's data.
Hive has a set of standard SerDes that handle data in several formats such as delimited data and data extracted using regular expressions. You can also use third-party or custom-defined SerDes that allow Hive to process data stored in other file formats. For example, some commonly-used third-party SerDes handle data stored in JSON format.
The HCatalog Connector directly fetches file segments from HDFS and uses Hive's SerDes classes to extract data from them. The Connector includes all Hive's standard SerDes classes, so it can process data stored in any file that Hive natively supports. If you want to query data from a Hive table that uses a custom SerDe, you must first install the SerDe classes on the Vertica cluster.
Determining which SerDe you need
If you have access to the Hive command line, you can determine which SerDe a table uses by using Hive's SHOW CREATE TABLE statement. This statement shows the HiveQL statement needed to recreate the table. For example:
hive> SHOW CREATE TABLE msgjson;
OK
CREATE EXTERNAL TABLE msgjson(
messageid int COMMENT 'from deserializer',
userid string COMMENT 'from deserializer',
time string COMMENT 'from deserializer',
message string COMMENT 'from deserializer')
ROW FORMAT SERDE
'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://hivehost.example.com:8020/user/exampleuser/msgjson'
TBLPROPERTIES (
'transient_lastDdlTime'='1384194521')
Time taken: 0.167 seconds
In the example, ROW FORMAT SERDE
indicates that a special SerDe is used to parse the data files. The next row shows that the class for the SerDe is named org.apache.hadoop.hive.contrib.serde2.JsonSerde
.You must provide the HCatalog Connector with a copy of this SerDe class so that it can read the data from this table.
You can also find out which SerDe class you need by querying the table that uses the custom SerDe. The query will fail with an error message that contains the class name of the SerDe needed to parse the data in the table. In the following example, the portion of the error message that names the missing SerDe class is in bold.
=> SELECT * FROM hcat.jsontable;
ERROR 3399: Failure in UDx RPC call InvokePlanUDL(): Error in User Defined
Object [VHCatSource], error code: 0
com.vertica.sdk.UdfException: Error message is [
org.apache.hcatalog.common.HCatException : 2004 : HCatOutputFormat not
initialized, setOutput has to be called. Cause : java.io.IOException:
java.lang.RuntimeException:
MetaException(message:org.apache.hadoop.hive.serde2.SerDeException
SerDe com.cloudera.hive.serde.JSONSerDe does not exist) ] HINT If error
message is not descriptive or local, may be we cannot read metadata from hive
metastore service thrift://hcathost:9083 or HDFS namenode (check
UDxLogs/UDxFencedProcessesJava.log in the catalog directory for more information)
at com.vertica.hcatalogudl.HCatalogSplitsNoOpSourceFactory
.plan(HCatalogSplitsNoOpSourceFactory.java:98)
at com.vertica.udxfence.UDxExecContext.planUDSource(UDxExecContext.java:898)
. . .
Installing the SerDe on the Vertica cluster
You usually have two options to getting the SerDe class file the HCatalog Connector needs:
-
Find the installation files for the SerDe, then copy those over to your Vertica cluster. For example, there are several third-party JSON SerDes available from sites like Google Code and GitHub. You may find the one that matches the file installed on your Hive cluster. If so, then download the package and copy it to your Vertica cluster.
-
Directly copy the JAR files from a Hive server onto your Vertica cluster. The location for the SerDe JAR files depends on your Hive installation. On some systems, they may be located in /usr/lib/hive/lib
.
Wherever you get the files, copy them into the /opt/vertica/packages/hcat/lib
directory on every node in your Vertica cluster.
Important
If you add a new host to your Vertica cluster, remember to copy every custom SerDer JAR file to it.
13 - Troubleshooting HCatalog Connector problems
You may encounter the following issues when using the HCatalog Connector.
You may encounter the following issues when using the HCatalog Connector.
Connection errors
The HCatalog Connector can encounter errors both when you define a schema and when you query it. The types of errors you get depend on which CREATE HCATALOG SCHEMA parameters are incorrect. Suppose you have incorrect parameters for the metastore database, but correct parameters for HiveServer2. In this case, HCatalog-related system table queries succeed, while queries on the HCatalog schema fail. The following example demonstrates creating an HCatalog schema with the correct default HiveServer2 information. However, the port number for the metastore database is incorrect.
=> CREATE HCATALOG SCHEMA hcat2 WITH hostname='hcathost'
-> HCATALOG_SCHEMA='default' HCATALOG_USER='hive' PORT=1234;
CREATE SCHEMA
=> SELECT * FROM HCATALOG_TABLE_LIST;
-[ RECORD 1 ]------+---------------------
table_schema_id | 45035996273864536
table_schema | hcat2
hcatalog_schema | default
table_name | test
hcatalog_user_name | hive
=> SELECT * FROM hcat2.test;
ERROR 3399: Failure in UDx RPC call InvokePlanUDL(): Error in User Defined
Object [VHCatSource], error code: 0
com.vertica.sdk.UdfException: Error message is [
org.apache.hcatalog.common.HCatException : 2004 : HCatOutputFormat not
initialized, setOutput has to be called. Cause : java.io.IOException:
MetaException(message:Could not connect to meta store using any of the URIs
provided. Most recent failure: org.apache.thrift.transport.TTransportException:
java.net.ConnectException:
Connection refused
at org.apache.thrift.transport.TSocket.open(TSocket.java:185)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(
HiveMetaStoreClient.java:277)
. . .
To resolve these issues, you must drop and recreate the schema or alter the schema to correct the parameters. If you still have issues, determine whether there are connectivity issues between your Vertica cluster and your Hadoop cluster. Such issues can include a firewall that prevents one or more Vertica hosts from contacting the HiveServer2, metastore, or HDFS hosts.
UDx failure when querying data: error 3399
You might see an error message when querying data (as opposed to metadata like schema information). This might be accompanied by a ClassNotFoundException in the log. This can happen for the following reasons:
-
You are not using the same version of Java on your Hadoop and Vertica nodes. In this case you need to change one of them to match the other.
-
You have not used hcatUtil to copy all Hadoop and Hive libraries and configuration files to Vertica, or you ran hcatutil and then changed your version of Hadoop or Hive.
-
You upgraded Vertica to a new version and did not rerun hcatutil and reinstall the HCatalog Connector.
-
The version of Hadoop you are using relies on a third-party library that you must copy manually.
-
You are reading files with LZO compression and have not copied the libraries or set the io.compression.codecs
property in core-site.xml
.
-
You are reading Parquet data from Hive, and columns were added to the table after some data was already present in the table. Adding columns does not update existing data, and the ParquetSerDe provided by Hive and used by the HCatalog Connector does not handle this case. This error is due to a limitation in Hive and there is no workaround.
-
The query is taking too long and is timing out. If this is a frequent problem, you can increase the value of the UDxFencedBlockTimeout configuration parameter. See General parameters.
If you did not copy the libraries or configure LZO compression, follow the instructions in Configuring Vertica for HCatalog.
If the Hive jars that you copied from Hadoop are out of date, you might see an error message like the following:
ERROR 3399: Failure in UDx RPC call InvokePlanUDL(): Error in User Defined Object [VHCatSource],
error code: 0 Error message is [ Found interface org.apache.hadoop.mapreduce.JobContext, but
class was expected ]
HINT hive metastore service is thrift://localhost:13433 (check UDxLogs/UDxFencedProcessesJava.log
in the catalog directory for more information)
This error usually signals a problem with hive-hcatalog-core jar
. Make sure you have an up-to-date copy of this file. Remember that if you rerun hcatUtil you also need to re-create the HCatalog schema.
You might also see a different form of this error:
ERROR 3399: Failure in UDx RPC call InvokePlanUDL(): Error in User Defined Object [VHCatSource],
error code: 0 Error message is [ javax/servlet/Filter ]
This error can be reported even if hcatUtil reports that your libraries are up to date. The javax.servlet.Filter
class is in a library that some versions of Hadoop use but that is not usually part of the Hadoop installation directly. If you see an error mentioning this class, locate servlet-api-*.jar
on a Hadoop node and copy it to the hcat/lib
directory on all database nodes. If you cannot locate it on a Hadoop node, locate and download it from the Internet. (This case is rare.) The library version must be 2.3 or higher.
After you have copied the jar to the hcat/lib
directory, reinstall the HCatalog connector as explained in Configuring Vertica for HCatalog.
Authentication error when querying data
You might have successfully created a schema using CREATE HCATALOG SCHEMA but get errors at query time such as the following:
=> SELECT * FROM hcat.clickdata;
ERROR 6776: Failed to glob [hdfs:///user/hive/warehouse/click-*.parquet]
because of error: hdfs:///user/hive/warehouse/click-12214.parquet: statAll failed;
error: AuthenticationFailed
You might see this error if Hive uses an authorization service. If the permissions on the underlying files in HDFS match those in the authorization service, then Vertica must use user impersonation when accessing that data. To enable user impersonation, set the EnableHCatImpersonation configuration parameter to 1.
Vertica uses the database principal to access HDFS. Therefore, if EnableHCatImpersonation is 0, the Vertica database principal must have access to the data inside the hive warehouse on HDFS. If it does not, you might see the following error:
=> SELECT * FROM hcat.salesdata;
ERROR 6776: Failed to glob [vertica-hdfs:///hive/warehouse/sales/*/*]
because of error: vertica-hdfs:///hive/warehouse/sales/: listStat failed;
error: Permission denied: user=vertica, access=EXECUTE, inode="/hive/warehouse/sales":hdfs:hdfs:d---------
The URL scheme in this error message has been changed from hdfs
to vertica-hdfs
. This is an internal scheme and is not valid in URLs outside of Vertica. You cannot use this scheme when specifying paths in HDFS.
Differing results between Hive and Vertica queries
Sometimes, running the same query on Hive and on Vertica through the HCatalog Connector can return different results. There are a few common causes of this problem.
This discrepancy is often caused by the differences between the data types supported by Hive and Vertica. See Data type conversions from Hive to Vertica for more information about supported data types.
If Hive string values are being truncated in Vertica, this might be caused by multi-byte character encodings in Hive. Hive reports string length in characters, while Vertica records it in bytes. For a two-byte encoding such as Unicode, you need to double the column size in Vertica to avoid truncation.
Discrepancies can also occur if the Hive table uses partition columns of types other than string.
If the Hive table stores partition data in custom locations instead of the default, you will see different query results if you do not specify the CUSTOM_PARTITIONS parameter when creating the Vertica schema. See Using Partitioned Data. Further, even when using custom partitions, there are differences between Vertica and Hive:
-
If a custom partition is unavailable at query time, Hive ignores it and returns the rest of the results. Vertica reports an error.
-
If a partition is altered in Hive to use a custom location after table creation, Hive reads data from only the new location while Vertica reads data from both the default and the new locations.
-
If a partition value and its corresponding directory name disagree, Hive uses the value in its metadata while Vertica uses the value in the directory name.
The following example illustrates these differences. A table in Hive, partitioned on the state column, starts with the following data:
hive> select * from inventory;
+--------------+-----------------+-----------------+--+
| inventory.ID | inventory.quant | inventory.state |
+--------------+-----------------+-----------------+--+
| 2 | 300 | CA |
| 4 | 400 | CA |
| 5 | 500 | DC |
| 1 | 100 | PA |
| 2 | 200 | PA |
+--------------+-----------------+-----------------+--+
The partition for one state, CA, is then moved. This directory contains some data already. Note that the query now returns different results for the CA partitions.
hive> ALTER TABLE inventory PARTITION (state='CA') SET LOCATION 'hdfs:///partitions/inventory/state=MD';
hive> select * from inventory;
+--------------+-----------------+-----------------+--+
| inventory.ID | inventory.quant | inventory.state |
+--------------+-----------------+-----------------+--+
| 20 | 30000 | CA |
| 40 | 40000 | CA |
| 5 | 500 | DC |
| 1 | 100 | PA |
| 2 | 200 | PA |
+--------------+-----------------+-----------------+--+
5 rows selected (0.399 seconds)
The CA partitions were moved to a directory named state=MD. Vertica reports the state for the first two rows, the ones in the new partition location, as MD because of the directory name. It also reports the CA values from the original location in addition to the new one:
=> SELECT * FROM hcat.inventory;
ID | quant | state
----+----------+------
20 | 30000 | MD
40 | 40000 | MD
2 | 300 | CA
4 | 400 | CA
1 | 100 | PA
2 | 200 | PA
5 | 500 | DC
(7 rows)
HCatalog Connector installation fails on MapR
If you mount a MapR file system as an NFS mount point and then install the HCatalog Connector, it could fail with a message like the following:
ROLLBACK 2929: Couldn't create new UDx side process,
failed to get UDx side process info from zygote: Broken pipe
This might be accompanied by an error like the following in dbLog
:
java.io.IOException: Couldn't get lock for /home/dbadmin/node02_catalog/UDxLogs/UDxFencedProcessesJava.log
at java.util.logging.FileHandler.openFiles(FileHandler.java:389)
at java.util.logging.FileHandler.<init>(FileHandler.java:287)
at com.vertica.udxfence.UDxLogger.setup(UDxLogger.java:78)
at com.vertica.udxfence.UDxSideProcess.go(UDxSideProcess.java:75)
...
This error occurs if you locked your NFS mount point when creating it. Locking is the default. If you use the HCatalog Connector with MapR mounted as an NFS mount point, you must create the mount point with the -o nolock
option. For example:
sudo mount -o nolock -t nfs MaprCLDBserviceHostname:/mapr/ClusterName/vertica/$(hostname -f)/ vertica
You can use the HCatalog Connector with MapR without mounting the MapR file system. If you mount the MapR file system, you must do so without a lock.
Excessive query delays
Network issues or high system loads on the HiveServer2 server can cause long delays while querying a Hive database using the HCatalog Connector. While Vertica cannot resolve these issues, you can set parameters that limit how long Vertica waits before canceling a query on an HCatalog schema. You can set these parameters globally using Vertica configuration parameters. You can also set them for specific HCatalog schemas in the CREATE HCATALOG SCHEMA statement. These specific settings override the settings in the configuration parameters.
The HCatConnectionTimeout configuration parameter and the CREATE HCATALOG SCHEMA statement's HCATALOG_CONNECTION_TIMEOUT parameter control how many seconds the HCatalog Connector waits for a connection to the HiveServer2 server. A value of 0 (the default setting for the configuration parameter) means to wait indefinitely. If the server does not respond by the time this timeout elapses, the HCatalog Connector breaks the connection and cancels the query. If you find that some queries on an HCatalog schema pause excessively, try setting this parameter to a timeout value, so the query does not hang indefinitely.
The HCatSlowTransferTime configuration parameter and the CREATE HCATALOG SCHEMA statement's HCATALOG_SLOW_TRANSFER_TIME parameter specify how long the HCatlog Connector waits for data after making a successful connection to the server. After the specified time has elapsed, the HCatalog Connector determines whether the data transfer rate from the server is at least the value set in the HCatSlowTransferLimit configuration parameter (or by the CREATE HCATALOG SCHEMA statement's HCATALOG_SLOW_TRANSFER_LIMIT parameter). If it is not, then the HCatalog Connector terminates the connection and cancels the query.
You can set these parameters to cancel queries that run very slowly but do eventually complete. However, query delays are usually caused by a slow connection rather than a problem establishing the connection. Therefore, try adjusting the slow transfer rate settings first. If you find the cause of the issue is connections that never complete, you can alternately adjust the Linux TCP socket timeouts to a suitable value instead of relying solely on the HCatConnectionTimeout parameter.
SerDe errors
Errors can occur if you attempt to query a Hive table that uses a nonstandard SerDe. If you have not installed the SerDe JAR files on your Vertica cluster, you receive an error similar to the one in the following example:
=> SELECT * FROM hcat.jsontable;
ERROR 3399: Failure in UDx RPC call InvokePlanUDL(): Error in User Defined
Object [VHCatSource], error code: 0
com.vertica.sdk.UdfException: Error message is [
org.apache.hcatalog.common.HCatException : 2004 : HCatOutputFormat not
initialized, setOutput has to be called. Cause : java.io.IOException:
java.lang.RuntimeException:
MetaException(message:org.apache.hadoop.hive.serde2.SerDeException
SerDe com.cloudera.hive.serde.JSONSerDe does not exist) ] HINT If error
message is not descriptive or local, may be we cannot read metadata from hive
metastore service thrift://hcathost:9083 or HDFS namenode (check
UDxLogs/UDxFencedProcessesJava.log in the catalog directory for more information)
at com.vertica.hcatalogudl.HCatalogSplitsNoOpSourceFactory
.plan(HCatalogSplitsNoOpSourceFactory.java:98)
at com.vertica.udxfence.UDxExecContext.planUDSource(UDxExecContext.java:898)
. . .
In the error message, you can see that the root cause is a missing SerDe class (shown in bold). To resolve this issue, install the SerDe class on your Vertica cluster. See Using nonstandard SerDes for more information.
This error may occur intermittently if just one or a few hosts in your cluster do not have the SerDe class.