Troubleshooting reads from HDFS
You might encounter the following issues when accessing data in HDFS.
Queries using [web]hdfs:/// show unexpected results
If you are using the ///
shorthand to query external tables and see unexpected results, such as production data in your test cluster, verify that HadoopConfDir is set to the value you expect. The HadoopConfDir configuration parameter defines a path to search for the Hadoop configuration files that Vertica needs to resolve file locations. The HadoopConfDir parameter can be set at the session level, overriding the permanent value set in the database.
To debug problems with ///
URLs, try replacing the URLs with ones that use an explicit nameservice or name node. If the explicit URL works, then the problem is with the resolution of the shorthand. If the explicit URL also does not work as expected, then the problem is elsewhere (such as your nameservice).
Queries take a long time to run when using HA
The High Availability Name Node feature in HDFS allows a name node to fail over to a standby name node. The dfs.client.failover.max.attempts
configuration parameter (in hdfs-site.xml
) specifies how many attempts to make when failing over. Vertica uses a default value of 4 if this parameter is not set. After reaching the maximum number of failover attempts, Vertica concludes that the HDFS cluster is unavailable and aborts the operation. Vertica uses the dfs.client.failover.sleep.base.millis
and dfs.client.failover.sleep.max.millis
parameters to decide how long to wait between retries. Typical ranges are 500 milliseconds to 15 seconds, with longer waits for successive retries.
A second parameter, ipc.client.connect.retry.interval
, specifies the time to wait between attempts, with typical values being 10 to 20 seconds.
Cloudera and Hortonworks both provide tools to automatically generate configuration files. These tools can set the maximum number of failover attempts to a much higher number (50 or 100). If the HDFS cluster is unavailable (all name nodes are unreachable), Vertica can appear to hang for an extended period (minutes to hours) while trying to connect.
Failover attempts are logged in the QUERY_EVENTS system table. The following example shows how to query this table to find these events:
=> SELECT event_category, event_type, event_description, operator_name,
event_details, count(event_type) AS count
FROM query_events
WHERE event_type ilike 'WEBHDFS FAILOVER RETRY'
GROUP BY event_category, event_type, event_description, operator_name, event_details;
-[ RECORD 1 ]-----+---------------------------------------
event_category | EXECUTION
event_type | WEBHDFS FAILOVER RETRY
event_description | WebHDFS Namenode failover and retry.
operator_name | WebHDFS FileSystem
event_details | WebHDFS request failed on ns
count | 4
You can either wait for Vertica to complete or abort the connection, or set the dfs.client.failover.max.attempts
parameter to a lower value.
Kerberos authentication errors
Kerberos authentication can fail even though a ticket is valid if Hadoop expires tickets frequently. It can also fail due to clock skew between Hadoop and Vertica nodes. For details, see Troubleshooting Kerberos authentication.