Configuring Hadoop for co-located clusters
If you are co-locating Vertica on any HDFS nodes, there are some additional configuration requirements.
Hadoop configuration parameters
For best performance, set the following parameters with the specified minimum values:
Parameter | Minimum Value |
---|---|
HDFS block size | 512MB |
Namenode Java Heap | 1GB |
Datanode Java Heap | 2GB |
WebHDFS
Hadoop has two services that can provide web access to HDFS:
-
WebHDFS
-
httpFS
For Vertica, you must use the WebHDFS service.
YARN
The YARN service is available in newer releases of Hadoop. It performs resource management for Hadoop clusters. When co-locating Vertica on YARN-managed Hadoop nodes you must make some changes in YARN.
Vertica recommends reserving at least 16GB of memory for Vertica on shared nodes. Reserving more will improve performance. How you do this depends on your Hadoop distribution:
-
If you are using Hortonworks, create a "Vertica" node label and assign this to the nodes that are running Vertica.
-
If you are using Cloudera, enable and configure static service pools.
Consult the documentation for your Hadoop distribution for details. Alternatively, you can disable YARN on the shared nodes.
Hadoop balancer
The Hadoop Balancer can redistribute data blocks across HDFS. For many Hadoop services, this feature is useful. However, for Vertica this can reduce performance under some conditions.
If you are using HDFS storage locations, the Hadoop load balancer can move data away from the Vertica nodes that are operating on it, degrading performance. This behavior can also occur when reading ORC or Parquet files if Vertica is not running on all Hadoop nodes. (If you are using separate Vertica and Hadoop clusters, all Hadoop access is over the network, and the performance cost is less noticeable.)
To prevent the undesired movement of data blocks across the HDFS cluster, consider excluding Vertica nodes from rebalancing. See the Hadoop documentation to learn how to do this.
Replication factor
By default, HDFS stores three copies of each data block. Vertica is generally set up to store two copies of each data item through K-Safety. Thus, lowering the replication factor to 2 can save space and still provide data protection.
To lower the number of copies HDFS stores, set HadoopFSReplication, as explained in Troubleshooting HDFS storage locations.
Disk space for Non-HDFS use
You also need to reserve some disk space for non-HDFS use. To reserve disk space using Ambari, set dfs.datanode.du.reserved
to a value in the hdfs-site.xml
configuration file.
Setting this parameter preserves space for non-HDFS files that Vertica requires.