Separate clusters
With separate clusters, a Vertica cluster and a Hadoop cluster share no nodes. You should use a high-bandwidth network connection between the two clusters.
The following figure illustrates the configuration for separate clusters::
Network
The network is a key performance component of any well-configured cluster. When Vertica stores data to HDFS it writes and reads data across the network.
The layout shown in the figure calls for two networks, and there are benefits to adding a third:
-
Database Private Network: Vertica uses a private network for command and control and moving data between nodes in support of its database functions. In some networks, the command and control and passing of data are split across two networks.
-
Database/Hadoop Shared Network: Each Vertica node must be able to connect to each Hadoop data node and the NameNode. Hadoop best practices generally require a dedicated network for the Hadoop cluster. This is not a technical requirement, but a dedicated network improves Hadoop performance. Vertica and Hadoop should share the dedicated Hadoop network.
-
Optional Client Network: Outside clients may access the clustered networks through a client network. This is not an absolute requirement, but the use of a third network that supports client connections to either Vertica or Hadoop can improve performance. If the configuration does not support a client network, than client connections should use the shared network.
Hadoop configuration parameters
For best performance, set the following parameters with the specified minimum values:
Parameter | Minimum Value |
---|---|
HDFS block size | 512MB |
Namenode Java Heap | 1GB |
Datanode Java Heap | 2GB |