Configuring your Vertica cluster for Eon Mode
Running Vertica in Eon Mode decouples the cluster size from the data volume and lets you configure for your compute needs independently from your storage needs. There are a number of factors you must consider when deciding what sorts of instances to use and how large your Eon Mode cluster will be.
Before you begin
Vertica Eon Mode works both in the cloud and on-premises. As a Vertica administrator setting up your production cluster running in Eon Mode, you must make decisions about the virtual or physical hardware you will use to meet your needs. This topic provides guidelines and best practices for selecting server types and cluster sizes for a Vertica database running in Eon Mode. It assumes that you have a basic understanding of the Eon Mode architecture and concepts such as communal storage, depot, namespaces, and shards. If you need to refresh your understanding, see Eon Mode architecture and Namespaces and shards.
Cluster size overview
Because Eon Mode separates data storage from the computing power of your nodes, choosing a cluster size is more complex than for an Enterprise Mode database. Usually, you choose a base number of nodes that will form one or more primary subclusters. These subclusters contain nodes that are always running in your database. You usually use them for workloads such as data loading and executing DDL statements. You rarely alter the size of these subclusters dynamically. As you need additional compute resources to execute queries, you add one or more subclusters (usually secondary subclusters) of nodes to your database.
When choosing your instances and sizing your cluster for Eon Mode, consider the working data size that your database will be dealing with. This is the amount of data that most of your queries operate on. For example, suppose your database stores sales data. If most of the queries running on your database analyze the last month or two of sales to create reports on sales trends, then your working data size is the amount of data you typically load over a few months.
Choosing instances or physical hardware
Depending on the complexity of your workload and expected concurrency, choose physical or virtual hardware that has sufficient CPU and memory. For production clusters Vertica recommends the following minimum configuration for either virtual or physical nodes in an Eon Mode database:
-
16 cores
-
128 GB RAM
-
2 TB of local storage
Note
The above specifications are just a minimum recommendation. You should consider increasing these specifications based on your workloads and the amount of data you are processing.You must have a minimum of 3 nodes in the an Eon Mode database cluster.
For specific recommendations of instances for cloud-based Eon Mode database, see:
Determining local storage requirements
For both virtual and physical hardware, you must decide how much local storage your nodes need. In Eon Mode, the definitive copy of your database's data resides in the communal storage. This storage is provided by either a cloud-based object store such as AWS S3, or by an on-premises object store, such as a Pure Storage FlashBlade appliance.
Even though your database's data is stored in communal storage, your nodes still need some local storage. A node in an Eon Mode database uses local storage for three purposes:
-
Depot storage: To get the fastest response time for frequently executed queries, provision a depot large enough to hold your working data set after data compression. Divide the working data size by the number of nodes you will have in your subcluster to estimate the size of the depot you need for each node in a subcluster. See Choosing the Number of Shards and the Initial Node Count below to get an idea of how many nodes you want in your initial database subcluster. In cases where you expect to dynamically scale your cluster up, estimate the depot size based on the minimum number of nodes you anticipate having in the subcluster.
Also consider how much data you will load at once when sizing your depot. When loading data, Vertica defaults to writing uncommitted ROS files into the depot before uploading the files to communal storage. If the free space in the depot is not sufficient, Vertica evicts files from the depot to make space for new files.
Your data load fails if the amount of data you try to load in a single transaction is larger the total sizes of all the depots in the subcluster. To load more data than there is space in the subcluster's combined depots, set UseDepotForWrites to 0. This configuration parameter tells Vertica to load the data directly into communal storage.
-
Data storage: The data storage location holds files that belong to temporary tables and temporary data from sort operators that spill to disk. When loading data into Vertica, the sort operator may spill to disk. Depending on the size of the load, Vertica may perform the sort in multiple merge phases. The amount of data concurrently loaded into Vertica cannot be larger than the sum of temporary storage location sizes across all nodes divided by 2.
-
Catalog storage. The catalog size depends on the number of database objects per shard and the number of shard subscriptions per node.
Vertica recommends a minimum local storage capacity of 2 TB per node, out of which 60% is reserved for the depot and the other 40% is shared between the catalog and data location. If you determine that you need a depot larger than 1.2TB per node (which is 60% of 2TB) then add more storage than this minimum recommendation.You can calculate the space you need using this equation:
For example, suppose you have a compressed working data size of 24TB, and you want to have a initial primary subcluster of 3 nodes. Using these values in the equation results in 13.33TB:
Choosing the initial node and shard counts
Shards are how Vertica divides the responsibility for the data in communal storage among nodes. The number of shards that data is segmented into depends on the shard count of the namespace to which the data belongs. Each node in a subcluster subscribes to at least one shard in communal storage. During queries, data loads, and other database tasks, each node is responsible for the data in the shards it subscribes to. See Namespaces and shards for more information.
The relation between shards and nodes means that when selecting the number of shards for a namespace, you must consider the number of nodes you will have in your database.
You set the initial shard count of the default_namespace
when you create your database. The shard count of non-default namespaces is set with the CREATE NAMESPACE statement. The initial node count is the number of nodes you will have in your core primary subcluster. The number of shards in a namespace should always be a multiple or divisor of the node count. This ensures that the shards are evenly divided between the nodes. For example, in a six-shard namespace, you should have subclusters that contain two, three, six, or a multiple of six nodes. If the number of shards is not a divisor or multiple of the node count, the shard subscriptions are not spread evenly across the nodes. This leads to some nodes being more heavily loaded than others.
Note
In a database where the number of nodes is a multiple of the number of shards in a namespace, the subcluster uses Elastic Crunch Scaling (ECS) to evenly divide shard coverage among two or more nodes. For more information, see Using elastic crunch scaling to improve query performance.The following table shows the recommended initial node count based on the working data size:
Cluster Type | Working Data Size | Initial Node Count |
---|---|---|
Small | Up to 24 TB | 3 |
Medium | Up to 48 TB | 6 |
Large | Up to 96 TB | 12 |
Extra large | Up to 192 TB | 24 |
Scaling your cluster
There are two strategies you can use when adding nodes to your database. Each of these strategies lets you improve different types of database performance:
-
To increase the performance of complex, long-running queries, add nodes to an existing subcluster. These additional nodes improve the overall performance of these complex queries by splitting the load across more compute nodes. You can add more nodes to a subcluster than you have shards in a namespace. In this case, nodes in the subcluster that subscribe to the same shard will split up the data in the shard when performing a query. See Elasticity for more information.
-
To increase the throughput of multiple short-term queries (often called "dashboard queries"), improve your cluster's parallelism by adding additional subclusters. Subclusters work independently and in parallel on these shorter queries. See Subclusters for more information.
Complex analytic queries perform better on subclusters with more nodes, which means that 6 nodes with 6 shards perform better than 3 nodes and 6 shards. Having more nodes than shards can increase performance further, but the performance gain is not linear. For example, a subcluster containing 12 nodes in a 6-shard namespace is not as efficient as a 12-node subcluster in a 12-shard namespace. Dashboard-type queries operating on smaller data sets may not see much difference between a 3-node subcluster in a 6-shard namespacce and 6-node subcluster in a 6-shard namespace.
Adding communal storage locations
Eon Mode databases support multiple communal storage locations. On database creation, you set the main communal storage location. You can then add additional communal storage locations with the CREATE LOCATION function, and use the SET_OBJECT_STORAGE_POLICY function to assign database objects to one of the storage locations.
Multiple communal storage locations can be useful in cases such as the following:
- Your on-premises communal storage location has run out of space, and you need to add an additional storage location, such as a PureStorage FlashBlade appliance.
- In a multi-teanancy application, you want to store the data for different users in separate communal storage locations.
- Some of your data is accessed infrequently, and you want to move it to cold storage—such as a less expensive, slower S3 storage class.
Use cases
Let’s look at some use cases to learn how to size your Eon Mode cluster to meet your own particular requirements.
Use case 1: save compute by provisioning when needed, rather than for peak times
This use case highlights increasing query throughput in Eon Mode by scaling a cluster from 6 to 18 nodes with 3 subclusters of 6 nodes each. In this use case, you need to support a high concurrent, short query workload on a 24 TB or less working data set. You create an initial database with 6 nodes and default_namespace
of 6 shards. You scale your database for concurrent throughput on demand by adding one or more subclusters during certain days of the week or for specific date ranges when you are expecting a peak load. You can then shut down or terminate the additional subclusters when your database experiences lower demand. With Vertica in Eon Mode, you save money by provisioning on demand, rather than provisioning for the peak times.
Use case 2: complex analytic workload requires more compute nodes
This use case showcases the idea that complex analytic workloads on large working data sets benefit from high shard count and node count. You create an initial subcluster with 24 nodes and default_namespace
of 24 shards. As needed, you can add an additional 24 nodes to your initial subcluster. These additional nodes enable the subcluster to use elastic crunch scaling to reduce the time it takes to complete complex analytic queries.
If you use multiple namespaces, you can separate the large working data set and any smaller data sets into different namespaces. For details, see Managing namespaces.
Use case 3: workload isolation
This use case showcases the idea of having separate subclusters to isolate ETL and report workloads. You create an initial primary subcluster with 6 nodes and a default_namespace
of 6 shards for servicing ETL workloads. Then add another 6-node secondary subcluster for executing query workloads. To separate the two workloads, you can configure a network load balancer or create connection load balancing policies in Vertica to direct clients to the correct subcluster based on the type of workloads they need to execute.