Using elastic crunch scaling to improve query performance

You can choose to add nodes to your database to improve the performance of complex long-running analytic queries.

You can choose to add nodes to your database to improve the performance of complex long-running analytic queries. Adding nodes helps these queries run faster.

When you have more nodes in a subcluster than you have shards in your database, multiple nodes subscribe to each shard. To involve all of the nodes in the subcluster in queries, the Vertica query optimizer automatically uses a feature called Elastic Crunch Scaling (ECS). This feature splits the responsibility for processing the data in each shard among the nodes that subscribe to it. During a query, each node has less data to process and usually finishes the query faster.

For example, suppose you have a six-node subcluster in a three-shard database. In this subcluster, two nodes subscribe to each shard. When you execute a query, Vertica assigns each node roughly half of the data in the shard it subscribes to. Because all nodes in the subcluster participate in the query, the query usually finishes faster than if only half the nodes had participated.

ECS lets a subcluster that has more nodes than shards act as if the shard count in the database were higher. In a three-shard database, a six-node subcluster acts as if the database has six shards by splitting each shard in half. However, using ECS isn't as efficient as having a higher shard count. In practice, you will see slightly slower query performance on a six-node subcluster in a three shard database than you would see from a six-node subcluster in a six-shard database.

You can call RESHARD_DATABASE to change the number of shards in your database. If the new number of shards is greater than or equal to the number of nodes in the subcluster, the subcluster no longer uses ECS. This will generally lead to faster query performance. However, re-sharding produces a larger catalog size and storage containers that are initially misaligned with the new shard definitions. Until the storage containers are realigned, queries must filter out the data in the storage containers that is outside the new shard bounds. This adds a small overhead to queries. For details, see Change the number of shards in the database.

You can determine when the optimizer will use ECS in a subcluster by querying the V_CATALOG.SESSION_SUBSCRIPTIONS system table and look for nodes whose is_collaborating column is TRUE. Subclusters whose node count is less than or equal to the number of shards in the database only have participating nodes. Subclusters that have more nodes than the database's shard count assign the "extra" nodes the role of collaborators. The differences between the two types of nodes are not important for when you are executing queries. The two types just relate to how Vertica organizes the nodes to execute ECS-enabled queries.

This example shows how to get the list of nodes that are participating or collaborating in resolving queries for the current session:

=> SELECT node_name, shard_name, is_collaborating, is_participating
        FROM V_CATALOG.SESSION_SUBSCRIPTIONS
        WHERE is_participating = TRUE OR is_collaborating = TRUE
        ORDER BY shard_name, node_name;
      node_name       | shard_name  | is_collaborating | is_participating
----------------------+-------------+------------------+------------------
 v_verticadb_node0004 | replica     | f                | t
 v_verticadb_node0005 | replica     | f                | t
 v_verticadb_node0006 | replica     | t                | f
 v_verticadb_node0007 | replica     | f                | t
 v_verticadb_node0008 | replica     | t                | f
 v_verticadb_node0009 | replica     | t                | f
 v_verticadb_node0007 | segment0001 | f                | t
 v_verticadb_node0008 | segment0001 | t                | f
 v_verticadb_node0005 | segment0002 | f                | t
 v_verticadb_node0009 | segment0002 | t                | f
 v_verticadb_node0004 | segment0003 | f                | t
 v_verticadb_node0006 | segment0003 | t                | f
(12 rows)

You can see that nodes 4, 5, and 7 are participating, and nodes 6, 8, and 9 are collaborating.

You can also see that ECS is enabled by looking at an EXPLAIN plan for a query. At the top of the plan for an ECS-enabled query is the statement "this query involves non-participating nodes." These non-participating nodes are the collaborating nodes that are splitting the data in the shard with the participating nodes. The plan also lists the nodes taking part in the query.

This example shows an explain plan for an ECS-enabled query in a six-node subcluster in a three-shard database:

=> EXPLAIN SELECT sales_quantity, sales_dollar_amount, transaction_type, cc_name
     FROM online_sales.online_sales_fact
     INNER JOIN online_sales.call_center_dimension
     ON (online_sales.online_sales_fact.call_center_key
         = online_sales.call_center_dimension.call_center_key
         AND sale_date_key = 156)
     ORDER BY sales_dollar_amount DESC;

                                      QUERY PLAN
--------------------------------------------------------------------------------
 ------------------------------
 QUERY PLAN DESCRIPTION:
 The execution of this query involves non-participating nodes. Crunch scaling
 strategy preserves data segmentation
 ------------------------------

 EXPLAIN SELECT sales_quantity, sales_dollar_amount, transaction_type, cc_name
 FROM online_sales.online_sales_fact
 INNER JOIN online_sales.call_center_dimension
 ON (online_sales.online_sales_fact.call_center_key
     = online_sales.call_center_dimension.call_center_key
     AND sale_date_key = 156)
 ORDER BY sales_dollar_amount DESC;

 Access Path:
 +-SORT [Cost: 6K, Rows: 754K] (PATH ID: 1)
 |  Order: online_sales_fact.sales_dollar_amount DESC
 |  Execute on: v_verticadb_node0007, v_verticadb_node0004, v_verticadb_node0005,
 |     v_verticadb_node0006, v_verticadb_node0008, v_verticadb_node0009
 | +---> JOIN MERGEJOIN(inputs presorted) [Cost: 530, Rows: 754K (202 RLE)] (PATH ID: 2)
 | |      Join Cond: (online_sales_fact.call_center_key = call_center_dimension.call_center_key)
 | |      Materialize at Output: online_sales_fact.sales_quantity,
 | |            online_sales_fact.sales_dollar_amount, online_sales_fact.transaction_type
 | |      Execute on: v_verticadb_node0007, v_verticadb_node0004,
 | |            v_verticadb_node0005, v_verticadb_node0006, v_verticadb_node0008,
 | |            v_verticadb_node0009
 | | +-- Outer -> STORAGE ACCESS for online_sales_fact [Cost: 13, Rows: 754K (202 RLE)] (PATH ID: 3)
 | | |      Projection: online_sales.online_sales_fact_DBD_18_seg_vmart_b0
 | | |      Materialize: online_sales_fact.call_center_key
 | | |      Filter: (online_sales_fact.sale_date_key = 156)
 | | |      Execute on: v_verticadb_node0007, v_verticadb_node0004,
 | | |          v_verticadb_node0005, v_verticadb_node0006, v_verticadb_node0008,
 | | |          v_verticadb_node0009
 | | |      Runtime Filter: (SIP1(MergeJoin): online_sales_fact.call_center_key)
 | | +-- Inner -> STORAGE ACCESS for call_center_dimension [Cost: 17, Rows: 200] (PATH ID: 4)
 | | |      Projection: online_sales.call_center_dimension_DBD_16_seg_vmart_b0
 | | |      Materialize: call_center_dimension.call_center_key, call_center_dimension.cc_name
 | | |      Execute on: v_verticadb_node0007, v_verticadb_node0004,
                v_verticadb_node0005, v_verticadb_node0006, v_verticadb_node0008,
                v_verticadb_node0009
 . . .

Taking advantage of ECS

To take advantage of ECS, create a secondary subcluster where the number of nodes is a multiple of the number of shards in your database. For example, in a 12-shard database, create a subcluster that contains a multiple of 12 nodes such as 24 or 36. The number of nodes must be a multiple of the number of shards to evenly distribute the data across the nodes in the subcluster. See Subclusters for more information.

Once you have created the subcluster, have users connect to it and run their analytic queries. Vertica automatically enables ECS in the subcluster because it has more nodes than there are shards in the database.

How the optimizer assigns data responsibilities to nodes

The optimizer has two strategies to choose from when dividing the data in a shard among its subscribing nodes. One strategy is optimized for queries that use data segmentation. Queries that contain a JOIN or GROUP BY clause rely on data segmentation. The other strategy is for queries that do not need segmentation.

By default, the optimizer automatically chooses the strategy to use. For most queries, the automatically-chosen strategy results in faster query performance. For some queries, you may want to manually override the strategy using hints. In a small number of queries, ECS does not help performance. In these cases, you can disable ECS. See Manually choosing an ECS strategy for details.