This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Creating custom designs

Vertica strongly recommends that you use the physical schema design produced by , which provides , excellent query performance, and efficient use of storage space.

Vertica strongly recommends that you use the physical schema design produced by Database Designer, which provides K-safety, excellent query performance, and efficient use of storage space. If any queries run less as efficiently than you expect, consider using the Database Designer incremental design process to optimize the database design for the query.

If the projections created by Database Designer still do not meet your needs, you can write custom projections, from scratch or based on projection designs created by Database Designer.

If you are unfamiliar with writing custom projections, start by modifying an existing design generated by Database Designer.

1 - Custom design process

To create a custom design or customize an existing one:.

To create a custom design or customize an existing one:

  1. Plan the new design or modifications to an existing one. See Planning your design.

  2. Create or modify projections. See Design fundamentals and CREATE PROJECTION for more detail.

  3. Deploy projections to a test environment. See Writing and deploying custom projections.

  4. Test and modify projections as needed.

  5. After you finalize the design, deploy projections to the production environment.

2 - Planning your design

The syntax for creating a design is easy for anyone who is familiar with SQL.

The syntax for creating a design is easy for anyone who is familiar with SQL. As with any successful project, however, a successful design requires some initial planning. Before you create your first design:

  • Become familiar with standard design requirements and plan your design to include them. See Design requirements.

  • Determine how many projections you need to include in the design. See Determining the number of projections to use.

  • Determine the type of compression and encoding to use for columns. See Architecture.

  • Determine whether or not you want the database to be K-safe. Vertica recommends that all production databases have a minimum K-safety of one (K=1). Valid K-safety values are 0, 1, and 2. See Designing for K-safety.

2.1 - Design requirements

A physical schema design is a script that contains CREATE PROJECTION statements.

A physical schema design is a script that contains CREATE PROJECTION statements. These statements determine which columns are included in projections and how they are optimized.

If you use Database Designer as a starting point, it automatically creates designs that meet all fundamental design requirements. If you intend to create or modify designs manually, be aware that all designs must meet the following requirements:

  • Every design must create at least one superprojection for every table in the database that is used by the client application. These projections provide complete coverage that enables users to perform ad-hoc queries as needed. They can contain joins and they are usually configured to maximize performance through sort order, compression, and encoding.

  • Query-specific projections are optional. If you are satisfied with the performance provided through superprojections, you do not need to create additional projections. However, you can maximize performance by tuning for specific query work loads.

  • Vertica recommends that all production databases have a minimum K-safety of one (K=1) to support high availability and recovery. (K-safety can be set to 0, 1, or 2.) See High availability with projections and Designing for K-safety.

  • Vertica recommends that if you have more than 20 nodes, but small tables, do not create replicated projections. If you create replicated projections, the catalog becomes very large and performance may degrade. Instead, consider segmenting those projections.

2.2 - Determining the number of projections to use

In many cases, a design that consists of a set of superprojections (and their buddies) provides satisfactory performance through compression and encoding.

In many cases, a design that consists of a set of superprojections (and their buddies) provides satisfactory performance through compression and encoding. This is especially true if the sort orders for the projections have been used to maximize performance for one or more query predicates (WHERE clauses).

However, you might want to add additional query-specific projections to increase the performance of queries that run slowly, are used frequently, or are run as part of business-critical reporting. The number of additional projections (and their buddies) that you create should be determined by:

  • Your organization's needs

  • The amount of disk space you have available on each node in the cluster

  • The amount of time available for loading data into the database

As the number of projections that are tuned for specific queries increases, the performance of these queries improves. However, the amount of disk space used and the amount of time required to load data increases as well. Therefore, you should create and test designs to determine the optimum number of projections for your database configuration. On average, organizations that choose to implement query-specific projections achieve optimal performance through the addition of a few query-specific projections.

2.3 - Designing for K-safety

Vertica recommends that all production databases have a minimum K-safety of one (K=1).

Vertica recommends that all production databases have a minimum K-safety of one (K=1). Valid K-safety values for production databases are 1 and 2. Non-production databases do not have to be K-safe and can be set to 0.

A K-safe database must have at least three nodes, as shown in the following table:

K-safety level Number of required nodes
1 3+
2 5+

You can set K-safety to 1 or 2 only when the physical schema design meets certain redundancy requirements. See Requirements for a K-safe physical schema design.

Using Database Designer

To create designs that are K-safe, Vertica recommends that you use the Database Designer. When creating projections with Database Designer, projection definitions that meet K-safe design requirements are recommended and marked with a K-safety level. Database Designer creates a script that uses the MARK_DESIGN_KSAFE function to set the K-safety of the physical schema to 1. For example:

=> \i VMart_Schema_design_opt_1.sql
CREATE PROJECTION
CREATE PROJECTION
mark_design_ksafe
----------------------
Marked design 1-safe
(1 row)

By default, Vertica creates K-safe superprojections when database K-safety is greater than 0.

Monitoring K-safety

Monitoring tables can be accessed programmatically to enable external actions, such as alerts. You monitor the K-safety level by querying the SYSTEM table for settings in columns DESIGNED_FAULT_TOLERANCE and CURRENT_FAULT_TOLERANCE.

Loss of K-safety

When K nodes in your cluster fail, your database continues to run, although performance is affected. Further node failures could potentially cause the database to shut down if the failed node's data is not available from another functioning node in the cluster.

See also

K-safety in an Enterprise Mode database

2.3.1 - Requirements for a K-safe physical schema design

Database Designer automatically generates designs with a K-safety of 1 for clusters that contain at least three nodes.

Database Designer automatically generates designs with a K-safety of 1 for clusters that contain at least three nodes. (If your cluster has one or two nodes, it generates designs with a K-safety of 0. You can modify a design created for a three-node (or greater) cluster, and the K-safe requirements are already set.

If you create custom projections, your physical schema design must meet the following requirements to be able to successfully recover the database in the event of a failure:

You can use the MARK_DESIGN_KSAFE function to find out whether your schema design meets requirements for K-safety.

2.3.2 - Requirements for a physical schema design with no K-safety

If you use Database Designer to generate an comprehensive design that you can modify and you do not want the design to be K-safe, set K-safety level to 0 (zero).

If you use Database Designer to generate an comprehensive design that you can modify and you do not want the design to be K-safe, set K-safety level to 0 (zero).

If you want to start from scratch, do the following to establish minimal projection requirements for a functioning database with no K-safety (K=0):

  1. Define at least one superprojection for each table in the logical schema.

  2. Replicate (define an exact copy of) each dimension table superprojection on each node.

2.3.3 - Designing segmented projections for K-safety

Projections must comply with database K-safety requirements.

Projections must comply with database K-safety requirements. In general, you must create buddy projections for each segmented projection, where the number of buddy projections is K+1. Thus, if system K-safety is set to 1, each projection segment must be duplicated by one buddy; if K-safety is set to 2, each segment must be duplicated by two buddies.

Automatic creation of buddy projections

You can use CREATE PROJECTION so it automatically creates the number of buddy projections required to satisfy K-safety, by including SEGMENTED BY ... ALL NODES. If CREATE PROJECTION specifies K-safety (KSAFE=n), Vertica uses that setting; if the statement omits KSAFE, Vertica uses system K-safety.

In the following example, CREATE PROJECTION creates segmented projection ttt_p1 for table ttt. Because system K-safety is set to 1, Vertica requires a buddy projection for each segmented projection. The CREATE PROJECTION statement omits KSAFE, so Vertica uses system K-safety and creates two buddy projections: ttt_p1_b0 and ttt_p1_b1:

=> SELECT mark_design_ksafe(1);

  mark_design_ksafe
----------------------
 Marked design 1-safe
(1 row)

=> CREATE TABLE ttt (a int, b int);
WARNING 6978:  Table "ttt" will include privileges from schema "public"
CREATE TABLE

=> CREATE PROJECTION ttt_p1 as SELECT * FROM ttt SEGMENTED BY HASH(a) ALL NODES;
CREATE PROJECTION

=> SELECT projection_name from projections WHERE anchor_table_name='ttt';
 projection_name
-----------------
 ttt_p1_b0
 ttt_p1_b1
(2 rows)

Vertica automatically names buddy projections by appending the suffix _bn to the projection base name—for example ttt_p1_b0.

Manual creation of buddy projections

If you create a projection on a single node, and system K-safety is greater than 0, you must manually create the number of buddies required for K-safety. For example, you can create projection xxx_p1 for table xxx on a single node, as follows:

=> CREATE TABLE xxx (a int, b int);
WARNING 6978:  Table "xxx" will include privileges from schema "public"
CREATE TABLE

=> CREATE PROJECTION xxx_p1 AS SELECT * FROM xxx SEGMENTED BY HASH(a) NODES v_vmart_node0001;
CREATE PROJECTION

Because K-safety is set to 1, a single instance of this projection is not K-safe. Attempts to insert data into its anchor table xxx return with an error like this:

=> INSERT INTO xxx VALUES (1, 2);
ERROR 3586:  Insufficient projections to answer query
DETAIL:  No projections that satisfy K-safety found for table xxx
HINT:  Define buddy projections for table xxx

In order to comply with K-safety, you must create a buddy projection for projection xxx_p1. For example:

=> CREATE PROJECTION xxx_p1_buddy AS SELECT * FROM xxx SEGMENTED BY HASH(a) NODES v_vmart_node0002;
CREATE PROJECTION

Table xxx now complies with K-safety and accepts DML statements such as INSERT:

VMart=> INSERT INTO xxx VALUES (1, 2);
 OUTPUT
--------
      1
(1 row)

See also

For general information about segmented projections and buddies, see Segmented projections. For information about designing for K-safety, see Designing for K-safety and Designing for segmentation.

2.3.4 - Designing unsegmented projections for K-Safety

In many cases, dimension tables are relatively small, so you do not need to segment them.

In many cases, dimension tables are relatively small, so you do not need to segment them. Accordingly, you should design a K-safe database so projections for its dimension tables are replicated without segmentation on all cluster nodes. You create these projections with a CREATE PROJECTION statement that includes the keywords UNSEGMENTED ALL NODES. These keywords specify to create identical instances of the projection on all cluster nodes.

The following example shows how to create an unsegmented projection for the table store.store_dimension:


=> CREATE PROJECTION store.store_dimension_proj (storekey, name, city, state)
             AS SELECT store_key, store_name, store_city, store_state
             FROM store.store_dimension
             UNSEGMENTED ALL NODES;
CREATE PROJECTION

Vertica uses the same name to identify all instances of the unsegmented projection—in this example, store.store_dimension_proj. The keyword ALL NODES specifies to replicate the projection on all nodes:


=> \dj store.store_dimension_proj
                         List of projections
 Schema |         Name         |  Owner  |       Node       | Comment
--------+----------------------+---------+------------------+---------
 store  | store_dimension_proj | dbadmin | v_vmart_node0001 |
 store  | store_dimension_proj | dbadmin | v_vmart_node0002 |
 store  | store_dimension_proj | dbadmin | v_vmart_node0003 |
(3 rows)

For more information about projection name conventions, see Projection naming.

2.4 - Designing for segmentation

You segment projections using hash segmentation.

You segment projections using hash segmentation. Hash segmentation allows you to segment a projection based on a built-in hash function that provides even distribution of data across multiple nodes, resulting in optimal query execution. In a projection, the data to be hashed consists of one or more column values, each having a large number of unique values and an acceptable amount of skew in the value distribution. Primary key columns that meet the criteria could be an excellent choice for hash segmentation.

When segmenting projections, determine which columns to use to segment the projection. Choose one or more columns that have a large number of unique data values and acceptable skew in their data distribution. Primary key columns are an excellent choice for hash segmentation. The columns must be unique across all the tables being used in a query.

3 - Design fundamentals

Although you can write custom projections from scratch, Vertica recommends that you use Database Designer to create a design to use as a starting point.

Although you can write custom projections from scratch, Vertica recommends that you use Database Designer to create a design to use as a starting point. This ensures that you have projections that meet basic requirements.

3.1 - Writing and deploying custom projections

Before you write custom projections, review the topics in Planning Your Design carefully.

Before you write custom projections, review the topics in Planning your design carefully. Failure to follow these considerations can result in non-functional projections.

To manually modify or create a projection:

  1. Write a script with CREATE PROJECTION statements to create the desired projections.

  2. Run the script in vsql with the meta-command \i.

  3. For a K-safe database, call Vertica meta-function GET_PROJECTIONS on tables of the new projections. Check the output to verify that all projections have enough buddies to be identified as safe.

  4. If you create projections for tables that already contains data, call REFRESH or START_REFRESH to update new projections. Otherwise, these projections are not available for query processing.

  5. Call MAKE_AHM_NOW to set the Ancient History Mark (AHM) to the most recent epoch.

  6. Call DROP PROJECTION on projections that are no longer needed, and would otherwise waste disk space and reduce load speed.

  7. Call ANALYZE_STATISTICS on all database projections:

    => SELECT ANALYZE_STATISTICS ('');
    

    This function collects and aggregates data samples and storage information from all nodes on which a projection is stored, and then writes statistics into the catalog.

3.2 - Designing superprojections

Superprojections have the following requirements:.

Superprojections have the following requirements:

  • They must contain every column within the table.

  • For a K-safe design, superprojections must either be replicated on all nodes within the database cluster (for dimension tables) or paired with buddies and segmented across all nodes (for very large tables and medium large tables). See Projections and High availability with projections for an overview of projections and how they are stored. See Designing for K-safety for design specifics.

To provide maximum usability, superprojections need to minimize storage requirements while maximizing query performance. To achieve this, the sort order for columns in superprojections is based on storage requirements and commonly used queries.

3.3 - Sort order benefits

Column sort order is an important factor in minimizing storage requirements, and maximizing query performance.

Column sort order is an important factor in minimizing storage requirements, and maximizing query performance.

Minimize storage requirements

Minimizing storage saves on physical resources and increases performance by reducing disk I/O. You can minimize projection storage by prioritizing low-cardinality columns in its sort order. This reduces the number of rows Vertica stores and accesses to retrieve query results.

After identifying projection sort columns, analyze their data and choose the most effective encoding method. The Vertica optimizer gives preference to columns with run-length encoding (RLE), so be sure to use it whenever appropriate. Run-length encoding replaces sequences (runs) of identical values with a single pair that contains the value and number of occurrences. Therefore, it is especially appropriate to use it for low-cardinality columns whose run length is large.

Maximize query performance

You can facilitate query performance through column sort order as follows:

  • Where possible, sort order should prioritize columns with the lowest cardinality.

  • Do not sort projections on columns of type LONG VARBINARY and LONG VARCHAR.

See also

Choosing sort order: best practices

3.4 - Choosing sort order: best practices

When choosing sort orders for your projections, Vertica has several recommendations that can help you achieve maximum query performance, as illustrated in the following examples.

When choosing sort orders for your projections, Vertica has several recommendations that can help you achieve maximum query performance, as illustrated in the following examples.

Combine RLE and sort order

When dealing with predicates on low-cardinality columns, use a combination of RLE and sorting to minimize storage requirements and maximize query performance.

Suppose you have a students table contain the following values and encoding types:

Column # of Distinct Values Encoded With
gender 2 (M or F) RLE
pass_fail 2 (P or F) RLE
class 4 (freshman, sophomore, junior, or senior) RLE
name 10000 (too many to list) Auto

You might have queries similar to this one:

SELECT name FROM studentsWHERE gender = 'M' AND pass_fail = 'P' AND class = 'senior';

The fastest way to access the data is to work through the low-cardinality columns with the smallest number of distinct values before the high-cardinality columns. The following sort order minimizes storage and maximizes query performance for queries that have equality restrictions on gender, class, pass_fail, and name. Specify the ORDER BY clause of the projection as follows:

ORDER BY students.gender, students.pass_fail, students.class, students.name

In this example, the gender column is represented by two RLE entries, the pass_fail column is represented by four entries, and the class column is represented by 16 entries, regardless of the cardinality of the students table. Vertica efficiently finds the set of rows that satisfy all the predicates, resulting in a huge reduction of search effort for RLE encoded columns that occur early in the sort order. Consequently, if you use low-cardinality columns in local predicates, as in the previous example, put those columns early in the projection sort order, in increasing order of distinct cardinality (that is, in increasing order of the number of distinct values in each column).

If you sort this table with student.class first, you improve the performance of queries that restrict only on the student.class column, and you improve the compression of the student.class column (which contains the largest number of distinct values), but the other columns do not compress as well. Determining which projection is better depends on the specific queries in your workload, and their relative importance.

Storage savings with compression decrease as the cardinality of the column increases; however, storage savings with compression increase as the number of bytes required to store values in that column increases.

Maximize the advantages of RLE

To maximize the advantages of RLE encoding, use it only when the average run length of a column is greater than 10 when sorted. For example, suppose you have a table with the following columns, sorted in order of cardinality from low to high:

address.country, address.region, address.state, address.city, address.zipcode

The zipcode column might not have 10 sorted entries in a row with the same zip code, so there is probably no advantage to run-length encoding that column, and it could make compression worse. But there are likely to be more than 10 countries in a sorted run length, so applying RLE to the country column can improve performance.

Put lower cardinality column first for functional dependencies

In general, put columns that you use for local predicates (as in the previous example) earlier in the join order to make predicate evaluation more efficient. In addition, if a lower cardinality column is uniquely determined by a higher cardinality column (like city_id uniquely determining a state_id), it is always better to put the lower cardinality, functionally determined column earlier in the sort order than the higher cardinality column.

For example, in the following sort order, the Area_Code column is sorted before the Number column in the customer_info table:

ORDER BY = customer_info.Area_Code, customer_info.Number, customer_info.Address

In the query, put the Area_Code column first, so that only the values in the Number column that start with 978 are scanned.

=> SELECT AddressFROM customer_info WHERE Area_Code='978' AND Number='9780123457';

Sort for merge joins

When processing a join, the Vertica optimizer chooses from two algorithms:

  • Merge join—If both inputs are pre-sorted on the join column, the optimizer chooses a merge join, which is faster and uses less memory.

  • Hash join—Using the hash join algorithm, Vertica uses the smaller (inner) joined table to build an in-memory hash table on the join column. A hash join has no sort requirement, but it consumes more memory because Vertica builds a hash table with the values in the inner table. The optimizer chooses a hash join when projections are not sorted on the join columns.

If both inputs are pre-sorted, merge joins do not have to do any pre-processing, making the join perform faster. Vertica uses the term sort-merge join to refer to the case when at least one of the inputs must be sorted prior to the merge join. Vertica sorts the inner input side but only if the outer input side is already sorted on the join columns.

To give the Vertica query optimizer the option to use an efficient merge join for a particular join, create projections on both sides of the join that put the join column first in their respective projections. This is primarily important to do if both tables are so large that neither table fits into memory. If all tables that a table will be joined to can be expected to fit into memory simultaneously, the benefits of merge join over hash join are sufficiently small that it probably isn't worth creating a projection for any one join column.

Sort on columns in important queries

If you have an important query, one that you run on a regular basis, you can save time by putting the columns specified in the WHERE clause or the GROUP BY clause of that query early in the sort order.

If that query uses a high-cardinality column such as Social Security number, you may sacrifice storage by placing this column early in the sort order of a projection, but your most important query will be optimized.

Sort columns of equal cardinality by size

If you have two columns of equal cardinality, put the column that is larger first in the sort order. For example, a CHAR(20) column takes up 20 bytes, but an INTEGER column takes up 8 bytes. By putting the CHAR(20) column ahead of the INTEGER column, your projection compresses better.

Sort foreign key columns first, from low to high distinct cardinality

Suppose you have a fact table where the first four columns in the sort order make up a foreign key to another table. For best compression, choose a sort order for the fact table such that the foreign keys appear first, and in increasing order of distinct cardinality. Other factors also apply to the design of projections for fact tables, such as partitioning by a time dimension, if any.

In the following example, the table inventory stores inventory data, and product_key and warehouse_key are foreign keys to the product_dimension and warehouse_dimension tables:

=> CREATE TABLE inventory (
 date_key INTEGER NOT NULL,
 product_key INTEGER NOT NULL,
 warehouse_key INTEGER NOT NULL,
 ...
);
=> ALTER TABLE inventory
   ADD CONSTRAINT fk_inventory_warehouse FOREIGN KEY(warehouse_key)
   REFERENCES warehouse_dimension(warehouse_key);
ALTER TABLE inventory
   ADD CONSTRAINT fk_inventory_product FOREIGN KEY(product_key)
   REFERENCES product_dimension(product_key);

The inventory table should be sorted by warehouse_key and then product, since the cardinality of the warehouse_key column is probably lower that the cardinality of the product_key.

3.5 - Prioritizing column access speed

If you measure and set the performance of storage locations within your cluster, Vertica uses this information to determine where to store columns based on their rank.

If you measure and set the performance of storage locations within your cluster, Vertica uses this information to determine where to store columns based on their rank. For more information, see Setting storage performance.

How columns are ranked

Vertica stores columns included in the projection sort order on the fastest available storage locations. Columns not included in the projection sort order are stored on slower disks. Columns for each projection are ranked as follows:

  • Columns in the sort order are given the highest priority (numbers > 1000).

  • The last column in the sort order is given the rank number 1001.

  • The next-to-last column in the sort order is given the rank number 1002, and so on until the first column in the sort order is given 1000 + # of sort columns.

  • The remaining columns are given numbers from 1000–1, starting with 1000 and decrementing by one per column.

Vertica then stores columns on disk from the highest ranking to the lowest ranking. It places highest-ranking columns on the fastest disks and the lowest-ranking columns on the slowest disks.

Overriding default column ranking

You can modify which columns are stored on fast disks by manually overriding the default ranks for these columns. To accomplish this, set the ACCESSRANK keyword in the column list. Make sure to use an integer that is not already being used for another column. For example, if you want to give a column the fastest access rank, use a number that is significantly higher than 1000 + the number of sort columns. This allows you to enter more columns over time without bumping into the access rank you set.

The following example sets column store_key's access rank to 1500:

CREATE PROJECTION retail_sales_fact_p (
     store_key ENCODING RLE ACCESSRANK 1500,
     pos_transaction_number ENCODING RLE,
     sales_dollar_amount,
     cost_dollar_amount )
AS SELECT
     store_key,
     pos_transaction_number,
     sales_dollar_amount,
     cost_dollar_amount
FROM store.store_sales_fact
ORDER BY store_key
SEGMENTED BY HASH(pos_transaction_number) ALL NODES;