Projections for queries with predicates

If your query contains one or more predicates, you can modify the projections to improve the query's performance, as described in the following two examples.

Queries that use date ranges

This example shows how to encode data using RLE and change the projection sort order to improve the performance of a query that retrieves all data within a given date range.

Suppose you have a query that looks like this:

=> SELECT * FROM trades
   WHERE trade_date BETWEEN '2016-11-01' AND '2016-12-01';

To optimize this query, determine whether all of the projections can perform the SELECT operation in a timely manner. Run SELECT COUNT(*) statement for each projection, specifying the date range, and note the response time. For example:

=> SELECT COUNT(*) FROM [ projection_name ]
   WHERE trade_date BETWEEN '2016-11-01' AND '2016-12-01;

If one or more of the queries is slow, check the uniqueness of the trade_date column and determine if it needs to be in the projection’s ORDER BY clause and/or can be encoded using RLE. RLE replaces sequences of the same data values within a column by a pair that represents the value and a count. For best results, order the columns in the projection from lowest cardinality to highest cardinality, and use RLE to encode the data in low-cardinality columns.

Note

For an example of using sorting and RLE, see Choosing sort order: best practices.

If the number of unique columns is unsorted, or if the average number of repeated rows is less than 10, trade_date is too close to being unique and cannot be encoded using RLE. In this case, add a new column to minimize the search scope.

The following example adds a new column trade_year:

Determine if the new column trade_year returns a manageable result set. The following query returns the data grouped by trade_year:
```
=> SELECT DATE_TRUNC('trade_year', trade_date), COUNT(*)
   FROM trades
   GROUP BY DATE_TRUNC('trade_year',trade_date);
```
Assuming that trade_year = 2007 is near 8k, add a column for trade_year to the trades table. The SELECT statement then becomes:
```
=> SELECT * FROM trades
   WHERE trade_year = 2007
   AND trade_date BETWEEN '2016-11-01' AND '2016-12-01';
```
As a result, you have a projection that is sorted on trade_year, which can be encoded using RLE.

Queries for tables with a high-cardinality primary key

This example demonstrates how you can modify the projection to improve the performance of queries that select data from a table with a high-cardinality primary key.

Suppose you have the following query:

=> SELECT FROM [table]
   WHERE pk IN (12345, 12346, 12347,...);

Because the primary key is a high-cardinality column, Vertica has to search a large amount of data.

To optimize the schema for this query, create a new column named buckets and assign it the value of the primary key divided by 10000. In this example, buckets=(int) pk/10000. Use the buckets column to limit the search scope as follows:

=> SELECT FROM [table]
   WHERE buckets IN (1,...)
   AND pk IN (12345, 12346, 12347,...);

Creating a lower cardinality column and adding it to the query limits the search scope and improves the query performance. In addition, if you create a projection where buckets is first in the sort order, the query may run even faster.