Projections for queries with predicates
If your query contains one or more predicates, you can modify the projections to improve the query's performance, as described in the following two examples.
Queries that use date ranges
This example shows how to encode data using RLE and change the projection sort order to improve the performance of a query that retrieves all data within a given date range.
Suppose you have a query that looks like this:
=> SELECT * FROM trades
WHERE trade_date BETWEEN '2016-11-01' AND '2016-12-01';
To optimize this query, determine whether all of the projections can perform the SELECT operation in a timely manner. Run SELECT COUNT(*) statement for each projection, specifying the date range, and note the response time. For example:
=> SELECT COUNT(*) FROM [ projection_name ]
WHERE trade_date BETWEEN '2016-11-01' AND '2016-12-01;
If one or more of the queries is slow, check the uniqueness of the trade_date
column and determine if it needs to be in the projection’s ORDER BY clause and/or can be encoded using RLE. RLE replaces sequences of the same data values within a column by a pair that represents the value and a count. For best results, order the columns in the projection from lowest cardinality to highest cardinality, and use RLE to encode the data in low-cardinality columns.
Note
For an example of using sorting and RLE, see Choosing sort order: best practices.If the number of unique columns is unsorted, or if the average number of repeated rows is less than 10, trade_date
is too close to being unique and cannot be encoded using RLE. In this case, add a new column to minimize the search scope.
The following example adds a new column trade_year
:
-
Determine if the new column
trade_year
returns a manageable result set. The following query returns the data grouped bytrade_year
:=> SELECT DATE_TRUNC('trade_year', trade_date), COUNT(*) FROM trades GROUP BY DATE_TRUNC('trade_year',trade_date);
-
Assuming that
trade_year = 2007
is near 8k, add a column fortrade_year
to thetrades
table. TheSELECT
statement then becomes:=> SELECT * FROM trades WHERE trade_year = 2007 AND trade_date BETWEEN '2016-11-01' AND '2016-12-01';
As a result, you have a projection that is sorted on
trade_year
, which can be encoded using RLE.
Queries for tables with a high-cardinality primary key
This example demonstrates how you can modify the projection to improve the performance of queries that select data from a table with a high-cardinality primary key.
Suppose you have the following query:
=> SELECT FROM [table]
WHERE pk IN (12345, 12346, 12347,...);
Because the primary key is a high-cardinality column, Vertica has to search a large amount of data.
To optimize the schema for this query, create a new column named buckets
and assign it the value of the primary key divided by 10000. In this example, buckets=(int) pk/10000
. Use the buckets
column to limit the search scope as follows:
=> SELECT FROM [table]
WHERE buckets IN (1,...)
AND pk IN (12345, 12346, 12347,...);
Creating a lower cardinality column and adding it to the query limits the search scope and improves the query performance. In addition, if you create a projection where buckets
is first in the sort order, the query may run even faster.