Partitioning options for processing local data

UDTFs typically process data that is partitioned in a specific way.

UDTFs typically process data that is partitioned in a specific way. For example, a UDTF that processes a web server log file to count the number of hits referred by each partner web site needs to have its input partitioned by a referrer column. Each instance of the UDTF sees the hits referred by a particular partner site so it can count them.

In cases like this, the window partitioning clause should use a PARTITION BY clause. Each node in the cluster partitions the data it stores, sends some of these partitions off to other nodes, and then consolidates the partitions it receives from other nodes and runs an instance of the UDTF to process them.

In other cases, a UDTF might not need to partition input data in a particular way—for example, a UDTF that parses data out of an Apache log file. In this case, you can specify that each UDTF instance process only the data that is stored locally by the node on which it is running. By eliminating the overhead of partitioning data, processing can be much more efficient.

You can tell a UDTF to process only local data with one of the following window partitioning options:

  • PARTITION BEST: For thread-safe UDTFs only, optimizes performance through multi-threaded queries across multiple nodes.

  • PARTITION NODES: Optimizes performance of single-threaded queries across multiple nodes.

The query must specify a source table that is replicated across all nodes and contains a single row (similar to the DUAL table). For example, the following statements call a UDTF that parses locally-stored Apache log files:

=> CREATE TABLE rep (dummy INTEGER) UNSEGMENTED ALL NODES;
CREATE TABLE
=> INSERT INTO rep VALUES (1);
 OUTPUT
--------
      1
(1 row)
=> SELECT ParseLogFile('/data/apache/log*') OVER (PARTITION BEST) FROM rep;