Best practices for spatial joins

Use these best practices to improve overall performance and optimize your spatial queries.

Use these best practices to improve overall performance and optimize your spatial queries.

Best practices for using spatial joins in Vertica include:

  • Table segmentation to speed up index creation

  • Adequately sizing a geometry column to store point data

  • Loading Well-Known Text (WKT) directly into a Geometry column, using STV_GeometryPoint in a COPY statement

  • Using OVER (PARTITION BEST) with STV_Intersect transform queries

Best practices example

Before performing the steps in the following example, download place_output.csv.zip from the Vertica Place GitHub repository (https://github.com/vertica/Vertica-Geospatial). You need to use the data set from this repository.

  1. Create the table for the polygons. Use a GEOMETRY column width that fits your data without being excessively large. A good column-width fit improves performance. In addition, segmenting the table by HASH provides the advantages of parallel computation.

    => CREATE TABLE artworks (gid int, g GEOMETRY(700)) SEGMENTED BY HASH(gid) ALL NODES;
    
  2. Use a copy statement with ST_Buffer to create and load the polygons on which to run the intersect. By using ST_Buffer in your copy statement, you can use that function to create the polygons.

    => COPY artworks(gid, gx FILLER LONG VARCHAR, g AS ST_Buffer(ST_GeomFromText(gx),8)) FROM STDIN DELIMITER ',';
    >> 1, POINT(10 45)
    >> 2, POINT(25 45)
    >> 3, POINT(35 45)
    >> 4, POINT(35 15)
    >> 5, POINT(30 5)
    >> 6, POINT(15 5)
    >> \.
    
  3. Create a table for the location data, represented by points. You can store point data in a GEOMETRY column of 100 bytes. Avoid over-fitting your GEOMETRY column. Doing so can significantly degrade spatial intersection performance. Also, segment this table by HASH, to take advantage of parallel computation.

    => CREATE TABLE usr_data (gid identity, usr_id int, date_time timestamp, g GEOMETRY(100))
         SEGMENTED BY HASH(gid) ALL NODES;
    
  4. During the copy statement, transform the raw location data to GEOMETRY data. You must perform this transformation because your location data needs to use the GEOMETRY data type. Use the function STV_GeometryPoint to transform the x and y columns of the source table.

    => COPY usr_data (usr_id, date_time, x FILLER LONG VARCHAR,
                      y FILLER LONG VARCHAR, g AS STV_GeometryPoint(x, y))
       FROM LOCAL 'place_output.csv' DELIMITER ',' ENCLOSED BY '';
    
  5. Create the spatial index for the polygons. This index helps you speed up intersection calculations.

    => SELECT STV_Create_Index(gid, g USING PARAMETERS index='art_index', overwrite=true) OVER() FROM artworks;
    
  6. Write an analytic query that returns the number of intersections per polygon. Specify that Vertica ignore any usr_id that intersects less than 20 times with a given polygon.

    => SELECT pol_gid,
           COUNT(DISTINCT(usr_id)) AS count_user_visit
       FROM
         (SELECT pol_gid,
           usr_id,
           COUNT(usr_id) AS user_points_in
        FROM
           (SELECT STV_Intersect(usr_id, g USING PARAMETERS INDEX='art_index') OVER(PARTITION BEST) AS (usr_id,
                                                            pol_gid)
        FROM usr_data
          WHERE date_time BETWEEN '2014-07-02 09:30:20' AND '2014-07-02 17:05:00') AS c
        GROUP BY pol_gid,
        usr_id HAVING COUNT(usr_id) > 20) AS real_visits
        GROUP BY pol_gid
        ORDER BY count_user_visit DESC;
    

Optimizations in the example query

This query has the following optimizations:

  • The time predicated appears in the subquery.

  • Using the location data table avoids the need for an expensive join.

  • The query uses OVER (PARTITION BEST), to improve performance by partitioning the data.

  • The user_points_in provides an estimate of the combined time spent intersecting with the artwork by all visitors.