Best practices for spatial joins
Use these best practices to improve overall performance and optimize your spatial queries.
Best practices for using spatial joins in Vertica include:
-
Table segmentation to speed up index creation
-
Adequately sizing a geometry column to store point data
-
Loading Well-Known Text (WKT) directly into a Geometry column, using STV_GeometryPoint in a COPY statement
-
Using OVER (PARTITION BEST) with STV_Intersect transform queries
Best practices example
Note
The following example was originally published in a Vertica blog post about using spatial data in museums. To read the entire blog, see Using Location Data with Vertica PlaceBefore performing the steps in the following example, download place_output.csv.zip
from the Vertica Place GitHub repository (https://github.com/vertica/Vertica-Geospatial). You need to use the data set from this repository.
-
Create the table for the polygons. Use a GEOMETRY column width that fits your data without being excessively large. A good column-width fit improves performance. In addition, segmenting the table by HASH provides the advantages of parallel computation.
=> CREATE TABLE artworks (gid int, g GEOMETRY(700)) SEGMENTED BY HASH(gid) ALL NODES;
-
Use a copy statement with ST_Buffer to create and load the polygons on which to run the intersect. By using ST_Buffer in your copy statement, you can use that function to create the polygons.
=> COPY artworks(gid, gx FILLER LONG VARCHAR, g AS ST_Buffer(ST_GeomFromText(gx),8)) FROM STDIN DELIMITER ','; >> 1, POINT(10 45) >> 2, POINT(25 45) >> 3, POINT(35 45) >> 4, POINT(35 15) >> 5, POINT(30 5) >> 6, POINT(15 5) >> \.
-
Create a table for the location data, represented by points. You can store point data in a GEOMETRY column of 100 bytes. Avoid over-fitting your GEOMETRY column. Doing so can significantly degrade spatial intersection performance. Also, segment this table by HASH, to take advantage of parallel computation.
=> CREATE TABLE usr_data (gid identity, usr_id int, date_time timestamp, g GEOMETRY(100)) SEGMENTED BY HASH(gid) ALL NODES;
-
During the copy statement, transform the raw location data to GEOMETRY data. You must perform this transformation because your location data needs to use the GEOMETRY data type. Use the function STV_GeometryPoint to transform the x and y columns of the source table.
=> COPY usr_data (usr_id, date_time, x FILLER LONG VARCHAR, y FILLER LONG VARCHAR, g AS STV_GeometryPoint(x, y)) FROM LOCAL 'place_output.csv' DELIMITER ',' ENCLOSED BY '';
-
Create the spatial index for the polygons. This index helps you speed up intersection calculations.
=> SELECT STV_Create_Index(gid, g USING PARAMETERS index='art_index', overwrite=true) OVER() FROM artworks;
-
Write an analytic query that returns the number of intersections per polygon. Specify that Vertica ignore any
usr_id
that intersects less than 20 times with a given polygon.=> SELECT pol_gid, COUNT(DISTINCT(usr_id)) AS count_user_visit FROM (SELECT pol_gid, usr_id, COUNT(usr_id) AS user_points_in FROM (SELECT STV_Intersect(usr_id, g USING PARAMETERS INDEX='art_index') OVER(PARTITION BEST) AS (usr_id, pol_gid) FROM usr_data WHERE date_time BETWEEN '2014-07-02 09:30:20' AND '2014-07-02 17:05:00') AS c GROUP BY pol_gid, usr_id HAVING COUNT(usr_id) > 20) AS real_visits GROUP BY pol_gid ORDER BY count_user_visit DESC;
Optimizations in the example query
This query has the following optimizations:
-
The time predicated appears in the subquery.
-
Using the location data table avoids the need for an expensive join.
-
The query uses OVER (PARTITION BEST), to improve performance by partitioning the data.
-
The
user_points_in
provides an estimate of the combined time spent intersecting with the artwork by all visitors.