Spatial joins allow you to identify spatial relationships between two sets of spatial data. For example, you can use spatial joins to:
Calculate the density of mobile calls in various regions to determine the location of a new cell phone tower.
Identify homes that fall within the impact zone of a hurricane.
Calculate the number of users who live within a certain ZIP code.
Calculate the number of customers in a retail store at any given time.
1 - Best practices for spatial joins
Use these best practices to improve overall performance and optimize your spatial queries.
Use these best practices to improve overall performance and optimize your spatial queries.
Best practices for using spatial joins in Vertica include:
Table segmentation to speed up index creation
Adequately sizing a geometry column to store point data
Loading Well-Known Text (WKT) directly into a Geometry column, using STV_GeometryPoint in a COPY statement
Using OVER (PARTITION BEST) with STV_Intersect transform queries
Best practices example
Note
The following example was originally published in a Vertica blog post about using spatial data in museums. To read the entire blog, see Using Location Data with Vertica Place
Before performing the steps in the following example, download place_output.csv.zip from the Vertica Place GitHub repository (https://github.com/vertica/Vertica-Geospatial). You need to use the data set from this repository.
Create the table for the polygons. Use a GEOMETRY column width that fits your data without being excessively large. A good column-width fit improves performance. In addition, segmenting the table by HASH provides the advantages of parallel computation.
=> CREATE TABLE artworks (gid int, g GEOMETRY(700)) SEGMENTED BY HASH(gid) ALL NODES;
Use a copy statement with ST_Buffer to create and load the polygons on which to run the intersect. By using ST_Buffer in your copy statement, you can use that function to create the polygons.
=> COPY artworks(gid, gx FILLER LONG VARCHAR, g AS ST_Buffer(ST_GeomFromText(gx),8)) FROM STDIN DELIMITER ',';
>> 1, POINT(10 45)
>> 2, POINT(25 45)
>> 3, POINT(35 45)
>> 4, POINT(35 15)
>> 5, POINT(30 5)
>> 6, POINT(15 5)
>> \.
Create a table for the location data, represented by points. You can store point data in a GEOMETRY column of 100 bytes. Avoid over-fitting your GEOMETRY column. Doing so can significantly degrade spatial intersection performance. Also, segment this table by HASH, to take advantage of parallel computation.
=> CREATE TABLE usr_data (gid identity, usr_id int, date_time timestamp, g GEOMETRY(100))
SEGMENTED BY HASH(gid) ALL NODES;
During the copy statement, transform the raw location data to GEOMETRY data. You must perform this transformation because your location data needs to use the GEOMETRY data type. Use the function STV_GeometryPoint to transform the x and y columns of the source table.
=> COPY usr_data (usr_id, date_time, x FILLER LONG VARCHAR,
y FILLER LONG VARCHAR, g AS STV_GeometryPoint(x, y))
FROM LOCAL 'place_output.csv' DELIMITER ',' ENCLOSED BY '';
Create the spatial index for the polygons. This index helps you speed up intersection calculations.
=> SELECT STV_Create_Index(gid, g USING PARAMETERS index='art_index', overwrite=true) OVER() FROM artworks;
Write an analytic query that returns the number of intersections per polygon. Specify that Vertica ignore any usr_id that intersects less than 20 times with a given polygon.
=> SELECT pol_gid,
COUNT(DISTINCT(usr_id)) AS count_user_visit
FROM
(SELECT pol_gid,
usr_id,
COUNT(usr_id) AS user_points_in
FROM
(SELECT STV_Intersect(usr_id, g USING PARAMETERS INDEX='art_index') OVER(PARTITION BEST) AS (usr_id,
pol_gid)
FROM usr_data
WHERE date_time BETWEEN '2014-07-02 09:30:20' AND '2014-07-02 17:05:00') AS c
GROUP BY pol_gid,
usr_id HAVING COUNT(usr_id) > 20) AS real_visits
GROUP BY pol_gid
ORDER BY count_user_visit DESC;
Optimizations in the example query
This query has the following optimizations:
The time predicated appears in the subquery.
Using the location data table avoids the need for an expensive join.
The query uses OVER (PARTITION BEST), to improve performance by partitioning the data.
The user_points_in provides an estimate of the combined time spent intersecting with the artwork by all visitors.
2 - Ensuring polygon validity before creating or refreshing an index
When Vertica creates or updates a spatial index it does not check polygon validity.
When Vertica creates or updates a spatial index it does not check polygon validity. To prevent getting invalid results when you query your spatial index, you should check the validity of your polygons prior to creating or updating your spatial index.
The following example shows you how to check the validity of polygons.
Create a table and load spatial data.
=> CREATE TABLE polygon_validity_test (gid INT, geom GEOMETRY);
CREATE TABLE
=> COPY polygon_validity_test (gid, gx FILLER LONG VARCHAR, geom AS St_GeomFromText(gx)) FROM STDIN;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 2|POLYGON((-31 74,8 70,8 50,-36 53,-31 74))
>> 3|POLYGON((-38 50,4 13,11 45,0 65,-38 50))
>> 4|POLYGON((-12 42,-12 42,27 48,14 26,-12 42))
>> 5|POLYGON((0 0,1 1,0 0,2 1,1 1,0 0))
>> 6|POLYGON((3 3,2 2,2 1,2 3,3 3))
>> \.
Use ST_IsValid and STV_IsValidReason to find any invalid polygons.
=> SELECT gid, ST_IsValid(geom), STV_IsValidReason(geom) FROM polygon_validity_test;
gid | ST_IsValid | STV_IsValidReason
-----+------------+------------------------------------------
4 | t |
6 | f | Self-intersection at or near POINT (2 1)
2 | t |
3 | t |
5 | f | Self-intersection at or near POINT (0 0)
(5 rows)
Now that we have identifed the invalid polygons in our table, there are a couple different ways you can handle the invalid polygons when creating or refreshing a spatial index.
Filtering invalid polygons using a WHERE clause
This method is slower than filtering before creating an index because it checks the validity of each polygon at execution time.
The following example shows you how to exclude invalid polygons using a WHERE clause.
```
=> SELECT STV_Create_Index(gid, geom USING PARAMETERS index = 'valid_polygons') OVER()
FROM polygon_validity_test
WHERE ST_IsValid(geom) = 't';
```
Filtering invalid polygons before creating or refreshing an index
This method is faster than filtering using a WHERE clause because you incur the performance cost prior to building the index.
The following example shows you how to exclude invalid polygons by creating a new table excluding invalid polygons.
```
=> CREATE TABLE polygon_validity_clean AS
SELECT *
FROM polygon_validity_test
WHERE ST_IsValid(geom) = 't';
CREATE TABLE
=> SELECT STV_Create_Index(gid, geom USING PARAMETERS index = 'valid_polygons') OVER()
FROM polygon_validity_clean;
```
3 - STV_Intersect: scalar function vs. transform function
The STV_Intersect functions are similar in purpose, but you use them differently.
The STV_Intersect functions are similar in purpose, but you use them differently.
STV_Intersect Function Type
Description
Performance
Scalar
Matches a point to a polygon. If several polygons contain the point, this function returns a gid value. The result is a polygon gid or, if no polygon contains the point, the result is NULL.
Eliminates points that do not intersect with any indexed polygons, avoiding unnecessary comparisons.
Transform
Matches a point to all the polygons that contain it. When a point does not intersect with any polygon in the index, the function returns no rows.
Processes all input points regardless of whether or not they intersect with the indexed polygons.
In the following example, the STV_Intersect scalar function compares the points in the points table to the polygons in a spatial index named my_polygons. STV_Intersect returns all points and polygons that match exactly:
=> SELECT gid AS pt_gid
STV_Intersect(geom USING PARAMETERS index='my_polygons') AS pol_gid
FROM points ORDER BY pt_gid;
pt_gid | pol_gid
--------+---------
100 | 2
101 |
102 | 2
103 |
104 |
105 | 3
106 |
107 |
(8 rows)
The following example shows how to use the STV_Intersect transform function to return information about the three point-polygon pairs that match and each of the polygons they match:
=> SELECT STV_Intersect(gid, geom
USING PARAMETERS index='my_polygons')
OVER (PARTITION BEST) AS (pt_gid, pol_id)
FROM points;
pt_gid | pol_id
--------+--------
100 | 1
100 | 2
100 | 3
102 | 2
105 | 3
(3 rows)
4 - Performing spatial joins with STV_Intersect functions
Suppose you want to process a medium-to-large spatial data set and determine which points intersect with which polygons.
Suppose you want to process a medium-to-large spatial data set and determine which points intersect with which polygons. In that case, first create a spatial index using STV_Create_Index. A spatial index provides efficient access to the set of polygons.
Then, use the STV_Intersect scalar or transform function to identify which point-polygon pairs match.
4.1 - Spatial indexes and STV_Intersect
Before performing a spatial join using one of the STV_Intersect functions, you must first run STV_Create_Index to create a database object that contains information about polygons.
Before performing a spatial join using one of the STV_Intersect functions, you must first run STV_Create_Index to create a database object that contains information about polygons. This object is called a spatial index of the set of polygons. The spatial index improves the time it takes for the STV_Intersect functions to access the polygon data.
Vertica creates spatial indexes in a global space. Thus, any user with access to the STV_*_Index functions can describe, rename, or drop indexes created by any other user.
Vertica provides functions that work with spatial indexes:
STV_Create_Index—Stores information about polygons in an index to improve performance.
Vertica provides two capabilities to identify whether a set of points intersect with a set of polygons.
Vertica provides two capabilities to identify whether a set of points intersect with a set of polygons. Depending on the size of your data set, choose the approach that gives the best performance:
When comparing a set of geometries to a single geometry to see if they intersect, use the ST_Intersects function.
To determine if a set of points intersects with a set of polygons in a medium-to-large data set, first create a spatial index using STV_Create_Index. Then, use one of the STV_Intersect functions to return the set of pairs that intersect.
Note
You can only perform spatial joins on GEOMETRY data.
5.1 - Performing spatial joins with ST_Intersects
The ST_Intersects function determines if two GEOMETRY objects intersect or touch at a single point.
The ST_Intersects function determines if two GEOMETRY objects intersect or touch at a single point.
Use ST_Intersects when you want to identify if a small set of geometries in a column intersect with a given geometry.
Example
The following example uses ST_Intersects to compare a column of point geometries to a single polygon. The table that contains the points has 1 million rows.
ST_Intersects returns only the points that intersect with the polygon. Those points represent about 0.01% of the points in the table:
=> CREATE TABLE points_1m(gid IDENTITY, g GEOMETRY(100)) ORDER BY g;
=> COPY points_1m(wkt FILLER LONG VARCHAR(100), g AS ST_GeomFromText(wkt))
FROM LOCAL '/data/points.dat';
Rows Loaded
-------------
1000000
(1 row)
=> SELECT ST_AsText(g) FROM points_1m WHERE
ST_Intersects
(
g,
ST_GeomFromText('POLYGON((-71 42, -70.9 42, -70.9 42.1, -71 42.1, -71 42))')
);
st_astext
----------------------------
POINT (-70.97532 42.03538)
POINT (-70.97421 42.0376)
POINT (-70.99004 42.07538)
POINT (-70.99477 42.08454)
POINT (-70.99088 42.08177)
POINT (-70.98643 42.07593)
POINT (-70.98032 42.07982)
POINT (-70.95921 42.00982)
POINT (-70.95115 42.02177)
...
(116 rows)
Vertica recommends that you test the intersections of two columns of geometries by creating a spatial index. Use one of the STV_Intersect functions as described in STV_Intersect: scalar function vs. transform function.