Sessionization with event-based windows
Sessionization, a special case of event-based windows, is a feature often used to analyze click streams, such as identifying web browsing sessions from recorded web clicks.
In Vertica, given an input clickstream table, where each row records a Web page click made by a particular user (or IP address), the sessionization computation attempts to identify Web browsing sessions from the recorded clicks by grouping the clicks from each user based on the time-intervals between the clicks. If two clicks from the same user are made too far apart in time, as defined by a time-out threshold, the clicks are treated as though they are from two different browsing sessions.
Example Schema The examples in this topic use the following WebClicks schema to represent a simple clickstream table:
CREATE TABLE WebClicks(userId INT, timestamp TIMESTAMP);
INSERT INTO WebClicks VALUES (1, '2009-12-08 3:00:00 pm');
INSERT INTO WebClicks VALUES (1, '2009-12-08 3:00:25 pm');
INSERT INTO WebClicks VALUES (1, '2009-12-08 3:00:45 pm');
INSERT INTO WebClicks VALUES (1, '2009-12-08 3:01:45 pm');
INSERT INTO WebClicks VALUES (2, '2009-12-08 3:02:45 pm');
INSERT INTO WebClicks VALUES (2, '2009-12-08 3:02:55 pm');
INSERT INTO WebClicks VALUES (2, '2009-12-08 3:03:55 pm');
COMMIT;
The input table WebClicks
contains the following rows:
=> SELECT * FROM WebClicks;
userId | timestamp
--------+---------------------
1 | 2009-12-08 15:00:00
1 | 2009-12-08 15:00:25
1 | 2009-12-08 15:00:45
1 | 2009-12-08 15:01:45
2 | 2009-12-08 15:02:45
2 | 2009-12-08 15:02:55
2 | 2009-12-08 15:03:55
(7 rows)
In the following query, sessionization performs computation on the SELECT list columns, showing the difference between the current and previous timestamp value using LAG()
. It evaluates to true and increments the window ID when the difference is greater than 30 seconds.
=> SELECT userId, timestamp,
CONDITIONAL_TRUE_EVENT(timestamp - LAG(timestamp) > '30 seconds')
OVER(PARTITION BY userId ORDER BY timestamp) AS session FROM WebClicks;
userId | timestamp | session
--------+---------------------+---------
1 | 2009-12-08 15:00:00 | 0
1 | 2009-12-08 15:00:25 | 0
1 | 2009-12-08 15:00:45 | 0
1 | 2009-12-08 15:01:45 | 1
2 | 2009-12-08 15:02:45 | 0
2 | 2009-12-08 15:02:55 | 0
2 | 2009-12-08 15:03:55 | 1
(7 rows)
In the output, the session column contains the window ID from the CONDITIONAL_TRUE_EVENT function. The window ID evaluates to true on row 4 (timestamp 15:01:45), and the ID that follows row 4 is zero because it is the start of a new partition (for user ID 2), and that row does not evaluate to true until the last line in the output.
You might want to give users different time-out thresholds. For example, one user might have a slower network connection or be multi-tasking, while another user might have a faster connection and be focused on a single Web site, doing a single task.
To compute an adaptive time-out threshold based on the last 2 clicks, use CONDITIONAL_TRUE_EVENT with LAG to return the average time between the last 2 clicks with a grace period of 3 seconds:
=> SELECT userId, timestamp, CONDITIONAL_TRUE_EVENT(timestamp - LAG(timestamp) >
(LAG(timestamp, 1) - LAG(timestamp, 3)) / 2 + '3 seconds')
OVER(PARTITION BY userId ORDER BY timestamp) AS session
FROM WebClicks;
userId | timestamp | session
--------+---------------------+---------
2 | 2009-12-08 15:02:45 | 0
2 | 2009-12-08 15:02:55 | 0
2 | 2009-12-08 15:03:55 | 0
1 | 2009-12-08 15:00:00 | 0
1 | 2009-12-08 15:00:25 | 0
1 | 2009-12-08 15:00:45 | 0
1 | 2009-12-08 15:01:45 | 1
(7 rows)