So, you have granular event data.. and a user identifier.. and a timestamp for each event. Congratulations!
You have ultimate freedom in re-thinking the concept of a
In this post, I’ll show you the SQL code to make this possible. So you don’t have to worry about that.
But first: think. And forget about all you know about sessions in GA, and read this little story.
From Paris with Love
Imagine.. You’re in Paris, and decide to go to the famous Galeries Lafayette, the grand department store.
- You walk in
- You walk to the third floor
- You look at the Hugo Boss products
- You start crying a bit because you can’t afford that lovely sweater
- You go to the restroom to look in the mirror
- You go back and buy a pair of Hugo Boss socks to feel better
- Then you walk to the sixth floor
- You browse through the souvenirs
- You take the escalator
- And back you are, on the street
Now.. think of how this customer journey can be described in terms of sessions (or: visits)
So much visits..
The event log above contains a lot of visits. To name a few:
- the visit to Gallery Lafayette
- 2 visits to a floor (not considering the ground floor)
- the 3rd floor
- the 6th floor
- a visit to the restroom
That’s just like.. your opinion, man
The above visits can be considered objective in a way.
Enter a space – leave a space.
But this is where it can get murky. Look at the visit to the Hugo Boss store.
- you entered it once
- then you left for a toilet break
- then you came back
Two visits? Or one?
Custom Logic – prepare to SQL
The trick in SQL-session modeling is to decide two things
- what events to include
- what event ends a session
In the case of the Lafayette store visit, let’s model floor visits. And do it like this:
- include all events that happen on a store floor
- the session ends when the floor number changes
Prepare the data
We need the event, user (to group by), time, and floor_number.
We need to define “floor change”, so we derive that – in a window function, using the
select user, event, event_number, time, floor_number, lead(floor_number) over (partition by user order by time asc) as next_event_floor from paris where floor_number is not null
Event numbe is left in for your convenience, so you can look it up above.
Here’s the results. Look how close we are to sessions.
Now for the next step
Looking at the table, you can see that the second to last row marks the end of the first floor session.
If the next_event_floor is not the floor_number, then we count that row, and all the preceding rows as having the same session.
There’s a window function for that. See below (note that I select from the result table of the previous query:
select user, event, event_number, time, floor_number, 1 + countif(next_event_floor != floor_number) over (partition by user order by time asc rows between unbounded preceding and 1 preceding) as session_number from prep
And behold the output:
Sessionize it differently
Imagine you want to add an extra condition to your session
- the session ends when the floor changes
- the session also ends when there is a timeout of 5 minutes. Note that the restroom break will now break the session.
Then you can:
- calculate the timeout (in the “prep” query, using a time diff function)
- use that in your session-end logic
with prep as( select user, event, event_number, time, floor_number, lead(floor_number) over (partition by user order by time asc) as next_event_floor, timestamp_diff( lead(time) over (partition by user order by time asc), time), second) as time_to_next_event from paris where floor_number is not null ) select user, event, event_number, time, floor_number, 1 + countif(next_event_floor != floor_number or time_to_next_event > 600) over (partition by user order by time asc rows between unbounded preceding and 1 preceding) as session_number from prep
And the results are 3 sessions./
The possibilities are endless.
Now. Enough rambling, let’s fire up a Big Query Query and use your GA4 data to do some sessionisation.
Big Query and GA4 example for you to use
Let’s model sessions, as defined by only 1 criterium: a 600 second event silence means a session end.
with prep as ( select user_pseudo_id, event_date as date, timestamp_micros(event_timestamp) as event_timestamp, event_name, lead(event_name) over (partition by user_pseudo_id order by event_timestamp) as next_event, timestamp_diff( lead(timestamp_micros(event_timestamp)) over (partition by user_pseudo_id order by event_timestamp), timestamp_micros(event_timestamp), second) as time_to_next_event from `yourproject.analytics_viewid.events_202011*` -- change this to your table -- no WHERE clause, every event counts ) ,sessionize as( select user_pseudo_id, event_timestamp, event_name, 1 + countif(time_to_next_event is null or time_to_next_event > 300) over (partition by user_pseudo_id order by event_timestamp asc rows between unbounded preceding and 1 preceding) as session_number, time_to_next_event from prep ) -- summarize per session select user_pseudo_id ||'-'|| session_number as session_id, min(event_timestamp) as first_event, max(event_timestamp) as last_event, timestamp_diff(max(event_timestamp), min(event_timestamp), second) as session_duration from sessionize group by 1
Now.. note that I concatenate the user_pseudo_id and session_number to generate some session id of my choice.
I can generate loads of session ids like this, based on numerous criteria.
Stand up, and query
Now it’s your turn. Fire up BQ, and sessionize your events.
But beware.. do not use stupid session-endings, like
- payment gateway referrers or referral exclusions
- arbitrary pausing (like I did)
For your first query, you say Timeout of 5 minutes = new session. But the query states > 600, that would be more than 10 minutes, not more than 5 (300 seconds). Or am I missing something here?
You’re right, it should be 300. Thanks for reading so carefully.
Fixing it now.
Thank you for the great article
Than you both :)
This saved me a world of time. Thank you for the super clear explanation and real-life examples!