Stuifbergen.com

Analytics & Growth Hacking

  • About
  • Contact
  • Professional Services

Find me here

  • Mastodon (social)
  • Mastodon (measure)
  • No more Twitter
You are here: Home / Blog / Session modeling in BigQuery SQL (using GA4 data and a trip to Paris)

Session modeling in BigQuery SQL (using GA4 data and a trip to Paris)

13 November 2020 by jules 5 Comments

So, you have granular event data.. and a user identifier.. and a timestamp for each event. Congratulations!

You have ultimate freedom in re-thinking the concept of a session now.

In this post, I’ll show you the SQL code to make this possible. So you don’t have to worry about that.

But first: think. And forget about all you know about sessions in GA, and read this little story.

From Paris with Love

Paris

Imagine.. You’re in Paris, and decide to go to the famous Galeries Lafayette, the grand department store.

This happens:

  1. You walk in
  2. You walk to the third floor
  3. You look at the Hugo Boss products
  4. You start crying a bit because you can’t afford that lovely sweater
  5. You go to the restroom to look in the mirror
  6. You go back and buy a pair of Hugo Boss socks to feel better
  7. Then you walk to the sixth floor
  8. You browse through the souvenirs
  9. You take the escalator
  10. And back you are, on the street

Now.. think of how this customer journey can be described in terms of sessions (or: visits)

So much visits..

The event log above contains a lot of visits. To name a few:

  • the visit to Gallery Lafayette
  • 2 visits to a floor (not considering the ground floor)
    • the 3rd floor
    • the 6th floor
  • a visit to the restroom

That’s just like.. your opinion, man

The above visits can be considered objective in a way.

Enter a space – leave a space.

But this is where it can get murky. Look at the visit to the Hugo Boss store.

  • you entered it once
  • then you left for a toilet break
  • then you came back

Two visits? Or one?

You decide!

Custom Logic – prepare to SQL

The trick in SQL-session modeling is to decide two things

  • what events to include
  • what event ends a session

In the case of the Lafayette store visit, let’s model floor visits. And do it like this:

  • include all events that happen on a store floor
  • the session ends when the floor number changes

Prepare the data

We need the event, user (to group by), time, and floor_number.

We need to define “floor change”, so we derive that – in a window function, using the lead() function.

select
   user, event, event_number, time, floor_number,
   lead(floor_number) over (partition by user order by time asc) as next_event_floor
from paris
where floor_number is not null

Event numbe is left in for your convenience, so you can look it up above.

Here’s the results. Look how close we are to sessions.

Now for the next step

Sessionize it

Looking at the table, you can see that the second to last row marks the end of the first floor session.

If the next_event_floor is not the floor_number, then we count that row, and all the preceding rows as having the same session.

There’s a window function for that. See below (note that I select from the result table of the previous query:

select
   user, event, event_number, time, floor_number,
   1 + countif(next_event_floor != floor_number) over
       (partition by user order by time asc
        rows between unbounded preceding and 1 preceding) as session_number
from prep

And behold the output:

Sessionize it differently

Imagine you want to add an extra condition to your session

  • the session ends when the floor changes
  • the session also ends when there is a timeout of 5 minutes. Note that the restroom break will now break the session.

Then you can:

  • calculate the timeout (in the “prep” query, using a time diff function)
  • use that in your session-end logic

See:

with prep as(
  select
    user, event, event_number, time, floor_number,
    lead(floor_number) over (partition by user order by time asc) as next_event_floor,
    timestamp_diff( lead(time) over (partition by user order by time asc), time), second) as time_to_next_event
  from paris
  where floor_number is not null
)

select
   user, event, event_number, time, floor_number,
   1 + countif(next_event_floor != floor_number or time_to_next_event > 600) over
       (partition by user order by time asc
        rows between unbounded preceding and 1 preceding) as session_number
from prep

And the results are 3 sessions./

The possibilities are endless.

Now. Enough rambling, let’s fire up a Big Query Query and use your GA4 data to do some sessionisation.

Big Query and GA4 example for you to use

Let’s model sessions, as defined by only 1 criterium: a 600 second event silence means a session end.

Let’s go:

  
with
prep as (
select
  user_pseudo_id,
  event_date as date,
  timestamp_micros(event_timestamp) as event_timestamp,
  event_name,
  lead(event_name) over (partition by user_pseudo_id order by event_timestamp) as next_event,
  timestamp_diff( lead(timestamp_micros(event_timestamp)) over (partition by user_pseudo_id order by event_timestamp), timestamp_micros(event_timestamp), second) as time_to_next_event
from
  `yourproject.analytics_viewid.events_202011*`  -- change this to your table
-- no WHERE clause, every event counts
)
,sessionize as(
  select
    user_pseudo_id,
    event_timestamp,
    event_name,
    1 + countif(time_to_next_event is null or time_to_next_event > 300) over
        (partition by user_pseudo_id order by event_timestamp asc
         rows between unbounded preceding and 1 preceding) as session_number,
    time_to_next_event
  from prep
)

-- summarize per session
select
  user_pseudo_id ||'-'|| session_number as session_id,
  min(event_timestamp) as first_event,
  max(event_timestamp) as last_event,
  timestamp_diff(max(event_timestamp), min(event_timestamp), second) as session_duration
from sessionize
group by 1

It works!

Now.. note that I concatenate the user_pseudo_id and session_number to generate some session id of my choice.

I can generate loads of session ids like this, based on numerous criteria.

Stand up, and query

Now it’s your turn. Fire up BQ, and sessionize your events.

But beware.. do not use stupid session-endings, like

  • midnight
  • payment gateway referrers or referral exclusions
  • arbitrary pausing (like I did)

Be smart!

<3, @zjuul

Related posts:

graphTrack, Store and Analyze granular Page Performance data: a practical guide Multi Channel Funnel Conversion Rate Conversion Attribution: a Markov model (using Google Analytics and the R ChannelAttribution package) Visualize Snowplow data with Google Data Studio

Filed Under: Blog Tagged With: big query, ga4, Google Analytics, sessions, sql

Liked this post?

Buy Me a Coffee

Comments

  1. FredD says

    14 November 2020 at 09:22

    Hi,

    For your first query, you say Timeout of 5 minutes = new session. But the query states > 600, that would be more than 10 minutes, not more than 5 (300 seconds). Or am I missing something here?
    Thank you

    Reply
    • jules says

      14 November 2020 at 10:02

      You’re right, it should be 300. Thanks for reading so carefully.
      Fixing it now.

      Reply
      • FredD says

        14 November 2020 at 15:06

        Thank you for the great article

        Reply
        • MOhammad says

          9 March 2021 at 16:30

          Than you both :)

          Reply
  2. Lotte says

    27 April 2022 at 15:12

    This saved me a world of time. Thank you for the super clear explanation and real-life examples!

    Reply

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Recent Posts

  • Analyze AB tests in GA4 via Big Query
  • How to make sure your GA4 events tables do not expire
  • Prepare your cloud project for “Bulk data export” Google Search Console
  • Making Sense of the GA4 Configuration Tag
  • Using Big Query to calculate DAU / MAU

Need help?

  • Contact me

Search

© Copyright Jules Stuifbergen · Powered by Genesis ·