Stuifbergen.com

Analytics & Growth Hacking

  • About
  • Contact
  • Professional Services

Find me here

  • Mastodon (social)
  • Mastodon (measure)
  • No more Twitter
You are here: Home / TIL / Smart incremental GA4 tables in Dataform

Smart incremental GA4 tables in Dataform

5 April 2024 by jules 3 Comments

Or: race conditions, Murphy’s Law, and assumptions – a story on why my models suddenly failed, and what I did about it to fix it.

Accompanied by the obligatory AI generated image of an SQL bug. Because it’s 2024.

Since a couple of years, I’ve been using Dataform a lot for modeling tables. It’s a great (now GCP native!) tool to orchestrate your scheduled queries. Use it!

There are several techniques in Dataform to build incremental tables. A good resource to read is this blogpost by Taneli Salonen – who describes 3.

The trick I use normally use is a slightly modified version of the “Delete + checkpoint” method:

  • For incremental builds, delete the most recent 4 days from the dataset (row 5 below)
  • then I check what the most recent table suffix date is in my modeled table

I use this code:

/* we use the table_suffix column to determine incrementality */
pre_operations {
  -- delete some older data, since this may be updated later by GA4
  ${when(incremental(),
    `delete ${self()} where _partitiondate >=  date_sub(current_date(), interval 4 day)`,
    `select 'do not delete anything'`)}
  ---
  declare date_checkpoint STRING default (
    ${when(incremental(),
    `select MAX(REGEXP_EXTRACT(table_suffix,'[0-9]+')) from ${self()}`,
    `select '20240101'`)}   /* the default, when it's not incremental */
  )
}

-- actual query
SELECT
  _table_suffix AS table_suffix,
  event_name,
  -- ... way more stuff
FROM ${ref("events_*")}
WHERE REGEXP_EXTRACT(_table_suffix,'[0-9]+') > date_checkpoint

I combine that with a trick from Johan van de Werken – to query processed tables AND intraday tables in one go. See row 21.

Neat! It works!

Always!

Until it stops working.

Sometimes the intraday table and the processed table coexist

When GA4 does processing of say April 5, at some point it moves the events_intraday_20240405 table to events_20240505

When the above query runs (see code) – this means REGEXP_EXTRACT(_table_suffix,'[0-9]+') > date_checkpoint matches 2 shards, and rows will be duplicated!

And this.. dear reader, is called a race condition. Nasty stuff that happens because two processes run at the same time and assume things are fine.

Murphy’s Law comes in, and says: It can go wrong.. so it will go wrong at some point.

The fix

The fix is relatively simple, and requires a check for a duplicate table_suffix via a DENSE_RANK() window function.

With duplicate, I mean: intraday_20240405 is fine, as long as there’s no 20240405 as well.

Here’s the improved version of the actual query from the snippet above

-- actual query
SELECT
  _table_suffix AS table_suffix,
  event_name,
  -- ... way more stuff

  DENSE_RANK() OVER (
     PARTITION BY REGEXP_EXTRACT(_table_suffix,'[0-9]+')
     ORDER BY _table_suffix
  ) as tbl_suffix_nr
FROM ${ref("events_*")}
WHERE REGEXP_EXTRACT(_table_suffix,'[0-9]+') > date_checkpoint
QUALIFY tbl_suffix_nr = 1

This works, because dense_rank ranks over the date part of the suffix, and order by suffix itself (then 20240405 comes first, intraday_20240405 comes 2nd
Then QUALIFY to only include the first suffix found:

  1. either the intraday suffix (if that’s the only one)
  2. or the processed suffix (if there’s a processed one AND an intraday)

Rejoice! A bug is squashed

But more importantly: I’ve again learned to never assume when writing code, and again found out that things will go wrong in unexpected ways :-)

As always: let me know if this post was helpful, via a comment, a toot or a skeet, but not a tweet. Boycott X!

Related posts:

graphTrack, Store and Analyze granular Page Performance data: a practical guide ParisSession modeling in BigQuery SQL (using GA4 data and a trip to Paris) Make your GA4 life easier: Some powertips! Using Big Query to calculate DAU / MAU

Filed Under: TIL Tagged With: big query, dataform, ga4

Comments

  1. Zach Renwick says

    11 April 2024 at 20:18

    Nice article, and very helpful as I am currently modeling GA4 data (and trying to integrate it with GA UA!).

    One question – does your dense_rank() function in the smart incremental logic favour the full day export instead of the events_intraday table? Presumably we would want to use the full day export instead of streaming dataset as the data would be more complete.

    Reply
    • jules says

      11 April 2024 at 21:18

      Thanks Zach! Yes, in the sort, the _20240411 comes before _intraday_20240411, so it favours the processed tables

      Reply
      • Zach Renwick says

        11 April 2024 at 23:22

        Amazing, thanks!

        Reply

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Recent Posts

  • How to use a pre-hashed email address in GA4 tags in GTM
  • GA4 intraday exports and cookieless pings
  • July 2024 change in collected_traffic_source in GA4 Big Query
  • srsltid parameter, GA4 and Big Query
  • Smart incremental GA4 tables in Dataform

Need help?

  • Contact me

Search

© Copyright Jules Stuifbergen · Powered by Genesis ·