Or: race conditions, Murphy’s Law, and assumptions – a story on why my models suddenly failed, and what I did about it to fix it.
Accompanied by the obligatory AI generated image of an SQL bug. Because it’s 2024.
Since a couple of years, I’ve been using Dataform a lot for modeling tables. It’s a great (now GCP native!) tool to orchestrate your scheduled queries. Use it!
There are several techniques in Dataform to build incremental tables. A good resource to read is this blogpost by Taneli Salonen – who describes 3.
The trick I use normally use is a slightly modified version of the “Delete + checkpoint” method:
- For incremental builds, delete the most recent 4 days from the dataset (row 5 below)
- then I check what the most recent table suffix date is in my modeled table
I use this code:
/* we use the table_suffix column to determine incrementality */ pre_operations { -- delete some older data, since this may be updated later by GA4 ${when(incremental(), `delete ${self()} where _partitiondate >= date_sub(current_date(), interval 4 day)`, `select 'do not delete anything'`)} --- declare date_checkpoint STRING default ( ${when(incremental(), `select MAX(REGEXP_EXTRACT(table_suffix,'[0-9]+')) from ${self()}`, `select '20240101'`)} /* the default, when it's not incremental */ ) } -- actual query SELECT _table_suffix AS table_suffix, event_name, -- ... way more stuff FROM ${ref("events_*")} WHERE REGEXP_EXTRACT(_table_suffix,'[0-9]+') > date_checkpoint
I combine that with a trick from Johan van de Werken – to query processed tables AND intraday tables in one go. See row 21.
Neat! It works!
Always!
Until it stops working.
Sometimes the intraday table and the processed table coexist
When GA4 does processing of say April 5, at some point it moves the events_intraday_20240405
table to events_20240505
When the above query runs (see code) – this means REGEXP_EXTRACT(_table_suffix,'[0-9]+') > date_checkpoint
matches 2 shards, and rows will be duplicated!
And this.. dear reader, is called a race condition. Nasty stuff that happens because two processes run at the same time and assume things are fine.
Murphy’s Law comes in, and says: It can go wrong.. so it will go wrong at some point.
The fix
The fix is relatively simple, and requires a check for a duplicate table_suffix via a DENSE_RANK()
window function.
With duplicate, I mean: intraday_20240405
is fine, as long as there’s no 20240405
as well.
Here’s the improved version of the actual query from the snippet above
-- actual query SELECT _table_suffix AS table_suffix, event_name, -- ... way more stuff DENSE_RANK() OVER ( PARTITION BY REGEXP_EXTRACT(_table_suffix,'[0-9]+') ORDER BY _table_suffix ) as tbl_suffix_nr FROM ${ref("events_*")} WHERE REGEXP_EXTRACT(_table_suffix,'[0-9]+') > date_checkpoint QUALIFY tbl_suffix_nr = 1
This works, because dense_rank
ranks over the date part of the suffix, and order by suffix itself (then 20240405
comes first, intraday_20240405
comes 2nd
Then QUALIFY to only include the first suffix found:
- either the intraday suffix (if that’s the only one)
- or the processed suffix (if there’s a processed one AND an intraday)
Rejoice! A bug is squashed
But more importantly: I’ve again learned to never assume when writing code, and again found out that things will go wrong in unexpected ways :-)
As always: let me know if this post was helpful, via a comment, a toot or a skeet, but not a tweet. Boycott X!
Zach Renwick says
Nice article, and very helpful as I am currently modeling GA4 data (and trying to integrate it with GA UA!).
One question – does your dense_rank() function in the smart incremental logic favour the full day export instead of the events_intraday table? Presumably we would want to use the full day export instead of streaming dataset as the data would be more complete.
jules says
Thanks Zach! Yes, in the sort, the
_20240411
comes before_intraday_20240411
, so it favours the processed tablesZach Renwick says
Amazing, thanks!