I have a list of different channels that could potentially bring users to a website (organic, SEO, online marketing, etc.). I would like to find an efficient way to count daily active user that comes from the combination of these channels. Each channel has its own table and track its respective users.
The tables looks like the following,
channel A
date user_id
2020-08-01 A
2020-08-01 B
2020-08-01 C
channel B
date user_id
2020-08-01 C
2020-08-01 D
2020-08-01 G
channel C
date user_id
2020-08-01 A
2020-08-01 C
2020-08-01 F
I want to know the following combinations
- Only visit channel A
- Only visit channel A & B
- Only visit channel B & C
- Only visit channel B
- etc.
However, when there are a lot of channels (I have around 8 channels) the combination is a lot. What I’ve done roughly is as simple as this (this one includes channel A)
SELECT
a.date,
COUNT(DISTINCT IF(b.user_id IS NULL AND c.user_id IS NULL, a.user_id, NULL)) AS dau_a,
COUNT(DISTINCT IF(b.user_id IS NOT NULL AND c.user_id IS NULL, a.user_id, NULL)) AS dau_a_b,
...
FROM a LEFT JOIN b ON a.user_id = b.user_id AND a.date = b.date
LEFT JOIN c ON a.user_id = c.user_id AND a.date = c.date
GROUP BY 1
but extremely tedious when the total channels is 8 (28 variations for 2 combinations, 56 for 3, 70 for 4, and many more).
Any smart ideas to solve this? I was thinking to use FULL OUTER JOIN
but can’t seem to get the grasp out of it. Answers really appreciated.
4
Answers
I think you could use set operators to answer your questions: https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#set_operators
E.g.
etc.
I am thinking
full join
and aggregation:I would approach this with
union all
and two levels of aggregation:Below is for BigQuery Standard SQL and addresses exactly above aspect of the OP’s concerns
If to apply to sample data from your question – output is
CTE
users
just simply union all tables and adds channel column to be able to distinguish from which table respective row cameCTE
visits
extracts list of all visited channels for each user-date combinationCTE
channels
just simply prepares list of channels and assigns number for later useCTE
combinations
uses JS UDF to generate all combinations of channels’ numbers and then joins them back to channels to generate channels combinationsand final SELECT statement is simply looks for those users whose list of visited channels match channels combination generated in previous step
channel_*
patternyou can use wildcard tables feature in
users
CTE and instead ofyou can use something like below – so just one line instead of as many lines as cannles you have