skip to Main Content

I have a list of different channels that could potentially bring users to a website (organic, SEO, online marketing, etc.). I would like to find an efficient way to count daily active user that comes from the combination of these channels. Each channel has its own table and track its respective users.

The tables looks like the following,

channel A
date         user_id
2020-08-01   A
2020-08-01   B
2020-08-01   C

channel B
date         user_id
2020-08-01   C
2020-08-01   D
2020-08-01   G

channel C
date         user_id
2020-08-01   A
2020-08-01   C
2020-08-01   F

I want to know the following combinations

  1. Only visit channel A
  2. Only visit channel A & B
  3. Only visit channel B & C
  4. Only visit channel B
  5. etc.

However, when there are a lot of channels (I have around 8 channels) the combination is a lot. What I’ve done roughly is as simple as this (this one includes channel A)

SELECT 
    a.date, 
    COUNT(DISTINCT IF(b.user_id IS NULL AND c.user_id IS NULL, a.user_id, NULL)) AS dau_a,
    COUNT(DISTINCT IF(b.user_id IS NOT NULL AND c.user_id IS NULL, a.user_id, NULL)) AS dau_a_b,
    ...
FROM a LEFT JOIN b ON a.user_id = b.user_id AND a.date = b.date 
LEFT JOIN c ON a.user_id = c.user_id AND a.date = c.date
GROUP BY 1

but extremely tedious when the total channels is 8 (28 variations for 2 combinations, 56 for 3, 70 for 4, and many more).

Any smart ideas to solve this? I was thinking to use FULL OUTER JOIN but can’t seem to get the grasp out of it. Answers really appreciated.

4

Answers


  1. I think you could use set operators to answer your questions: https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#set_operators

    E.g.

    1. is (A except B) except C
    2. is A intersect B

    etc.

    Login or Signup to reply.
  2. I am thinking full join and aggregation:

    select date, a.channel_a, b.channel_b, c.channel_c, count(*) cnt
    from      (select 'a' channel_a, a.* from channel_a) a
    full join (select 'b' channel_b, b.* from channel_b b) b using (date, user_id)
    full join (select 'c' channel_c, c.* from channel_c c) c using (date, user_id)
    group by date, a.channel_a, b.channel_b, c.channel_c
    
    Login or Signup to reply.
  3. I would approach this with union all and two levels of aggregation:

    select date, channels, count(*) as num_users
    from (select date, user_id, string_agg(channel order by channel) as channels
          from ((select distinct date, user_id, 'a' as channel from a) union all
                (select distinct date, user_id, 'b' as channel from b) union all
                (select distinct date, user_id, 'c' as channel from c) 
               ) abc
          group by date, user_id
         ) c
    group by date, channels;
      
    
    Login or Signup to reply.
  4. However, when there are a lot of channels (I have around 8 channels) the combination is a lot

    extremely tedious when the total channels is 8 (28 variations for 2 combinations, 56 for 3, 70 for 4, and many more).

    Any smart ideas to solve this?

    Below is for BigQuery Standard SQL and addresses exactly above aspect of the OP’s concerns

    #standardSQL
    CREATE TEMP FUNCTION generate_combinations(a ARRAY<INT64>) 
    RETURNS ARRAY<STRING>
    LANGUAGE js AS '''
      var combine = function(a) {
        var fn = function(n, src, got, all) {
          if (n == 0) {
            if (got.length > 0) {
              all[all.length] = got;
            } return;
          }
          for (var j = 0; j < src.length; j++) {
            fn(n - 1, src.slice(j + 1), got.concat([src[j]]), all);
          } return;
        }
        var all = []; for (var i = 1; i < a.length; i++) {
          fn(i, a, [], all);
        }
        all.push(a);
        return all;
      } 
      return combine(a)
    ''';
    with users as (
        select distinct date, user_id, 'A' channel from channel_A union all
        select distinct date, user_id, 'B' from channel_B union all
        select distinct date, user_id, 'C' from channel_C 
    ), visits as (
      select date, user_id, 
        string_agg(channel, ' & ' order by channel) combination
      from users
      group by date, user_id
    ), channels AS (
      select channel, cast(row_number() over(order by channel) as string) channel_num
      from (select distinct channel from users)
    ), combinations as (
      select string_agg(channel, ' & ' order by channel_num) combination
      from unnest(generate_combinations(generate_array(1,(select count(1) from channels)))) AS items, 
        unnest(split(items)) AS channel_num
      join channels using(channel_num)
      group by items
    )
    select date, 
      combination as channels_visited_only, 
      count(distinct user_id) dau
    from visits
    join combinations using (combination)
    group by date, combination
    order by combination
    

    If to apply to sample data from your question – output is

    enter image description here

    Some explanations to help with using above

    • CTE users just simply union all tables and adds channel column to be able to distinguish from which table respective row came

    • CTE visits extracts list of all visited channels for each user-date combination

    • CTE channels just simply prepares list of channels and assigns number for later use

    • CTE combinations uses JS UDF to generate all combinations of channels’ numbers and then joins them back to channels to generate channels combinations

    • and final SELECT statement is simply looks for those users whose list of visited channels match channels combination generated in previous step

    Some recommendations for further streamlining above code

    • assuming your channel tables names follow channel_* pattern

    you can use wildcard tables feature in users CTE and instead of

    select distinct date, user_id, 'A' channel from channel_A union all
    select distinct date, user_id, 'B' from channel_B union all
    select distinct date, user_id, 'C' from channel_C 
    

    you can use something like below – so just one line instead of as many lines as cannles you have

    select distinct date, user_id, _TABLE_SUFFIX as channel from channel_*      
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search