skip to Main Content

I have the following table <state_table> that tracks entries per minute of an id and its state:

    minute id type
    ------ -- ----
    1      A  solid
    1      A  solid
    1      A  solid
    1      A  liquid
    1      B  solid
    1      B  solid
    1      B  liquid
    .... 1000+ rows ...
    1      ZZX  liquid
    1      ZZZ  liquid
    2      A  solid
    2      A  solid
    2      A  liquid

With the following query, I can get the top 1000 pairs based on occurrence:

With TempIds AS (
   SELECT
      state_table.minute as minute,
      state_table.id as id,
      COUNT(*)
   FROM
      state_table
GROUP BY 1,2
) SELECT
   TempId.minute,
   TempId.id,
   TempId.count
FROM
   TempIds
ORDER BY 3 DESC
LIMIT 1000
;

e.g.

    minute id count
    ------ -- ----
    2      B  1002
    3      A  990
    1      C  800
    3      B  798

How can I modify my query to get the type of the id?
For example, there are 1002 <minute=2,id=B> rows. Is there a way to get there are 402 solids and 600 liquids?

    minute id count type
    ------ -- ---- -----
    2      B  402  solid
    2      B  600  liquid
    3      A  330  solid
    3      A  660  liquid

The only way I can think of is a fairly complex nested query:

With TempTop AS (
    With TempIds AS (
       SELECT
          state_table.minute as minute,
          state_table.id as id,
          COUNT(*)
       FROM
          state_table
    GROUP BY 1,2
    ) SELECT
       TempId.minute as minute,
       TempId.id as id,
       TempId.count
    FROM
       TempIds
    ORDER BY 3 DESC
    LIMIT 1000
    )
) SELECT
    state_table.minute,
    state_table.id,
    state_table.type,
    COUNT(*)
FROM
   state_table, TempTop
WHERE
    state_table.minute = TempTop.minute
    AND state_table.id = TempTop.id
;

Is there a simpler way to make this query?
Goal:

  • For the Top 1000 most frequent pairs, get the breakdown of the type.

2

Answers


  1. Your query is

    SELECT minute, id, COUNT(*)
    FROM state_table
    GROUP BY minute, id
    ORDER BY COUNT(*) DESC
    LIMIT 1000;
    

    and in this process you lose the types, because you want the 1000 top minute/id pairs, so you cannot simply group by minute, id, type instead.

    If this is only about the types ‘solid’ and ‘liquid’, you can apply conditional aggregation to get the separate counts along:

    SELECT 
      minute, id, 
      COUNT(*) AS total,
      SUM(type = 'solid') AS solid,
      SUM(type = 'liquid') AS liquid
    FROM state_table
    GROUP BY minute, id
    ORDER BY COUNT(*) DESC
    LIMIT 1000;
    

    Summing up the boolean expressions works in MySQL, because true equals 1 and false equals 0 there.

    The problem with the above queries are ties, by the way. If there are two pairs with the same 1000th count, you pick one arbitrarily instead of showing only 999 or 1001 pairs then in order to treat both tying pairs the same. So I’d probably re-write the queries using DENSE_RANK in order to properly handle ties.

    For the more complicated case where the types are unknown at the point of writing the query, you need rows instead of columns, just as already shown in your request. In that case you really need to group by minute, id, type first. The easiest way to get the total counts is with SUM OVER then. Then rank the pairs as mentioned with DENSE_RANK and keep the top 1000.

    SELECT minute, id, type, cnt
    FROM
    (
      SELECT
        minute, id, type, cnt,
        DENSE_RANK() OVER (ORDER BY total DESC) AS rnk
      FROM
      (
        SELECT
          minute, id, type,
          COUNT(*) AS cnt,
          SUM(COUNT(*)) OVER(PARTITION BY minute, id) AS total
        FROM state_table
        GROUP BY minute, id, type
      ) counted
    ) ranked
    WHERE rnk <= 1000
    ORDER BY rnk, minute, id, type;
    

    This can get you more than 1000 minute/id pairs in case of ties. You can reduce this with RANK instead of DENSE_RANK. If these two approaches still don’t get the number you want, you may have to count minute/id pairs separately in a subquery instead:

    select minute, id, type, COUNT(*)
    from state_table
    WHERE (minute, id) IN
    (
      SELECT minute, id
      FROM state_table
      GROUP BY minute, id
      ORDER BY COUNT(*) DESC
      LIMIT 1000
    )
    GROUP BY minute, id, type
    ORDER BY 
      SUM(COUNT(*)) OVER (PARTITION BY minute, id) DESC,
      minute, id, type;
    
    Login or Signup to reply.
  2. Best way I can think of is, you still have to use 2 CTEs but not nested ones.

    You have 3 tasks to achieve,

    1. CTE will count the records.

    2. second one will use row_number() window function to assign row numbers based on decreasing order or count.

    3. last select will do a join between state table and second cte with a where clause specifying row_nunber <= 1000 and grouping based on ID , State and Minute.

      With TempIdsCount AS (
      SELECT 
      state_table.minute as minute, 
      state_table.id as id, 
      COUNT(*) rw_cnt 
      FROM 
      state_table 
      GROUP BY 
      1, 
      2
      ) TempIdsRowNum AS (
      SELECT 
      minute, 
      id, 
      Row_number() over (order by rw_cnt desc) rw_num 
      FROM 
      TempIdsCount
      ) 
      SELECT 
      st.minute, 
      st.id, 
      st.type, 
      Count(*) cnt 
      FROM 
      state_table st 
      Join TempIdsRowNum trn on st.minute = trn.minute 
      and st.id = trn.id 
      Where 
      rw_num <= 1000 
      Group by 
      st.minute, 
      st.id, 
      st.type
      
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search