skip to Main Content

I have a following table representing user activity in API endpoints.

create table crud_logs (
    id           bigint generated always as identity
        constraint pk_crud_logs
            primary key,
    object_type  varchar(255)                           not null,
    object_id    bigint                                 not null,
    action       crudtypes                              not null,
    operation_ts timestamp with time zone default now() not null,
    user_id      bigint
        constraint fk_crud_logs_user_id_users
            references users
            on delete set null
);

As part of a user statistics API, I want to get the rank for a given user_id (numeric or %-wise), ranked by the count of rows per user for a given period. For simplicity let’s say from forever.

Example:

id object_type object_id action operation_ts user_id
57 Enterprise 56 update 2023-07-21 12:51:43.904511 +00:00 34
58 Enterprise 56 update 2023-07-21 12:52:46.472828 +00:00 34
59 Enterprise 56 delete 2023-07-21 13:00:26.499430 +00:00 34
60 Enterprise 56 update 2023-07-21 12:51:43.904511 +00:00 34
61 Enterprise 56 update 2023-07-21 12:52:46.472828 +00:00 34
62 Enterprise 56 delete 2023-07-21 13:00:26.499430 +00:00 34
67 Enterprise 56 update 2023-07-21 12:51:43.904511 +00:00 34
68 Enterprise 56 update 2023-07-21 12:52:46.472828 +00:00 34
69 Enterprise 56 delete 2023-07-21 13:00:26.499430 +00:00 59
70 Enterprise 56 update 2023-07-21 12:51:43.904511 +00:00 58
71 Enterprise 56 update 2023-07-21 12:52:46.472828 +00:00 59
72 Enterprise 56 delete 2023-07-21 13:00:26.499430 +00:00 58

Here, user_id = 34 would have first place because that user has the most entries.

I have managed to write the following query:

select user_id, rank() over (order by cnt desc )
from (select user_id, count(*) cnt from crud_logs group by user_id) sq

It works for all users, whereas I need the result only for 1 specific user. For example, the expected output for user_id = 58 would be:

user_id = 58, rank = 2

2

Answers


  1. The query you’ve written is pretty close. To obtain the rank for a specific user, you could simply add a WHERE clause in the outer SQL query to filter for the required user_id.

    Here’s how you could modify your query:

    SELECT user_id, rank 
    FROM (
        SELECT user_id, RANK() OVER (ORDER BY cnt DESC) as rank 
        FROM (
            SELECT user_id, COUNT(*) cnt 
            FROM crud_logs 
            GROUP BY user_id
        ) sq
    ) result
    WHERE user_id = 58
    

    In this query, the innermost subquery groups the crud_logs table by user_id and counts the number of entries for each user. The middle subquery ranks these users by their count in descending order. Finally, the outer query filters these results to return the rank of the user with user_id = 58.

    Login or Signup to reply.
  2. Basic query

    You can wrap the window function rank() around the aggregate count(*) in the same query level:

    SELECT *
    FROM  (
       SELECT user_id, rank() OVER (ORDER BY count(*) DESC) AS rank
       FROM   crud_logs 
       GROUP  BY 1
       ) sub
    WHERE  user_id = 58;
    

    Subtle issue 1: ignore null values?

    You have a FK constraint on user_id with ON DELETE SET NULL – suggesting that users can be deleted and their log entries are kept with user_id reset to null. The "NULL" user would be ranked like any other, representing a collection of deleted users. I suspect you’d want to exclude that from the ranks. So:

    SELECT *
    FROM  (
       SELECT user_id, rank() OVER (ORDER BY count(*) DESC) AS rank
       FROM   crud_logs 
       WHERE  user_id IS NOT NULL  -- !
       GROUP  BY 1
       ) sub
    WHERE  user_id = 58;
    

    Subtle issue 2: proper time periods

    Those time periods (which we are ignoring for the sake of simplicity) can be tricky with the type timestamptz. Typically, you’d want statistics for days/weeks/months in a given time zone. Avoid sneaky corner case errors by factoring in the time zone properly.

    Example for the month July 2023 in Australia:

    SELECT *
    FROM  (
       SELECT user_id
            , rank() OVER (ORDER BY count(*) DESC) AS rank
       FROM   crud_logs 
       WHERE  user_id IS NOT NULL  -- !
       AND    operation_ts >= timestamp '2023-07-01' AT TIME ZONE 'Australia/Sydney'  -- !
       AND    operation_ts <  timestamp '2023-08-01' AT TIME ZONE 'Australia/Sydney'  -- !
       GROUP  BY 1
       ) sub
    WHERE  user_id = 58;
    

    fiddle

    See:

    If the table is big, an index on (operation_ts) should help.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search