skip to Main Content

Context

I have two tables entity_a and entity_b. The entity_b is linked to entity_a through a foreign key, and has another field date, which is a date (not a datetime, not a timestamp). The date field of entity_b indicates when entity_a happens.

Problem

I want to count of number of half-days where entity_a is happening. However to do that, I don’t take into account the time of entity_b (I don’t make the rules, and no I don’t think they make sense either), just the number of entity_b per date: on one day, if there are more than two, count 1 day, if there is 1, count 0.5 day.

In other words, each entity_b counts for 0.5, for a maximum of 1 per date.

As one more constraint: I want to return one row for each entity_a, since I’ll be querying multiple at time

Note: using Postgres

Example

In this example, entity_a_1 has 1.0 day, entity_a_2 has 1.5 days

  • entity_a_1
    • entity_b (2023-05-01)
    • entity_b (2023-05-02)
  • entity_a_2
    • entity_b (2023-05-12)
    • entity_b (2023-05-13)
    • entity_b (2023-05-13)
    • entity_b (2023-05-13)

Thing I’ve tried

first, for those who want to test at home: the schema and data of the example

CREATE TABLE entity_a (id SERIAL PRIMARY KEY);
CREATE TABLE entity_b (id SERIAL PRIMARY KEY, a_id INTEGER REFERENCES entity_a(id), date DATE);
INSERT INTO entity_a (id) VALUES (1), (2);
INSERT INTO entity_b (a_id, date) VALUES (1, '2023-05-01'), (1, '2023-05-02'), (2, '2023-05-12'), (2, '2023-05-13'), (2, '2023-05-13'), (2, '2023-05-13');

I have tried things that did the job using a subquery but with very poor performance:

SELECT
  a.id,
  SUM(b_agg.count)
FROM
  entity_a a
  JOIN (
    SELECT
      b.a_id,
      CASE
        WHEN COUNT(*) > 1 THEN 1
        ELSE 0.5
      END AS count
    FROM
      entity_b b
    GROUP BY
      b.a_id,
      b.date
  ) b_agg ON b_agg.a_id = a.id
GROUP BY
  a.id;

(if you use the data from the example, it will obviously be very fast, but consider there are millions of rows in each table)
(here’s a small bash script that will fill the tables with a million lines. It’s far from reality, yet a little closer for i in {3..1000000}; do; psql --command "INSERT INTO entity_a (id) VALUES ($i); INSERT INTO entity_b (a_id, date) VALUES ($i, '2023-01-01')"; done)

And I had the feeling window functions could help, but I don’t know much about them so I always ended up failing to use them correctly 😅

Any help is welcome, thanks in advance !

2

Answers


  1. I don’t believe window functions will help. If you remove the join because it isn’t required here, the query will be faster:

    SELECT a_id, SUM(count) as count_
    FROM (
      SELECT
          b.a_id,
          CASE
            WHEN COUNT(*) > 1 THEN 1
            ELSE 0.5
          END AS count
        FROM
          entity_b b
        GROUP BY
          b.a_id,
          b.date
    ) AS t
    group by t.a_id;
    
    Login or Signup to reply.
  2. Here’s a solution via window functions. I don’t know if you’ll find that it runs better but it’s worth checking into:

    with data as (
        select *,
            case when date = lag(date) over (partition by a_id order by date)
                then 0.0
                else case when date = lead(date) over (partition by a_id order by date)
                    then 1.0 else 0.5 end
            end as weight
        from entity_b
    )
    select a.id, sum(weight) as total
    from entity_a a inner join entity_b b on b.a_id = a.id
    group by a.id;
    

    https://dbfiddle.uk/Az4y03-0

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search