Postgresql - SQL: Summarize rows by column value, without using UNION

Benjamin
October 19, 2022
119 views
2 votes
3 Answers

Im trying to write an sql statement which summarizes the count of rows by values in a specific column.

I have several tables, starting with s0_
Every table has a huge amount of rows with multiple, different values in column load_nr.

Example:
s0_table1:

load_nr	column2	column3…
101	someData	someData…
101	someData	someData…
102	someData	someData…
103	someData	someData…
103	someData	someData…
103	someData	someData…

S0_table2:

load_nr	column2	column3…
101	someData	someData…
102	someData	someData…
102	someData	someData…
102	someData	someData…
102	someData	someData…
103	someData	someData…

Desired result:

load_nr	count s0_table1	count s0_table2
101	2	1
102	1	4
103	3	1

New data with up counting load_nr is added every day.
Best case is, that the table names are fetched dynamically (s0_*).

Could anyone give me a cloue how to build this statement? I’m a newbe and not so experienced with more complicated statements.

Best Regards

*edit: Hopefully the tables are displayed correctly xD

Tags: postgresql sql

Answers

- nbk
- October 19, 2022 at 7:19 pm
- 0 votes
0
For that you need a FULL OUTER JOIN in case the load_nr don’t match up
```
SELECT COALESCE(s1.load_nr,s2.load_nr) as load_nr, Count_S1, Count_S2
FROM
(SELECT COUNT(*) Count_S1,load_nr FROM S0_table1 GROUP BY load_nr) s1 FULL OUTER JOIN 
(SELECT COUNT(*) Count_S2,load_nr FROM S0_table2 GROUP BY load_nr) s2  ON s1.load_nr = s2.load_nr
```
ORDER BY load_nr
Login or Signup to reply.

Dynamic table names are going to be a problem. You basically have no choice but to write a SQL statement dynamically. There’s a good chance that summing a UNION-ed subquery is going to be the fastest way to go, but if you can also try this:

SELECT 
    COALESCE(
        Table1Count.[load_nr],
        Table2Count.[load_nr],
        Table3Count.[load_nr]       
    ) [load_nr],
    COALESCE(Table1Count.[rows],0) [s0_table1],
    COALESCE(Table2Count.[rows],0) [s0_table2],
    COALESCE(Table3Count.[rows],0) [s0_table3]
FROM            (   SELECT  
                        [load_nr],
                        COUNT(1) [rows]
                    FROM        s0_table1
                    GROUP BY    [load_nr]
                ) Table1Count
FULL OUTER JOIN (   SELECT  
                        [load_nr],
                        COUNT(1) [rows]
                    FROM        s0_table2
                    GROUP BY    [load_nr]
                ) Table2Count
ON              Table2Count.load_nr = Table1Count.load_nr
FULL OUTER JOIN (   SELECT  
                        [load_nr],
                        COUNT(1) [rows]
                    FROM        s0_table3
                    GROUP BY    [load_nr]
                ) Table3Count
ON              COALESCE(
                    Table1Count.[load_nr],
                    Table2Count.[load_nr]       
                ) = Table3Count.load_nr

Of course you’d need to build the query dynamically and the nested coalesces for each subsequent join could get a bit tricky.

Another option if you have the load numbers you need in another table you can make it cleaner and easier to construct dynamically:

SELECT 
    loads.[load_nr],
    (SELECT COUNT(1) FROM s0_table1 WHERE load_nr = loads.[load_nr]) [s0_table1Count],
    (SELECT COUNT(1) FROM s0_table2 WHERE load_nr = loads.[load_nr]) [s0_table2Count],
    (SELECT COUNT(1) FROM s0_table3 WHERE load_nr = loads.[load_nr]) [s0_table3Count],
    (SELECT COUNT(1) FROM s0_table4 WHERE load_nr = loads.[load_nr]) [s0_table4Count]
FROM    loads

And if you need to remove the loads with no rows in any table you can just wrap it in a subquery and add a where clause:

SELECT 
    SubQuery.*
FROM    (   SELECT 
                loads.[load_nr],
                (SELECT COUNT(1) FROM s0_table1 WHERE load_nr = loads.[load_nr]) [s0_table1Count],
                (SELECT COUNT(1) FROM s0_table2 WHERE load_nr = loads.[load_nr]) [s0_table2Count],
                (SELECT COUNT(1) FROM s0_table3 WHERE load_nr = loads.[load_nr]) [s0_table3Count],
                (SELECT COUNT(1) FROM s0_table4 WHERE load_nr = loads.[load_nr]) [s0_table4Count]
            FROM    loads
        ) SubQuery
WHERE   (       SubQuery.[s0_table1Count] <> 0
            OR  SubQuery.[s0_table2Count] <> 0
            OR  SubQuery.[s0_table3Count] <> 0
            OR  SubQuery.[s0_table4Count] <> 0
        )

This could be pretty slow though depending on the shape of your data.

I’m not sure if any of the following will work for you. This is something that
I haven’t ever used in a real project, but maybe it will provide you with some
inspiration. The final solution will get 99% to having the best case that you
wanted (more on this at the end).

Here’s the basic idea. psql provides a command called crosstabview that is
able to display pivot tables, which is what you want. So, for your sample data,
we could do this (notice that instead of terminating the query with a semicolon
;, we’re terminating it with crosstabview):

select load_nr, 's0_table1', count(*) from s0_table1 group by load_nr
union
select load_nr, 's0_table2', count(*) from s0_table2 group by load_nr
order by load_nr asc

crosstabview

┌─────────┬───────────┬───────────┐
│ load_nr │ s0_table1 │ s0_table2 │
├─────────┼───────────┼───────────┤
│     101 │         2 │         1 │
│     102 │         1 │         4 │
│     103 │         3 │         1 │
└─────────┴───────────┴───────────┘

Try the same query with a semicolon terminator to see what the query has returned
before psql post-processed the result:

select load_nr, 's0_table1', count(*) from s0_table1 group by load_nr
union
select load_nr, 's0_table2', count(*) from s0_table2 group by load_nr
order by load_nr asc

┌─────────┬───────────┬───────┐
│ load_nr │ ?column?  │ count │
├─────────┼───────────┼───────┤
│     101 │ s0_table1 │     2 │
│     101 │ s0_table2 │     1 │
│     102 │ s0_table1 │     1 │
│     102 │ s0_table2 │     4 │
│     103 │ s0_table2 │     1 │
│     103 │ s0_table1 │     3 │
└─────────┴───────────┴───────┘

Ok. Now this was a psql thing, but it PostgreSQL provides a built-in
extension called tablefunc which exposes a function called crosstab that
does precisely what the psql crosstabview command does. Furthermore, this
function accepts as argument a string containing the SQL text of the query
you need, which means we could generate it.

For the moment we can just use the hardcoded query (I’m using $$-delimited
strings below so that I don’t have to escape single quotes):

create extension if not exists tablefunc;

select * from crosstab(
  -- The 1st argument is the query producing the values on the two dimentions:
  -- `load_nr` and table name.
  $$
  select load_nr, 's0_table1', count(*) from s0_table1 group by load_nr
  union
  select load_nr, 's0_table2', count(*) from s0_table2 group by load_nr
  order by load_nr asc
  $$,

  -- The 2nd argument must return the headers names, which in our case are
  -- the table names
  $$
  select * from (values ('s0_table1'), ('s0_table2')) as headers
  $$
) as (
  "load_nr" int,
  "count s0_table1" int,
  "count s0_table2" int
);

The above will return this, which reproduces perfectly the crosstabview
result:

┌─────────┬─────────────────┬─────────────────┐
│ load_nr │ count s0_table1 │ count s0_table2 │
├─────────┼─────────────────┼─────────────────┤
│     101 │               2 │               1 │
│     102 │               1 │               4 │
│     103 │               3 │               1 │
└─────────┴─────────────────┴─────────────────┘

Ok, now on to dynamically generate the two SQL queries that crosstab wants.

First, we can obtain a list of the table names prefixed with s0_:

select
  relname
from
  pg_catalog.pg_class
where relname like 's0_%'
order by relname asc;

┌───────────┐
│  relname  │
├───────────┤
│ s0_table1 │
│ s0_table2 │
└───────────┘

Now, we can use these values to build individual SELECT statements that we’ll
later join via UNION. First the SELECTs:

select
  'SELECT load_nr, ''' || relname || ''', count(*) FROM ' || relname || ' GROUP BY load_nr' as sql_text
from
  pg_catalog.pg_class
where relname like 's0_%'
order by relname asc;

┌───────────────────────────────────────────────────────────────────────┐
│                               sql_text                                │
├───────────────────────────────────────────────────────────────────────┤
│ SELECT load_nr, 's0_table1', count(*) FROM s0_table1 GROUP BY load_nr │
│ SELECT load_nr, 's0_table2', count(*) FROM s0_table2 GROUP BY load_nr │
└───────────────────────────────────────────────────────────────────────┘
(2 rows)

Now we can join the 2 rows into a single one:

select string_agg(sql_text, E'nUNIONn') || E'nORDER BY load_nr ASC' from (
  select
    'SELECT load_nr, ''' || relname || ''', count(*) FROM ' || relname || ' GROUP BY load_nr' as sql_text
  from
    pg_catalog.pg_class
  where relname like 's0_%'
  order by relname asc
) as queries;

┌───────────────────────────────────────────────────────────────────────┐
│                               ?column?                                │
├───────────────────────────────────────────────────────────────────────┤
│ SELECT load_nr, 's0_table1', count(*) FROM s0_table1 GROUP BY load_nr↵│
│ UNION                                                                ↵│
│ SELECT load_nr, 's0_table2', count(*) FROM s0_table2 GROUP BY load_nr↵│
│ ORDER BY load_nr ASC                                                  │
└───────────────────────────────────────────────────────────────────────┘
(1 row)

If we take this query and manually run it using crosstabview, we’ll get:

SELECT load_nr, 's0_table1', count(*) FROM s0_table1 GROUP BY load_nr
UNION
SELECT load_nr, 's0_table2', count(*) FROM s0_table2 GROUP BY load_nr
ORDER BY load_nr ASC crosstabview

┌─────────┬───────────┬───────────┐
│ load_nr │ s0_table1 │ s0_table2 │
├─────────┼───────────┼───────────┤
│     101 │         2 │         1 │
│     102 │         1 │         4 │
│     103 │         3 │         1 │
└─────────┴───────────┴───────────┘
(3 rows)

Nice, so this was precisely what we wanted. We can now use SQL-generating query
as the first argument to the function crosstab:

with
  source_sql as (
    select string_agg(sql_text, E'nUNIONn') || E'nORDER BY load_nr ASC' from (
      select
        'SELECT load_nr, ''' || relname || ''', count(*) FROM ' || relname || ' GROUP BY load_nr' as sql_text
      from
        pg_catalog.pg_class
      where relname like 's0_%'
      order by relname asc
    ) as queries
  )
  select * from crosstab(
    -- The 1st argument is the query producing the values on the two dimentions:
    -- `load_nr` and table name.
    (select * from source_sql),

    -- The 2nd argument must return the headers names, which in our case are
    -- the table names
    $$
    select * from (values ('s0_table1'), ('s0_table2')) as headers
    $$
  ) as (
    "load_nr" int,
    "count s0_table1" int,
    "count s0_table2" int
  );

Executing it will get us:

┌─────────┬─────────────────┬─────────────────┐
│ load_nr │ count s0_table1 │ count s0_table2 │
├─────────┼─────────────────┼─────────────────┤
│     101 │               2 │               1 │
│     102 │               1 │               4 │
│     103 │               3 │               1 │
└─────────┴─────────────────┴─────────────────┘
(3 rows)

Now, the final thing is to replace the 2nd argument to crosstab with a proper
query — a way to fetch all table names prefixed with s0_, which we already
know how to do — and with that the solution is complete:

with
  source_sql as (
    select string_agg(sql_text, E'nUNIONn') || E'nORDER BY load_nr ASC' from (
      select
        'SELECT load_nr, ''' || relname || ''', count(*) FROM ' || relname || ' GROUP BY load_nr' as sql_text
      from
        pg_catalog.pg_class
      where relname like 's0_%'
      order by relname asc
    ) as queries
  )

  -- select * from source_sql;

  select * from public.crosstab(
    (select * from source_sql),
    $$
      select relname from pg_catalog.pg_class
      where relname like 's0_%'
      order by relname asc
    $$
  ) as (
    load_nr int,
    "count s0_table1" int,
    "count s0_table2" int
  );

Coming back to my observation that this will bring you just 99% to your ideal
solution. The reason for that is the fact that crosstab requires a column
list definition, this part, where you have to manually list the table headers
(again):

  ) as (
    load_nr int,
    "count s0_table1" int,
    "count s0_table2" int
  );

Unfortunately, I don’t know of a way to get rid of that last hurdle. And as I
mentioned in the beginning, this query might perform quite badly, but maybe it
is good enough for your use case.

Please signup or login to give your own answer.

Click here to cancel reply.

Postgresql – SQL: Summarize rows by column value, without using UNION

Answers