Im trying to write an sql statement which summarizes the count of rows by values in a specific column.
I have several tables, starting with s0_
Every table has a huge amount of rows with multiple, different values in column load_nr.
Example:
s0_table1:
load_nr | column2 | column3… |
---|---|---|
101 | someData | someData… |
101 | someData | someData… |
102 | someData | someData… |
103 | someData | someData… |
103 | someData | someData… |
103 | someData | someData… |
S0_table2:
load_nr | column2 | column3… |
---|---|---|
101 | someData | someData… |
102 | someData | someData… |
102 | someData | someData… |
102 | someData | someData… |
102 | someData | someData… |
103 | someData | someData… |
Desired result:
load_nr | count s0_table1 | count s0_table2 |
---|---|---|
101 | 2 | 1 |
102 | 1 | 4 |
103 | 3 | 1 |
New data with up counting load_nr is added every day.
Best case is, that the table names are fetched dynamically (s0_*).
Could anyone give me a cloue how to build this statement? I’m a newbe and not so experienced with more complicated statements.
Best Regards
*edit: Hopefully the tables are displayed correctly xD
3
Answers
For that you need a FULL OUTER JOIN in case the load_nr don’t match up
ORDER BY load_nr
Dynamic table names are going to be a problem. You basically have no choice but to write a SQL statement dynamically. There’s a good chance that summing a UNION-ed subquery is going to be the fastest way to go, but if you can also try this:
Of course you’d need to build the query dynamically and the nested coalesces for each subsequent join could get a bit tricky.
Another option if you have the load numbers you need in another table you can make it cleaner and easier to construct dynamically:
And if you need to remove the loads with no rows in any table you can just wrap it in a subquery and add a where clause:
This could be pretty slow though depending on the shape of your data.
I’m not sure if any of the following will work for you. This is something that
I haven’t ever used in a real project, but maybe it will provide you with some
inspiration. The final solution will get 99% to having the best case that you
wanted (more on this at the end).
Here’s the basic idea.
psql
provides a command calledcrosstabview
that isable to display pivot tables, which is what you want. So, for your sample data,
we could do this (notice that instead of terminating the query with a semicolon
;
, we’re terminating it withcrosstabview
):Try the same query with a semicolon terminator to see what the query has returned
before
psql
post-processed the result:Ok. Now this was a
psql
thing, but it PostgreSQL provides a built-inextension called
tablefunc
which exposes a function calledcrosstab
thatdoes precisely what the psql
crosstabview
command does. Furthermore, thisfunction accepts as argument a string containing the SQL text of the query
you need, which means we could generate it.
For the moment we can just use the hardcoded query (I’m using
$$
-delimitedstrings below so that I don’t have to escape single quotes):
The above will return this, which reproduces perfectly the
crosstabview
result:
Ok, now on to dynamically generate the two SQL queries that
crosstab
wants.First, we can obtain a list of the table names prefixed with
s0_
:Now, we can use these values to build individual
SELECT
statements that we’lllater join via
UNION
. First theSELECT
s:Now we can join the 2 rows into a single one:
If we take this query and manually run it using
crosstabview
, we’ll get:Nice, so this was precisely what we wanted. We can now use SQL-generating query
as the first argument to the function
crosstab
:Executing it will get us:
Now, the final thing is to replace the 2nd argument to
crosstab
with a properquery — a way to fetch all table names prefixed with
s0_
, which we alreadyknow how to do — and with that the solution is complete:
Coming back to my observation that this will bring you just 99% to your ideal
solution. The reason for that is the fact that
crosstab
requires a columnlist definition, this part, where you have to manually list the table headers
(again):
Unfortunately, I don’t know of a way to get rid of that last hurdle. And as I
mentioned in the beginning, this query might perform quite badly, but maybe it
is good enough for your use case.