Compute averages for two distinct sets from one table in a single query - Postgresql

Koenvc
January 12, 2023
301 views
3 votes
3 Answers

I have a query to show the average length of all the proteins which also exists in a second table mrna_pieces:

SELECT AVG(protein_length)
FROM protein
WHERE exists
(SELECT protein_id FROM mrna_pieces WHERE mrna_brokstukken.protein_id = protein.protein_id)

I also want to show the average for all the proteins which don’t exist in the second table.

SELECT AVG(protein_length)
FROM protein
WHERE exists
(SELECT protein_id FROM mrna_pieces WHERE mrna_brokstukken.protein_id != protein.protein_id)

But I want these 2 parts in one table like this
table example
I tried:

SELECT AVG(eiwit_lengte) AS avglengthwith, AVG(eiwit_lengte) AS avglengthwithout
FROM eiwit
WHERE exists
(SELECT eiwit_id FROM mrna_brokstukken WHERE mrna_brokstukken.eiwit_id != eiwit.eiwit_id)
WHERE exists
(SELECT eiwit_id FROM mrna_brokstukken WHERE mrna_brokstukken.eiwit_id = eiwit.eiwit_id)

But that gives me the following error:

ERROR: pq: syntax error at or near "WHERE"

Tags: postgresql sql

Answers

Just select the two at the same time:

select (
  select AVG(protein_length)
  FROM protein
  WHERE exists
  (select protein_id from mrna_pieces where mrna_brokstukken.protein_id = protein.protein_id)
) as avgwith,
(
  select AVG(protein_length)
  FROM protein
  WHERE exists
  (select protein_id from mrna_pieces where mrna_brokstukken.protein_id != protein.protein_id)
)
as avgwithout

as both queries will gove back a scalar value you can simply use bot as subqueries

SELECT

    (select AVG(protein_length)
    FROM protein
    WHERE exists
    (select protein_id from mrna_pieces where mrna_brokstukken.protein_id = protein.protein_id) ) AVG_1,
    (select AVG(protein_length)
    FROM protein
    WHERE exists
    (select protein_id from mrna_pieces where mrna_brokstukken.protein_id != protein.protein_id)) AVG_2

- ErwinBrandstetter
- January 13, 2023 at 1:28 am
- 0 votes
0
It should be cheapest to compute the "exists" value once per row in a subquery, then two conditional aggregates in the outer SELECT:
```
SELECT avg(protein_length) FILTER (WHERE p_exists) AS avg_len_exists
     , avg(protein_length) FILTER (WHERE NOT p_exists) AS avg_len_not_exists
FROM  (
   SELECT protein_length
        , EXISTS (SELECT FROM mrna_pieces m WHERE m.protein_id = p.protein_id) AS p_exists
   FROM   protein p
   ) sub;
```
About the aggregate FILTER clause (which can be combined with the OVER clause of window functions):
- Aggregate columns with additional (distinct) filters
Or, shorter if two result rows are acceptable:
```
SELECT EXISTS (SELECT FROM mrna_pieces m WHERE m.protein_id = p.protein_id) AS p_exists
     , avg(protein_length) avg_len
FROM   protein p
GROUP  BY 1;
```
Should be slightly faster, yet.

Either way, if your table mrna_pieces is big, an index with protein_id as leading or only column helps performance (a lot).

Note that the second query in your question is not doing what you are asking for. EXISTS ( ... WHERE m.protein_id != p.protein_id) is true if at least one row in table mrna_pieces does not match, which is always the case if there are at least two distinct values in m.protein_id. You would want NOT EXISTS ( ... WHERE m.protein_id = p.protein_id). But use the cheaper query I suggested instead.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Compute averages for two distinct sets from one table in a single query – Postgresql

Answers