skip to Main Content

I have a query to show the average length of all the proteins which also exists in a second table mrna_pieces:

SELECT AVG(protein_length)
FROM protein
WHERE exists
(SELECT protein_id FROM mrna_pieces WHERE mrna_brokstukken.protein_id = protein.protein_id)

I also want to show the average for all the proteins which don’t exist in the second table.

SELECT AVG(protein_length)
FROM protein
WHERE exists
(SELECT protein_id FROM mrna_pieces WHERE mrna_brokstukken.protein_id != protein.protein_id)

But I want these 2 parts in one table like this
table example
I tried:

SELECT AVG(eiwit_lengte) AS avglengthwith, AVG(eiwit_lengte) AS avglengthwithout
FROM eiwit
WHERE exists
(SELECT eiwit_id FROM mrna_brokstukken WHERE mrna_brokstukken.eiwit_id != eiwit.eiwit_id)
WHERE exists
(SELECT eiwit_id FROM mrna_brokstukken WHERE mrna_brokstukken.eiwit_id = eiwit.eiwit_id)

But that gives me the following error:

ERROR: pq: syntax error at or near "WHERE"

3

Answers


  1. Just select the two at the same time:

    select (
      select AVG(protein_length)
      FROM protein
      WHERE exists
      (select protein_id from mrna_pieces where mrna_brokstukken.protein_id = protein.protein_id)
    ) as avgwith,
    (
      select AVG(protein_length)
      FROM protein
      WHERE exists
      (select protein_id from mrna_pieces where mrna_brokstukken.protein_id != protein.protein_id)
    )
    as avgwithout
    
    Login or Signup to reply.
  2. as both queries will gove back a scalar value you can simply use bot as subqueries

    SELECT
    
        (select AVG(protein_length)
        FROM protein
        WHERE exists
        (select protein_id from mrna_pieces where mrna_brokstukken.protein_id = protein.protein_id) ) AVG_1,
        (select AVG(protein_length)
        FROM protein
        WHERE exists
        (select protein_id from mrna_pieces where mrna_brokstukken.protein_id != protein.protein_id)) AVG_2
    
    Login or Signup to reply.
  3. It should be cheapest to compute the "exists" value once per row in a subquery, then two conditional aggregates in the outer SELECT:

    SELECT avg(protein_length) FILTER (WHERE p_exists) AS avg_len_exists
         , avg(protein_length) FILTER (WHERE NOT p_exists) AS avg_len_not_exists
    FROM  (
       SELECT protein_length
            , EXISTS (SELECT FROM mrna_pieces m WHERE m.protein_id = p.protein_id) AS p_exists
       FROM   protein p
       ) sub;
    

    About the aggregate FILTER clause (which can be combined with the OVER clause of window functions):

    Or, shorter if two result rows are acceptable:

    SELECT EXISTS (SELECT FROM mrna_pieces m WHERE m.protein_id = p.protein_id) AS p_exists
         , avg(protein_length) avg_len
    FROM   protein p
    GROUP  BY 1;
    

    Should be slightly faster, yet.

    Either way, if your table mrna_pieces is big, an index with protein_id as leading or only column helps performance (a lot).

    Note that the second query in your question is not doing what you are asking for. EXISTS ( ... WHERE m.protein_id != p.protein_id) is true if at least one row in table mrna_pieces does not match, which is always the case if there are at least two distinct values in m.protein_id. You would want NOT EXISTS ( ... WHERE m.protein_id = p.protein_id). But use the cheaper query I suggested instead.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search