skip to Main Content

I have a DB in postgres with two tables, ‘responses’ and ‘variables’, like this:

responses

respondent_id variable_name response
01 name susan
01 age 42
02 name bob
02 age 13

variables

variable_name type
name text
age integer

Note: The data in the ‘response’ column of the responses table is stored as text.

I’m trying to write a query that will "pivot" the data in the responses table from "long" to "wide" so that there is one row for each respondent_id, and a column for each of the variables in the ‘variable_name’ table… AND I want to convert the types of the columns returned according to the types listed in the variables table. For instance, the ‘age’ column in the pivoted result should be cast as an integer.

The problem I’m running into is trying to use a "scalar subquery" to extract the type information from the variables table which can then be fed to a CAST expression. Here is the query I’ve tried (which doesn’t work):

SELECT
  respondent_id,
  CAST(
    max(case when variable_name = 'age' then response else null end) AS
    (SELECT variables.type FROM variables WHERE variable_name = 'age')
  ) AS 'age',
  CAST(
    max(case when variable_name = 'name' then response else null end) AS
    (SELECT variables.type FROM variables WHERE variable_name = 'name')
  ) AS 'name'
FROM responses GROUP BY respondent_id

This query throws a syntax error. As best I can tell, this is because a scalar subquery cannot be used in the ‘AS’ part of CAST. I think this is true because the following query works:

SELECT
  respondent_id,
  CAST(
    max(case when variable_name = 'age' then response else null end) AS
    integer
  ) AS 'age',
  CAST(
    max(case when variable_name = 'name' then response else null end) AS
    text
  ) AS 'name'
FROM responses GROUP BY respondent_id

How can I feed information from the variables table into my query so that the columns will be cast to the correct type?

One more note: Hardcoding the types in the CAST expressions is not an option for me. What I’ve shown here is a greatly simplified example. In the actual case I’m working with, there are about 1800 rows in the variables table, and I need a parameterized query that can create the pivoted-and-correctly-cast table for any arbitrary combination of those variables.

Thanks!

2

Answers


  1. Using another language like R or Python can facilitate this. To illustrate, below is R code (using DuckDB, which is largely equivalent to PostgreSQL). You can even use it just to generate the SQL (seen as output from show_query() below).

    library(tidyverse)
    library(DBI)
    
    db <- dbConnect(duckdb::duckdb())
    
    responses <- tribble(
        ~respondent_id, ~variable_name, ~response,
        01, "name", "susan",
        01, "age",  "42",
        02, "name", "bob",
        02, "age",  "13") |>
      copy_to(db, df = _, name = "responses", overwrite = TRUE)
    
    variables <- tribble(
      ~variable_name, ~type,
      "name", "text",
      "age", "integer") |>
      copy_to(db, df = _, name = "variables", overwrite = TRUE)
    
    ints <- 
      variables |> 
      filter(type == "integer") |> 
      select(variable_name) |> 
      pull()
    
    result <-
      responses |>
      pivot_wider(id_cols = respondent_id,
                  names_from = "variable_name",
                  values_from = "response") |>
      mutate(across(any_of(ints), as.integer))
      
    result |>
      show_query()
    #> <SQL>
    #> SELECT respondent_id, CAST(age AS INTEGER) AS age, "name"
    #> FROM (
    #>   SELECT
    #>     respondent_id,
    #>     MAX(CASE WHEN (variable_name = 'age') THEN response END) AS age,
    #>     MAX(CASE WHEN (variable_name = 'name') THEN response END) AS "name"
    #>   FROM responses
    #>   GROUP BY respondent_id
    #> ) q01
    

    Created on 2024-07-30 with reprex v2.1.1

    Login or Signup to reply.
  2. Having 1800+ variables would make the resultset very ‘wide’ and sparsely populated. I would rather suggest a single aggregated JSON column (hopefully much more usable in the tier above) with the variable names as attributes.
    So, first create a simple helper function

    create or replace function jsonb_dyntype(v text, t text)
    returns jsonb language sql as 
    $$
    select 
     case t
        when 'text'     then to_jsonb(v::text)
        when 'date'     then to_jsonb(v::date)
        when 'number'   then to_jsonb(v::numeric)
        when 'integer'  then to_jsonb(v::integer)
        when 'datetime' then to_jsonb(v::timestamp)
        -- other types 
     end
    $$;
    

    and then

    select jsonb_object_agg(variable_name, jsonb_dyntype(response, type)) 
    from responses r join variables v using (variable_name)
    group by respondent_id;
    
    jsonb_object_agg
    {"age": 13, "name": "bob"}
    {"age": 42, "name": "susan"}

    DB Fiddle demo

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search