skip to Main Content

I have a dataset like this:

  • id : E.g. 111, 111, 111, 112, 112, 113, 113
  • Year: E.g. 2010, 2011, 2012, 2010, 2011,2010, 2015
  • Sex: E.g. M, M, F, F, F, M, M

In this dataset, ID = 111 had a sex change (switch from M to F – or from F to M)

With postgre sql, I try to find out:

  • A: How many ids stay as man (and which ids)
  • B: How many ids stay as woman (and which ids)
  • C: How many ids go from man to woman (and which ids)
  • D: How many ids go from woman to man (and which ids)

I try like this:

# problem A
SELECT COUNT(DISTINCT ID) FROM table WHERE ID NOT IN (SELECT ID FROM table WHERE SEX = 'M');
SELECT DISTINCT ID FROM table WHERE ID NOT IN (SELECT ID FROM table WHERE SEX = 'M');

# problem B
SELECT COUNT(DISTINCT ID) FROM table WHERE ID NOT IN (SELECT ID FROM table WHERE SEX = 'F');
SELECT DISTINCT ID FROM table WHERE ID NOT IN (SELECT ID FROM table WHERE SEX = 'F');

# all sex change
SELECT COUNT(DISTINCT ID) FROM table WHERE ID IN (SELECT ID FROM table WHERE SEX = 'M') AND ID IN (SELECT ID FROM table WHERE SEX = 'F');
SELECT DISTINCT ID FROM table WHERE ID IN (SELECT ID FROM table WHERE SEX = 'M') AND ID IN (SELECT ID FROM table WHERE SEX = 'F');

Is it correct? Or is window-lag function needed?

3

Answers


  1. You can try this, to calculate in advance some metrics:

     SELECT *
            ,MAX(CASE WHEN sex = 'M' THEN 1 ELSE 0 END) OVER (PARTITION BY ID) AS has_M
            ,MAX(CASE WHEN sex = 'F' THEN 1 ELSE 0 END) OVER (PARTITION BY ID) AS has_F
            ,DENSE_RANK() OVER (PARTITION BY id ORDER BY id, year) AS initial_sex
      FROM mytable;
    

    enter image description here

    and then solve your issues:

    SELECT SUM(CASE WHEN initial_sex = 1 AND SEX = 'M' THEN 1 ELSE 0 END) 
          ,string_agg(CASE WHEN initial_sex = 1 AND SEX = 'M' THEN CAST(id AS VARCHAR(12)) END, ', ') 
          ,SUM(CASE WHEN initial_sex = 1 AND SEX = 'F' THEN 1 ELSE 0 END) 
          ,string_agg(CASE WHEN initial_sex = 1 AND SEX = 'F' THEN CAST(id AS VARCHAR(12)) END, ', ') 
          ,SUM(CASE WHEN (initial_sex = 1 AND SEX = 'F' AND has_m = 1) OR (initial_sex = 1 AND SEX = 'M' AND has_F = 1)  THEN 1 ELSE 0 END) 
          ,string_agg(CASE WHEN (initial_sex = 1 AND SEX = 'F' AND has_m = 1) OR (initial_sex = 1 AND SEX = 'M' AND has_F = 1)  THEN CAST(id AS VARCHAR(12)) END, ', ') 
    FROM
    (
      SELECT *
            ,MAX(CASE WHEN sex = 'M' THEN 1 ELSE 0 END) OVER (PARTITION BY ID) AS has_M
            ,MAX(CASE WHEN sex = 'F' THEN 1 ELSE 0 END) OVER (PARTITION BY ID) AS has_F
            ,DENSE_RANK() OVER (PARTITION BY id ORDER BY id, year) AS initial_sex
      FROM mytable
    ) DS;
    

    enter image description here

    Here is the full working example.

    Login or Signup to reply.
  2. Assuming column SEX will only have either ‘F’ or ‘M’ as value, problem A can be solved

    problem A

    SELECT COUNT(DISTINCT ID) FROM table WHERE  SEX != 'F';
    SELECT DISTINCT ID FROM table WHERE  SEX != 'F';
    
    Login or Signup to reply.
  3. step-by-step demo: db<>fiddle

    Assuming, the change happens only once, you could use the first_value() window function:

    SELECT DISTINCT                                                                   -- 5
        id,
        CASE
            WHEN first_sex = last_sex THEN 'Stay ' || sex                             -- 3
            ELSE 'Change from ' || first_sex || ' To ' || last_sex                    -- 4    
        END sex_status
    FROM (
        SELECT 
            id,
            sex,
            first_value(sex) OVER (PARTITION BY id ORDER BY year) as first_sex,       -- 1
            first_value(sex) OVER (PARTITION BY id ORDER BY year DESC) as last_sex    -- 2
        FROM mytable
    ) s
    
    1. Fetch first sex values per id over years
    2. Fetch last sex values per id over years (Notice the different order: It gives the first value from the "bottom")
    3. Compare first and last; if they are the same, return "Stay" and sex
    4. Otherwise return "Change" with sexes. (Of course, you can do whatever you want here. Adding appropriate status identifiers or similar instead of pure text seems to make sense at this point.)
    5. DISTINCT clause to reduce the records to one per id.

    Afterwards you can do whatever statistics you want. For example counting the different status by GROUP BY sex_status:

    demo: db<>fiddle

    SELECT 
        sex_status,
        COUNT(*)
    FROM (
    
    -- query from above
    
    ) s
    GROUP BY sex_status
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search