skip to Main Content

I have two tables, table 2 has one row per variable, while table 1 has many rows per variable. I want to randomly select 1000 rows from table 2 (1000 variables) and then join those to table 1. Final table will have many rows per variable. I’m also adding conditions for the selection from both tables. This is my code but it’s not picking 1000 variables from table 2:

SELECT t1.var1, t1.var2, t2.var3, t2.var4
FROM table1 t1
RIGHT JOIN (SELECT var1, var3 FROM table2 ORDER BY RAND() LIMIT 1000) t2 
ON t1.var1 = t2.var2
WHERE t1.var2 < 90 AND t2.var4 = "asd";

What am I missing?
Also, I’ve tested this putting a limit of 10 variables, and sometimes I get zero rows…

2

Answers


  1. Your query could easily filter out all results.

    SELECT t1.var1, t1.var2, t2.var3, t2.var4
    FROM table1 t1
    RIGHT JOIN (SELECT var1, var3 FROM table2 ORDER BY RAND() LIMIT 1000) t2 
    ON t1.var1 = t2.var2
    WHERE t1.var2 < 90 AND t2.var4 = "asd";
    

    This selects the 1000 random rows before testing for the rows where t2.var4 = 'asd'. I assume not all rows match that condition. So it’s possible that none of the 1000 randomly selected rows match that condition.

    Also you are testing t1.var2 < 90 after the outer join, which means only rows of t1 that have at least one corresponding row of t2 matching that range condition will be included. This also creates the possibility that the query returns no results, because not all rows of t1 have such corresponding rows of t2.

    You should be able to fix both problems.

    SELECT t1.var1, t1.var2, t2.var3, t2.var4
    FROM table1 t1
    RIGHT JOIN (
      SELECT var1, var3 FROM table2 
      WHERE var4 = 'asd' ORDER BY RAND() LIMIT 1000
    ) t2 
    ON t1.var1 = t2.var2 AND t1.var2 < 90;
    

    This puts the condition on t2.var4 inside the subquery, so the subquery returns 1000 randomly selected rows from the subset of rows that match.

    Also putting the condition on t1.var2 into the join condition means that rows of t2 will be returned even if they they have no corresponding rows of t1 that match that condition.

    Login or Signup to reply.
  2. RIGHT JOIN is commonly avoided, because right-join queries are way harder to read than left-join queries. Your query equals:

    SELECT t1.var1, t1.var2, t2.var3, t2.var4
    FROM (SELECT var1, var3 FROM table2 ORDER BY RAND() LIMIT 1000) t2 
    LEFT OUTER JOIN table1 t1 ON t1.var1 = t2.var2
    WHERE t1.var2 < 90 AND t2.var4 = 'asd';
    

    In an outer joined row all t1 values are NULL, so WHERE t1.var2 < 90 dismisses these rows. Your query is hence

    SELECT t1.var1, t1.var2, t2.var3, t2.var4
    FROM (SELECT var1, var3 FROM table2 ORDER BY RAND() LIMIT 1000) t2 
    INNER JOIN table1 t1 ON t1.var1 = t2.var2
    WHERE t1.var2 < 90 AND t2.var4 = 'asd';
    

    Now, what does this query do? It first picks 1000 random t2 rows. Then it dismisses all that don’t match t2.var4 = 'asd'. It also joins t1 rows matching t1.var2 < 90 and dismisses all t2 rows that don’t have such a match. This leaves you with 0 to 1000 t2 rows / variables and their matches – how many different variables is left to chance.

    It seems you rather want:

    SELECT t1.var1, t1.var2, t2.var3, t2.var4
    FROM
    (
      SELECT var1, var3
      FROM table2
      WHERE var4 = 'asd'
      ORDER BY RAND()
      LIMIT 1000
    ) t2 
    LEFT JOIN table1 t1 ON t1.var1 = t2.var2 AND t1.var2 < 90;
    

    i.e., 1000 vars from table 2 whether they have t1 matches or not.

    Or

    SELECT var1, var2, var3, var4
    FROM
    (
      SELECT
        t1.var1, t1.var2, t2.var3, t2.var4,
        RANK() OVER (ORDER BY t2.var) AS rnk
      FROM table2 t2
      JOIN table1 t1 ON t1.var1 = t2.var2 AND t1.var2 < 90
      WHERE var4 = 'asd'
    ) ranked
    WHERE rnk <= 1000;
    

    i.e., 1000 vars from table 2 that have t1 matches.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search