Why I get so many more records with a condition in sql?
I try to build a sql table based on another sql table. The original commands are:
cursor.execute('DELETE FROM Dataplace.Queue')
cursor.execute(f"INSERT INTO Dataplace.Queue SELECT DISTINCT COSTCENTER, 'Pending', Null, Null FROM Dataplace.START")
Where the start table has the following structure:
Calmonth COSTCENTER NAME VALUE
202301 0002001234 A 10
202302 0002001234 A 20
202301 0002001234 B 15
202301 0001166666 A 12
202301 0001000666 A 11
...
And with the original command i get 378 entries, with the output for the Queue Table:
Key Status Container Duration
0002001234 Pending NULL NULL
0001166666 Pending NULL NULL
0001000666 Pending NULL NULL
When I run the commands:
cursor.execute('DELETE FROM Dataplace.Queue')
cursor.execute(f"INSERT INTO Dataplace.Queue SELECT DISTINCT COSTCENTER, 'Pending', Null, Null FROM Dataplace.START WHERE NOT (COSTCENTER LIKE '000200%' OR COSTCENTER LIKE '001000%')")
I get an output with 577 rows. I would expect an output less then above (378-84 see the following command). What do I miss here?
The commands:
cursor.execute('DELETE FROM Dataplace.Queue')
cursor.execute(f"INSERT INTO Dataplace.Queue SELECT DISTINCT COSTCENTER, 'Pending', Null, Null FROM Dataplace.START WHERE COSTCENTER LIKE '000200%' OR COSTCENTER LIKE '001000%'")
gives a output with 84 rows.
2
Answers
The discrepancy in the number of records you’re observing is likely due to the DISTINCT keyword and how the WHERE clause is filtering your data.
When you use DISTINCT, it ensures that all rows in the result set are unique. In your first INSERT statement, you are selecting distinct COSTCENTER values from the entire Dataplace.START table without any filtering. This gives you 378 unique COSTCENTER values.
In your second INSERT statement, you apply a filter that excludes rows where COSTCENTER starts with ‘000200’ or ‘001000’. Intuitively, you might expect this to reduce the number of rows in the result. However, the actual outcome depends on the distribution of COSTCENTER values in your table. If, after excluding these specific COSTCENTER values, there are more unique COSTCENTER values left in the data than you had initially, you can end up with more records, not fewer.
For the third INSERT statement, you’re explicitly filtering to include only COSTCENTER values that start with ‘000200’ or ‘001000’. This results in 84 rows, which indicates that there are 84 unique COSTCENTER values in your table that match this condition.
The key point to understand is that the number of distinct COSTCENTER values that do not start with ‘000200’ or ‘001000’ can be greater than the total number of distinct COSTCENTER values in your table. This can happen if there are many unique COSTCENTER values that don’t match these patterns.
To troubleshoot and understand your data better, you might want to check the distinct COSTCENTER values in your table and see how they are distributed. You can execute a query to count the number of unique COSTCENTER values that match and do not match your criteria:
This will give you a clearer picture of how your COSTCENTER values are distributed and why the result set sizes are different.
It is really hard to determine the correct query to execute without the Dataset, but I’ll try anyway:
If I understood, it’s just a matter of conditions and operators.
Doing a
will basically return everything.
Instead:
should return the correct result.
You need to retrieve the
COSTCENTER
s that don’t contain'000200%'
AND
'001000%'
.