I need to compare two tables with exactly the same fields/types and identify all records which are different in at least one field or do only exist in one of the two tables.
Both tables have an column id
with a unique value for each record. This would allow to identify corresponding records in both tables. But it does not mean, that those records are identical (any of the other columns can be different between the two records).
table0:
| id | col1 | col2 | ... |
| -- | ---- | ---- | --- |
| 22 | 1111 | 'gt' | ... |
| 23 | 5624 | 'ha' | ... |
| 24 | 7775 | 'oh' | ... |
| 26 | 2113 | 'yh' | ... |
| 28 | 9988 | 'wq' | ... |
table1:
| id | col1 | col2 | ... |
| -- | ---- | ---- | --- |
| 22 | 1111 | 'gt' | ... |
| 23 | 5624 | 'ha' | ... |
| 25 | 3333 | 'er' | ... |
| 26 | 2113 | 'ya' | ... |
| 28 | 9988 | 'wq' | ... |
What I need is the following result/output:
| id | reason |
| -- | ------ |
| 24 | only in table0 |
| 25 | only in table1 |
| 26 | not identical values |
A simple way would obviously be
SELECT *
FROM
(
SELECT *, 0 AS src /* added src to identify table 0 */
FROM table0
UNION ALL
SELECT *, 1 AS src /* added src to identify table 1 */
FROM table1
) temp
GROUP BY col1, col2, col3, ...
HAVING COUNT(*) = 1
But there are two problems with this solution:
-
I do not know the full set of columns at the time of writing this code. The table specification does change over time (the two tables are created by reading csv files, though the two tables to be compared are always identical). Is there a way to group by something like ‘*’?
-
If grouping is indeed possible in a way I described above, I would need to exclude the
src
value I have added to identify the table a resulting record comes from.
2
Answers
if you know the source and can sum a value of 1 using a window function then by definition if the result is 1 then the missing source is obvious. for example
in your case since you don’t know all the columns you would need to use dynamic sql.
NB I have no idea how this would perform with a lot of columns and a lot of data and for illustration purposes I haven’t attempted to optimise this process.
Also ‘different in at least one field’ isn’t a goer but it may be if the result of the above process skins down the result set to a reasonable size.
Your last edit makes it such a different question that it may be worth raising a new question and deleting this one that way some of the experienced contributors may revisit it.
Another approach would be to hash all the columns and compare the has values , you will still need to build the query from information_schema.columns but at least you don’t need to match every column individually. Here’s an example
NB I don’t know how this performs over a lot of columns but I guess there’s scope for chunking the hash values.