I need to join two tables: our_sample
and tls207_pers_appln
from PATSTAT.
our_sample
‘ has 4 columns: appln_id, appln_auth, appln_nr, appln_kind.
tls207_pers_appln
has 4 columns: appln_id, person_id, applt_seq_nr, invt_seq_nr.
our_sample
has 2191 rows and some (60) of them are missing in tls207_pers_appln
.
Because I want to join the tables keeping all the appln_id in our_sample
(even if they do not have the matching information from tls207_pers_appln
) I join the two tables doing a RIGHT JOIN.
However, the resulting view t2_tot_in_patent
has only 2096 appln_id.
This is in part due to the restriction I put (35 patents are dropped because I select only those HAVING MAX(invt_seq_nr) > 0
, which is fine). But this would yield 2191-35 = 2156 patents.
Instead, I get 2096 of them that is:
2191 (in our_sample) – 60 (appln_ids in our_sample that miss from tls207) – 35 (appln_ids for which invt_seq_nr = 0).
BUT the whole point of using RIGHT JOIN is that I should not loose those 60 patents. Why then?
-- compiling total count of inventors per patent: t2_tot_in_patent
DROP VIEW IF EXISTS t2_tot_in_patent;
CREATE VIEW t2_tot_in_patent AS
SELECT m.appln_id, MAX(invt_seq_nr) AS tot_in_patent
FROM patstat2022a.tls207_pers_appln AS t7
RIGHT OUTER JOIN cecilia.our_sample AS m
ON t7.appln_id = m.appln_id
GROUP BY appln_id
HAVING MAX(invt_seq_nr) > 0
2
Answers
Suggestion:
Maybe the SQL is grouping by appln_id of the other table.
Try this:
Right now your code is getting only rows where this max value is more than 0 but those that don’t exist in the other table are getting ignored because they would have this value as
NULL
.A right-join to our_sample in the second position is the same as a left-join starting with our_sample in the first position. No alias for the max( invt_seq_nr ) table which I would imply comes from the t7 table.
Now, it appears you want to have EVERYONE from the our_sample table regardless of the tls207_pers_appln. I would reverse to a left-join putting our_sample in primary position.
Remove the HAVING clause, that is going to chop off those people who never had an event. By adding the order by as I have here, I am basically floating to the top of the list all people who did NOT have any records in the t7 table.
At this point you should be able to get your entire EXPECTED count to be correct, but also see those that dont have and see if that pairs-up with the missing count encountered.
Once confirmed, you can re-apply the HAVING clause to only restrict those who HAD activity in the t7 table vs not.