I have a large jsonl file like so:
# source.jsonl
{"id": "y88979", "content": "content goes here"}
{"id": "h93794", "content": "content goes here"}
{"id": "k9489", "content": "content goes here"}
{"id": "p48947", "content": "content goes here"}
{"id": "i8408", "content": "content goes here"}
I have a banned id list like so:
#banned_list.txt
k9489
p48947
</snip>
I want to now delete the lines where the "id" matches any of the id on the banned list text file. So I am looking for the result:
{"id": "y88979", "content": "content goes here"}
{"id": "h93794", "content": "content goes here"}
{"id": "i8408", "content": "content goes here"}
Python would be too slow to iterate over this jsonl file (20gb) and I see that jq
is the best for doing this but unsure about the syntax which will allow it to take all ids from a list.. 🙁
2
Answers
You could read the banned ids using
--rawfile
, split it at newline characters, and compare each JSON line read if its.id
is contained:This, however, would split the list of banned ids over and over on each input line, so it’d be better to prepare it beforehand into a proper JSON array using another call to jq, and
--slurpfile
to read the JSON strings into an array:Output:
You could even improve on this by sorting the list of banned ids, and use
bsearch
for a binary search.Having your flat file with the list of banned:
You can convert to json: (Note: the awk ensures no ‘r’ chars ):
Now you can select all objects with the "test" function (regex function )
Note the $ban variable has "k9489|p48947" (pipe is the logical OR) and the "not" reverts selection: