skip to Main Content

I have a large jsonl file like so:

# source.jsonl
{"id": "y88979", "content": "content goes here"}
{"id": "h93794", "content": "content goes here"}
{"id": "k9489", "content": "content goes here"}
{"id": "p48947", "content": "content goes here"}
{"id": "i8408", "content": "content goes here"}

I have a banned id list like so:

#banned_list.txt
k9489
p48947
</snip>

I want to now delete the lines where the "id" matches any of the id on the banned list text file. So I am looking for the result:

{"id": "y88979", "content": "content goes here"}
{"id": "h93794", "content": "content goes here"}
{"id": "i8408", "content": "content goes here"}

Python would be too slow to iterate over this jsonl file (20gb) and I see that jq is the best for doing this but unsure about the syntax which will allow it to take all ids from a list.. 🙁

2

Answers


  1. You could read the banned ids using --rawfile, split it at newline characters, and compare each JSON line read if its .id is contained:

    jq -c --rawfile b banned_list.txt 'select(IN(.id; $b | (. / "n")[]) | not)' source.jsonl
    

    This, however, would split the list of banned ids over and over on each input line, so it’d be better to prepare it beforehand into a proper JSON array using another call to jq, and --slurpfile to read the JSON strings into an array:

    jq -c --slurpfile b <(jq -R . banned_list.txt) 'select(IN(.id; $b[]) | not)' source.jsonl
    

    Output:

    {"id":"y88979","content":"content goes here"}
    {"id":"h93794","content":"content goes here"}
    {"id":"i8408","content":"content goes here"}
    

    You could even improve on this by sorting the list of banned ids, and use bsearch for a binary search.

    Login or Signup to reply.
  2. Having your flat file with the list of banned:

    $ cat banned_list.txt
    k9489
    p48947
    

    You can convert to json: (Note: the awk ensures no ‘r’ chars ):

    $ awk '{gsub(/r/,"",$0);print $0}' banned_list.txt  | jq --raw-input '.' | jq '.' -s > banned.json
    
    $ cat banned.json
    [
      "k9489",
      "p48947"
    ]
    

    Now you can select all objects with the "test" function (regex function )
    Note the $ban variable has "k9489|p48947" (pipe is the logical OR) and the "not" reverts selection:

    $  jq --argfile ban  banned.json 'select(.id|test($ban|join("|"))|not)' source.jsonl -c
    
    {"id":"y88979","content":"content goes here"}
    {"id":"h93794","content":"content goes here"}
    {"id":"i8408","content":"content goes here"}
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search