Question posted in Json
Our archive of expertly curated questions and answers provides insights and solutions to common problems related to this popular data interchange format. From parsing and manipulating JSON data to integrating it with various programming languages and web services, our archive has got you covered. Start exploring today and take your JSON skills to the next level

deleting lines from jsonl file with matching key/value using jq

JohnJ
November 23, 2023
233 views
0 votes
2 Answers

I have a large jsonl file like so:

# source.jsonl
{"id": "y88979", "content": "content goes here"}
{"id": "h93794", "content": "content goes here"}
{"id": "k9489", "content": "content goes here"}
{"id": "p48947", "content": "content goes here"}
{"id": "i8408", "content": "content goes here"}

I have a banned id list like so:

#banned_list.txt
k9489
p48947
</snip>

I want to now delete the lines where the "id" matches any of the id on the banned list text file. So I am looking for the result:

{"id": "y88979", "content": "content goes here"}
{"id": "h93794", "content": "content goes here"}
{"id": "i8408", "content": "content goes here"}

Python would be too slow to iterate over this jsonl file (20gb) and I see that jq is the best for doing this but unsure about the syntax which will allow it to take all ids from a list.. 🙁

Tags: jq json jsonlines

Answers

- pmf
- November 22, 2023 at 6:15 pm
- 0 votes
0
You could read the banned ids using --rawfile, split it at newline characters, and compare each JSON line read if its .id is contained:
```
jq -c --rawfile b banned_list.txt 'select(IN(.id; $b | (. / "n")[]) | not)' source.jsonl
```
This, however, would split the list of banned ids over and over on each input line, so it’d be better to prepare it beforehand into a proper JSON array using another call to jq, and --slurpfile to read the JSON strings into an array:
```
jq -c --slurpfile b <(jq -R . banned_list.txt) 'select(IN(.id; $b[]) | not)' source.jsonl
```
Output:
```
{"id":"y88979","content":"content goes here"}
{"id":"h93794","content":"content goes here"}
{"id":"i8408","content":"content goes here"}
```
You could even improve on this by sorting the list of banned ids, and use bsearch for a binary search.
Login or Signup to reply.

- AlejandroBerm250dez
- November 23, 2023 at 1:33 am
- 0 votes
0
Having your flat file with the list of banned:
```
$ cat banned_list.txt
k9489
p48947
```
You can convert to json: (Note: the awk ensures no ‘r’ chars ):
```
$ awk '{gsub(/r/,"",$0);print $0}' banned_list.txt  | jq --raw-input '.' | jq '.' -s > banned.json

$ cat banned.json
[
  "k9489",
  "p48947"
]
```
Now you can select all objects with the "test" function (regex function )
Note the $ban variable has "k9489|p48947" (pipe is the logical OR) and the "not" reverts selection:
```
$  jq --argfile ban  banned.json 'select(.id|test($ban|join("|"))|not)' source.jsonl -c

{"id":"y88979","content":"content goes here"}
{"id":"h93794","content":"content goes here"}
{"id":"i8408","content":"content goes here"}
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.