Easy way of merging two json files without duplicates based on a column in Python

JamesBlack
May 8, 2024
246 views
0 votes
2 Answers

I want to merget 2 json files into one json file and remove all duplicated rows based on a column (second column). At the moment I merge two or multiple json files manually, then I use python codes to remove all rows with duplicated userid.

First json file:

    [
        {
            "userid": "567897068",
            "status": "UserStatus.RECENTLY",
            "name": "btb appeal court",
            "bot": false,
            "username": "None"
        },
        {
            "userid": "6403980168",
            "status": "UserStatus.RECENTLY",
            "name": "Ah",
            "bot": false,
            "username": "fearpic"
        },
        {
            "userid": "7104649590",
            "status": "UserStatus.RECENTLY",
            "name": "Da",
            "bot": false,
            "username": "Abc130000"
        },
        {
            "userid": "5813962086",
            "status": "UserStatus.RECENTLY",
            "name": "Sothea",
            "bot": false,
            "username": "SotheaSopheap169"
        }
    ]

Second json file:

    [
        {
            "userid": "567897068",
            "status": "UserStatus.RECENTLY",
            "name": "btb appeal court",
            "bot": false,
            "username": "None"
        },
        {
            "userid": "111111111111",
            "status": "UserStatus.RECENTLY",
            "name": "Ah",
            "bot": false,
            "username": "fearpic"
        },
        {
            "userid": "7104649590",
            "status": "UserStatus.RECENTLY",
            "name": "Da",
            "bot": false,
            "username": "Abc130000"
        },
        {
            "userid": "555555555555",
            "status": "UserStatus.RECENTLY",
            "name": "Sothea",
            "bot": false,
            "username": "SotheaSopheap169"
        }
    ]

merged file should be:

    [
        {
            "userid": "567897068",
            "status": "UserStatus.RECENTLY",
            "name": "btb appeal court",
            "bot": false,
            "username": "None"
        },
        {
            "userid": "6403980168",
            "status": "UserStatus.RECENTLY",
            "name": "Ah",
            "bot": false,
            "username": "fearpic"
        },
        {
            "userid": "7104649590",
            "status": "UserStatus.RECENTLY",
            "name": "Da",
            "bot": false,
            "username": "Abc130000"
        },
        {
            "userid": "5813962086",
            "status": "UserStatus.RECENTLY",
            "name": "Sothea",
            "bot": false,
            "username": "SotheaSopheap169"
        },
        {
            "userid": "111111111111",
            "status": "UserStatus.RECENTLY",
            "name": "Ah",
            "bot": false,
            "username": "fearpic"
        },
        {
            "userid": "555555555555",
            "status": "UserStatus.RECENTLY",
            "name": "Sothea",
            "bot": false,
            "username": "SotheaSopheap169"
        }
        
    ]

I have use the following codes to remove duplicates based on userid column in a json file I merged manually:

    import json
    with open('source_user_all.json', 'r', encoding='utf-8') as f:
        jsons = json.load(f)

    ids = set()
    jsons2 = []
    for item in jsons:
        if item['userid'] not in ids:
            ids.add(item['userid'])
            jsons2.append(item)
            
    with open('source_user.json', 'w', encoding='utf-8') as nf:
        json.dump(jsons2, nf, indent=4)

The above work well.

Is there an easy way of merging multip json files, and remove all duplicates based on a column before writing to a single output file?

Thanks

Answers

- SIGHUP
- May 8, 2024 at 12:27 pm
- 0 votes
0
You just need to build a dictionary by looping over your input files.

Like this:
```
import json

files = ["json1.json", "json2.json"]

td = dict()

for file in files:
    with open(file) as fd:
        for d in json.load(fd):
            uid = d["userid"]
            if not uid in td:
                td[uid] = d

with open("merged.json", "w") as fd:
    json.dump(list(td.values()), fd, indent=2)
```
You could remove the conditional check (not uid in td) if you don’t care which duplicate is removed
Login or Signup to reply.

- azro
- May 8, 2024 at 6:43 pm
- 0 votes
0
You know how to apply the logic on a list of dict (your items)

Now you want to apply it to a list (each file) of list of dict, so just add another loop around

Using dict.setdefault you can have a nicer and shorter code
```
import json
from pathlib import Path

files = ["file_1.json", "file_2.json", "file_3.json"]
result = {}
for file in files:
    for item in json.loads(Path(file).read_text()):
        result.setdefault(item["userid"], item)

Path("merged.json").write_text(json.dumps(list(result.values()), indent=4))
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.