skip to Main Content

I want to merget 2 json files into one json file and remove all duplicated rows based on a column (second column). At the moment I merge two or multiple json files manually, then I use python codes to remove all rows with duplicated userid.

First json file:

    [
        {
            "userid": "567897068",
            "status": "UserStatus.RECENTLY",
            "name": "btb appeal court",
            "bot": false,
            "username": "None"
        },
        {
            "userid": "6403980168",
            "status": "UserStatus.RECENTLY",
            "name": "Ah",
            "bot": false,
            "username": "fearpic"
        },
        {
            "userid": "7104649590",
            "status": "UserStatus.RECENTLY",
            "name": "Da",
            "bot": false,
            "username": "Abc130000"
        },
        {
            "userid": "5813962086",
            "status": "UserStatus.RECENTLY",
            "name": "Sothea",
            "bot": false,
            "username": "SotheaSopheap169"
        }
    ]

Second json file:

    [
        {
            "userid": "567897068",
            "status": "UserStatus.RECENTLY",
            "name": "btb appeal court",
            "bot": false,
            "username": "None"
        },
        {
            "userid": "111111111111",
            "status": "UserStatus.RECENTLY",
            "name": "Ah",
            "bot": false,
            "username": "fearpic"
        },
        {
            "userid": "7104649590",
            "status": "UserStatus.RECENTLY",
            "name": "Da",
            "bot": false,
            "username": "Abc130000"
        },
        {
            "userid": "555555555555",
            "status": "UserStatus.RECENTLY",
            "name": "Sothea",
            "bot": false,
            "username": "SotheaSopheap169"
        }
    ]

merged file should be:

    [
        {
            "userid": "567897068",
            "status": "UserStatus.RECENTLY",
            "name": "btb appeal court",
            "bot": false,
            "username": "None"
        },
        {
            "userid": "6403980168",
            "status": "UserStatus.RECENTLY",
            "name": "Ah",
            "bot": false,
            "username": "fearpic"
        },
        {
            "userid": "7104649590",
            "status": "UserStatus.RECENTLY",
            "name": "Da",
            "bot": false,
            "username": "Abc130000"
        },
        {
            "userid": "5813962086",
            "status": "UserStatus.RECENTLY",
            "name": "Sothea",
            "bot": false,
            "username": "SotheaSopheap169"
        },
        {
            "userid": "111111111111",
            "status": "UserStatus.RECENTLY",
            "name": "Ah",
            "bot": false,
            "username": "fearpic"
        },
        {
            "userid": "555555555555",
            "status": "UserStatus.RECENTLY",
            "name": "Sothea",
            "bot": false,
            "username": "SotheaSopheap169"
        }
        
    ]

I have use the following codes to remove duplicates based on userid column in a json file I merged manually:

    import json
    with open('source_user_all.json', 'r', encoding='utf-8') as f:
        jsons = json.load(f)

    ids = set()
    jsons2 = []
    for item in jsons:
        if item['userid'] not in ids:
            ids.add(item['userid'])
            jsons2.append(item)
            
    with open('source_user.json', 'w', encoding='utf-8') as nf:
        json.dump(jsons2, nf, indent=4)

The above work well.

Is there an easy way of merging multip json files, and remove all duplicates based on a column before writing to a single output file?

Thanks

2

Answers


  1. You just need to build a dictionary by looping over your input files.

    Like this:

    import json
    
    files = ["json1.json", "json2.json"]
    
    td = dict()
    
    for file in files:
        with open(file) as fd:
            for d in json.load(fd):
                uid = d["userid"]
                if not uid in td:
                    td[uid] = d
    
    with open("merged.json", "w") as fd:
        json.dump(list(td.values()), fd, indent=2)
    

    You could remove the conditional check (not uid in td) if you don’t care which duplicate is removed

    Login or Signup to reply.
  2. You know how to apply the logic on a list of dict (your items)

    Now you want to apply it to a list (each file) of list of dict, so just add another loop around

    Using dict.setdefault you can have a nicer and shorter code

    import json
    from pathlib import Path
    
    files = ["file_1.json", "file_2.json", "file_3.json"]
    result = {}
    for file in files:
        for item in json.loads(Path(file).read_text()):
            result.setdefault(item["userid"], item)
    
    Path("merged.json").write_text(json.dumps(list(result.values()), indent=4))
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search