skip to Main Content

This is part of the json file I have got as an output after running running a python script using the telethon API.

[{"_": "Message", "id": 4589, "to_id": {"_": "PeerChannel", "channel_id": 1399858792}, "date": "2020-09-03T14:51:03+00:00", "message": "Looking for product managers / engineers who have worked in search engine / query understanding space. Please PM me if you can connect me to someone for the same", "out": false, "mentioned": false, "media_unread": false, "silent": false, "post": false, "from_scheduled": false, "legacy": false, "edit_hide": false, "from_id": 356886523, "fwd_from": null, "via_bot_id": null, "reply_to_msg_id": null, "media": null, "reply_markup": null, "entities": [], "views": null, "edit_date": null, "post_author": null, "grouped_id": null, "restriction_reason": []}, {"_": "MessageService", "id": 4588, "to_id": {"_": "PeerChannel", "channel_id": 1399858792}, "date": "2020-09-03T11:48:18+00:00", "action": {"_": "MessageActionChatJoinedByLink", "inviter_id": 310378430}, "out": false, "mentioned": false, "media_unread": false, "silent": false, "post": false, "legacy": false, "from_id": 1264437394, "reply_to_msg_id": null}

As you can see, the python script has scraped the chats from a particular channel in telegram. All I need is to store the date and message section of the json into a separate dataframe so that I can apply appropriate filters and give a proper output. Can anyone help me with this?

2

Answers


  1. I think you should use json loads then json_normalize to convert json to dataframe with max_level for nested dictionary.

    from pandas import json_normalize
    import json
    d = '[{"_": "Message", "id": 4589, "to_id": {"_": "PeerChannel", "channel_id": 1399858792}, "date": "2020-09-03T14:51:03+00:00", "message": "Looking for product managers / engineers who have worked in search engine / query understanding space. Please PM me if you can connect me to someone for the same", "out": false, "mentioned": false, "media_unread": false, "silent": false, "post": false, "from_scheduled": false, "legacy": false, "edit_hide": false, "from_id": 356886523, "fwd_from": null, "via_bot_id": null, "reply_to_msg_id": null, "media": null, "reply_markup": null, "entities": [], "views": null, "edit_date": null, "post_author": null, "grouped_id": null, "restriction_reason": []}, {"_": "MessageService", "id": 4588, "to_id": {"_": "PeerChannel", "channel_id": 1399858792}, "date": "2020-09-03T11:48:18+00:00", "action": {"_": "MessageActionChatJoinedByLink", "inviter_id": 310378430}, "out": false, "mentioned": false, "media_unread": false, "silent": false, "post": false, "legacy": false, "from_id": 1264437394, "reply_to_msg_id": null}]'
    f = json.loads(d)
    print(json_normalize(f, max_level=2))
    
    Login or Signup to reply.
    • This assumes the object returned from the API is not a string (e.g. '[{...}, {...}]'.
      • If it is a string, use data = json.loads(data), first.
    • The 'date' and corresponding 'message' can be extracted from the list of dicts with a list-comprehension.
    • Iterate through each dict in the list, and use dict.get for the key. If the key doesn’t exist, None is returned.
    import pandas as pd
    
    # where data is the list of dicts, unpack the desired keys and load into pandas
    df = pd.DataFrame([{'date': i.get('date'), 'message': i.get('message')} for i in data])
    
    # display(df)
                            date                                                                                                                                                            message
    0  2020-09-03T14:51:03+00:00  Looking for product managers / engineers who have worked in search engine / query understanding space. Please PM me if you can connect me to someone for the same
    1  2020-09-03T11:48:18+00:00                                                                                                                                                               None
    

    Alternatively

    • If you wish to skip data, where 'message' is None
    df = pd.DataFrame([{'date': i['date'], 'message': i['message']} for i in data if i.get('message')])
    
                          date                                                                                                                                                            message
     2020-09-03T14:51:03+00:00  Looking for product managers / engineers who have worked in search engine / query understanding space. Please PM me if you can connect me to someone for the same
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search