skip to Main Content

I can not read full text with this json file:

{
  "messages": [
    {
      "sender_name": "test",
      "timestamp_ms": 1554347140802,
      "content": "Chu00c3u00a0o Anh/Chu00e1u00bbu008b, Anh/Chu00e1u00bbu008b vui lu00c3u00b2ng u00c4u0091u00e1u00bbu0083 lu00e1u00bau00a1i Su00e1u00bbu0090 u00c4u0090Iu00e1u00bbu0086N THOu00e1u00bau00a0I + Tu00c3u008cNH TRu00e1u00bau00a0NG Bu00e1u00bbu0086NH u00c4u0091u00e1u00bbu0083 Du00c6u00afu00e1u00bbu00a2C Su00c4u00a8 CHUYu00c3u008aN Mu00c3u0094N su00e1u00bau00afp xu00e1u00bau00bfp tu00c6u00b0 vu00e1u00bau00a5n vu00e1u00bbu0081 su00e1u00bau00a3n phu00e1u00bau00a9m, bu00e1u00bbu0087nh tu00c3u00acnh cu00e1u00bbu00a5 thu00e1u00bbu0083 vu00c3u00a0 liu00e1u00bbu0087u tru00c3u00acnh phu00c3u00b9 hu00e1u00bbu00a3p cho Anh/Chu00e1u00bbu008b nhu00c3u00a9.",
      "is_geoblocked_for_viewer": false
    },
    {
      "sender_name": "",
      "timestamp_ms": 1554334611125,
      "content": "Tu00c3u00b4i muu00e1u00bbu0091n u00c4u0091u00e1u00bau00b7t hu00c3u00a0ng",
      "is_geoblocked_for_viewer": false
    },
    {
      "sender_name": "test",
      "timestamp_ms": 1554334610788,
      "content": "Chu00c3u00a0o Musickhc! Chu00c3u00bang tu00c3u00b4i cu00c3u00b3 thu00e1u00bbu0083 giu00c3u00bap gu00c3u00ac cho bu00e1u00bau00a1n?",
      "is_geoblocked_for_viewer": false
    },
    {
      "sender_name": "test",
      "timestamp_ms": 1554334609955,
      "content": "Customer u00c4u0091u00c3u00a3 tru00e1u00bau00a3 lu00e1u00bbu009di tin nhu00e1u00bau00afn chu00c3u00a0o mu00e1u00bbu00abng tu00e1u00bbu00b1 u00c4u0091u00e1u00bbu0099ng cu00e1u00bbu00a7a bu00e1u00bau00a1n. u00c4u0090u00e1u00bbu0083 thay u00c4u0091u00e1u00bbu0095i hou00e1u00bau00b7c gu00e1u00bbu00a1 lu00e1u00bbu009di chu00c3u00a0o nu00c3u00a0y, hu00c3u00a3y truy cu00e1u00bau00adp phu00e1u00bau00a7n Cu00c3u00a0i u00c4u0091u00e1u00bau00b7t tin nhu00e1u00bau00afn.",
      "is_geoblocked_for_viewer": false
    }
  ]
}

I am using this code:

with open('message_1.json', 'r', encoding='utf-8') as file:
    data = json.loads(file.read())
    print('message', data)
    file.close()

The result is
{'messages': [{'sender_name': 'test', 'timestamp_ms': 1554347140802, 'content': 'ChÃxa0o Anh/Chá»x8b, Anh/Chá»x8b vui lòng Äx91á»x83 lại Sá»x90 Äx90Iá»x86N THOáºxa0I + TÃx8cNH TRáºxa0NG Bá»x86NH Äx91á»x83 DƯỢC SĨ CHUYÃx8aN MÃx94N sắp xếp tÆ° vấn vá»x81 sản phẩm, bá»x87nh tình cụ thá»x83 vÃxa0 liá»x87u trình phù hợp cho Anh/Chá»x8b nhé.', 'is_geoblocked_for_viewer': False}, {'sender_name': '', 'timestamp_ms': 1554334611125, 'content': 'Tôi muá»x91n Äx91ặt hÃxa0ng', 'is_geoblocked_for_viewer': False}, {'sender_name': 'test', 'timestamp_ms': 1554334610788, 'content': 'ChÃxa0o Musickhc! Chúng tôi có thá»x83 giúp gì cho bạn?', 'is_geoblocked_for_viewer': False}, {'sender_name': 'test', 'timestamp_ms': 1554334609955, 'content': 'Customer Äx91ã trả lá»x9di tin nhắn chÃxa0o mừng tá»± Äx91á»x99ng của bạn. Äx90á»x83 thay Äx91á»x95i hoặc gỡ lá»x9di chÃxa0o nÃxa0y, hãy truy cáºxadp phần CÃxa0i Äx91ặt tin nhắn.', 'is_geoblocked_for_viewer': False}]}

Can someone help me how to read this file with utf-8 ?
Thanks

2

Answers


  1. Chosen as BEST ANSWER

    I just done it:

    from operator import itemgetter
    
    
    with open('message_1.json', 'r', encoding='raw_unicode_escape') as file:
        messages = json.loads(file.read().encode('raw_unicode_escape').decode())
        print(messages)
        file.close()
    

    But dont know if there are some way better


  2. Unfortunately, whatever generated this JSON file has mangled it by encoding Unicode characters as UTF-8, then encoding them as separate code points in the file.

    For example, à should be written as u00e0 directly, but instead it is written as u00c3u00a0.

    Your JSON file is broken. You have two options:

    1. tell whoever or whatever generated your JSON file to output correctly formed JSON. JSON uses UTF-16 escapes for encoding Unicode characters.
    2. If you cannot control your JSON file, you’ll need to fix it. One option is to walk over the file and fix every broken field. For example:
    for message in data["messages"]:
        message["content"] = message["content"].encode("latin1").decode("utf8")
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search