I’ve got some JSON that looks like this:
{"name": "John",
"description": "I'm just "A BOY" okay? He said "Hello, World!" to everyone.",
"remark": ""This is a test" he mentioned."}
And the "
instances are breaking json.loads()
.
import json
json_string = '''{"name": "John",
"description": "I'm just "A BOY" okay? He said "Hello, World!" to everyone.",
"remark": ""This is a test" he mentioned."}'''
data = json.loads(json_string)
print(data)
raises:
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 2 column 27 (char 43)
I feel like I’ve tried every regex under the sun to target these instances (but leave all the other double quotes, not preceded by a backslash) and replace them with an empty string (functionally just strip them). If anyone has tips I’d appreciate it.
My implementation right now is something like:
import re
# Define a regular expression pattern to match " within a string
pattern = r'\"'
# Use re.sub to replace all occurrences of the pattern with an empty string
cleaned_string = re.sub(pattern, '', json_string)
print(cleaned_string)
But when i run this in a repl, nothing changes.
For reference, I’d just like the output to be:
{"name": "John",
"description": "I'm just A BOY okay? He said Hello, World! to everyone.",
"remark": "This is a test he mentioned."}
Edit: for clarity this is just an example of the nature of the input data i’m working with, its coming from AWS Cloudwatch logs so I don’t have an easy way to manipulate the input before dragging it into Python. For example, part of the payload is something like
""Girl Let's Talk" Virtual 90s Kickback"
In context:
{"search_ads": [ {"event_id": "4838383", "ad_id": "1112", "budget_amount": 5.0, "currency": "USD", "marketplace": "Online_US", "score": 18.205433, "p_click": 0.0, "p_order": 0.0, "goal": 2, "category_id": 113, "subcategory_id": 13999, "format": null, "is_paid": false, "online_event": true, "event_start_date": "2024-06-28T00:00:00Z", "latitude": null, "longitude": null, "name": ""Girl Let's Talk" Virtual 90s Kickback", "vip_status": false, "is_participant": true}]}
so the "
characters are really the only problem – if I copy all that input into VS Code and just search for/delete that pattern, json.loads()
works great as is.
As one commenter mentioned, i think what im looking for is a regex that will match and strip the pattern "
but ive had no luck with that so far! Ive only been able to strip either the s
, which leaves me with double quotes that break json.loads()
(expecting delimiter aka thinks this is another JSON key/val pair) or stripping all the double-quotes, which of course completely breaks the same.
2
Answers
Your JSON-snippet seems valid but you need to write
in your code. It is probably less confusing if you are loading it from file avoiding the double backslash.
Also
pattern = r'\"'
is too much. Writepattern = r'"'
orpattern = '\"'
.You do not need to remove
"
. It’s part of the data.*What you’re having a problem with is Python’s interpretation of string literals. The sequence
"
is an escape sequence that turns into just"
.This can be solved with a raw string (
r
prefix).Output:
However, you might prefer to put the JSON in a separate file and use
json.load()
, to avoid having to muck around with string literals at all.* To be more precise, it’s part of the JSON. In a JSON string,
"
represents"
, which is the raw data.