I’ve found this regex that seems to work for Java but not python.
found it here: Regex: Remove Commas within quotes
r"""(G(?!^)|"body"s*:s*")([^",]*),"""
I am very inexperienced with building my own regexes and need help translating it to python acceptable format. I know the G is the problem as it doesnt exist in python(?)
I want to use it to replace commas in the "body" key in my JSON file but I get:
bad escape G at position 1
2
Answers
What you could do is first capture the data within the body value, and then parse on the comma.
The pattern would be,
It’s quite simple; it just seems complex.
I recommend reading the Wikipedia article, as it outlines the entire syntax.
Wikipedia – Regular expression.
Additionally there are several books; O’Reilly Media and Packt Publishing both have some good ones.
O’Reilly Media – Mastering Regular Expressions.
Packt Publishing – Mastering Python Regular Expressions.
Here is an example in Python.
I’m using the re.match function to utilize the pattern.
And, the str.split function to generate a list of values.
Output
The proper solution
Use the JSON module to parse the JSON, modify the resulting object, then stringify again:
This prints
{"body": "a; b", "foo": "bar", "baz": "foo"}
as expected.The RegEx solution
Just for the kicks of it, I also wrote a RegEx solution. This works fine because regular expressions suffice to tokenize JSON, and the changes required here can pretty much be done at a token level.
The (slim) advantage of this solution is that it preserves the "style" of the JSON (spacing, ordering of keys, number formatting used, characters escaped, etc.).
The disadvantage is that it is less readable / more difficult to maintain and much less flexible: It doesn’t know the context. Changing this to e.g. apply only to certain objects which also have certain other keys or are inside certain other objects would not be possible.
For the sake of simplicity, I have decided to not handle Unicode escape sequences of the form
uXXXX
at all, although these ASCII codepoints may theoretically be encoded that way; typically it won’t be, and it’s trivial to fix the solution to handle them properly by using(?:<character>|\u<ASCII code>)
for each ASCII character within the string (the charactersb
,o
,d
,y
and the comma).Here goes the RegEx:
("body"s*:s*")((?:[^"\]|\.)*?)(")
Explanation:
("body"s*:s*")
: Capture the"body" : "
part to preserve it; allow for arbitrary spacing, just like JSON.((?:[^"\]|\.)*?)
: Capture the body of the string. Allow zero or more arbitrary characters except for backslashes or string quotes, which need to to be escaped.(")
: Preserve the closing quote of the string. We could have hardcoded this in thelambda
just as well, but this is cleaner if you want(ed) to extend this in the future.and here’s the corresponding Python, with a small test case:
In my examples, I’m replacing commas (
,
) with semicolons (;
). You link a question about removing commas but mention replacing commas (with what?) in your question so I was unsure.