skip to Main Content

I’ve found this regex that seems to work for Java but not python.
found it here: Regex: Remove Commas within quotes

r"""(G(?!^)|"body"s*:s*")([^",]*),"""

I am very inexperienced with building my own regexes and need help translating it to python acceptable format. I know the G is the problem as it doesnt exist in python(?)

I want to use it to replace commas in the "body" key in my JSON file but I get:

bad escape G at position 1

2

Answers


  1. What you could do is first capture the data within the body value, and then parse on the comma.

    The pattern would be,

    "body"s*:s*"(.+)",
    

    "… I am very inexperienced with building my own regexes …"

    It’s quite simple; it just seems complex.

    I recommend reading the Wikipedia article, as it outlines the entire syntax.
    Wikipedia – Regular expression.

    Additionally there are several books; O’Reilly Media and Packt Publishing both have some good ones.
    O’Reilly Media – Mastering Regular Expressions.
    Packt Publishing – Mastering Python Regular Expressions.

    Here is an example in Python.
    I’m using the re.match function to utilize the pattern.
    And, the str.split function to generate a list of values.

    string = '"body": "abc, def, ghi",'
    value = re.match(r'"body"s*:s*"(.+)",', string)
    value = value.group(1)
    strings = value.split(', ')
    

    Output

    ['abc', 'def', 'ghi']
    
    Login or Signup to reply.
  2. The proper solution

    Use the JSON module to parse the JSON, modify the resulting object, then stringify again:

    import json
    
    json_str = '{"body": "a, b", "foo": "bar", "baz": "foo"}'
    json_obj = json.loads(json_str)
    def replace_commas_body(obj):
        if type(obj) == list:
            for elem in list:
                replace_commas_body(elem)
        elif type(obj) == dict:
            if "body" in obj:
                obj["body"] = obj["body"].replace(",", ";")
            for elem in obj.values():
                replace_commas_body(elem)
    replace_commas_body(json_obj)
    print(json.dumps(json_obj))
    

    This prints {"body": "a; b", "foo": "bar", "baz": "foo"} as expected.

    The RegEx solution

    Just for the kicks of it, I also wrote a RegEx solution. This works fine because regular expressions suffice to tokenize JSON, and the changes required here can pretty much be done at a token level.

    The (slim) advantage of this solution is that it preserves the "style" of the JSON (spacing, ordering of keys, number formatting used, characters escaped, etc.).

    The disadvantage is that it is less readable / more difficult to maintain and much less flexible: It doesn’t know the context. Changing this to e.g. apply only to certain objects which also have certain other keys or are inside certain other objects would not be possible.

    For the sake of simplicity, I have decided to not handle Unicode escape sequences of the form uXXXX at all, although these ASCII codepoints may theoretically be encoded that way; typically it won’t be, and it’s trivial to fix the solution to handle them properly by using (?:<character>|\u<ASCII code>) for each ASCII character within the string (the characters b, o, d, y and the comma).

    Here goes the RegEx: ("body"s*:s*")((?:[^"\]|\.)*?)(")

    Explanation:

    • ("body"s*:s*"): Capture the "body" : " part to preserve it; allow for arbitrary spacing, just like JSON.
    • ((?:[^"\]|\.)*?): Capture the body of the string. Allow zero or more arbitrary characters except for backslashes or string quotes, which need to to be escaped.
    • ("): Preserve the closing quote of the string. We could have hardcoded this in the lambda just as well, but this is cleaner if you want(ed) to extend this in the future.

    and here’s the corresponding Python, with a small test case:

    import re
    
    json_str = '{"body": "a, b", "foo": "bar", "baz": "foo"}'
    print(re.sub(r"("body"s*:s*")((?:[^"\]|\.)*?)(")",
        lambda m: m.group(1) + m.group(2).replace(",", ";") + m.group(3),
        json_str))
    

    In my examples, I’m replacing commas (,) with semicolons (;). You link a question about removing commas but mention replacing commas (with what?) in your question so I was unsure.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search