skip to Main Content

I have an XML file and only for specific tag names I need to remove dots (.) from contents, I don’t know the dots position and the numbers (ex: "12345.6", "1.23.456.7", "ABC.456.98"). For example, if I have:

<?xml version="1.0"?>
<MyData>
   <test>A.123.236</test>
   <tag1>202400.000.0.0.17731</tag1>
   <tag2>some content</tag2>
   <tag3>some.content</tag3>
   <test>dotted.content.123</test>
   <data>
        <test>dsd456.1</test>
        <tag5>some.content</tag5>
   </data>
</MyData>

I want to remove dots within the content of the "test" tag, so:

<?xml version="1.0"?>
<MyData>
   <test>A123236</test>
   <tag1>202400.000.0.0.17731</tag1>
   <tag2>some content</tag2>
   <tag3>some.content</tag3>
   <test>dottedcontent123</test>
   <data>
        <test>dsd4561</test>
        <tag5>some.content</tag5>
   </data>
</MyData>

In VsCode, I find the content with: <test>(.+?)</test> but I don’t know what to put in replace field.
Thanks in advance.

2

Answers


  1. you are using VScode find/replace option? or is another programming language ?

    i can help you with this python code:

    import re
    content ="""
    <?xml version="1.0"?>
    <MyData>
       <test>A.123.236</test>
       <tag1>202400.000.0.0.17731</tag1>
       <tag2>some content</tag2>
       <tag3>some.content</tag3>
       <test>dotted.content.123</test>
       <data>
            <test>dsd456.1</test>
            <tag5>some.content</tag5>
       </data>
    </MyData>
    """
    
    tag_open = r'<test>'
    tag_close = r'</test>'
    pattern_text = tag_open+r'(.*?)'+tag_close
    pattern = re.compile(pattern_text, re.DOTALL)
    matches = pattern.findall(content)
    fix_content = content
    for match in matches:
        fix_content = fix_content.replace(match,match.replace('.',''))
    print(fix_content)
    

    The algorithm searches for all the content within the tags.
    then replace the content by removing the dots.

    This will be useful if the content is not repeated within other tags

    In case the content is repeated elsewhere, you can remove the capture group (.*?) and put only .*?

    This will replace all text including tags, it will work as long as the tags don’t have dots within their definition.

    Login or Signup to reply.
  2. In the search and replace panel, when your document is opened, you can use a regex like

    (?<=<test>[^<]*).(?=[^<]*</test>)
    

    Which matches any . char inside <test> and </test> strings with no < in between these tags.

    Details:

    • (?<=<test>[^<]*) – a positive lookbehind that matches a location that is immediately preceded with <test> and then any zero or more chars other than <
    • . – a dot
    • (?=[^<]*</test>) – a positive lookahead that matches a location that is immediately preceded with any zero or more chars other than < and then a </test> string.
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search