skip to Main Content

I wrote a code that converts a Word document to HTML using pypandoc because I even want images in that. The problem is my docx file contains characters and which turn into something different in HTML when sent as a mail body. I want and to be replaced with ', a normal apostrophe.

Check the attached images so that the difference is clear enough.

source

expected result

I tried a few ways as shown in the code below. I commented out ways which I tried but failed.

# Read the HTML file
with open(html_file, 'r') as file:
    html_data = file.read()
            
    # Replace all occurrences of ',' with '
    # print("called")
    html_data = re.sub("‘", "'", html_data)
    html_data = re.sub("’", "'", html_data)
    # html_data = re.sub(r'’', "'", html_data)
    # html_data =  re.sub(r'‘', "'", html_data)
    # html_data = re.sub(r'“', '"', html_data)
    # html_data = re.sub(r'”', '"', html_data)
    # html_data = html_data.replace("‘", "'")
    # html_data = html_data.replace("’", "'")
    # html_data = html_data.replace('“', "'")
    # html_data = html_data.replace("”", "'")

For example, my Word document contains a phrase i’d like to that should get converted to i'd like to.

2

Answers


  1. I think you need to escape the character so it does not conflict with string declaration:

    s = 'i’d like to'
    m = s.replace('’', ''')
    print(m)
    

    Output:

    "i'd like to"
    
    Login or Signup to reply.
  2.         # Read the HTML file
        with open(html_file, 'r') as file:
            html_data = file.read()
            
        # Replace all occurrences of ',' with '
        html_data = re.sub("‘", "'",html_data)
        html_data = re.sub("’", "'",html_data)
        html_data = re.sub("‘", "'",html_data)
        html_data = re.sub("’", "'",html_data)
    

    Try this it works, in html ‘ is sometimes considered as ‘ and ’ is considered as ’ so it does not replaces using your code.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search