skip to Main Content

The programming language of choice is Python.
This is my HTML text:

<a href="https://www.example.com">Original message</a><br>
<ul id="list">
    <li class="blockbody" id="post_1">
        <div class="header">
            <div class="datetime">
                24 januari 2020, 11:34
            </div><span class="name">Jane Doe</span>
        </div>
        <div class="content">
            <blockquote class="restore">
                <div class="bbcode_container">
                    <i class="fa fa-envelope"></i> Citation:
                    <div class="bbcode_quote printable">
                        <hr>
                        <div>
                            Citation: John Doe <a href="showthread.php#post684209" rel="nofollow"><img alt="" class="inlineimg" src="image/style/Aesthetica/button/view.gif"></a>
                        </div>
                        <div class="message">
                            Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
                        </div>
                        <hr>
                    </div>
                </div><br>
                <div class="bbcode_container">
                    <i class="fa fa-envelope"></i> Citation:
                    <div class="bbcode_quote printable">
                        <hr>
                        Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum?<br>
                        <br>
                        Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam.<br>
                        <br>
                        quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum..
                        <hr>
                    </div>
                </div>... <a href="https://example.com" target="_blank">https://example.html</a><br>
                <br>
                velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum?
            </blockquote>
        </div>
    </li>
</ul>

I want to remove everything between THE FIRST <div class="bbcode_quote printable"> and THE LAST <hr> tag. As you can see there are multiple instances of both tags that is why I emphasize THE FIRST and THE LAST.
I’m familiar with Python but string manipulation is not my field of expertise.
I hope I made myself clear.

2

Answers


  1. You can use the Beautiful Soup library – https://pypi.org/project/beautifulsoup4/

    Reference:


    from bs4 import BeautifulSoup
    # The HTML string provided
    html = """
    <a href="https://www.example.com">Original message</a><br>
    <ul id="list">
        <li class="blockbody" id="post_1">
            <div class="header">
                <div class="datetime">
                    24 januari 2020, 11:34
                </div><span class="name">Jane Doe</span>
            </div>
            <div class="content">
                <blockquote class="restore">
                    <div class="bbcode_container">
                        <i class="fa fa-envelope"></i> Citation:
                        <div class="bbcode_quote printable">
                            <hr>
                            <div>
                                Citation: John Doe <a href="showthread.php#post684209" rel="nofollow"><img alt="" class="inlineimg" src="image/style/Aesthetica/button/view.gif"></a>
                            </div>
                            <div class="message">
                                Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
                            </div>
                            <hr>
                        </div>
                    </div><br>
                    <div class="bbcode_container">
                        <i class="fa fa-envelope"></i> Citation:
                        <div class="bbcode_quote printable">
                            <hr>
                            Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum?<br>
                            <br>
                            Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam.<br>
                            <br>
                            quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum..
                            <hr>
                        </div>
                    </div>... <a href="https://example.com" target="_blank">https://example.html</a><br>
                    <br>
                    velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum?
                </blockquote>
            </div>
        </li>
    </ul>
    """
    
    soup = BeautifulSoup(html, 'html.parser')
    
    # Find the first div with class "bbcode_quote printable"
    first_quote = soup.find('div', class_='bbcode_quote printable')
    
    # Find all hr tags in the first div, get the last hr tag
    last_hr = first_quote.find_all('hr')[-1]
    
    # Extract all contents between the first hr and the last hr
    start = first_quote.hr
    end = last_hr
    
    # Extract all elements between these two tags
    current = start.find_next_sibling()
    while current and current != end:
        current.extract()
        current = start.find_next_sibling()
    
    # This will remove the first <hr> also, if you want to keep it uncomment below line
    # last_hr.extract()  # Optionally remove the last <hr> as well, depending on requirements
    
    # Print the modified HTML
    print(str(soup))
    
    Login or Signup to reply.
  2. Using regex, preserving the first and last tags:

    html = '''
    <a href="https://www.example.com">Original message</a><br>
    <ul id="list">
        <li class="blockbody" id="post_1">
            <div class="header">
                <div class="datetime">
                    24 januari 2020, 11:34
                </div><span class="name">Jane Doe</span>
            </div>
            <div class="content">
                <blockquote class="restore">
                    <div class="bbcode_container">
                        <i class="fa fa-envelope"></i> Citation:
                        <div class="bbcode_quote printable">
                            <hr>
                            <div>
                                Citation: John Doe <a href="showthread.php#post684209" rel="nofollow"><img alt="" class="inlineimg" src="image/style/Aesthetica/button/view.gif"></a>
                            </div>
                            <div class="message">
                                Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
                            </div>
                            <hr>
                        </div>
                    </div><br>
                    <div class="bbcode_container">
                        <i class="fa fa-envelope"></i> Citation:
                        <div class="bbcode_quote printable">
                            <hr>
                            Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum?<br>
                            <br>
                            Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam.<br>
                            <br>
                            quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum..
                            <hr>
                        </div>
                    </div>... <a href="https://example.com" target="_blank">https://example.html</a><br>
                    <br>
                    velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum?
                </blockquote>
            </div>
        </li>
    </ul>
    '''
    

    Then:

    import re
    
    first = re.search(r'<div class="bbcode_quote printable">', html).span()[1]
    last = re.search(fr'{"<hr>"[::-1]}', html[::-1]).span()[1]
    
    new_html = html[:first] + html[len(html)-last:]
    

    Result:

    print(new_html)
    
    <a href="https://www.example.com">Original message</a><br>
    <ul id="list">
        <li class="blockbody" id="post_1">
            <div class="header">
                <div class="datetime">
                    24 januari 2020, 11:34
                </div><span class="name">Jane Doe</span>
            </div>
            <div class="content">
                <blockquote class="restore">
                    <div class="bbcode_container">
                        <i class="fa fa-envelope"></i> Citation:
                        <div class="bbcode_quote printable"><hr>
                        </div>
                    </div>... <a href="https://example.com" target="_blank">https://example.html</a><br>
                    <br>
                    velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum?
                </blockquote>
            </div>
        </li>
    </ul>
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search