In Python 3, how to apply html2text() better when the html includes markdown grammar - Artificial Intelligence

YWPKwon
December 23, 2016
82 views
0 votes
2 Answers

I am trying to crawl visible texts from a given URL.

It should work for any random url, so I cannot pre-assume html tags, elements, layouts, etc. Knowing perfect crawling seems difficult, I just hope to include most of natural language parts, and exclude most of non-natural language parts.

So far, I found that the combination of using BeautifulSoup and html2text seemed reasonably good.

e.g., Below are my skeleton codes.

url = 'https://en.wikipedia.org/wiki/Autonomous_car'
req = urllib.request.Request(url, headers={'User-Agent' : "Magic Browser"})

cj = CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
response = opener.open(req)
html = response.read().decode('utf8', errors='ignore')
response.close()

# Get html string
soup = BeautifulSoup(html, "lxml")
htmltext = soup.encode('utf-8').decode('utf-8','ignore')
html2text.html2text(htmltext)

Then, I have resulting texts as below, which are not bad (all the html tags are gone), but they turn out markdown grammars.

# Autonomous car

From Wikipedia, the free encyclopedia

Jump to: navigation, search

For the wider application of artificial intelligence to automobiles, see [Unmanned ground vehicle](/wiki/Unmanned_ground_vehicle "Unmanned ground vehicle" ) and [Vehicular automation](/wiki/Vehicular_automation "Vehicular Automation").

[![](//upload.wikimedia.org/wikipedia/commons/thumb/6/65/Hands-free_Driving.jpg/230px-Hands-free_Driving.jpg)](/wiki/File:Hands free_Driving.jpg)

Junior, a robotic [Volkswagen Passat](/wiki/Volkswagen_Passat "Volkswagen Passat" ), at [Stanford University](/wiki/Stanford_University "Stanford University" ) in October 2009.

An **autonomous car** (**driverless car**,[1] **self-driving car**,[2] **robotic car**[3]) is a [vehicle](/wiki/Vehicular_automation "Vehicular automation" ) that is capable of sensing its environment and navigating without human input.[4]

Will there be a way of excluding markdown tags (esp. image and url links) and having a bit better sentences?

Tags: python web-scraping

Answers

Chosen as BEST ANSWER
- YWPKwon
- December 24, 2016 at 6:27 pm
- 0 votes
0
I found that html2text extracts texts from a give html with links and images in markdown grammars.

So instead of using, html2text.html2text(htmltext), you can manage some options by using
```
h = html2text.HTML2Text()
h.ignore_links = True
h.ignore_images = True
h.handle (htmltext)
```

(Edit)

For people who want to ignore all markdown grammers:

import html2text

htmlParser = html2text.HTML2Text()
htmlParser.body_width = 2 ** 31 - 1
htmlParser.handle_tag = lambda *args, **kwargs: None
htmlText = "<ul><li><quote><b>Hello World</b></quote></li></ul>"
print("Old: @{}@".format(html2text.html2text(htmlText).strip()))
print("New: @{}@".format(htmlParser.handle(htmlText).strip()))

Output:

Old: @* **Hello World**@
New: @Hello World@

Please signup or login to give your own answer.

Click here to cancel reply.

In Python 3, how to apply html2text() better when the html includes markdown grammar – Artificial Intelligence

Answers