I am coding some script in Python to input a word into the online Latin morphological analysis software Collatinus (Collatinus) and receive a full declination/conjugation of that word.
Here is what I have so far:
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium import webdriver
import time
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
import pandas
from bs4 import BeautifulSoup as bs
import sys
#x = input("Word:")
chrome_service = Service(executable_path="C:Program Files (x86)msedgedriver.exe")
driver = webdriver.Edge(service=chrome_service)
driver.get("https://outils.biblissima.fr/en/collatinus-web/")
time.sleep(15)
driver.find_element(By.TAG_NAME, 'body').send_keys(Keys.COMMAND + 'r')
time.sleep(15)
driver.find_element(By.TAG_NAME, 'body').send_keys(Keys.COMMAND + 'r')
time.sleep(15)
search = driver.find_element(By.ID, "flexion_lemme")
search.send_keys('canis')
time.sleep(7)
submit = driver.find_elements(By.TAG_NAME, "button")
correct = submit[4]
correct.click()
time.sleep(15)
html = driver.page_source
if html.find("Une erreur s'est produite") != -1:
driver.find_element(By.TAG_NAME, 'body').send_keys(Keys.COMMAND + 'r')
time.sleep(20)
search = driver.find_element(By.ID, "flexion_lemme")
search.send_keys('canis')
time.sleep(7)
submit = driver.find_elements(By.TAG_NAME, "button")
correct = submit[4]
correct.click()
time.sleep(20)
html = driver.page_source
if html.find("Une erreur s'est produite") != -1:
driver.quit()
raise ZeroDivisionError("Nope.")
else:
results = driver.find_element(By.ID, "results")
titlesh4 = results.find_elements(By.TAG_NAME, "h4")
titlesp = results.find_elements(By.TAG_NAME, "p")
titleshref = results.find_elements(By.XPATH, "//*[ text() = 'Formes composées' ]")
html = driver.page_source
f = open('tables.html', "w", encoding="utf-8")
f.write(html)
f.close()
lh = open("tables.html", "r", encoding="utf-8")
soup = bs(lh, "html.parser")
prettyHTML=soup.prettify()
prettyHTML = prettyHTML.replace("ā", "a").replace("ă","a").replace("ā̆", "a").replace("ē","e").replace("ĕ", "e").replace("ē̆","e").replace("ī", "i").replace("ĭ", "i").replace("ī̆","i").replace("ō","o").replace("ō̆","o").replace("ŏ","o").replace("ŭ","u").replace("ū̆", "u").replace("ū","u")
f = open('tables.html', "w", encoding="utf-8")
f.write(prettyHTML)
f.close()
It’s still in development.
-
The issue is when it comes to the titles of each table that I’m drawing from the function.
-
I wish to have each table of data that I draw from the HTML have a title that I can work with. But, the titles that the software gives are in differing forms: h4, p, a, etc. Also, some titles overarch others in a hierarchy, e.g. the table title of ‘subjonctif’ will come under another title, let’s say, ‘actif’ or something.
-
The only way, in my opinion, how this could be done is if the program could detect the title before it and decide from there.
-
Also, those hierarchical ones; I wish that the parent name be included in each of the smaller titles, i.e. ‘subjonctif’ would become ‘actif subjonctif’.
-
The last issue is that some titles house two or three(?) tables inside of them, so I wish that those could be labelled, e.g., as "subjonctif #1" and "subjonctif #2".
In my opinion (please correct me if I am wrong) all these problems could be easily fixed if the program knew what was before each table.
I have not really tried anything, as I’m not sure really where to start.
If anybody could help, it would be really appreciated.
2
Answers
The first thing I’m going to advise you is to give up
selenium
, you don’t need to use it here. To successfully parse many sites, you need to stop thinking like a human, think like a browser. I wrote a simplescraper
that does withoutselenium
, just parses the necessarytoken
forPOST
request to this endpoint and gets the data. The end result is that we make the initial request,cookies
are stored, we get thetoken
and can work with it. The spider looks like the following:The token for your form can be found here:
To work you will need the following libraries:
request
,beautifulsoup4
,lxml
. Install them usingpip
or another package manager. Thespider
will do everything itself the first time you connect, it will parse thetoken
, setcookies
, in order for you to use it, do the following:Now let’s get to the interesting part. I have written a parsing function for you as I understand your request, but the data on your site is very chaotic. Sometimes there are no elements that can be used for headers at all, sometimes there are just too many, and it is not clear what to use. Now, the logic of the parser is:
Look at what you get and tell me what is expected as a name, as long as we don’t take parent names, for the current node only. I will be able to advise you on something else.
Here’s part of the results:
I designed a template to scrape the data. Pls view the comments:
Execute this template, using a open source nodejs package, can get the following data: