skip to Main Content

I am trying to use BeautifulSoup to access thesaurus.com in order to quickly find synonyms for certain words. However, the synonyms are in a list that has different ids and classes per word, and so the best thing I can do is access a grandparent that is the same per word: Here is a simplified example:

<div data-testid="same_between_words">
    <ul class="different_between_words">
        <li>
            <a data-linkid="same_between_words_2">Word 1</a>
        </li>
        <li>
            <a data-linkid="same_between_words_2">Word 2</a>
        </li>
    </ul>
</div>

There’s also similar words which are fine to include if necessary and antonyms which are obviously not fine to include. In case it matters, the words do have the same data-linkid between each other and different words but they’re also the same as antonyms, so I haven’t gotten that to work. My current code is

from bs4 import BeautifulSoup
import requests

url = "https://www.thesaurus.com/browse/EXAMPLE WORD"
page = requests.get(url)
html = page.text

soup = BeautifulSoup(html,"html.parser")
ele = soup.find('div', attrs={'data-testid': 'word-grid-container'})
syn = ele.findChildren('ul', recursive=False)
print(syn)

which gives all of the html for the data-testid in a big old mess, and adding .text doesn’t seem to work since it’s saying I’m treating a list of results like a single one (which I don’t think I am. I’m not using find_all). Not to mention I think adding that would just give me the first synonym which isn’t ideal.

I’d like to get a list of synonyms from a word. I’ve gotten a big single string with all the words but I would love to have it be in a list I can work with since some synonyms have spaces in them (like ‘fine and dandy’ for ‘good’. I can’t split a string based on spaces then).

2

Answers


  1. Each word is in a tag with font-weight="inherit" property, you can even just select all a tags.

    from bs4 import BeautifulSoup
    import requests
    
    url = "https://www.thesaurus.com/browse/smile"
    page = requests.get(url)
    html = page.text
    
    soup = BeautifulSoup(html,"html.parser")
    #words = soup.select_one('div[data-testid="word-grid-container"]').select('a[font-weight="inherit"]')
    words = soup.select_one('div[data-testid="word-grid-container"]').select('a')
    for word in words:
        print(word.get_text())
    
    Login or Signup to reply.
  2. You are near to your goal, but to give you an idea, try to select by static things id or HTML structure, may use css selectors for convenience.

    Example

    from bs4 import BeautifulSoup
    import requests
    
    url = "https://www.thesaurus.com/browse/idea"
    page = requests.get(url)
    html = page.text
    
    soup = BeautifulSoup(html,"html.parser")
    print([e.get_text(strip=True) for e in soup.select('#meanings ul li>a')])
    

    Output

    ['belief', 'concept', 'conclusion', 'design', 'feeling', 'form', 'intention', 'interpretation', 'meaning', 'notion', 'objective', 'opinion', 'perception', 'plan', 'scheme', 'sense', 'solution', 'suggestion', 'theory', 'thought', 'understanding', 'view', 'aim', 'approximation', 'brainstorm', 'clue', 'conception', 'conviction', 'doctrine', 'end', 'essence', 'estimate', 'fancy', 'flash', 'guess', 'hint', 'hypothesis', 'import', 'impression', 'inkling', 'intimation', 'judgment', 'object', 'pattern', 'purpose', 'reason', 'significance', 'suspicion', 'teaching', 'viewpoint', 'believed abstraction']
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search