skip to Main Content

I have a code that parses information about competitions from the RSСF website. Yes, yes, parsing again. But don’t worry, I already know what and how. And wrote the code. It works like clockwork for me. Doesn’t give any errors.

import requests
from bs4 import BeautifulSoup
import re
import os
from urllib.request import urlopen
import json
from urllib.parse import unquote

import warnings
warnings.filterwarnings("ignore")

BASE_URL = 'https://www.rscf.ru/contests'

session = requests.Session()
session.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:100.0) Gecko/20100101 Firefox/100.0'

items = []
max_page = 10
for page in range(1, max_page + 1):
    url = f'{BASE_URL}/?PAGEN_2={page}/' if page > 1 else BASE_URL
    print(url)

    rs = session.get(url, verify=False)
    rs.raise_for_status()

    soup = BeautifulSoup(rs.content, 'html.parser')
    for item in soup.select('.classification-table-row.contest-table-row'):
        number = item.select_one('.contest-num').text
        title = item.select_one('.contest-name').text
        date = item.select_one('.contest-date').text.replace("n", "").replace("Подать заявку", "")
        documents = item.select_one('.contest-docs').text.replace("n", " ").replace("        ", " ").replace("    ", " ")
        synopsis = item.select_one('.contest-status').text.replace("n", " ")
        items.append({
            'Номер': number,
            'Наименование конкурса': title,
            'Приём заявок': date,
            'Статус': synopsis,
            'Документы': documents,
        })

with open('out.json', 'w', encoding='utf-8') as f:
    json.dump(items, f, indent=4, ensure_ascii=False)

Everything works, everything is in order. There is one nuance.

The fact is that the site has such a feature – the color of the text. Depending on whether the competition is active or completed, the status is colored in a certain color. If applications are being accepted, the status is highlighted in green. If an examination is carried out – orange. And if the contest is over – red. Here are the contests.

https://www.rscf.ru/contests/
And I need the code to output in JSON the text that is marked in red, orange or green in HTML. Unfortunately, I couldn’t find anything similar on the Internet. There are only codes that color the text in color. But do not extract ready.

I tried to write a code

redword = item.select_one('.contest-danger').text
        orangeword = item.select_one('.contest-danger').text
        greenword = item.select_one('.contest-success').text
        for synopsis in item.select_one('.contest-status').text:
            try:
                syn = re.sub(orangeword, str(synopsis))
            except:
                syn = re.sub(orangeword, str(greenword))
        items.append({
            'Номер': number,
            'Наименование конкурса': title,
            'Приём заявок': date,
            'Статус': syn,
            'Документы': documents,
        })

but it gave me only error

redword = item.select_one('.contest-danger').text
AttributeError: 'NoneType' object has no attribute 'text'

Can you help me please?

2

Answers


  1. Chosen as BEST ANSWER

    So, I decided to write the next code.

    import requests
    from bs4 import BeautifulSoup
    import re
    import os
    from urllib.request import urlopen
    import json
    from urllib.parse import unquote
    
    import warnings
    warnings.filterwarnings("ignore")
    
    BASE_URL = 'https://www.rscf.ru/contests'
    
    session = requests.Session()
    session.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:100.0) Gecko/20100101 Firefox/100.0'
    
    items = []
    max_page = 10
    for page in range(1, max_page + 1):
        url = f'{BASE_URL}/?PAGEN_2={page}/' if page > 1 else BASE_URL
        print(url)
    
        rs = session.get(url, verify=False)
        rs.raise_for_status()
    
        soup = BeautifulSoup(rs.content, 'html.parser')
        for item in soup.select('.classification-table-row.contest-table-row'):
            number = item.select_one('.contest-num').text
            title = item.select_one('.contest-name').text
            date = item.select_one('.contest-date').text.replace("n", "").replace("Подать заявку", "")
            documents = item.select_one('.contest-docs').text.replace("n", " ").replace("        ", " ").replace("    ", " ")
            try:
                synopsis = [s.get_text(strip=True) for s in item.select(".contest-status") if s.get_text(strip=True)]
                del synopsis[:1]
                syn = str(synopsis).replace("['", '').replace("']", '')
            except:
                synopsis = [s.get_text(strip=True) for s in item.select(".contest-success") if s.get_text(strip=True)]
                del synopsis[:1]
                syn = str(synopsis).replace("['", '').replace("']", '')
            items.append({
                'Номер': number,
                'Наименование конкурса': title,
                'Приём заявок': date,
                'Статус': syn,
                'Документы': documents,
            })
    
    with open('out.json', 'w', encoding='utf-8') as f:
        json.dump(items, f, indent=4, ensure_ascii=False)
    

    Result is:

    {
            "Номер": "92",
            "Наименование конкурса": " Конкурс на получение грантов РНФ по мероприятию «Проведение фундаментальных научных исследований и поисковых научных исследований отдельными научными группами»",
            "Приём заявок": "до 15.11.2023 17:00",
            "Статус": "Прием заявок",
            "Документы": " Извещение Конкурсная документация    "
        },
        {
            "Номер": "3005",
            "Наименование конкурса": "Конкурс на получение грантов РНФ «Проведение пилотных проектов НИОКР в рамках стратегических инициатив Президента РФ в научно-технологической сфере» по теме: «Разработка нитрид-галлиевого СВЧ-транзистора S-диапазона с выходной мощностью не менее 120 Вт»",
            "Приём заявок": "до 02.06.2023 17:00",
            "Статус": "Конкурс завершен",
            "Документы": " Извещение Конкурсная документация Список победителей "
        },
        {
    

    You can try it by yourself


  2. you can get the color here

    Here is the long explanation.

    from bs4 import BeautifulSoup
    
    # Assume 'html' is your HTML content
    soup = BeautifulSoup(html, 'html.parser')
    
    # Use a CSS selector to target the desired element
    element = soup.select_one('h1')  # Replace 'h1' with your target selector
    
    # Check if the element exists
    if element:
        # Get the 'style' attribute of the element
        style = element.get('style')
        # Parse the 'style' attribute to extract the color
        if style:
            # Split the 'style' attribute into individual styles
            styles = style.split(';')
            # Search for the 'color' style
            for style in styles:
                if 'color' in style:
                    # Extract the color value
                    color = style.split(':')[1].strip()
                    print("Color:", color)
    else:
        print("Element not found.")
    

    Short version

    title = item.select_one('.contest-name')
    style = title.get('style')
    
    if style:
        # Split the style into individual styles
        styles = style.split(';')
    
        # Search for the color style
        for style in styles:
            if 'color' in style:
                # Extract the color value
                color = style.split(':')[1].strip()
                print("Color:", color)
    else:
        print("Style attribute not found.")
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search