Question posted in Json
Our archive of expertly curated questions and answers provides insights and solutions to common problems related to this popular data interchange format. From parsing and manipulating JSON data to integrating it with various programming languages and web services, our archive has got you covered. Start exploring today and take your JSON skills to the next level

Json – How to find a coloured font from HTML and add it to parser?

GaloGalo
June 29, 2023
147 views
0 votes
2 Answers

I have a code that parses information about competitions from the RSСF website. Yes, yes, parsing again. But don’t worry, I already know what and how. And wrote the code. It works like clockwork for me. Doesn’t give any errors.

import requests
from bs4 import BeautifulSoup
import re
import os
from urllib.request import urlopen
import json
from urllib.parse import unquote

import warnings
warnings.filterwarnings("ignore")

BASE_URL = 'https://www.rscf.ru/contests'

session = requests.Session()
session.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:100.0) Gecko/20100101 Firefox/100.0'

items = []
max_page = 10
for page in range(1, max_page + 1):
    url = f'{BASE_URL}/?PAGEN_2={page}/' if page > 1 else BASE_URL
    print(url)

    rs = session.get(url, verify=False)
    rs.raise_for_status()

    soup = BeautifulSoup(rs.content, 'html.parser')
    for item in soup.select('.classification-table-row.contest-table-row'):
        number = item.select_one('.contest-num').text
        title = item.select_one('.contest-name').text
        date = item.select_one('.contest-date').text.replace("n", "").replace("Подать заявку", "")
        documents = item.select_one('.contest-docs').text.replace("n", " ").replace("        ", " ").replace("    ", " ")
        synopsis = item.select_one('.contest-status').text.replace("n", " ")
        items.append({
            'Номер': number,
            'Наименование конкурса': title,
            'Приём заявок': date,
            'Статус': synopsis,
            'Документы': documents,
        })

with open('out.json', 'w', encoding='utf-8') as f:
    json.dump(items, f, indent=4, ensure_ascii=False)

Everything works, everything is in order. There is one nuance.

The fact is that the site has such a feature – the color of the text. Depending on whether the competition is active or completed, the status is colored in a certain color. If applications are being accepted, the status is highlighted in green. If an examination is carried out – orange. And if the contest is over – red. Here are the contests.

https://www.rscf.ru/contests/
And I need the code to output in JSON the text that is marked in red, orange or green in HTML. Unfortunately, I couldn’t find anything similar on the Internet. There are only codes that color the text in color. But do not extract ready.

I tried to write a code

redword = item.select_one('.contest-danger').text
        orangeword = item.select_one('.contest-danger').text
        greenword = item.select_one('.contest-success').text
        for synopsis in item.select_one('.contest-status').text:
            try:
                syn = re.sub(orangeword, str(synopsis))
            except:
                syn = re.sub(orangeword, str(greenword))
        items.append({
            'Номер': number,
            'Наименование конкурса': title,
            'Приём заявок': date,
            'Статус': syn,
            'Документы': documents,
        })

but it gave me only error

redword = item.select_one('.contest-danger').text
AttributeError: 'NoneType' object has no attribute 'text'

Can you help me please?

Tags: json python

Answers

Chosen as BEST ANSWER

So, I decided to write the next code.

import requests
from bs4 import BeautifulSoup
import re
import os
from urllib.request import urlopen
import json
from urllib.parse import unquote

import warnings
warnings.filterwarnings("ignore")

BASE_URL = 'https://www.rscf.ru/contests'

session = requests.Session()
session.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:100.0) Gecko/20100101 Firefox/100.0'

items = []
max_page = 10
for page in range(1, max_page + 1):
    url = f'{BASE_URL}/?PAGEN_2={page}/' if page > 1 else BASE_URL
    print(url)

    rs = session.get(url, verify=False)
    rs.raise_for_status()

    soup = BeautifulSoup(rs.content, 'html.parser')
    for item in soup.select('.classification-table-row.contest-table-row'):
        number = item.select_one('.contest-num').text
        title = item.select_one('.contest-name').text
        date = item.select_one('.contest-date').text.replace("n", "").replace("Подать заявку", "")
        documents = item.select_one('.contest-docs').text.replace("n", " ").replace("        ", " ").replace("    ", " ")
        try:
            synopsis = [s.get_text(strip=True) for s in item.select(".contest-status") if s.get_text(strip=True)]
            del synopsis[:1]
            syn = str(synopsis).replace("['", '').replace("']", '')
        except:
            synopsis = [s.get_text(strip=True) for s in item.select(".contest-success") if s.get_text(strip=True)]
            del synopsis[:1]
            syn = str(synopsis).replace("['", '').replace("']", '')
        items.append({
            'Номер': number,
            'Наименование конкурса': title,
            'Приём заявок': date,
            'Статус': syn,
            'Документы': documents,
        })

with open('out.json', 'w', encoding='utf-8') as f:
    json.dump(items, f, indent=4, ensure_ascii=False)

Result is:

{
        "Номер": "92",
        "Наименование конкурса": " Конкурс на получение грантов РНФ по мероприятию «Проведение фундаментальных научных исследований и поисковых научных исследований отдельными научными группами»",
        "Приём заявок": "до 15.11.2023 17:00",
        "Статус": "Прием заявок",
        "Документы": " Извещение Конкурсная документация    "
    },
    {
        "Номер": "3005",
        "Наименование конкурса": "Конкурс на получение грантов РНФ «Проведение пилотных проектов НИОКР в рамках стратегических инициатив Президента РФ в научно-технологической сфере» по теме: «Разработка нитрид-галлиевого СВЧ-транзистора S-диапазона с выходной мощностью не менее 120 Вт»",
        "Приём заявок": "до 02.06.2023 17:00",
        "Статус": "Конкурс завершен",
        "Документы": " Извещение Конкурсная документация Список победителей "
    },
    {

You can try it by yourself

(Edit)

you can get the color here

Here is the long explanation.

from bs4 import BeautifulSoup

# Assume 'html' is your HTML content
soup = BeautifulSoup(html, 'html.parser')

# Use a CSS selector to target the desired element
element = soup.select_one('h1')  # Replace 'h1' with your target selector

# Check if the element exists
if element:
    # Get the 'style' attribute of the element
    style = element.get('style')
    # Parse the 'style' attribute to extract the color
    if style:
        # Split the 'style' attribute into individual styles
        styles = style.split(';')
        # Search for the 'color' style
        for style in styles:
            if 'color' in style:
                # Extract the color value
                color = style.split(':')[1].strip()
                print("Color:", color)
else:
    print("Element not found.")

Short version

title = item.select_one('.contest-name')
style = title.get('style')

if style:
    # Split the style into individual styles
    styles = style.split(';')

    # Search for the color style
    for style in styles:
        if 'color' in style:
            # Extract the color value
            color = style.split(':')[1].strip()
            print("Color:", color)
else:
    print("Style attribute not found.")

Please signup or login to give your own answer.

Click here to cancel reply.