Html - Problem with scrapping a large list of stock data from Yahoo finance

NadjibBendaoud
March 14, 2024
221 views
0 votes
2 Answers

I want to scrap the "key-statistics" tab from Yahoo Finance. The HTML page contains multiple tables that I scrapped using Beautiful Soup. Each table contains only 2 columns, and I managed to scrap them using both HTML tags "table, td and tr" and Pandas’ "read_html" function.

The tables are concatenated into a single dataframe using this code

 response = requests.get(url, headers={'user-agent': 'custom'})
 soup = BeautifulSoup(response.content, 'html.parser')
 key_stats = pd.DataFrame(columns=["indicator", "value"])
 tables = pd.read_html(str(soup))

 for table in tables:
     table.columns = ['indicator', 'value' ]
     key_stats = pd.concat([key_stats, table], axis=0)
 
 key_stats = key_stats.set_index("indicator")

The code works perfectly when using a small list of stocks, however when trying to use the same code for a large list (5665 stock) the following error occurs.

 ValueError: Length mismatch: Expected axis has 1 elements, new values have 2 elements

This error appears randomly on certain stocks and it means that the tables being scrapped contain 1 column, which is not true.

The most confusing part about this, is that the code works fine when re-executed using the same stocks that generated an error.

I could not understand what’s causing this issue, could anyone help me with that ?

Answers

- VonC
- March 14, 2024 at 1:16 am
- 0 votes
0
As commented, without seeing the HTML data, all we know is that you get inconsistencies in the data being scraped from Yahoo Finance’s "key-statistics" tab (visible here) for many stocks. That means you need to implement some strategies to be more robust in the face of those inconsistencies:
- Before concatenating the tables into your key_stats DataFrame, validate that each table indeed has the expected two columns. That can be done by checking the shape of the DataFrame.
- Implement a try-except block to catch and handle the ValueError. That will allow you to log or investigate the problematic stocks without interrupting the entire scraping process.
- Enhance your scraping logic to handle cases where the page layout might differ or where certain expected tables are absent.
For instance:
```
import requests
from bs4 import BeautifulSoup
import pandas as pd

url = "your_yahoo_finance_url"
response = requests.get(url, headers={'user-agent': 'custom'})
soup = BeautifulSoup(response.content, 'html.parser')
key_stats = pd.DataFrame(columns=["indicator", "value"])
tables = pd.read_html(str(soup))

for table in tables:
    try:
        # Validate the table structure
        if table.shape[1] == 2:
            table.columns = ['indicator', 'value']
            key_stats = pd.concat([key_stats, table], axis=0)
        else:
            print("Skipped a table with unexpected format.")
    except ValueError as e:
        print(f"Error processing a table: {e}")

key_stats = key_stats.set_index("indicator")
```
That way, your code checks that each table has exactly two columns before attempting to rename columns and concatenate it.
It also catches any ValueError that might occur during the process, allowing you to log or handle it as needed.
You can better handle variability in the data being scraped.
Login or Signup to reply.

The error you’re encountering indicates that there’s a mismatch in the lengths of the axis when concatenating the tables into the key_stats DataFrame. This could happen if the tables extracted from the HTML content have different numbers of columns.

To troubleshoot and handle this error, you can add some error handling and logging to understand which table is causing the issue and why. Here’s an updated version of your code with error handling and logging added:

response = requests.get(url, headers={'user-agent': 'custom'})
soup = BeautifulSoup(response.content, 'html.parser')
key_stats = pd.DataFrame(columns=["indicator", "value"])

try:
    tables = pd.read_html(str(soup))
except ValueError as e:
    print("ValueError occurred while parsing HTML tables:", e)
    # Add logging or handle the error as needed
    tables = []

for table in tables:
    try:
        if table.shape[1] == 2:  # Check if the table has two columns
            table.columns = ['indicator', 'value']
            key_stats = pd.concat([key_stats, table], axis=0)
        else:
            print("Skipping table with unexpected number of columns:", table)
            # Add logging or handle the table with unexpected columns
    except Exception as e:
        print("Error occurred while processing table:", e)
        # Add logging or handle the error as needed

key_stats = key_stats.set_index("indicator")

Please signup or login to give your own answer.

Click here to cancel reply.

Html – Problem with scrapping a large list of stock data from Yahoo finance

Answers