skip to Main Content

I want to scrap the "key-statistics" tab from Yahoo Finance. The HTML page contains multiple tables that I scrapped using Beautiful Soup. Each table contains only 2 columns, and I managed to scrap them using both HTML tags "table, td and tr" and Pandas’ "read_html" function.

The tables are concatenated into a single dataframe using this code

 response = requests.get(url, headers={'user-agent': 'custom'})
 soup = BeautifulSoup(response.content, 'html.parser')
 key_stats = pd.DataFrame(columns=["indicator", "value"])
 tables = pd.read_html(str(soup))

 for table in tables:
     table.columns = ['indicator', 'value' ]
     key_stats = pd.concat([key_stats, table], axis=0)
 
 key_stats = key_stats.set_index("indicator")

The code works perfectly when using a small list of stocks, however when trying to use the same code for a large list (5665 stock) the following error occurs.

 ValueError: Length mismatch: Expected axis has 1 elements, new values have 2 elements

This error appears randomly on certain stocks and it means that the tables being scrapped contain 1 column, which is not true.

The most confusing part about this, is that the code works fine when re-executed using the same stocks that generated an error.

I could not understand what’s causing this issue, could anyone help me with that ?

2

Answers


  1. As commented, without seeing the HTML data, all we know is that you get inconsistencies in the data being scraped from Yahoo Finance’s "key-statistics" tab (visible here) for many stocks. That means you need to implement some strategies to be more robust in the face of those inconsistencies:

    • Before concatenating the tables into your key_stats DataFrame, validate that each table indeed has the expected two columns. That can be done by checking the shape of the DataFrame.
    • Implement a try-except block to catch and handle the ValueError. That will allow you to log or investigate the problematic stocks without interrupting the entire scraping process.
    • Enhance your scraping logic to handle cases where the page layout might differ or where certain expected tables are absent.

    For instance:

    import requests
    from bs4 import BeautifulSoup
    import pandas as pd
    
    url = "your_yahoo_finance_url"
    response = requests.get(url, headers={'user-agent': 'custom'})
    soup = BeautifulSoup(response.content, 'html.parser')
    key_stats = pd.DataFrame(columns=["indicator", "value"])
    tables = pd.read_html(str(soup))
    
    for table in tables:
        try:
            # Validate the table structure
            if table.shape[1] == 2:
                table.columns = ['indicator', 'value']
                key_stats = pd.concat([key_stats, table], axis=0)
            else:
                print("Skipped a table with unexpected format.")
        except ValueError as e:
            print(f"Error processing a table: {e}")
    
    key_stats = key_stats.set_index("indicator")
    

    That way, your code checks that each table has exactly two columns before attempting to rename columns and concatenate it.
    It also catches any ValueError that might occur during the process, allowing you to log or handle it as needed.
    You can better handle variability in the data being scraped.

    Login or Signup to reply.
  2. The error you’re encountering indicates that there’s a mismatch in the lengths of the axis when concatenating the tables into the key_stats DataFrame. This could happen if the tables extracted from the HTML content have different numbers of columns.

    To troubleshoot and handle this error, you can add some error handling and logging to understand which table is causing the issue and why. Here’s an updated version of your code with error handling and logging added:

    response = requests.get(url, headers={'user-agent': 'custom'})
    soup = BeautifulSoup(response.content, 'html.parser')
    key_stats = pd.DataFrame(columns=["indicator", "value"])
    
    try:
        tables = pd.read_html(str(soup))
    except ValueError as e:
        print("ValueError occurred while parsing HTML tables:", e)
        # Add logging or handle the error as needed
        tables = []
    
    for table in tables:
        try:
            if table.shape[1] == 2:  # Check if the table has two columns
                table.columns = ['indicator', 'value']
                key_stats = pd.concat([key_stats, table], axis=0)
            else:
                print("Skipping table with unexpected number of columns:", table)
                # Add logging or handle the table with unexpected columns
        except Exception as e:
            print("Error occurred while processing table:", e)
            # Add logging or handle the error as needed
    
    key_stats = key_stats.set_index("indicator")
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search