I want to scrap the "key-statistics" tab from Yahoo Finance. The HTML page contains multiple tables that I scrapped using Beautiful Soup. Each table contains only 2 columns, and I managed to scrap them using both HTML tags "table, td and tr" and Pandas’ "read_html" function.
The tables are concatenated into a single dataframe using this code
response = requests.get(url, headers={'user-agent': 'custom'})
soup = BeautifulSoup(response.content, 'html.parser')
key_stats = pd.DataFrame(columns=["indicator", "value"])
tables = pd.read_html(str(soup))
for table in tables:
table.columns = ['indicator', 'value' ]
key_stats = pd.concat([key_stats, table], axis=0)
key_stats = key_stats.set_index("indicator")
The code works perfectly when using a small list of stocks, however when trying to use the same code for a large list (5665 stock) the following error occurs.
ValueError: Length mismatch: Expected axis has 1 elements, new values have 2 elements
This error appears randomly on certain stocks and it means that the tables being scrapped contain 1 column, which is not true.
The most confusing part about this, is that the code works fine when re-executed using the same stocks that generated an error.
I could not understand what’s causing this issue, could anyone help me with that ?
2
Answers
As commented, without seeing the HTML data, all we know is that you get inconsistencies in the data being scraped from Yahoo Finance’s "key-statistics" tab (visible here) for many stocks. That means you need to implement some strategies to be more robust in the face of those inconsistencies:
key_stats
DataFrame, validate that each table indeed has the expected two columns. That can be done by checking the shape of the DataFrame.ValueError
. That will allow you to log or investigate the problematic stocks without interrupting the entire scraping process.For instance:
That way, your code checks that each table has exactly two columns before attempting to rename columns and concatenate it.
It also catches any
ValueError
that might occur during the process, allowing you to log or handle it as needed.You can better handle variability in the data being scraped.
The error you’re encountering indicates that there’s a mismatch in the lengths of the axis when concatenating the tables into the key_stats DataFrame. This could happen if the tables extracted from the HTML content have different numbers of columns.
To troubleshoot and handle this error, you can add some error handling and logging to understand which table is causing the issue and why. Here’s an updated version of your code with error handling and logging added: