pandas.read_html -- one table over multiple pages

EmilioMastriani
January 20, 2024
151 views
0 votes
2 Answers

I encountered the following problem scraping the https://nuforc.org/subndx/?id=all URL, which contains a long list of pages (1497) with a 100-line table for each page.

I used the following Python code to scrap the web table:

import pandas as pd
insight_ufo = pd.read_html('https://nuforc.org/subndx/?id=all', displayed_only=False)[0]

It correctly loads the first page into the data frame, but only the first 100-line (the first page). So, I don’t know how to loop over the full list of pages.

insight_ufo
    Link    Occurred    City    State   Country Shape   Summary Reported    Posted  Image
0   Open    02/02/1995 23:00    Shady Grove OR  USA NaN Man and wife witness very bright, moving light...   02/03/1995  11/02/1999  NaN
1   Open    02/02/1995 19:15    Denmark WI  USA Cone    Many witness strange craft streaking in night ...   02/03/1995  11/02/1999  NaN
2   Open    02/02/1995 20:10    Traverse City   MI  USA NaN 4 children report seeing disc above them; bath...   02/03/1995  11/02/1999  NaN
3   Open    12/13/1994 18:55    Murphy  NC  USA NaN Woman reports seeing strange, lighted obj. wit...   02/03/1995  11/02/1999  NaN
4   Open    02/03/1995 23:25    Fontana CA  USA NaN 8 adults witness five lights in northern sky f...   03/04/1995  11/02/1999  NaN
... ... ... ... ... ... ... ... ... ... ...
96  Open    01/01/1995 22:45    Anaheim CA  USA NaN Telephoned Report: Man witnessed a motionless ...   01/01/1995  11/20/2001  NaN
97  Open    01/01/1995 19:50    Warm Beach  WA  USA NaN Woman witnessed a red "ball" from her home nea...   01/02/1995  11/02/1999  NaN
98  Open    01/02/1995 06:25    New Port Richey FL  USA NaN Man witnesses huge, "coin shaped," orange obje...   01/02/1995  11/02/1999  NaN
99  Open    01/03/1995 23:45    Salinas CA  USA NaN Young woman witnessed "string of lights" for 1...   01/03/1995  11/02/1999  NaN
100 Link    Occurred    City    State   Country Shape   Summary Reported    Posted  Image
101 rows × 10 columns

Please, can you help me?

NB.: Changing page, the URL doesn’t change.

Thank you for your support.

Emilio

I tried the code exposed before

Answers

If the table pages are generated dynamically a scraping of the static page html will not work.

What you can do is to use Selenium and click on the buttons to generate the pages, it is slow because there are almost 1500 pages, but if you don’t need to do this often it will work:

from io import StringIO

import pandas as pd

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


class wait_for_text_changed:
    def __init__(self, locator, current_text):
        self.locator = locator
        self.current_text = current_text

    def __call__(self, driver):
        return driver.find_element(*self.locator).text != self.current_text


driver = webdriver.Firefox()

driver.get('https://nuforc.org/subndx/?id=all')
WebDriverWait(driver, 10).until(
    EC.visibility_of_element_located((By.ID, 'table_1'))
    )
html = driver.page_source
table = pd.read_html(StringIO(html), displayed_only=False)[0]
while True:
    tab_info_current_text = driver.find_element(By.ID, "table_1_info").text
    button = driver.find_element(By.ID, "table_1_next")
    if button.get_attribute('class') != "paginate_button next":
        break
    button.click()
    WebDriverWait(driver, 10).until(
        wait_for_text_changed((By.ID, 'table_1_info'), tab_info_current_text)
        )
    html = driver.page_source
    page = pd.read_html(StringIO(html), displayed_only=False)[0]
    table = pd.concat([table, page], axis=0)
driver.quit()

print(table)
print(table.shape)

Here’s one approach:

With the url open in Chrome, use ctrl + shift + j to open Chrome DevTools, navigate to Network and click on the 2-button on the page (i.e. to go to the next page).
There will appear a connection https://nuforc.org/wp-admin/admin-ajax.php?action=get_wdtable&table_id=1&wdt_var1=Post&wdt_var2=-1.
From the details, you can gather that different chunks from the data are retrieved with a POST request. The maximum length allowed per chunk appears to be 20_000.

Hence, you can do something like this:

import requests
from bs4 import BeautifulSoup
import pandas as pd
from tqdm import tqdm

# base url
URL = 'https://nuforc.org/subndx/?id=all'

# minimal post data
DATA = {'action': 'get_wdtable',
        'table_id': '1',
        'wdt_var1': 'Post',
        'wdt_var2': '-1',
        'draw': '1',
        'start': '0',   # we'll be adjusting this one
        'length': '0'   # and this one
        }

def get_data(url, data, step=20_000):
    
    # measure process with `tqdm`: 
    progress = tqdm(position=0, leave=True)
    
    # list `dfs` to store `df` per chunk
    dfs = []
    start = 0
    
    # break on bad connection | empty `df` (= end reached)
    while True:
        
        data.update({'start': str(start),
                     'length': str(step)})    
        
        # use `post` with `data` param
        resp = requests.post(url=URL, data=data)
        
        if resp.status_code != 200:
            # failure to connect
            break
        
        # create `soup`, locate `table` and select `tr` tags
        soup = BeautifulSoup(resp.content, 'lxml')
        
        table = soup.find('table')
        table_rows = table.find_all('tr')
        
        # reached table with no data (2 means only header and footer)
        if len(table_rows) == 2:
            break
        
        # from first chunk retrieve the headers
        if start == 0:
            headers = [th.text for th in table_rows[0].find_all('th')]
        
        df_data = []
        
        # for data proper exclude first and last row (header, footer)
        for tr in table_rows[1:-1]:
            
            # create `row` list per `tr`
            row = []
            
            for td in tr.find_all('td'):
                
                # href for `Open` value in column "Link"
                if td.find('a'):
                    value = f"https://nuforc.org{td.a['href']}"
                # text
                else:
                    value = td.text
                
                # add each `value` to list `row`
                row.append(value)

            # add `row` to list `df_data`
            df_data.append(row)
        
        # create a `df` and append to list `dfs`
        df = pd.DataFrame(df_data, columns=headers)
        
        dfs.append(df.assign(source=start))
        
        # increment `start` to move to next chunk
        start += step
        
        # progress update
        progress.update()
    
    # after `while loop`, concat `dfs`
    df = pd.concat(dfs, axis=0)
    return df

df = get_data(URL, dict(DATA), 20_000)

pd.options.display.max_columns = 50

print(f"""
{df.head()}""")
print(f"""
{df.shape}""")

Output:

8it [02:38, 19.84s/it] # time to retrieve data, 8 iterations
                                  Link          Occurred           City State  
0  https://nuforc.org/sighting/?id=111  02/02/1995 23:00    Shady Grove    OR   
1  https://nuforc.org/sighting/?id=113  02/02/1995 19:15        Denmark    WI   
2  https://nuforc.org/sighting/?id=114  02/02/1995 20:10  Traverse City    MI   
3  https://nuforc.org/sighting/?id=115  12/13/1994 18:55         Murphy    NC   
4  https://nuforc.org/sighting/?id=116  02/03/1995 23:25        Fontana    CA   

  Country Shape                                            Summary  
0     USA        Man and wife witness very bright, moving light...   
1     USA  Cone  Many witness strange craft streaking in night ...   
2     USA        4 children report seeing disc above them; bath...   
3     USA        Woman reports seeing strange, lighted obj. wit...   
4     USA        8 adults witness five lights in northern sky f...   

     Reported      Posted Image  source  
0  02/03/1995  11/02/1999             0  
1  02/03/1995  11/02/1999             0  
2  02/03/1995  11/02/1999             0  
3  02/03/1995  11/02/1999             0  
4  03/04/1995  11/02/1999             0  

(147890, 11)

N.B. Using this method, the data by default is ordered ascendingly on column "Occurred". Adding order[0][column]: 1, 'order[0][dir]': 'desc' to DATA should, by the looks of it, re-order the data, but for some reason, I don’t see the output responding to this addition. Maybe someone else reading this, will be able to figure how that might be made to work.

Please signup or login to give your own answer.

Click here to cancel reply.

pandas.read_html — one table over multiple pages

Answers