skip to Main Content

I encountered the following problem scraping the https://nuforc.org/subndx/?id=all URL, which contains a long list of pages (1497) with a 100-line table for each page.

I used the following Python code to scrap the web table:

import pandas as pd
insight_ufo = pd.read_html('https://nuforc.org/subndx/?id=all', displayed_only=False)[0]

It correctly loads the first page into the data frame, but only the first 100-line (the first page). So, I don’t know how to loop over the full list of pages.

insight_ufo
    Link    Occurred    City    State   Country Shape   Summary Reported    Posted  Image
0   Open    02/02/1995 23:00    Shady Grove OR  USA NaN Man and wife witness very bright, moving light...   02/03/1995  11/02/1999  NaN
1   Open    02/02/1995 19:15    Denmark WI  USA Cone    Many witness strange craft streaking in night ...   02/03/1995  11/02/1999  NaN
2   Open    02/02/1995 20:10    Traverse City   MI  USA NaN 4 children report seeing disc above them; bath...   02/03/1995  11/02/1999  NaN
3   Open    12/13/1994 18:55    Murphy  NC  USA NaN Woman reports seeing strange, lighted obj. wit...   02/03/1995  11/02/1999  NaN
4   Open    02/03/1995 23:25    Fontana CA  USA NaN 8 adults witness five lights in northern sky f...   03/04/1995  11/02/1999  NaN
... ... ... ... ... ... ... ... ... ... ...
96  Open    01/01/1995 22:45    Anaheim CA  USA NaN Telephoned Report: Man witnessed a motionless ...   01/01/1995  11/20/2001  NaN
97  Open    01/01/1995 19:50    Warm Beach  WA  USA NaN Woman witnessed a red "ball" from her home nea...   01/02/1995  11/02/1999  NaN
98  Open    01/02/1995 06:25    New Port Richey FL  USA NaN Man witnesses huge, "coin shaped," orange obje...   01/02/1995  11/02/1999  NaN
99  Open    01/03/1995 23:45    Salinas CA  USA NaN Young woman witnessed "string of lights" for 1...   01/03/1995  11/02/1999  NaN
100 Link    Occurred    City    State   Country Shape   Summary Reported    Posted  Image
101 rows × 10 columns

Please, can you help me?

NB.: Changing page, the URL doesn’t change.

Thank you for your support.

Emilio

I tried the code exposed before

2

Answers


  1. If the table pages are generated dynamically a scraping of the static page html will not work.

    What you can do is to use Selenium and click on the buttons to generate the pages, it is slow because there are almost 1500 pages, but if you don’t need to do this often it will work:

    from io import StringIO
    
    import pandas as pd
    
    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.wait import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    
    
    class wait_for_text_changed:
        def __init__(self, locator, current_text):
            self.locator = locator
            self.current_text = current_text
    
        def __call__(self, driver):
            return driver.find_element(*self.locator).text != self.current_text
    
    
    driver = webdriver.Firefox()
    
    driver.get('https://nuforc.org/subndx/?id=all')
    WebDriverWait(driver, 10).until(
        EC.visibility_of_element_located((By.ID, 'table_1'))
        )
    html = driver.page_source
    table = pd.read_html(StringIO(html), displayed_only=False)[0]
    while True:
        tab_info_current_text = driver.find_element(By.ID, "table_1_info").text
        button = driver.find_element(By.ID, "table_1_next")
        if button.get_attribute('class') != "paginate_button next":
            break
        button.click()
        WebDriverWait(driver, 10).until(
            wait_for_text_changed((By.ID, 'table_1_info'), tab_info_current_text)
            )
        html = driver.page_source
        page = pd.read_html(StringIO(html), displayed_only=False)[0]
        table = pd.concat([table, page], axis=0)
    driver.quit()
    
    print(table)
    print(table.shape)
    
    Login or Signup to reply.
  2. Here’s one approach:

    • With the url open in Chrome, use ctrl + shift + j to open Chrome DevTools, navigate to Network and click on the 2-button on the page (i.e. to go to the next page).
    • There will appear a connection https://nuforc.org/wp-admin/admin-ajax.php?action=get_wdtable&table_id=1&wdt_var1=Post&wdt_var2=-1.
    • From the details, you can gather that different chunks from the data are retrieved with a POST request. The maximum length allowed per chunk appears to be 20_000.

    Hence, you can do something like this:

    import requests
    from bs4 import BeautifulSoup
    import pandas as pd
    from tqdm import tqdm
    
    # base url
    URL = 'https://nuforc.org/subndx/?id=all'
    
    # minimal post data
    DATA = {'action': 'get_wdtable',
            'table_id': '1',
            'wdt_var1': 'Post',
            'wdt_var2': '-1',
            'draw': '1',
            'start': '0',   # we'll be adjusting this one
            'length': '0'   # and this one
            }
    
    def get_data(url, data, step=20_000):
        
        # measure process with `tqdm`: 
        progress = tqdm(position=0, leave=True)
        
        # list `dfs` to store `df` per chunk
        dfs = []
        start = 0
        
        # break on bad connection | empty `df` (= end reached)
        while True:
            
            data.update({'start': str(start),
                         'length': str(step)})    
            
            # use `post` with `data` param
            resp = requests.post(url=URL, data=data)
            
            if resp.status_code != 200:
                # failure to connect
                break
            
            # create `soup`, locate `table` and select `tr` tags
            soup = BeautifulSoup(resp.content, 'lxml')
            
            table = soup.find('table')
            table_rows = table.find_all('tr')
            
            # reached table with no data (2 means only header and footer)
            if len(table_rows) == 2:
                break
            
            # from first chunk retrieve the headers
            if start == 0:
                headers = [th.text for th in table_rows[0].find_all('th')]
            
            df_data = []
            
            # for data proper exclude first and last row (header, footer)
            for tr in table_rows[1:-1]:
                
                # create `row` list per `tr`
                row = []
                
                for td in tr.find_all('td'):
                    
                    # href for `Open` value in column "Link"
                    if td.find('a'):
                        value = f"https://nuforc.org{td.a['href']}"
                    # text
                    else:
                        value = td.text
                    
                    # add each `value` to list `row`
                    row.append(value)
    
                # add `row` to list `df_data`
                df_data.append(row)
            
            # create a `df` and append to list `dfs`
            df = pd.DataFrame(df_data, columns=headers)
            
            dfs.append(df.assign(source=start))
            
            # increment `start` to move to next chunk
            start += step
            
            # progress update
            progress.update()
        
        # after `while loop`, concat `dfs`
        df = pd.concat(dfs, axis=0)
        return df
    
    df = get_data(URL, dict(DATA), 20_000)
    
    pd.options.display.max_columns = 50
    
    print(f"""
    {df.head()}""")
    print(f"""
    {df.shape}""")
    

    Output:

    8it [02:38, 19.84s/it] # time to retrieve data, 8 iterations
                                      Link          Occurred           City State  
    0  https://nuforc.org/sighting/?id=111  02/02/1995 23:00    Shady Grove    OR   
    1  https://nuforc.org/sighting/?id=113  02/02/1995 19:15        Denmark    WI   
    2  https://nuforc.org/sighting/?id=114  02/02/1995 20:10  Traverse City    MI   
    3  https://nuforc.org/sighting/?id=115  12/13/1994 18:55         Murphy    NC   
    4  https://nuforc.org/sighting/?id=116  02/03/1995 23:25        Fontana    CA   
    
      Country Shape                                            Summary  
    0     USA        Man and wife witness very bright, moving light...   
    1     USA  Cone  Many witness strange craft streaking in night ...   
    2     USA        4 children report seeing disc above them; bath...   
    3     USA        Woman reports seeing strange, lighted obj. wit...   
    4     USA        8 adults witness five lights in northern sky f...   
    
         Reported      Posted Image  source  
    0  02/03/1995  11/02/1999             0  
    1  02/03/1995  11/02/1999             0  
    2  02/03/1995  11/02/1999             0  
    3  02/03/1995  11/02/1999             0  
    4  03/04/1995  11/02/1999             0  
    
    (147890, 11)
    

    N.B. Using this method, the data by default is ordered ascendingly on column "Occurred". Adding order[0][column]: 1, 'order[0][dir]': 'desc' to DATA should, by the looks of it, re-order the data, but for some reason, I don’t see the output responding to this addition. Maybe someone else reading this, will be able to figure how that might be made to work.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search