I encountered the following problem scraping the https://nuforc.org/subndx/?id=all URL, which contains a long list of pages (1497) with a 100-line table for each page.
I used the following Python code to scrap the web table:
import pandas as pd
insight_ufo = pd.read_html('https://nuforc.org/subndx/?id=all', displayed_only=False)[0]
It correctly loads the first page into the data frame, but only the first 100-line (the first page). So, I don’t know how to loop over the full list of pages.
insight_ufo
Link Occurred City State Country Shape Summary Reported Posted Image
0 Open 02/02/1995 23:00 Shady Grove OR USA NaN Man and wife witness very bright, moving light... 02/03/1995 11/02/1999 NaN
1 Open 02/02/1995 19:15 Denmark WI USA Cone Many witness strange craft streaking in night ... 02/03/1995 11/02/1999 NaN
2 Open 02/02/1995 20:10 Traverse City MI USA NaN 4 children report seeing disc above them; bath... 02/03/1995 11/02/1999 NaN
3 Open 12/13/1994 18:55 Murphy NC USA NaN Woman reports seeing strange, lighted obj. wit... 02/03/1995 11/02/1999 NaN
4 Open 02/03/1995 23:25 Fontana CA USA NaN 8 adults witness five lights in northern sky f... 03/04/1995 11/02/1999 NaN
... ... ... ... ... ... ... ... ... ... ...
96 Open 01/01/1995 22:45 Anaheim CA USA NaN Telephoned Report: Man witnessed a motionless ... 01/01/1995 11/20/2001 NaN
97 Open 01/01/1995 19:50 Warm Beach WA USA NaN Woman witnessed a red "ball" from her home nea... 01/02/1995 11/02/1999 NaN
98 Open 01/02/1995 06:25 New Port Richey FL USA NaN Man witnesses huge, "coin shaped," orange obje... 01/02/1995 11/02/1999 NaN
99 Open 01/03/1995 23:45 Salinas CA USA NaN Young woman witnessed "string of lights" for 1... 01/03/1995 11/02/1999 NaN
100 Link Occurred City State Country Shape Summary Reported Posted Image
101 rows × 10 columns
Please, can you help me?
NB.: Changing page, the URL doesn’t change.
Thank you for your support.
Emilio
I tried the code exposed before
2
Answers
If the table pages are generated dynamically a scraping of the static page html will not work.
What you can do is to use Selenium and click on the buttons to generate the pages, it is slow because there are almost 1500 pages, but if you don’t need to do this often it will work:
Here’s one approach:
ctrl + shift + j
to open Chrome DevTools, navigate toNetwork
and click on the2
-button on the page (i.e. to go to the next page).https://nuforc.org/wp-admin/admin-ajax.php?action=get_wdtable&table_id=1&wdt_var1=Post&wdt_var2=-1
.20_000
.Hence, you can do something like this:
Output:
N.B. Using this method, the data by default is ordered ascendingly on column "Occurred". Adding
order[0][column]: 1, 'order[0][dir]': 'desc'
toDATA
should, by the looks of it, re-order the data, but for some reason, I don’t see the output responding to this addition. Maybe someone else reading this, will be able to figure how that might be made to work.