skip to Main Content

Good afternoon dear community,

I have finally compiled a list of working XPaths required to scrape all of the information from URL’s that i need.

I would like to ask for your suggestion, for a newbie in coding what is the best way to scrape around 50k links using only XPaths (around 100 xpaths for each link)?

Import.io is my best tool at the moment, or even SEO tools for Excel, but they both have their own limitations. Import io is expensive, SEO tools for excel isn’t suited to extract more than 1000 links.

I am willing to learn the system suggested, but please suggest a good way of scraping for my project!

#

SOLVED! SEO Tools crawler is actually super usefull and I believe I’ve found what i need. I guess i’ll hold off Python or Java until i encounter another tough obstacle.
Thank you all!

2

Answers


  1. That strongly depends on what you mean by “scraping information”. What exactly do you want to mine from the websites? All major languages (certainly Java and Python that you mentioned) have good solutions for connecting to websites, reading content, parsing HTML using a DOM and using XPath to extract certain fragments. For example, Java has JTidy, which allows you to parse even “dirty” HTML from websites into a DOM and manipulate it somewhat. However, the tools needed will depend on the exact data processing needs of your project.

    Login or Signup to reply.
  2. I would encourage you to use Python (I use 2.7.x) w/ Selenium. I routinely automate scraping and testing of websites with this combo (in both a headed and headless manner), and Selenium unlocks the opportunity to interact with scripted sites that do not have explicit webcalls for each and every page.

    Here is a good, quick tutorial from the Selenium docs: 2. Getting Started

    There are a lot of great sources out there, and it would take forever to post them all; but, you will find the Python community very helpful and you’ll likely see that Python is a great language for this type of web interaction.

    Good luck!

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search