I want to make a crawler that will just keep going infinitely until a page has no links. Every time it crawls a page it returns the html of the webpage so I can parse it and get the title, meta tags, and information from article or p tags. I basically want it to look like this:
while(num_links_in_page > 0){
html = page.content
/* code to parse html */
insert_in_db(html, meta, title, info, url)
}
I am using php, javascript, and MySQL for the DB but I have no problem switching to python or any other language, I do not have much money for distributed systems, but I need it to be fast and not take 20 minutes to crawl 5 links like my current crawler I made from scratch does, which also stops after about 50 links.
2
Answers
What have you tried to further your progress? You need a lot more than what you have above. You need something along these lines below:
http://talkerscode.com/webtricks/create-simple-web-crawler-using-php-and-mysql.php
The slowest part of your crawler is fetching the page. This can be solved via multiple processes (threads) running independently of one another. Perl can spawn processes. PHP 8 has "parallel". Shell scripts (at least on Linux-like OSs) can run things in the "background". I recommend 10 simultaneous processes as a reasonable tradeoff among various competing resource limits, etc.
For Perl, CPAN’s "Mechanize" will do all the parsing for you and provide an array of links.
The second slowest part is inserting rows into the table one at a time. Collect them up and build a multi-row "batch"
INSERT
. I recommend limiting the batch to 100 rows.You also need to avoid crawling the same site repeatedly. To assist with that, I suggest a
TIMESTAMP
be included with each row. (Plus other logic.)With the above advice, I would expect at least 10 new links per second. And no "stop after 50". OTOH, there are several things that can cause slowdowns or hiccups — huge pages, distant domains, access denied, etc.
Also, don’t pound on a single domain. Perhaps a DOS monitor saw 50 requests in a few seconds and blacklisted your IP address. So, be sure to delay several seconds between following a link to any domain you have recently fetched a page from.
Even without my above advice, your "20", "5", and "50" seem to point to other bugs.