skip to Main Content

I have a script that fetches several web pages and parses the info.

(An example can be seen at http://bluedevilbooks.com/search/?DEPT=MATH&CLASS=103&SEC=01 )

I ran cProfile on it, and as I assumed, urlopen takes up a lot of time. Is there a way to fetch the pages faster? Or a way to fetch several pages at once? I’ll do whatever is simplest, as I’m new to python and web developing.

Thanks in advance! 🙂

UPDATE: I have a function called fetchURLs(), which I use to make an array of the URLs I need
so something like urls = fetchURLS().The URLS are all XML files from Amazon and eBay APIs (which confuses me as to why it takes so long to load, maybe my webhost is slow?)

What I need to do is load each URL, read each page, and send that data to another part of the script which will parse and display the data.

Note that I can’t do the latter part until ALL of the pages have been fetched, that’s what my issue is.

Also, my host limits me to 25 processes at a time, I believe, so whatever is easiest on the server would be nice 🙂


Here it is for time:

Sun Aug 15 20:51:22 2010    prof

         211352 function calls (209292 primitive calls) in 22.254 CPU seconds

   Ordered by: internal time
   List reduced from 404 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       10   18.056    1.806   18.056    1.806 {_socket.getaddrinfo}
     4991    2.730    0.001    2.730    0.001 {method 'recv' of '_socket.socket' objects}
       10    0.490    0.049    0.490    0.049 {method 'connect' of '_socket.socket' objects}
     2415    0.079    0.000    0.079    0.000 {method 'translate' of 'unicode' objects}
       12    0.061    0.005    0.745    0.062 /usr/local/lib/python2.6/HTMLParser.py:132(goahead)
     3428    0.060    0.000    0.202    0.000 /usr/local/lib/python2.6/site-packages/BeautifulSoup.py:1306(endData)
     1698    0.055    0.000    0.068    0.000 /usr/local/lib/python2.6/site-packages/BeautifulSoup.py:1351(_smartPop)
     4125    0.053    0.000    0.056    0.000 /usr/local/lib/python2.6/site-packages/BeautifulSoup.py:118(setup)
     1698    0.042    0.000    0.358    0.000 /usr/local/lib/python2.6/HTMLParser.py:224(parse_starttag)
     1698    0.042    0.000    0.275    0.000 /usr/local/lib/python2.6/site-packages/BeautifulSoup.py:1397(unknown_starttag)

11

Answers


  1. The actual wait is probably not in urllib2 but in the server and/or your network connection to the server.

    There are 2 ways of speeding this up.

    1. Keep the connection alive (see this question on how to do that: Python urllib2 with keep alive)
    2. Use multiplle connections, you can use threads or an async approach as Aaron Gallagher suggested. For that, simply use any threading example and you should do fine 🙂 You can also use the multiprocessing lib to make things pretty easy.
    Login or Signup to reply.
  2. Fetching webpages obviously will take a while as you’re not accessing anything local. If you have several to access, you could use the threading module to run a couple at once.

    Here’s a very crude example

    import threading
    import urllib2
    import time
    
    urls = ['http://docs.python.org/library/threading.html',
            'http://docs.python.org/library/thread.html',
            'http://docs.python.org/library/multiprocessing.html',
            'http://docs.python.org/howto/urllib2.html']
    data1 = []
    data2 = []
    
    class PageFetch(threading.Thread):
        def __init__(self, url, datadump):
            self.url = url
            self.datadump = datadump
            threading.Thread.__init__(self)
        def run(self):
            page = urllib2.urlopen(self.url)
            self.datadump.append(page.read()) # don't do it like this.
    
    print "Starting threaded reads:"
    start = time.clock()
    for url in urls:
        PageFetch(url, data2).start()
    while len(data2) < len(urls): pass # don't do this either.
    print "...took %f seconds" % (time.clock() - start)
    
    print "Starting sequential reads:"
    start = time.clock()
    for url in urls:
        page = urllib2.urlopen(url)
        data1.append(page.read())
    print "...took %f seconds" % (time.clock() - start)
    
    for i,x in enumerate(data1):
        print len(data1[i]), len(data2[i])
    

    This was the output when I ran it:

    Starting threaded reads:
    ...took 2.035579 seconds
    Starting sequential reads:
    ...took 4.307102 seconds
    73127 19923
    19923 59366
    361483 73127
    59366 361483
    

    Grabbing the data from the thread by appending to a list is probably ill-advised (Queue would be better) but it illustrates that there is a difference.

    Login or Signup to reply.
  3. Use twisted! It makes this kind of thing absurdly easy compared to, say, using threads.

    from twisted.internet import defer, reactor
    from twisted.web.client import getPage
    import time
    
    def processPage(page, url):
        # do somewthing here.
        return url, len(page)
    
    def printResults(result):
        for success, value in result:
            if success:
                print 'Success:', value
            else:
                print 'Failure:', value.getErrorMessage()
    
    def printDelta(_, start):
        delta = time.time() - start
        print 'ran in %0.3fs' % (delta,)
        return delta
    
    urls = [
        'http://www.google.com/',
        'http://www.lycos.com/',
        'http://www.bing.com/',
        'http://www.altavista.com/',
        'http://achewood.com/',
    ]
    
    def fetchURLs():
        callbacks = []
        for url in urls:
            d = getPage(url)
            d.addCallback(processPage, url)
            callbacks.append(d)
    
        callbacks = defer.DeferredList(callbacks)
        callbacks.addCallback(printResults)
        return callbacks
    
    @defer.inlineCallbacks
    def main():
        times = []
        for x in xrange(5):
            d = fetchURLs()
            d.addCallback(printDelta, time.time())
            times.append((yield d))
        print 'avg time: %0.3fs' % (sum(times) / len(times),)
    
    reactor.callWhenRunning(main)
    reactor.run()
    

    This code also performs better than any of the other solutions posted (edited after I closed some things that were using a lot of bandwidth):

    Success: ('http://www.google.com/', 8135)
    Success: ('http://www.lycos.com/', 29996)
    Success: ('http://www.bing.com/', 28611)
    Success: ('http://www.altavista.com/', 8378)
    Success: ('http://achewood.com/', 15043)
    ran in 0.518s
    Success: ('http://www.google.com/', 8135)
    Success: ('http://www.lycos.com/', 30349)
    Success: ('http://www.bing.com/', 28611)
    Success: ('http://www.altavista.com/', 8378)
    Success: ('http://achewood.com/', 15043)
    ran in 0.461s
    Success: ('http://www.google.com/', 8135)
    Success: ('http://www.lycos.com/', 30033)
    Success: ('http://www.bing.com/', 28611)
    Success: ('http://www.altavista.com/', 8378)
    Success: ('http://achewood.com/', 15043)
    ran in 0.435s
    Success: ('http://www.google.com/', 8117)
    Success: ('http://www.lycos.com/', 30349)
    Success: ('http://www.bing.com/', 28611)
    Success: ('http://www.altavista.com/', 8378)
    Success: ('http://achewood.com/', 15043)
    ran in 0.449s
    Success: ('http://www.google.com/', 8135)
    Success: ('http://www.lycos.com/', 30349)
    Success: ('http://www.bing.com/', 28611)
    Success: ('http://www.altavista.com/', 8378)
    Success: ('http://achewood.com/', 15043)
    ran in 0.547s
    avg time: 0.482s
    

    And using Nick T’s code, rigged up to also give the average of five and show the output better:

    Starting threaded reads:
    ...took 1.921520 seconds ([8117, 30070, 15043, 8386, 28611])
    Starting threaded reads:
    ...took 1.779461 seconds ([8135, 15043, 8386, 30349, 28611])
    Starting threaded reads:
    ...took 1.756968 seconds ([8135, 8386, 15043, 30349, 28611])
    Starting threaded reads:
    ...took 1.762956 seconds ([8386, 8135, 15043, 29996, 28611])
    Starting threaded reads:
    ...took 1.654377 seconds ([8117, 30349, 15043, 8386, 28611])
    avg time: 1.775s
    
    Starting sequential reads:
    ...took 1.389803 seconds ([8135, 30147, 28611, 8386, 15043])
    Starting sequential reads:
    ...took 1.457451 seconds ([8135, 30051, 28611, 8386, 15043])
    Starting sequential reads:
    ...took 1.432214 seconds ([8135, 29996, 28611, 8386, 15043])
    Starting sequential reads:
    ...took 1.447866 seconds ([8117, 30028, 28611, 8386, 15043])
    Starting sequential reads:
    ...took 1.468946 seconds ([8153, 30051, 28611, 8386, 15043])
    avg time: 1.439s
    

    And using Wai Yip Tung’s code:

    Fetched 8117 from http://www.google.com/
    Fetched 28611 from http://www.bing.com/
    Fetched 8386 from http://www.altavista.com/
    Fetched 30051 from http://www.lycos.com/
    Fetched 15043 from http://achewood.com/
    done in 0.704s
    Fetched 8117 from http://www.google.com/
    Fetched 28611 from http://www.bing.com/
    Fetched 8386 from http://www.altavista.com/
    Fetched 30114 from http://www.lycos.com/
    Fetched 15043 from http://achewood.com/
    done in 0.845s
    Fetched 8153 from http://www.google.com/
    Fetched 28611 from http://www.bing.com/
    Fetched 8386 from http://www.altavista.com/
    Fetched 30070 from http://www.lycos.com/
    Fetched 15043 from http://achewood.com/
    done in 0.689s
    Fetched 8117 from http://www.google.com/
    Fetched 28611 from http://www.bing.com/
    Fetched 8386 from http://www.altavista.com/
    Fetched 30114 from http://www.lycos.com/
    Fetched 15043 from http://achewood.com/
    done in 0.647s
    Fetched 8135 from http://www.google.com/
    Fetched 28611 from http://www.bing.com/
    Fetched 8386 from http://www.altavista.com/
    Fetched 30349 from http://www.lycos.com/
    Fetched 15043 from http://achewood.com/
    done in 0.693s
    avg time: 0.715s
    

    I’ve gotta say, I do like that the sequential fetches performed better for me.

    Login or Signup to reply.
  4. EDIT: I’m expanding the answer to include a more polished example. I have found a lot hostility and misinformation in this post regarding threading v.s. async I/O. Therefore I also adding more argument to refute certain invalid claim. I hope this will help people to choose the right tool for the right job.

    This is a dup to a question 3 days ago.

    Python urllib2.open is slow, need a better way to read several urls – Stack Overflow
    Python urllib2.urlopen() is slow, need a better way to read several urls

    I’m polishing the code to show how to fetch multiple webpage in parallel using threads.

    import time
    import threading
    import Queue
    
    # utility - spawn a thread to execute target for each args
    def run_parallel_in_threads(target, args_list):
        result = Queue.Queue()
        # wrapper to collect return value in a Queue
        def task_wrapper(*args):
            result.put(target(*args))
        threads = [threading.Thread(target=task_wrapper, args=args) for args in args_list]
        for t in threads:
            t.start()
        for t in threads:
            t.join()
        return result
    
    def dummy_task(n):
        for i in xrange(n):
            time.sleep(0.1)
        return n
    
    # below is the application code
    urls = [
        ('http://www.google.com/',),
        ('http://www.lycos.com/',),
        ('http://www.bing.com/',),
        ('http://www.altavista.com/',),
        ('http://achewood.com/',),
    ]
    
    def fetch(url):
        return urllib2.urlopen(url).read()
    
    run_parallel_in_threads(fetch, urls)
    

    As you can see, the application specific code has only 3 lines, which can be collapsed into 1 line if you are aggressive. I don’t think anyone can justify their claim that this is complex and unmaintainable.

    Unfortunately most other threading code posted here has some flaws. Many of them do active polling to wait for the code to finish. join() is a better way to synchronize the code. I think this code has improved upon all the threading examples so far.

    keep-alive connection

    WoLpH’s suggestion about using keep-alive connection could be very useful if all you URLs are pointing to the same server.

    twisted

    Aaron Gallagher is a fans of twisted framework and he is hostile any people who suggest thread. Unfortunately a lot of his claims are misinformation. For example he said “-1 for suggesting threads. This is IO-bound; threads are useless here.” This contrary to evidence as both Nick T and I have demonstrated speed gain from the using thread. In fact I/O bound application has the most to gain from using Python’s thread (v.s. no gain in CPU bound application). Aaron’s misguided criticism on thread shows he is rather confused about parallel programming in general.

    Right tool for the right job

    I’m well aware of the issues pertain to parallel programming using threads, python, async I/O and so on. Each tool has their pros and cons. For each situation there is an appropriate tool. I’m not against twisted (though I have not deployed one myself). But I don’t believe we can flat out say that thread is BAD and twisted is GOOD in all situations.

    For example, if the OP’s requirement is to fetch 10,000 website in parallel, async I/O will be prefereable. Threading won’t be appropriable (unless maybe with stackless Python).

    Aaron’s opposition to threads are mostly generalizations. He fail to recognize that this is a trivial parallelization task. Each task is independent and do not share resources. So most of his attack do not apply.

    Given my code has no external dependency, I’ll call it right tool for the right job.

    Performance

    I think most people would agree that performance of this task is largely depend on the networking code and the external server, where the performance of platform code should have negligible effect. However Aaron’s benchmark show an 50% speed gain over the threaded code. I think it is necessary to response to this apparent speed gain.

    In Nick’s code, there is an obvious flaw that caused the inefficiency. But how do you explain the 233ms speed gain over my code? I think even twisted fans will refrain from jumping into conclusion to attribute this to the efficiency of twisted. There are, after all, a huge amount of variable outside of the system code, like the remote server’s performance, network, caching, and difference implementation between urllib2 and twisted web client and so on.

    Just to make sure Python’s threading will not incur a huge amount of inefficiency, I do a quick benchmark to spawn 5 threads and then 500 threads. I am quite comfortable to say the overhead of spawning 5 thread is negligible and cannot explain the 233ms speed difference.

    In [274]: %time run_parallel_in_threads(dummy_task, [(0,)]*5)
    CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s
    Wall time: 0.00 s
    Out[275]: <Queue.Queue instance at 0x038B2878>
    
    In [276]: %time run_parallel_in_threads(dummy_task, [(0,)]*500)
    CPU times: user 0.16 s, sys: 0.00 s, total: 0.16 s
    Wall time: 0.16 s
    
    In [278]: %time run_parallel_in_threads(dummy_task, [(10,)]*500)
    CPU times: user 1.13 s, sys: 0.00 s, total: 1.13 s
    Wall time: 1.13 s       <<<<<<<< This means 0.13s of overhead
    

    Further testing on my parallel fetching shows a huge variability in the response time in 17 runs. (Unfortunately I don’t have twisted to verify Aaron’s code).

    0.75 s
    0.38 s
    0.59 s
    0.38 s
    0.62 s
    1.50 s
    0.49 s
    0.36 s
    0.95 s
    0.43 s
    0.61 s
    0.81 s
    0.46 s
    1.21 s
    2.87 s
    1.04 s
    1.72 s
    

    My testing does not support Aaron’s conclusion that threading is consistently slower than async I/O by a measurable margin. Given the number of variables involved, I have to say this is not a valid test to measure the systematic performance difference between async I/O and threading.

    Login or Signup to reply.
  5. Here is an example using python Threads. The other threaded examples here launch a thread per url, which is not very friendly behaviour if it causes too many hits for the server to handle (for example it is common for spiders to have many urls on the same host)

    from threading import Thread
    from urllib2 import urlopen
    from time import time, sleep
    
    WORKERS=1
    urls = ['http://docs.python.org/library/threading.html',
            'http://docs.python.org/library/thread.html',
            'http://docs.python.org/library/multiprocessing.html',
            'http://docs.python.org/howto/urllib2.html']*10
    results = []
    
    class Worker(Thread):
        def run(self):
            while urls:
                url = urls.pop()
                results.append((url, urlopen(url).read()))
    
    start = time()
    threads = [Worker() for i in range(WORKERS)]
    any(t.start() for t in threads)
    
    while len(results)<40:
        sleep(0.1)
    print time()-start
    

    Note: The times given here are for 40 urls and will depend a lot on the speed of your internet connection and the latency to the server. Being in Australia, my ping is > 300ms

    With WORKERS=1 it took 86 seconds to run
    With WORKERS=4 it took 23 seconds to run
    with WORKERS=10 it took 10 seconds to run

    so having 10 threads downloading is 8.6 times as fast as a single thread.

    Here is an upgraded version that uses a Queue. There are at least a couple of advantages.
    1. The urls are requested in the order that they appear in the list
    2. Can use q.join() to detect when the requests have all completed
    3. The results are kept in the same order as the url list

    from threading import Thread
    from urllib2 import urlopen
    from time import time, sleep
    from Queue import Queue
    
    WORKERS=10
    urls = ['http://docs.python.org/library/threading.html',
            'http://docs.python.org/library/thread.html',
            'http://docs.python.org/library/multiprocessing.html',
            'http://docs.python.org/howto/urllib2.html']*10
    results = [None]*len(urls)
    
    def worker():
        while True:
            i, url = q.get()
            # print "requesting ", i, url       # if you want to see what's going on
            results[i]=urlopen(url).read()
            q.task_done()
    
    start = time()
    q = Queue()
    for i in range(WORKERS):
        t=Thread(target=worker)
        t.daemon = True
        t.start()
    
    for i,url in enumerate(urls):
        q.put((i,url))
    q.join()
    print time()-start
    
    Login or Signup to reply.
  6. Nowadays there is excellent Python lib that do this for you called requests.

    Use standard api of requests if you want solution based on threads or async api (using gevent under the hood) if you want solution based on non-blocking IO.

    Login or Signup to reply.
  7. Most of the answers focused on fetching multiple pages from different servers at the same time
    (threading) but not on reusing already open HTTP connection. If OP is making multiple request to the same server/site.

    In urlib2 a separate connection is created with each request which impacts performance and and as a result slower rate of fetching pages. urllib3 solves this problem by using a connection pool. Can read more here urllib3 [Also thread-safe]

    There is also Requests an HTTP library that uses urllib3

    This combined with threading should increase the speed of fetching pages

    Login or Signup to reply.
  8. Here’s a standard library solution. It’s not quite as fast, but it uses less memory than the threaded solutions.

    try:
        from http.client import HTTPConnection, HTTPSConnection
    except ImportError:
        from httplib import HTTPConnection, HTTPSConnection
    connections = []
    results = []
    
    for url in urls:
        scheme, _, host, path = url.split('/', 3)
        h = (HTTPConnection if scheme == 'http:' else HTTPSConnection)(host)
        h.request('GET', '/' + path)
        connections.append(h)
    for h in connections:
        results.append(h.getresponse().read())
    

    Also, if most of your requests are to the same host, then reusing the same http connection would probably help more than doing things in parallel.

    Login or Signup to reply.
  9. Since this question was posted it looks like there’s a higher level abstraction available, ThreadPoolExecutor:

    https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor-example

    The example from there pasted here for convenience:

    import concurrent.futures
    import urllib.request
    
    URLS = ['http://www.foxnews.com/',
            'http://www.cnn.com/',
            'http://europe.wsj.com/',
            'http://www.bbc.co.uk/',
            'http://some-made-up-domain.com/']
    
    # Retrieve a single page and report the url and contents
    def load_url(url, timeout):
        with urllib.request.urlopen(url, timeout=timeout) as conn:
            return conn.read()
    
    # We can use a with statement to ensure threads are cleaned up promptly
    with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
        # Start the load operations and mark each future with its URL
        future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
        for future in concurrent.futures.as_completed(future_to_url):
            url = future_to_url[future]
            try:
                data = future.result()
            except Exception as exc:
                print('%r generated an exception: %s' % (url, exc))
            else:
                print('%r page is %d bytes' % (url, len(data)))
    

    There’s also map which I think makes the code easier: https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.Executor.map

    Login or Signup to reply.
  10. Please find Python network benchmark script for single connection slowness identification:

    """Python network test."""
    from socket import create_connection
    from time import time
    
    try:
        from urllib2 import urlopen
    except ImportError:
        from urllib.request import urlopen
    
    TIC = time()
    create_connection(('216.58.194.174', 80))
    print('Duration socket IP connection (s): {:.2f}'.format(time() - TIC))
    
    TIC = time()
    create_connection(('google.com', 80))
    print('Duration socket DNS connection (s): {:.2f}'.format(time() - TIC))
    
    TIC = time()
    urlopen('http://216.58.194.174')
    print('Duration urlopen IP connection (s): {:.2f}'.format(time() - TIC))
    
    TIC = time()
    urlopen('http://google.com')
    print('Duration urlopen DNS connection (s): {:.2f}'.format(time() - TIC))
    

    And example of results with Python 3.6:

    Duration socket IP connection (s): 0.02
    Duration socket DNS connection (s): 75.51
    Duration urlopen IP connection (s): 75.88
    Duration urlopen DNS connection (s): 151.42
    

    Python 2.7.13 has very similar results.

    In this case, DNS and urlopen slowness are easily identified.

    Login or Signup to reply.
  11. Ray offers an elegant way to do this (in both Python 2 and Python 3). Ray is a library for writing parallel and distributed Python.

    Simply define the fetch function with the @ray.remote decorator. Then you can fetch a URL in the background by calling fetch.remote(url).

    import ray
    import sys
    
    ray.init()
    
    @ray.remote
    def fetch(url):
        if sys.version_info >= (3, 0):
            import urllib.request
            return urllib.request.urlopen(url).read()
        else:
            import urllib2
            return urllib2.urlopen(url).read()
    
    urls = ['https://en.wikipedia.org/wiki/Donald_Trump',
            'https://en.wikipedia.org/wiki/Barack_Obama',
            'https://en.wikipedia.org/wiki/George_W._Bush',
            'https://en.wikipedia.org/wiki/Bill_Clinton',
            'https://en.wikipedia.org/wiki/George_H._W._Bush']
    
    # Fetch the webpages in parallel.
    results = ray.get([fetch.remote(url) for url in urls])
    

    If you also want to process the webpages in parallel, you can either put the processing code directly into fetch, or you can define a new remote function and compose them together.

    @ray.remote
    def process(html):
        tokens = html.split()
        return set(tokens)
    
    # Fetch and process the pages in parallel.
    results = []
    for url in urls:
        results.append(process.remote(fetch.remote(url)))
    results = ray.get(results)
    

    If you have a very long list of URLs that you want to fetch, you may wish to issue some tasks and then process them in the order that they complete. You can do this using ray.wait.

    urls = 100 * urls  # Pretend we have a long list of URLs.
    results = []
    
    in_progress_ids = []
    
    # Start pulling 10 URLs in parallel.
    for _ in range(10):
        url = urls.pop()
        in_progress_ids.append(fetch.remote(url))
    
    # Whenever one finishes, start fetching a new one.
    while len(in_progress_ids) > 0:
        # Get a result that has finished.
        [ready_id], in_progress_ids = ray.wait(in_progress_ids)
        results.append(ray.get(ready_id))
        # Start a new task.
        if len(urls) > 0:
            in_progress_ids.append(fetch.remote(urls.pop()))
    

    View the Ray documentation.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search