skip to Main Content

I am 100% sure this is not my internet connection/speed issue. I want to download a text/json file with, let’s say, 10K+ lines (24 MB) from GitHub, using Python, but the download takes too long, whether with urllib.request or requests. I can’t seem to find any solutions regarding this online, all the references I found online are either small text file or huge file size that gets divided in chunks.

import requests

url = 'https://raw.githubusercontent.com/json-iterator/test-data/master/large-file.json'
r = requests.get(url)
open('large-file.json', 'wb').write(r.content)

it takes way too long for me (1+ minute). If I downloaded it manually from browser, it takes less than 10 seconds.

2

Answers


  1. I used the same code and the execution time was less than 2 seconds.

    import requests
    from timeit import timeit
    
    def get_data():
        url = 'https://raw.githubusercontent.com/json-iterator/test-data/master/large-file.json'
        r = requests.get(url)
        open('large-file.json', 'wb').write(r.content)
        t = r.elapsed.total_seconds()
        print(f"{t} seconds elapsed")
    
    print(f"{timeit(get_data, number=1)} seconds")
    

    This was the output:

    0.268242 seconds elapsed
    1.7471522999999252 seconds
    

    The amount of time elapsed between sending the request and the arrival of the response was 0.268242 seconds, while the code snippet took 1.747152 seconds to execute.

    Login or Signup to reply.
  2. After multiple tests using different methods (requests, requests.Session, and httpx with HTTP/2), it is clear that the time measurements are different in different runs. This variation possibly due to several many factors such as network latency, server Load, TCP setup, etc

    Tests
    Here are the code and results from the tests conducted:

    Original code

    Test 1: 0.300804 seconds elapsed, 1.0805199 seconds total

    Test 2: 0.120291 seconds elapsed, 2.7386789 seconds total

    Test 3: 0.125401 seconds elapsed, 0.7138218999999999 seconds total

    HTTPX with HTTP/2:

    import httpx
    from timeit import timeit
    
    def get_data():
        url = 'https://raw.githubusercontent.com/json-iterator/test-data/master/large-file.json'
        
        with httpx.Client(http2=True) as client:
            r = client.get(url)
            with open('large-file.json', 'wb') as f:
                f.write(r.content)
        
        # Timing the request itself
        t = r.elapsed.total_seconds()
        print(f"{t} seconds elapsed (HTTPX request)")
    
    # Timing the entire function execution
    print(f"{timeit(get_data, number=1)} seconds (total execution time)")
    

    Test 1: 0.941488 seconds elapsed (request), 0.9907317 seconds (total execution time)

    Test 2: 2.630461 seconds elapsed (request), 2.6783040 seconds (total execution time)

    Test 3: 0.604343 seconds elapsed (request), 0.6498959 seconds (total execution time)

    Requests with Session:

    import requests
    from timeit import timeit
    
    def get_data():
        url = 'https://raw.githubusercontent.com/json-iterator/test-data/master/large-file.json'
        
        with requests.Session() as session:
            response = session.get(url)
            with open('large-file.json', 'wb') as f:
                f.write(response.content)
        
        # Timing the request itself
        t = response.elapsed.total_seconds()
        print(f"{t} seconds elapsed (requests.Session request)")
    
    # Timing the entire function execution
    print(f"{timeit(get_data, number=1)} seconds (total execution time)")
    

    Test 1: 0.248372 seconds elapsed (request), 1.0908065 seconds (total execution time)

    Test 2: 0.131544 seconds elapsed (request), 2.547926 seconds (total execution time)

    Test 3: 0.135681 seconds elapsed (request), 0.7969455 seconds (total execution time)

    My conclusion

    The variation across different runs is expected as i mentioned before.
    HTTPX with HTTP/2 is likely the best choice if you want fast overall performance, or when dealing with larger files where the role of HTTP/2 multiplexing and connection reuse may change the results.
    Sticking with the original requests implementation is a valid choice, especially for simplicity and ease of use.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search