I am 100% sure this is not my internet connection/speed issue. I want to download a text/json file with, let’s say, 10K+ lines (24 MB) from GitHub, using Python, but the download takes too long, whether with urllib.request or requests. I can’t seem to find any solutions regarding this online, all the references I found online are either small text file or huge file size that gets divided in chunks.
import requests
url = 'https://raw.githubusercontent.com/json-iterator/test-data/master/large-file.json'
r = requests.get(url)
open('large-file.json', 'wb').write(r.content)
it takes way too long for me (1+ minute). If I downloaded it manually from browser, it takes less than 10 seconds.
2
Answers
I used the same code and the execution time was less than 2 seconds.
This was the output:
The amount of time elapsed between sending the request and the arrival of the response was 0.268242 seconds, while the code snippet took 1.747152 seconds to execute.
After multiple tests using different methods (requests, requests.Session, and httpx with HTTP/2), it is clear that the time measurements are different in different runs. This variation possibly due to several many factors such as network latency, server Load, TCP setup, etc
Tests
Here are the code and results from the tests conducted:
Original code
Test 1: 0.300804 seconds elapsed, 1.0805199 seconds total
Test 2: 0.120291 seconds elapsed, 2.7386789 seconds total
Test 3: 0.125401 seconds elapsed, 0.7138218999999999 seconds total
HTTPX with HTTP/2:
Test 1: 0.941488 seconds elapsed (request), 0.9907317 seconds (total execution time)
Test 2: 2.630461 seconds elapsed (request), 2.6783040 seconds (total execution time)
Test 3: 0.604343 seconds elapsed (request), 0.6498959 seconds (total execution time)
Requests with Session:
Test 1: 0.248372 seconds elapsed (request), 1.0908065 seconds (total execution time)
Test 2: 0.131544 seconds elapsed (request), 2.547926 seconds (total execution time)
Test 3: 0.135681 seconds elapsed (request), 0.7969455 seconds (total execution time)
My conclusion
The variation across different runs is expected as i mentioned before.
HTTPX with HTTP/2 is likely the best choice if you want fast overall performance, or when dealing with larger files where the role of HTTP/2 multiplexing and connection reuse may change the results.
Sticking with the original requests implementation is a valid choice, especially for simplicity and ease of use.