Following codes gives different results each time, sometimes in correct human readable ASCII, but other times in some other non-ASCII encoding format.
HEADERS = ({'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36',
'Accept-Language': 'en-US,en;q=0.9',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7'})
page = requests.get('https://www.powerball.com/', headers=HEADERS)
print(page.text)
print(page.encoding)
The encoding of the page is always utf-8. What could be the reason for the difference?
Tried copy http headers from the request sent from browser but getting the same result.
2
Answers
.content
is a bytes object, while.text
always returns a string (the encoding is automatically guessed from.content
).To get response in another encoding, use the following code:
For your question about the inconsistent encoding, I suggest checking
page.headers
andpage.encoding
to verify whether encoding can be fetched from headers, or can only be guessed from content.It’s also worth noticing that requests cannot read encoding from HTML data, so if a encoding is specified in HTML, you should use BeautifulSoup4 or similar things to read it, rather than from
page.encoding
.Reference: Response Content – Requests docs
The
page.headers
dictionary contains'Content-Encoding': 'br'
. This indicates Brotli compression and is not supported by default fromrequests
(as of version 2.31.0 anyway).Per the
requests
documentation: