Html - Python requests.get gives response in different encoding

s_c
February 8, 2024
152 views
0 votes
2 Answers

Following codes gives different results each time, sometimes in correct human readable ASCII, but other times in some other non-ASCII encoding format.

HEADERS = ({'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36',
            'Accept-Language': 'en-US,en;q=0.9',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7'})

page = requests.get('https://www.powerball.com/', headers=HEADERS)
print(page.text)
print(page.encoding)

The encoding of the page is always utf-8. What could be the reason for the difference?

Tried copy http headers from the request sent from browser but getting the same result.

Answers

- YoungLord
- February 8, 2024 at 5:45 am
- 0 votes
0
.content is a bytes object, while .text always returns a string (the encoding is automatically guessed from .content).

To get response in another encoding, use the following code:
```
# replace gbk with your encoding

# method 1: set encoding manually
page.encoding = "gbk"
print(page.text)

# or

# method 2: convert bytes data to str, with a specific encoding
print(page.content.decode("gbk"))
```
For your question about the inconsistent encoding, I suggest checking page.headers and page.encoding to verify whether encoding can be fetched from headers, or can only be guessed from content.

It’s also worth noticing that requests cannot read encoding from HTML data, so if a encoding is specified in HTML, you should use BeautifulSoup4 or similar things to read it, rather than from page.encoding.

Reference: Response Content – Requests docs
Login or Signup to reply.

- MarkTolonen
- February 8, 2024 at 5:56 am
- 0 votes
0
The page.headers dictionary contains 'Content-Encoding': 'br'. This indicates Brotli compression and is not supported by default from requests (as of version 2.31.0 anyway).

Per the requests documentation:

When either the brotli or brotlicffi package is installed, requests also decodes Brotli-encoded responses.
```
# Note: pip install brotli
import requests
import brotli

HEADERS = ({'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36',
            'Accept-Language': 'en-US,en;q=0.9',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng;q=0.8,application/signed-exchange;v=b3;q=0.7',
            'Accept-Encoding': 'gzip'})

page = requests.get('https://www.powerball.com/', headers=HEADERS)
print(page.text)
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Html – Python requests.get gives response in different encoding

Answers