skip to Main Content

Following codes gives different results each time, sometimes in correct human readable ASCII, but other times in some other non-ASCII encoding format.

HEADERS = ({'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36',
            'Accept-Language': 'en-US,en;q=0.9',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7'})

page = requests.get('https://www.powerball.com/', headers=HEADERS)
print(page.text)
print(page.encoding)

The encoding of the page is always utf-8. What could be the reason for the difference?

Tried copy http headers from the request sent from browser but getting the same result.

2

Answers


  1. .content is a bytes object, while .text always returns a string (the encoding is automatically guessed from .content).

    To get response in another encoding, use the following code:

    # replace gbk with your encoding
    
    # method 1: set encoding manually
    page.encoding = "gbk"
    print(page.text)
    
    # or
    
    # method 2: convert bytes data to str, with a specific encoding
    print(page.content.decode("gbk"))
    

    For your question about the inconsistent encoding, I suggest checking page.headers and page.encoding to verify whether encoding can be fetched from headers, or can only be guessed from content.

    It’s also worth noticing that requests cannot read encoding from HTML data, so if a encoding is specified in HTML, you should use BeautifulSoup4 or similar things to read it, rather than from page.encoding.

    Reference: Response Content – Requests docs

    Login or Signup to reply.
  2. The page.headers dictionary contains 'Content-Encoding': 'br'. This indicates Brotli compression and is not supported by default from requests (as of version 2.31.0 anyway).

    Per the requests documentation:

    When either the brotli or brotlicffi package is installed, requests also decodes Brotli-encoded responses.

    # Note: pip install brotli
    import requests
    import brotli
    
    HEADERS = ({'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36',
                'Accept-Language': 'en-US,en;q=0.9',
                'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng;q=0.8,application/signed-exchange;v=b3;q=0.7',
                'Accept-Encoding': 'gzip'})
    
    page = requests.get('https://www.powerball.com/', headers=HEADERS)
    print(page.text)
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search