html - Strange Output from Python urllib2 -
i read source code of webpage using urllib2; however, i'm seeing strange output i've not seen before. here's code (python 2.7, linux):
import urllib2 open_url = urllib2.urlopen("http://www.elegantthemes.com/gallery/") site_html = open_url.read() site_html[50:]
which gives output:
'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xe5\\ms\xdb\xb6\xb2\xfel\xcf\xe4?\xc0<s[\x9a\x8a\xa4^\xe28u,\xa5\x8e\x93\xf4\xa4\x93&\x99:9\xbdw\x9a\x8e\x07"'
does know why it's showing output , not correct html?
the http response being sent site gzipped content , hence strange output. urllib not automatically decode gzip cntent. there 2 ways solve -
1) decode zipped content before printing -
import urllib2 import io import gzip open_url = urllib2.urlopen("http://www.elegantthemes.com/gallery/") site_html = open_url.read() bi = io.bytesio(site_html) gf = gzip.gzipfile(fileobj=bi, mode="rb") s = gf.read() print s[50:]
2) use requests library -
import requests r = requests.get('http://www.elegantthemes.com/gallery/') print r.content
Comments
Post a Comment