通过pyquery的history属性,发现baidu这些连接大多进行了redirect(http 302),但是有一部分链接是直接获得了http200回复。对于从百度爬取的加密的url,进行requests.get()时不允许跳转(allow_redirects=False)。然后针对这两类服务器回复分别处理:
http 302跳转:从headers中的’location’可以获得原始url;
http 200回复:从content中通过正则表达式获取原始url
import requests
import re
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36"}
url = "http://www.baidu.com/link?url=WQGtCgFOk23GwLOgB1oGv0i2NDijQShztDtiKskUlolsvn4zHYrRKyYkVFNhzV2hxiRdi6Mi9EzRBEJiqwtvkq"
originalURLs = []
tmpPage = requests.get(url, headers=headers, allow_redirects=False)
if tmpPage.status_code == 200:
urlMatch = re.search(r'URL=\'(.*?)\'', tmpPage.text.encode('utf-8'), re.S)
originalURLs = urlMatch.group(1)
print(originalURLs)
elif tmpPage.status_code == 302:
originalURLs = tmpPage.headers.get('location')
if originalURLs.startswith('http'):
print(originalURLs)
else:
print('No URL found!!')
© 版权声明
文章版权归作者所有,未经允许请勿转载。
THE END
暂无评论内容