网络爬虫抓取心得体会(网络爬虫之用户名密码及验证码登陆)
使用requests包来爬取首先尝试用用户名密码自动登陆,如果失败,则需要采用cookie登陆,今天小编就来说说关于网络爬虫抓取心得体会?下面更多详细答案一起来看看吧!
网络爬虫抓取心得体会
说明:-
使用requests包来爬取。首先尝试用用户名密码自动登陆,如果失败,则需要采用cookie登陆。
-
配置文件config.ini,其中包括用户名密码信息,如果有验证码情况,需要手动登陆一次网站获取cookie信息。
-
判断登陆成功与否,看生成的html文件中有没有用户信息。
-
[info] email = xxxx@163.com password = xxxx [cookies] q_c1 = cap_id = _za = __utmt = __utma = __utmb = __utmc = __utmz = __utmv = z_c0 = unlock_ticket =
# -*- coding: utf-8 -*- ''' 网络爬虫之用户名密码及验证码登陆:爬取知乎网站 ''' import requests import ConfigParser def create_session(): cf = ConfigParser.ConfigParser() cf.read('config.ini') cookies = cf.items('cookies') cookies = dict(cookies) from pprint import pprint pprint(cookies) email = cf.get('info', 'email') password = cf.get('info', 'password') session = requests.session() login_data = {'email': email, 'password': password} header = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.124 Safari/537.36', 'Host': 'www.zhihu.com', 'Referer': 'http://www.zhihu.com/' } r = session.post('http://www.zhihu.com/login/email', data=login_data, headers=header) if r.json()['r'] == 1: print 'Login Failed, reason is:', for m in r.json()['data']: print r.json()['data'][m] print 'So we use cookies to login in...' has_cookies = False for key in cookies: if key != '__name__' and cookies[key] != '': has_cookies = True break if has_cookies is False: raise ValueError('请填写config.ini文件中的cookies项.') else: r = session.get('http://www.zhihu.com/login/email', cookies=cookies) # 实现验证码登陆 with open('login.html', 'w') as fp: fp.write(r.content) return session, cookies if __name__ == '__main__': requests_session, requests_cookies = create_session() url = 'http://www.zhihu.com/topic/19552832' # content = requests_session.get(url).content # 未登陆 content = requests_session.get(url, cookies=requests_cookies).content # 已登陆 with open('url.html', 'w') as fp: fp.write(content)
,免责声明:本文仅代表文章作者的个人观点,与本站无关。其原创性、真实性以及文中陈述文字和内容未经本站证实,对本文以及其中全部或者部分内容文字的真实性、完整性和原创性本站不作任何保证或承诺,请读者仅作参考,并自行核实相关内容。文章投诉邮箱:anhduc.ph@yahoo.com