微信公众号文章 爬虫(python抓取搜狗微信公众号文章)
类别:脚本大全 浏览量:2130
时间:2021-10-26 11:09:27 微信公众号文章 爬虫
python抓取搜狗微信公众号文章初学python,抓取搜狗微信公众号文章存入mysql
mysql表:
代码:
|
import requests import json import re import pymysql # 创建连接 conn = pymysql.connect(host = '你的数据库地址' , port = 端口, user = '用户名' , passwd = '密码' , db = '数据库名称' , charset = 'utf8' ) # 创建游标 cursor = conn.cursor() cursor.execute( "select * from hd_gzh" ) effect_row = cursor.fetchall() from bs4 import beautifulsoup socket.setdefaulttimeout( 60 ) count = 1 headers = { 'user-agent' : 'mozilla/5.0 (windows nt 10.0; win64; x64; rv:65.0) gecko/20100101 firefox/65.0' } #阿布云ip代理暂时不用 # proxyhost = "http-cla.abuyun.com" # proxyport = "9030" # # 代理隧道验证信息 # proxyuser = "h56761606429t7uc" # proxypass = "9168eb00c4167176" # proxymeta = "http://%(user)s:%(pass)s@%(host)s:%(port)s" % { # "host" : proxyhost, # "port" : proxyport, # "user" : proxyuser, # "pass" : proxypass, # } # proxies = { # "http" : proxymeta, # "https" : proxymeta, # } #查看是否已存在数据 def checkdata(name): sql = "select * from gzh_article where title = '%s'" data = (name,) count = cursor.execute(sql % data) conn.commit() if (count! = 0 ): return false else : return true #插入数据 def insertdata(title,picture,author,content): sql = "insert into gzh_article (title,picture,author,content) values ('%s', '%s','%s', '%s')" data = (title,picture,author,content) cursor.execute(sql % data) conn.commit() print ( "插入一条数据" ) return for row in effect_row: newsurl = 'https://weixin.sogou.com/weixin?type=1&s_from=input&query=' + row[ 1 ] + '&ie=utf8&_sug_=n&_sug_type_=' res = requests.get(newsurl,headers = headers) res.encoding = 'utf-8' soup = beautifulsoup(res.text, 'html.parser' ) url = 'https://weixin.sogou.com' + soup.select( '.tit a' )[ 0 ][ 'href' ] res2 = requests.get(url,headers = headers) res2.encoding = 'utf-8' soup2 = beautifulsoup(res2.text, 'html.parser' ) pattern = re. compile (r "url \+= '(.*?)';" , re.multiline | re.dotall) script = soup2.find( "script" ) url2 = pattern.search(script.text).group( 1 ) res3 = requests.get(url2,headers = headers) res3.encoding = 'utf-8' soup3 = beautifulsoup(res3.text, 'html.parser' ) print () pattern2 = re. compile (r "var msglist = (.*?);$" , re.multiline | re.dotall) script2 = soup3.find( "script" , text = pattern2) s2 = json.loads(pattern2.search(script2.text).group( 1 )) #等待10s time.sleep( 10 ) for news in s2[ "list" ]: articleurl = "https://mp.weixin.qq.com" + news[ "app_msg_ext_info" ][ "content_url" ] articleurl = articleurl.replace( '&' , '&' ) res4 = requests.get(articleurl,headers = headers) res4.encoding = 'utf-8' soup4 = beautifulsoup(res4.text, 'html.parser' ) if (checkdata(news[ "app_msg_ext_info" ][ "title" ])): insertdata(news[ "app_msg_ext_info" ][ "title" ],news[ "app_msg_ext_info" ][ "cover" ],news[ "app_msg_ext_info" ][ "author" ],pymysql.escape_string( str (soup4))) count + = 1 #等待5s time.sleep( 10 ) for news2 in news[ "app_msg_ext_info" ][ "multi_app_msg_item_list" ]: articleurl2 = "https://mp.weixin.qq.com" + news2[ "content_url" ] articleurl2 = articleurl2.replace( '&' , '&' ) res5 = requests.get(articleurl2,headers = headers) res5.encoding = 'utf-8' soup5 = beautifulsoup(res5.text, 'html.parser' ) if (checkdata(news2[ "title" ])): insertdata(news2[ "title" ],news2[ "cover" ],news2[ "author" ],pymysql.escape_string( str (soup5))) count + = 1 #等待10s time.sleep( 10 ) cursor.close() conn.close() print ( "操作完成" ) |
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持开心学习网。
原文链接:https://blog.csdn.net/a2398936046/article/details/88814078
您可能感兴趣
- python数据分割教程(python3对拉勾数据进行可视化分析的方法详解)
- python统一支付接口(Python实现的微信支付方式总结三种方式)
- python云服务技术(Python脚本修改阿里云的访问控制列表的方法)
- python实时输出图像(Python给图像添加噪声具体操作)
- python微信窗口内容获取(python使用wxpy轻松实现微信防撤回的方法)
- python详细讲解类方法的使用(浅谈python标准库--functools.partial)
- python选择排序最大最小同时排序(Python实现的插入排序,冒泡排序,快速排序,选择排序算法示例)
- python实现linux服务(Python实现Linux监控的方法)
- python微信消息模拟请求(python实现微信机器人: 登录微信、消息接收、自动回复功能)
- python 获取ip mac 地址(Python3获取电脑IP、主机名、Mac地址的方法示例)
- python零基础入门详细教程(Python零基础入门学习之输入与输出)
- python编写告白程序(python抖音表白程序源代码)
- python怎么用代码写出心形(六行python代码的爱心曲线详解)
- python3正则表达式详解(Python正则表达式和re库知识点总结)
- python真的能高效处理excel报表吗(Python数据报表之Excel操作模块用法分析)
- python爬虫并保存excel实例(Python实现爬取亚马逊数据并打印出Excel文件操作示例)
- 巅峰时期被爆床照,曾被选国民最讨厌女星,IU不为人知的黑历史(巅峰时期被爆床照)
- 每天1万吨牛奶倒进下水道,美国大萧条一幕重现(每天1万吨牛奶倒进下水道)
- 如何看待美国数十万加仑牛奶倒下水道 历史又重演了(如何看待美国数十万加仑牛奶倒下水道)
- 历史惊人的相似,美国80万加仑牛奶倒入下水道,意味着什么(历史惊人的相似)
- 美国数十万加仑牛奶倒进下水道,世界会重演1929年的大萧条吗(美国数十万加仑牛奶倒进下水道)
- 美国数十万加仑牛奶倒入下水道,贫民区食不果腹,历史再次重演(美国数十万加仑牛奶倒入下水道)
热门推荐
- dedecms安全补丁(防止Dedecms入侵、漏洞问题的4点安全防范建议)
- 如何激活ubunturoot账号(欧洲vps安装Ubuntu系统如何设置root登录)
- django中间件路径校验(Django中使用Whoosh进行全文检索的方法)
- dedecms如何写接口(dede的sql语句调用方法使用示例)
- css3支持多重背景吗(真正了解CSS3背景下的@font face规则)
- pycharm导入turtle出现错误(解决Pycharm调用Turtle时 窗口一闪而过的问题)
- drools机制(Swoole实现异步投递task任务案例详解)
- nginxssl证书怎么设置(nginx结合openssl实现https的方法)
- _viewstart.cshtml的作用
- python读取文件怎么用(Python基础之文件读取的讲解)
排行榜
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9