微信公众号文章 爬虫(python抓取搜狗微信公众号文章)
类别:脚本大全 浏览量:2130
时间:2021-10-26 11:09:27 微信公众号文章 爬虫
python抓取搜狗微信公众号文章初学python,抓取搜狗微信公众号文章存入mysql
mysql表:
代码:
|
import requests import json import re import pymysql # 创建连接 conn = pymysql.connect(host = '你的数据库地址' , port = 端口, user = '用户名' , passwd = '密码' , db = '数据库名称' , charset = 'utf8' ) # 创建游标 cursor = conn.cursor() cursor.execute( "select * from hd_gzh" ) effect_row = cursor.fetchall() from bs4 import beautifulsoup socket.setdefaulttimeout( 60 ) count = 1 headers = { 'user-agent' : 'mozilla/5.0 (windows nt 10.0; win64; x64; rv:65.0) gecko/20100101 firefox/65.0' } #阿布云ip代理暂时不用 # proxyhost = "http-cla.abuyun.com" # proxyport = "9030" # # 代理隧道验证信息 # proxyuser = "h56761606429t7uc" # proxypass = "9168eb00c4167176" # proxymeta = "http://%(user)s:%(pass)s@%(host)s:%(port)s" % { # "host" : proxyhost, # "port" : proxyport, # "user" : proxyuser, # "pass" : proxypass, # } # proxies = { # "http" : proxymeta, # "https" : proxymeta, # } #查看是否已存在数据 def checkdata(name): sql = "select * from gzh_article where title = '%s'" data = (name,) count = cursor.execute(sql % data) conn.commit() if (count! = 0 ): return false else : return true #插入数据 def insertdata(title,picture,author,content): sql = "insert into gzh_article (title,picture,author,content) values ('%s', '%s','%s', '%s')" data = (title,picture,author,content) cursor.execute(sql % data) conn.commit() print ( "插入一条数据" ) return for row in effect_row: newsurl = 'https://weixin.sogou.com/weixin?type=1&s_from=input&query=' + row[ 1 ] + '&ie=utf8&_sug_=n&_sug_type_=' res = requests.get(newsurl,headers = headers) res.encoding = 'utf-8' soup = beautifulsoup(res.text, 'html.parser' ) url = 'https://weixin.sogou.com' + soup.select( '.tit a' )[ 0 ][ 'href' ] res2 = requests.get(url,headers = headers) res2.encoding = 'utf-8' soup2 = beautifulsoup(res2.text, 'html.parser' ) pattern = re. compile (r "url \+= '(.*?)';" , re.multiline | re.dotall) script = soup2.find( "script" ) url2 = pattern.search(script.text).group( 1 ) res3 = requests.get(url2,headers = headers) res3.encoding = 'utf-8' soup3 = beautifulsoup(res3.text, 'html.parser' ) print () pattern2 = re. compile (r "var msglist = (.*?);$" , re.multiline | re.dotall) script2 = soup3.find( "script" , text = pattern2) s2 = json.loads(pattern2.search(script2.text).group( 1 )) #等待10s time.sleep( 10 ) for news in s2[ "list" ]: articleurl = "https://mp.weixin.qq.com" + news[ "app_msg_ext_info" ][ "content_url" ] articleurl = articleurl.replace( '&' , '&' ) res4 = requests.get(articleurl,headers = headers) res4.encoding = 'utf-8' soup4 = beautifulsoup(res4.text, 'html.parser' ) if (checkdata(news[ "app_msg_ext_info" ][ "title" ])): insertdata(news[ "app_msg_ext_info" ][ "title" ],news[ "app_msg_ext_info" ][ "cover" ],news[ "app_msg_ext_info" ][ "author" ],pymysql.escape_string( str (soup4))) count + = 1 #等待5s time.sleep( 10 ) for news2 in news[ "app_msg_ext_info" ][ "multi_app_msg_item_list" ]: articleurl2 = "https://mp.weixin.qq.com" + news2[ "content_url" ] articleurl2 = articleurl2.replace( '&' , '&' ) res5 = requests.get(articleurl2,headers = headers) res5.encoding = 'utf-8' soup5 = beautifulsoup(res5.text, 'html.parser' ) if (checkdata(news2[ "title" ])): insertdata(news2[ "title" ],news2[ "cover" ],news2[ "author" ],pymysql.escape_string( str (soup5))) count + = 1 #等待10s time.sleep( 10 ) cursor.close() conn.close() print ( "操作完成" ) |
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持开心学习网。
原文链接:https://blog.csdn.net/a2398936046/article/details/88814078
您可能感兴趣
- python豆瓣电影爬虫课程设计报告(详解python 模拟豆瓣登录豆瓣6.0)
- python类定义(浅谈python新式类和旧式类区别)
- python读写文件实验心得(Python文件读写常见用法总结)
- 2021-10-07 00:38:09
- python flask部署实例(Python Flask框架扩展操作示例)
- python中递归方法(Python中最大递归深度值的探讨)
- python 内存读写(详解python持久化文件读写)
- opencv 图像匹配python(OpenCV+Python识别车牌和字符分割的实现)
- python2.7中logging的使用方式(Python中使用logging和traceback模块记录日志和跟踪异常)
- python可以编写数据加密解密吗(python简单实现AES加密和解密)
- python虚拟环境的使用方法(详解python配置虚拟环境)
- python真的能高效处理excel报表吗(Python数据报表之Excel操作模块用法分析)
- eclipse配置python(eclipse创建python项目步骤详解)
- python爬虫第一本书(我用Python抓取了7000 多本电子书案例详解)
- 在python中各个符号含义的汇总(详解Python中is和==的区别)
- python单例模式读取配置文件(Python下简易的单例模式详解)
- 谢广坤,你这么欺负谢腾飞,良心不会痛吗(你这么欺负谢腾飞)
- 乡村爱情15 宋晓峰怀疑自己孩子,腾飞与姜奶奶亲子鉴定出结果(宋晓峰怀疑自己孩子)
- 《乡村爱情13》开播,新版刘能以假乱真,编剧思维进入瓶颈(新版刘能以假乱真)
- 当年的 白洋淀战神 练肌肉 嘎子哥也成为行走的荷尔蒙(当年的白洋淀战神)
- 肌肉小子陈康, 亚洲巨兽 黄哲勋,哪个才是你的菜(肌肉小子陈康亚洲巨兽)
- 新闻周刊 青岛网红 赵厂长 编段子一箩筐输出快乐,陪父亲十二载勇斗病魔(新闻周刊青岛网红)
热门推荐
- php开发详细步骤(php源码的安装方法和实例)
- js复制内容到剪贴板
- python中if语句应学会什么(Python基础之条件控制操作示例if语句)
- http403错误怎么解决(HTTP 错误 403.1 - 禁止访问:执行访问被拒绝。 解决方法该页无法显示)
- pythonredis列表(Python redis操作实例分析连接、管道、发布和订阅等)
- html5input标签的默认属性(移动端HTML5 input常见问题小结)
- angular开发详解(详解Angular动态组件)
- nginx报网络连接错误(Nginx 502 Bad Gateway错误原因及解决方案)
- iis配置网站授权(IIS的web.config中跨域访问设置方法)
- 阿里云服务ecs操作系统(阿里云ECS服务器CentOS7上安装服务器安全狗Linux版)
排行榜
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9