pythonhtml文件分析(对Python3 解析html的几种操作方式小结)
pythonhtml文件分析
对Python3 解析html的几种操作方式小结html">解析html是爬虫后的重要的一个处理数据的环节。一下记录解析html的几种方式。
先介绍基础的辅助函数,主要用于获取html并输入解析后的结束
|
#把传递解析函数,便于下面的修改 def get_html(url, paraser = bs4_paraser): headers = { 'Accept' : '*/*' , 'Accept-Encoding' : 'gzip, deflate, sdch' , 'Accept-Language' : 'zh-CN,zh;q=0.8' , 'Host' : 'www.360kan.com' , 'Proxy-Connection' : 'keep-alive' , 'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36' } request = urllib2.Request(url, headers = headers) response = urllib2.urlopen(request) response.encoding = 'utf-8' if response.code = = 200 : data = StringIO.StringIO(response.read()) gzipper = gzip.GzipFile(fileobj = data) data = gzipper.read() value = paraser(data) # open('E:/h5/haPkY0osd0r5UB.html').read() return value else : pass value = get_html( 'http://www.360kan.com/m/haPkY0osd0r5UB.html' , paraser = lxml_parser) for row in value: print row |
1,lxml.html的方式进行解析,
The lxml XML toolkit is a Pythonic binding for the C libraries libxml2 and libxslt. It is unique in that it combines the speed and XML feature completeness of these libraries with the simplicity of a native Python API, mostly compatible but superior to the well-known ElementTree API. The latest release works with all CPython versions from 2.6 to 3.5. See the introduction for more information about background and goals of the lxml project. Some common questions are answered in the FAQ. [官网](http://lxml.de/)
|
def lxml_parser(page): data = [] doc = etree.HTML(page) all_li = doc.xpath( '//li[@class="yingping-list-wrap"]' ) for row in all_li: # 获取每一个影评,即影评的item all_li_item = row.xpath( './/li[@class="item"]' ) # find_all('li', attrs={'class': 'item'}) for r in all_li_item: value = {} # 获取影评的标题部分 title = r.xpath( './/li[@class="g-clear title-wrap"][1]' ) value[ 'title' ] = title[ 0 ].xpath( './a/text()' )[ 0 ] value[ 'title_href' ] = title[ 0 ].xpath( './a/@href' )[ 0 ] score_text = title[ 0 ].xpath( './li/span/span/@style' )[ 0 ] score_text = re.search(r '\d+' , score_text).group() value[ 'score' ] = int (score_text) / 20 # 时间 value[ 'time' ] = title[ 0 ].xpath( './li/span[@class="time"]/text()' )[ 0 ] # 多少人喜欢 value[ 'people' ] = int ( re.search(r '\d+' , title[ 0 ].xpath( './li[@class="num"]/span/text()' )[ 0 ]).group()) data.append(value) return data |
2,使用BeautifulSoup,不多说了,大家网上找资料看看
|
def bs4_paraser(html): all_value = [] value = {} soup = BeautifulSoup(html, 'html.parser' ) # 获取影评的部分 all_li = soup.find_all( 'li' , attrs = { 'class' : 'yingping-list-wrap' }, limit = 1 ) for row in all_li: # 获取每一个影评,即影评的item all_li_item = row.find_all( 'li' , attrs = { 'class' : 'item' }) for r in all_li_item: # 获取影评的标题部分 title = r.find_all( 'li' , attrs = { 'class' : 'g-clear title-wrap' }, limit = 1 ) if title is not None and len (title) > 0 : value[ 'title' ] = title[ 0 ].a.string value[ 'title_href' ] = title[ 0 ].a[ 'href' ] score_text = title[ 0 ].li.span.span[ 'style' ] score_text = re.search(r '\d+' , score_text).group() value[ 'score' ] = int (score_text) / 20 # 时间 value[ 'time' ] = title[ 0 ].li.find_all( 'span' , attrs = { 'class' : 'time' })[ 0 ].string # 多少人喜欢 value[ 'people' ] = int ( re.search(r '\d+' , title[ 0 ].find_all( 'li' , attrs = { 'class' : 'num' })[ 0 ].span.string).group()) # print r all_value.append(value) value = {} return all_value |
3,使用SGMLParser,主要是通过start、end tag的方式进行了,解析工程比较明朗,但是有点麻烦,而且该案例的场景不太适合该方法,(哈哈)
|
class CommentParaser(SGMLParser): def __init__( self ): SGMLParser.__init__( self ) self .__start_li_yingping = False self .__start_li_item = False self .__start_li_gclear = False self .__start_li_ratingwrap = False self .__start_li_num = False # a self .__start_a = False # span 3中状态 self .__span_state = 0 # 数据 self .__value = {} self .data = [] def start_li( self , attrs): for k, v in attrs: if k = = 'class' and v = = 'yingping-list-wrap' : self .__start_li_yingping = True elif k = = 'class' and v = = 'item' : self .__start_li_item = True elif k = = 'class' and v = = 'g-clear title-wrap' : self .__start_li_gclear = True elif k = = 'class' and v = = 'rating-wrap g-clear' : self .__start_li_ratingwrap = True elif k = = 'class' and v = = 'num' : self .__start_li_num = True def end_li( self ): if self .__start_li_yingping: if self .__start_li_item: if self .__start_li_gclear: if self .__start_li_num or self .__start_li_ratingwrap: if self .__start_li_num: self .__start_li_num = False if self .__start_li_ratingwrap: self .__start_li_ratingwrap = False else : self .__start_li_gclear = False else : self .data.append( self .__value) self .__value = {} self .__start_li_item = False else : self .__start_li_yingping = False def start_a( self , attrs): if self .__start_li_yingping and self .__start_li_item and self .__start_li_gclear: self .__start_a = True for k, v in attrs: if k = = 'href' : self .__value[ 'href' ] = v def end_a( self ): if self .__start_li_yingping and self .__start_li_item and self .__start_li_gclear and self .__start_a: self .__start_a = False def start_span( self , attrs): if self .__start_li_yingping and self .__start_li_item and self .__start_li_gclear: if self .__start_li_ratingwrap: if self .__span_state ! = 1 : for k, v in attrs: if k = = 'class' and v = = 'rating' : self .__span_state = 1 elif k = = 'class' and v = = 'time' : self .__span_state = 2 else : for k, v in attrs: if k = = 'style' : score_text = re.search(r '\d+' , v).group() self .__value[ 'score' ] = int (score_text) / 20 self .__span_state = 3 elif self .__start_li_num: self .__span_state = 4 def end_span( self ): self .__span_state = 0 def handle_data( self , data): if self .__start_a: self .__value[ 'title' ] = data elif self .__span_state = = 2 : self .__value[ 'time' ] = data elif self .__span_state = = 4 : score_text = re.search(r '\d+' , data).group() self .__value[ 'people' ] = int (score_text) pass def sgl_parser(html): parser = CommentParaser() parser.feed(html) return parser.data |
4,HTMLParaer,与3原理相识,就是调用的方法不太一样,基本上可以公用,
|
class CommentHTMLParser(HTMLParser.HTMLParser): def __init__( self ): HTMLParser.HTMLParser.__init__( self ) self .__start_li_yingping = False self .__start_li_item = False self .__start_li_gclear = False self .__start_li_ratingwrap = False self .__start_li_num = False # a self .__start_a = False # span 3中状态 self .__span_state = 0 # 数据 self .__value = {} self .data = [] def handle_starttag( self , tag, attrs): if tag = = 'li' : for k, v in attrs: if k = = 'class' and v = = 'yingping-list-wrap' : self .__start_li_yingping = True elif k = = 'class' and v = = 'item' : self .__start_li_item = True elif k = = 'class' and v = = 'g-clear title-wrap' : self .__start_li_gclear = True elif k = = 'class' and v = = 'rating-wrap g-clear' : self .__start_li_ratingwrap = True elif k = = 'class' and v = = 'num' : self .__start_li_num = True elif tag = = 'a' : if self .__start_li_yingping and self .__start_li_item and self .__start_li_gclear: self .__start_a = True for k, v in attrs: if k = = 'href' : self .__value[ 'href' ] = v elif tag = = 'span' : if self .__start_li_yingping and self .__start_li_item and self .__start_li_gclear: if self .__start_li_ratingwrap: if self .__span_state ! = 1 : for k, v in attrs: if k = = 'class' and v = = 'rating' : self .__span_state = 1 elif k = = 'class' and v = = 'time' : self .__span_state = 2 else : for k, v in attrs: if k = = 'style' : score_text = re.search(r '\d+' , v).group() self .__value[ 'score' ] = int (score_text) / 20 self .__span_state = 3 elif self .__start_li_num: self .__span_state = 4 def handle_endtag( self , tag): if tag = = 'li' : if self .__start_li_yingping: if self .__start_li_item: if self .__start_li_gclear: if self .__start_li_num or self .__start_li_ratingwrap: if self .__start_li_num: self .__start_li_num = False if self .__start_li_ratingwrap: self .__start_li_ratingwrap = False else : self .__start_li_gclear = False else : self .data.append( self .__value) self .__value = {} self .__start_li_item = False else : self .__start_li_yingping = False elif tag = = 'a' : if self .__start_li_yingping and self .__start_li_item and self .__start_li_gclear and self .__start_a: self .__start_a = False elif tag = = 'span' : self .__span_state = 0 def handle_data( self , data): if self .__start_a: self .__value[ 'title' ] = data elif self .__span_state = = 2 : self .__value[ 'time' ] = data elif self .__span_state = = 4 : score_text = re.search(r '\d+' , data).group() self .__value[ 'people' ] = int (score_text) pass def html_parser(html): parser = CommentHTMLParser() parser.feed(html) return parser.data |
3,4对于该案例来说确实是不太适合,趁现在有空记录下来,功学习使用!
以上这篇对Python3 解析html的几种操作方式小结就是小编分享给大家的全部内容了,希望能给大家一个参考,也希望大家多多支持开心学习网。
原文链接:https://blog.csdn.net/yilovexing/article/details/79675672
- pythonpandas数据类型(Python3.5 Pandas模块之Series用法实例分析)
- python3.9.6生成的注册表(厉害了,Python也能操作注册表)
- python下划线怎么用(Python3中_下划线和__双下划线的用途和区别)
- python实用的游戏小代码(python3实现小球转动抽奖小游戏)
- python 组合数据类型(详解Python3 对象组合zip和回退方式*zip)
- python串口怎么使用(使用Python3+PyQT5+Pyserial 实现简单的串口工具方法)
- python基础教学之125 装饰器简介(python3 property装饰器实现原理与用法示例)
- python列表中的数组(Python3.4学习笔记之列表、数组操作示例)
- pythonhtml文件转换成pdf库(Python3转换html到pdf的不同解决方案)
- python提取字符串中的正则表达式(python3正则提取字符串里的中文实例)
- python3和python2 兼容(Python2和Python3的共存和切换使用)
- python用指针合并两个有序数组(Python3实现计算两个数组的交集算法示例)
- python3有哪些内置模块(Python3.5内置模块之os模块、sys模块、shutil模块用法实例分析)
- python 3.10 循环语法(Python3.4学习笔记之常用操作符,条件分支和循环用法示例)
- 简述python2与python3的不同点(Python2与Python3的区别实例分析)
- python opencv图像合并(Python3+OpenCV2实现图像的几何变换平移、镜像、缩放、旋转、仿射)
- 今日菜价 芥兰涨幅最高 1.33 ,花菜降幅最高 3.10(今日菜价芥兰涨幅最高)
- 今日菜价 椰菜涨幅最高 3.25 ,水空心菜降幅最高 2.58(今日菜价椰菜涨幅最高)
- 今日菜价 红三鱼涨幅最高 4.41 ,黄鳝降幅最高 5.06(红三鱼涨幅最高)
- 今日菜价 西生菜涨幅最高 6.19 ,生菜降幅最高 5.38(西生菜涨幅最高)
- 今日菜价 青豆角涨幅最高 0.70 ,菜心降幅最高 5.55(青豆角涨幅最高)
- 农村植物,龙芽草若长在您家路旁,请珍惜,它对抗癌有特效(龙芽草若长在您家路旁)
热门推荐
- python内置函数使用方法(Python神奇的内置函数locals的实例讲解)
- python封装函数讲解(Python中super函数用法实例分析)
- python实现七个基本算法(python实现维吉尼亚算法)
- 华为云阿里云腾讯云哪个服务好些(阿里云、腾讯云和华为云服务器相同配置哪个更好?)
- font-size:100%什么意思
- web安全常见的测试工具有哪些(Web压力测试工具:http_load、webbench、ab、Siege使用方法)
- python怎么设置matlab编程(实例详解Matlab 与 Python 的区别)
- sql 如何拆分字符串(SQL Server实现将特定字符串拆分并进行插入操作的方法)
- css图片颜色提取(解析CSS 提取图片主题色功能小技巧)
- mysql 8.0.22 winx64安装配置图文教程(mysql 8.0.22 winx64安装配置图文教程)
排行榜
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9