pythonhtml文件分析(对Python3 解析html的几种操作方式小结)
pythonhtml文件分析
对Python3 解析html的几种操作方式小结html">解析html是爬虫后的重要的一个处理数据的环节。一下记录解析html的几种方式。
先介绍基础的辅助函数,主要用于获取html并输入解析后的结束
|
#把传递解析函数,便于下面的修改 def get_html(url, paraser = bs4_paraser): headers = { 'Accept' : '*/*' , 'Accept-Encoding' : 'gzip, deflate, sdch' , 'Accept-Language' : 'zh-CN,zh;q=0.8' , 'Host' : 'www.360kan.com' , 'Proxy-Connection' : 'keep-alive' , 'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36' } request = urllib2.Request(url, headers = headers) response = urllib2.urlopen(request) response.encoding = 'utf-8' if response.code = = 200 : data = StringIO.StringIO(response.read()) gzipper = gzip.GzipFile(fileobj = data) data = gzipper.read() value = paraser(data) # open('E:/h5/haPkY0osd0r5UB.html').read() return value else : pass value = get_html( 'http://www.360kan.com/m/haPkY0osd0r5UB.html' , paraser = lxml_parser) for row in value: print row |
1,lxml.html的方式进行解析,
The lxml XML toolkit is a Pythonic binding for the C libraries libxml2 and libxslt. It is unique in that it combines the speed and XML feature completeness of these libraries with the simplicity of a native Python API, mostly compatible but superior to the well-known ElementTree API. The latest release works with all CPython versions from 2.6 to 3.5. See the introduction for more information about background and goals of the lxml project. Some common questions are answered in the FAQ. [官网](http://lxml.de/)
|
def lxml_parser(page): data = [] doc = etree.HTML(page) all_li = doc.xpath( '//li[@class="yingping-list-wrap"]' ) for row in all_li: # 获取每一个影评,即影评的item all_li_item = row.xpath( './/li[@class="item"]' ) # find_all('li', attrs={'class': 'item'}) for r in all_li_item: value = {} # 获取影评的标题部分 title = r.xpath( './/li[@class="g-clear title-wrap"][1]' ) value[ 'title' ] = title[ 0 ].xpath( './a/text()' )[ 0 ] value[ 'title_href' ] = title[ 0 ].xpath( './a/@href' )[ 0 ] score_text = title[ 0 ].xpath( './li/span/span/@style' )[ 0 ] score_text = re.search(r '\d+' , score_text).group() value[ 'score' ] = int (score_text) / 20 # 时间 value[ 'time' ] = title[ 0 ].xpath( './li/span[@class="time"]/text()' )[ 0 ] # 多少人喜欢 value[ 'people' ] = int ( re.search(r '\d+' , title[ 0 ].xpath( './li[@class="num"]/span/text()' )[ 0 ]).group()) data.append(value) return data |
2,使用BeautifulSoup,不多说了,大家网上找资料看看
|
def bs4_paraser(html): all_value = [] value = {} soup = BeautifulSoup(html, 'html.parser' ) # 获取影评的部分 all_li = soup.find_all( 'li' , attrs = { 'class' : 'yingping-list-wrap' }, limit = 1 ) for row in all_li: # 获取每一个影评,即影评的item all_li_item = row.find_all( 'li' , attrs = { 'class' : 'item' }) for r in all_li_item: # 获取影评的标题部分 title = r.find_all( 'li' , attrs = { 'class' : 'g-clear title-wrap' }, limit = 1 ) if title is not None and len (title) > 0 : value[ 'title' ] = title[ 0 ].a.string value[ 'title_href' ] = title[ 0 ].a[ 'href' ] score_text = title[ 0 ].li.span.span[ 'style' ] score_text = re.search(r '\d+' , score_text).group() value[ 'score' ] = int (score_text) / 20 # 时间 value[ 'time' ] = title[ 0 ].li.find_all( 'span' , attrs = { 'class' : 'time' })[ 0 ].string # 多少人喜欢 value[ 'people' ] = int ( re.search(r '\d+' , title[ 0 ].find_all( 'li' , attrs = { 'class' : 'num' })[ 0 ].span.string).group()) # print r all_value.append(value) value = {} return all_value |
3,使用SGMLParser,主要是通过start、end tag的方式进行了,解析工程比较明朗,但是有点麻烦,而且该案例的场景不太适合该方法,(哈哈)
|
class CommentParaser(SGMLParser): def __init__( self ): SGMLParser.__init__( self ) self .__start_li_yingping = False self .__start_li_item = False self .__start_li_gclear = False self .__start_li_ratingwrap = False self .__start_li_num = False # a self .__start_a = False # span 3中状态 self .__span_state = 0 # 数据 self .__value = {} self .data = [] def start_li( self , attrs): for k, v in attrs: if k = = 'class' and v = = 'yingping-list-wrap' : self .__start_li_yingping = True elif k = = 'class' and v = = 'item' : self .__start_li_item = True elif k = = 'class' and v = = 'g-clear title-wrap' : self .__start_li_gclear = True elif k = = 'class' and v = = 'rating-wrap g-clear' : self .__start_li_ratingwrap = True elif k = = 'class' and v = = 'num' : self .__start_li_num = True def end_li( self ): if self .__start_li_yingping: if self .__start_li_item: if self .__start_li_gclear: if self .__start_li_num or self .__start_li_ratingwrap: if self .__start_li_num: self .__start_li_num = False if self .__start_li_ratingwrap: self .__start_li_ratingwrap = False else : self .__start_li_gclear = False else : self .data.append( self .__value) self .__value = {} self .__start_li_item = False else : self .__start_li_yingping = False def start_a( self , attrs): if self .__start_li_yingping and self .__start_li_item and self .__start_li_gclear: self .__start_a = True for k, v in attrs: if k = = 'href' : self .__value[ 'href' ] = v def end_a( self ): if self .__start_li_yingping and self .__start_li_item and self .__start_li_gclear and self .__start_a: self .__start_a = False def start_span( self , attrs): if self .__start_li_yingping and self .__start_li_item and self .__start_li_gclear: if self .__start_li_ratingwrap: if self .__span_state ! = 1 : for k, v in attrs: if k = = 'class' and v = = 'rating' : self .__span_state = 1 elif k = = 'class' and v = = 'time' : self .__span_state = 2 else : for k, v in attrs: if k = = 'style' : score_text = re.search(r '\d+' , v).group() self .__value[ 'score' ] = int (score_text) / 20 self .__span_state = 3 elif self .__start_li_num: self .__span_state = 4 def end_span( self ): self .__span_state = 0 def handle_data( self , data): if self .__start_a: self .__value[ 'title' ] = data elif self .__span_state = = 2 : self .__value[ 'time' ] = data elif self .__span_state = = 4 : score_text = re.search(r '\d+' , data).group() self .__value[ 'people' ] = int (score_text) pass def sgl_parser(html): parser = CommentParaser() parser.feed(html) return parser.data |
4,HTMLParaer,与3原理相识,就是调用的方法不太一样,基本上可以公用,
|
class CommentHTMLParser(HTMLParser.HTMLParser): def __init__( self ): HTMLParser.HTMLParser.__init__( self ) self .__start_li_yingping = False self .__start_li_item = False self .__start_li_gclear = False self .__start_li_ratingwrap = False self .__start_li_num = False # a self .__start_a = False # span 3中状态 self .__span_state = 0 # 数据 self .__value = {} self .data = [] def handle_starttag( self , tag, attrs): if tag = = 'li' : for k, v in attrs: if k = = 'class' and v = = 'yingping-list-wrap' : self .__start_li_yingping = True elif k = = 'class' and v = = 'item' : self .__start_li_item = True elif k = = 'class' and v = = 'g-clear title-wrap' : self .__start_li_gclear = True elif k = = 'class' and v = = 'rating-wrap g-clear' : self .__start_li_ratingwrap = True elif k = = 'class' and v = = 'num' : self .__start_li_num = True elif tag = = 'a' : if self .__start_li_yingping and self .__start_li_item and self .__start_li_gclear: self .__start_a = True for k, v in attrs: if k = = 'href' : self .__value[ 'href' ] = v elif tag = = 'span' : if self .__start_li_yingping and self .__start_li_item and self .__start_li_gclear: if self .__start_li_ratingwrap: if self .__span_state ! = 1 : for k, v in attrs: if k = = 'class' and v = = 'rating' : self .__span_state = 1 elif k = = 'class' and v = = 'time' : self .__span_state = 2 else : for k, v in attrs: if k = = 'style' : score_text = re.search(r '\d+' , v).group() self .__value[ 'score' ] = int (score_text) / 20 self .__span_state = 3 elif self .__start_li_num: self .__span_state = 4 def handle_endtag( self , tag): if tag = = 'li' : if self .__start_li_yingping: if self .__start_li_item: if self .__start_li_gclear: if self .__start_li_num or self .__start_li_ratingwrap: if self .__start_li_num: self .__start_li_num = False if self .__start_li_ratingwrap: self .__start_li_ratingwrap = False else : self .__start_li_gclear = False else : self .data.append( self .__value) self .__value = {} self .__start_li_item = False else : self .__start_li_yingping = False elif tag = = 'a' : if self .__start_li_yingping and self .__start_li_item and self .__start_li_gclear and self .__start_a: self .__start_a = False elif tag = = 'span' : self .__span_state = 0 def handle_data( self , data): if self .__start_a: self .__value[ 'title' ] = data elif self .__span_state = = 2 : self .__value[ 'time' ] = data elif self .__span_state = = 4 : score_text = re.search(r '\d+' , data).group() self .__value[ 'people' ] = int (score_text) pass def html_parser(html): parser = CommentHTMLParser() parser.feed(html) return parser.data |
3,4对于该案例来说确实是不太适合,趁现在有空记录下来,功学习使用!
以上这篇对Python3 解析html的几种操作方式小结就是小编分享给大家的全部内容了,希望能给大家一个参考,也希望大家多多支持开心学习网。
原文链接:https://blog.csdn.net/yilovexing/article/details/79675672
- python怎样看字符unicode编码(Python3中编码与解码之Unicode与bytes的讲解)
- python人脸识别库(python3人脸识别的两种方法)
- python之pil模块使用(Python3安装Pillow与PIL的方法)
- 笨办法学python3目录(如何愉快地迁移到 Python 3)
- python3和python区别(Python2与Python3的区别实例总结)
- python列表中的数组(Python3.4学习笔记之列表、数组操作示例)
- python opencv替换背景教程(基于OpenCV python3实现证件照换背景的方法)
- centos7上安装python(centos6.5安装python3.7.1之后无法使用pip的解决方案)
- python3.7.0使用方法(Python3.5模块的定义、导入、优化操作图文详解)
- python3.7.2 详细安装教程(python3.5安装python3-tk详解)
- python用指针合并两个有序数组(Python3实现计算两个数组的交集算法示例)
- python中字典的常用操作(11个Python3字典内置方法大全与示例汇总)
- python3知识点汇总(Python3几个常见问题的处理方法)
- pythontime模块有哪些(Python3.5内置模块之time与datetime模块用法实例分析)
- python数据分析删除重复值(Python3实现从排序数组中删除重复项算法分析)
- 浅谈Python3中strip()、lstrip()、rstrip()用法详解(浅谈Python3中strip、lstrip、rstrip用法详解)
- 苹果自研芯片跑分对比 A16芯片排名靠后,M1系列霸榜(苹果自研芯片跑分对比)
- X86处理器的梦魇 苹果M1自研芯片到底有多强(苹果M1自研芯片到底有多强)
- 泰剧《爱欲之神》Boom kitkong和Great合体杂志(泰剧爱欲之神Boomkitkong和Great合体杂志)
- 素人恋爱综艺火药味十足 男生为赢得芳心集体扯头花,真是出好戏(素人恋爱综艺火药味十足)
- 《囧妈》为何受抵制 春节七部影片撤档背后的责任与博弈(囧妈为何受抵制)
- 提醒 2019年起河南驾考要开设科目五 官方回应来了(2019年起河南驾考要开设科目五)
热门推荐
- sqlserver常用流控语句(SQL Server实现自动循环归档分区数据脚本详解)
- nginx优化分几种(Nginx优化服务之网页压缩的实现方法)
- phpstudy怎么升级mysql(phpStudy中升级MySQL版本到5.7.17的方法步骤)
- python爬虫怎么爬取vip资源(Python网络爬虫之爬取微博热搜)
- django数据库详解(Django页面数据的缓存与使用的具体方法)
- mysql时间戳和datetime对比(mysql中 datatime与timestamp的区别说明)
- python四舍五入怎么用(python3 小数位的四舍五入用两种方法解决round 遇5不进)
- jenkins回滚docker容器(关于docker部署的jenkins跑git上的程序的问题)
- mysql主从复制时突然来了一批数据(MySQL主从复制断开的常用修复方法)
- docker容器是怎么进行通信的(Docker容器间通信与外网通信的操作)
排行榜
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9