您的位置：首页 > 脚本大全 > > 正文

python scrapy 框架原理（一步步教你用python的scrapy编写一个爬虫）

更多时间：2021-10-15 00:10:42 类别：脚本大全浏览量：1964

python scrapy 框架原理

一步步教你用python的scrapy编写一个爬虫

介绍

本文将介绍我是如何在python爬虫里面一步一步踩坑，然后慢慢走出来的，期间碰到的所有问题我都会详细说明，让大家以后碰到这些问题时能够快速确定问题的来源，后面的代码只是贴出了核心代码，更详细的代码暂时没有贴出来。

流程一览

首先我是想爬某个网站上面的所有文章内容，但是由于之前没有做过爬虫（也不知道到底那个语言最方便），所以这里想到了是用python来做一个爬虫（毕竟人家的名字都带有爬虫的含义），我这边是打算先将所有从网站上爬下来的数据放到elasticsearch里面, 选择elasticsearch的原因是速度快，里面分词插件，倒排索引，需要数据的时候查询效率会非常好（毕竟爬的东西比较多），然后我会将所有的数据在elasticsearch的老婆kibana里面将数据进行可视化出来，并且分析这些文章内容，可以先看一下预期可视化的效果（上图了），这个效果图是kibana6.4系统给予的帮助效果图（就是说你可以弄成这样,我也想弄成这样）。后面我会发一个dockerfile上来（现在还没弄）。

python scrapy 框架原理（一步步教你用python的scrapy编写一个爬虫）

环境需求

jdk (elasticsearch需要)
elasticsearch (用来存储数据)
kinaba (用来操作elasticsearch和数据可视化)
python (编写爬虫)
redis (数据排重)

这些东西可以去找相应的教程安装，我这里只有elasticsearch的安装

第一步，使用python的pip来安装需要的插件（第一个坑在这儿）

1.tomd:将html转换成markdown

1	`pip3 install tomd`

2.redis:需要python的redis插件

1	`pip3 install redis`

3.scrapy:框架安装(坑)

1、首先我是像上面一样执行了

1	`pip3 install scrapy`

2、然后发现缺少gcc组件 error: command 'gcc' failed with exit status 1

python scrapy 框架原理（一步步教你用python的scrapy编写一个爬虫）

3、然后我就找啊找，找啊找，最后终于找到了正确的解决方法(期间试了很多错误答案)。最终的解决办法就是使用yum来安装python34-devel, 这个python34-devel根据你自己的python版本来，可能是python-devel,是多少版本就将中间的34改成你的版本, 我的是3.4.6

1	`yum install python34-devel`

4、安装完成过后使用命令 scrapy 来试试吧。

python scrapy 框架原理（一步步教你用python的scrapy编写一个爬虫）

第二步，使用scrapy来创建你的项目

输入命令scrapy startproject scrapydemo, 来创建一个爬虫项目

1

2

3

4

5

6

7

8 liaochengdemacbook-pro:scrapy liaocheng$ scrapy startproject scrapydemo

new scrapy project 'scrapydemo', using template directory '/usr/local/lib/python3.7/site-packages/scrapy/templates/project', created in:

/users/liaocheng/script/scrapy/scrapydemo

you can start your first spider with:

cd scrapydemo

scrapy genspider example example.com

liaochengdemacbook-pro:scrapy liaocheng$

使用genspider来生成一个基础的spider,使用命令scrapy genspider demo juejin.im，后面这个网址是你要爬的网站,我们先爬自己家的

1

2

3 liaochengdemacbook-pro:scrapy liaocheng$ scrapy genspider demo juejin.im

created spider 'demo' using template 'basic'

liaochengdemacbook-pro:scrapy liaocheng$

查看生成的目录结构

python scrapy 框架原理（一步步教你用python的scrapy编写一个爬虫）

第三步，打开项目，开始编码

查看生成的的demo.py的内容

1

2

3

4

5

6

7

8

9

10

11 # -*- coding: utf-8 -*-

import scrapy

class demospider(scrapy.spider):

name = 'demo' ## 爬虫的名字

allowed_domains = ['juejin.im'] ## 需要过滤的域名，也就是只爬这个网址下面的内容

start_urls = ['https://juejin.im/post/5c790b4b51882545194f84f0'] ## 初始url链接

def parse(self, response): ## 如果新建的spider必须实现这个方法

pass

可以使用第二种方式，将start_urls给提出来

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16 # -*- coding: utf-8 -*-

import scrapy

class demospider(scrapy.spider):

name = 'demo' ## 爬虫的名字

allowed_domains = ['juejin.im'] ## 需要过滤的域名，也就是只爬这个网址下面的内容

def start_requests(self):

start_urls = ['http://juejin.im/'] ## 初始url链接

for url in start_urls:

# 调用parse

yield scrapy.request(url=url, callback=self.parse)

def parse(self, response): ## 如果新建的spider必须实现这个方法

pass

编写articleitem.py文件（item文件就类似java里面的实体类）

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32 import scrapy

class articleitem(scrapy.item): ## 需要实现scrapy.item文件

# 文章id

id = scrapy.field()

# 文章标题

title = scrapy.field()

# 文章内容

content = scrapy.field()

# 作者

author = scrapy.field()

# 发布时间

createtime = scrapy.field()

# 阅读量

readnum = scrapy.field()

# 点赞数

praise = scrapy.field()

# 头像

photo = scrapy.field()

# 评论数

commentnum = scrapy.field()

# 文章链接

link = scrapy.field()

编写parse方法的代码

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64 def parse(self, response):

# 获取页面上所有的url

nextpage = response.css("a::attr(href)").extract()

# 遍历页面上所有的url链接,时间复杂度为o(n)

for i in nextpage:

if nextpage is not none:

# 将链接拼起来

url = response.urljoin(i)

# 必须是掘金的链接才进入

if "juejin.im" in str(url):

# 存入redis，如果能存进去，就是一个没有爬过的链接

if self.insertredis(url) == true:

# dont_filter作用是是否过滤相同url true是不过滤，false为过滤，我们这里只爬一个页面就行了，不用全站爬，全站爬对对掘金不是很友好，我么这里只是用来测试的

yield scrapy.request(url=url, callback=self.parse,headers=self.headers,dont_filter=false)

# 我们只分析文章，其他的内容都不管

if "/post/" in response.url and "#comment" not in response.url:

# 创建我们刚才的articleitem

article = articleitem()

# 文章id作为id

article['id'] = str(response.url).split("/")[-1]

# 标题

article['title'] = response.css("#juejin > li.view-container > main > li > li.main-area.article-area.shadow > article > h1::text").extract_first()

# 内容

parameter = response.css("#juejin > li.view-container > main > li > li.main-area.article-area.shadow > article > li.article-content").extract_first()

article['content'] = self.parsetomarkdown(parameter)

# 作者

article['author'] = response.css("#juejin > li.view-container > main > li > li.main-area.article-area.shadow > article > li:nth-child(6) > meta:nth-child(1)::attr(content)").extract_first()

# 创建时间

createtime = response.css("#juejin > li.view-container > main > li > li.main-area.article-area.shadow > article > li.author-info-block > li > li > time::text").extract_first()

createtime = str(createtime).replace("年", "-").replace("月", "-").replace("日","")

article['createtime'] = createtime

# 阅读量

article['readnum'] = int(str(response.css("#juejin > li.view-container > main > li > li.main-area.article-area.shadow > article > li.author-info-block > li > li > span::text").extract_first()).split(" ")[1])

# 点赞数

article['badge'] = response.css("#juejin > li.view-container > main > li > li.article-suspended-panel.article-suspended-panel > li.like-btn.panel-btn.like-adjust.with-badge::attr(badge)").extract_first()

# 评论数

article['commentnum'] = response.css("#juejin > li.view-container > main > li > li.article-suspended-panel.article-suspended-panel > li.comment-btn.panel-btn.comment-adjust.with-badge::attr(badge)").extract_first()

# 文章链接

article['link'] = response.url

# 这个方法和很重要（坑），之前就是由于执行yield article, pipeline就一直不能获取数据

yield article

# 将内容转换成markdown

def parsetomarkdown(self, param):

return tomd.tomd(str(param)).markdown

# url 存入redis，如果能存那么就没有该链接，如果不能存，那么就存在该链接

def insertredis(self, url):

if self.redis != none:

return self.redis.sadd("articleurllist", url) == 1

else:

self.redis = self.redisconnection.getclient()

self.insertredis(url)

编写pipeline类,这个pipeline是一个管道，可以将所有yield关键字返回的数据都交给这个管道处理，但是需要在settings里面配置一下pipeline才行

from elasticsearch import elasticsearch

class articlepipelines(object):

# 初始化

def __init__(self):

# elasticsearch的index

self.index = "article"

# elasticsearch的type

self.type = "type"

# elasticsearch的ip加端口

self.es = elasticsearch(hosts="localhost:9200")

# 必须实现的方法，用来处理yield返回的数据

def process_item(self, item, spider):

# 这里是判断，如果是demo这个爬虫的数据才处理

if spider.name != "demo":

return item

result = self.checkdocumentexists(item)

if result == false:

self.createdocument(item)

else:

self.updatedocument(item)

# 添加文档

def createdocument(self, item):

body = {

"title": item['title'],

"content": item['content'],

"author": item['author'],

"createtime": item['createtime'],

"readnum": item['readnum'],

"praise": item['praise'],

"link": item['link'],

"commentnum": item['commentnum']

}

try:

self.es.create(index=self.index, doc_type=self.type, id=item["id"], body=body)

except:

pass

# 更新文档

def updatedocument(self, item):

parm = {

  
 
  标签：爬虫 Python Scrapy

上一篇：sqlserver维护计划保存在哪（SQL Server误设置max server memory的处理方法）

                	  
			  下一篇：dedecms标签调用原理（DEDECMS安全设置 执行php脚本限制设置方法apache+nginx）

   


  
      您可能感兴趣
				
					
  pythonrequests爬虫使用教程（Python 通过requests实现腾讯新闻抓取爬虫的方法）
  python scrapy爬虫教程视频（详解python3 + Scrapy爬虫学习之创建项目）
  python爬虫音乐代码（详解python selenium 爬取网易云音乐歌单名）
  pycharm 爬虫的数据存在哪了（利用PyCharm Profile分析异步爬虫效率详解）
  python爬虫并保存excel实例（Python实现爬取亚马逊数据并打印出Excel文件操作示例）
  python 操作html（Python HTML解析模块HTMLParser用法分析爬虫工具）
  python pyqt 教程（Python+PyQt5实现美剧爬虫可视工具的方法）
  python爬虫经典步骤（详解python爬虫系列之初识爬虫）
  python 常用爬虫库（Python常用爬虫代码总结方便查询）
  python爬取数据总结（python3爬虫学习之数据存储txt的案例详解）
  python3爬虫实例代码（python3通过selenium爬虫获取到dj商品的实例代码）
  如何查看python beautifulsoup（Python爬虫beautifulsoup4常用的解析方法总结）
  python网络爬虫案例实战（python爬取cnvd漏洞库信息的实例）
  python爬虫怎么爬取vip资源（Python网络爬虫之爬取微博热搜）
  python爬虫爬取知乎（详解用python写网络爬虫-爬取新浪微博评论）
  python 爬虫招聘（Python3获取拉勾网招聘信息的方法实例）
营养餐是什么（学校营养餐是什么）
谁说女子不如男 范冰冰演的武则天只是其一，另外两位你认识吗（谁说女子不如男）
杯酒人生---瓦伦丁酒杯和奥丁格啤酒（杯酒人生---瓦伦丁酒杯和奥丁格啤酒）
中秋节买啤酒，预算超过7元试试这8种啤酒，麦香浓郁都是真啤酒（预算超过7元试试这8种啤酒）
CellPress旗下的6 期刊，国人友刊来了解一下吧（CellPress旗下的6期刊国人友刊来了解一下吧）
（）
					
					
            
         
 


        
             

				 
    
        热门推荐
    
    
    
    
       使用Visual Studio为WebAPI生成帮助文档
linux系统显示时间的命令（详解Linux time 命令的使用）
phpredis怎么测试成功（PHP+redis实现的限制抢购防止商品超发功能详解）
mysql安装时服务无法启动（MySQL 实例无法启动的问题分析及解决）
mysql主从复制如何解决延迟（MySQL 8.0.23中复制架构从节点自动故障转移的问题）
dedecms更改主页模板（dedecms首页调用专题页描述和链接的实现方法）
tortoisesvn解析失败（TortoiseSvn小乌龟安装最新图文详细教程）
tortoisesvn解析失败（TortoiseSvn小乌龟安装最新图文详细教程）
pythonrequests爬虫使用教程（Python 通过requests实现腾讯新闻抓取爬虫的方法）
axios实现原理（项目中Axios二次封装实例Demo）    

    
   

    


  
   
		排行榜
	
	 
		
       1Python HTML解析器BeautifulSoup用法实例详解【爬虫解析器】（Python HTML解析器BeautifulSoup用法实例详解爬虫解析器）
2阿里云虚拟主机带公网地址吗（阿里云虚拟主机被搜索引擎爬虫访问耗费大量流量解决方法）
3python爬取豆瓣评分排行榜（Python爬虫——爬取豆瓣电影Top250代码实例）
4nginx服务器怎么屏蔽爬虫（nginx 防盗链防爬虫配置详解）
5python 爬虫招聘（Python3获取拉勾网招聘信息的方法实例）
6python爬虫爬取知乎（详解用python写网络爬虫-爬取新浪微博评论）
7python爬虫怎么爬取vip资源（Python网络爬虫之爬取微博热搜）
8python网络爬虫案例实战（python爬取cnvd漏洞库信息的实例）
9如何查看python beautifulsoup（Python爬虫beautifulsoup4常用的解析方法总结）
		
	







  
	 
  
   



	







     
    
	
        首页
            编程学习
            Web前端
            数据库
            软件设计
            
 开心学习 ©2013-2021 保留所有权利