尚硅谷爬虫note15n
1. 多条管道
多条管道开启(2步):
(1)定义管道类
(2)在settings中开启管道
在pipelines中:
import urllib.request
# 多条管道开启
#(1)定义管道类
#(2)在settings中开启管道
# "demo_nddw.pipelines.dangdangDownloadPipeline": 301
class dangdangDownloadPipeline:
def process_item(self, item, spider):
url = 'http:' + item.get('src')
filename = './books' + item.get('name') + '.jpg'
urllib.request.urlretrieve(url = url, filename=filename())
return item
2. 多页下载
爬取每页的业务逻辑都是一样的,将执行那页的请求再次调用就可以了
# 如果是多页下载的话,必须调整allowed_domains的范围,一般只写域名
allowed_domains = ["category.dangdang.com"]
ddw.py中:
#多页下载
# 爬取每页的业务逻辑是一样的,将执行那页的请求再次调用parse方法就可以了
# https://category.dangdang.com/cp01.27.01.06.00.00.html
# https://category.dangdang.com/pg2-cp01.27.01.06.00.00.html
# https://category.dangdang.com/pg3-cp01.27.01.06.00.00.html
if self.page < 100:
self.page = self.page + 1
url = self.basic_url + str(self.page) + '-cp01.27.01.06.00.00.html'
# 怎么调用parse方法
# scrapy.Request就是scrapy的get请求
#url就是请求地址、callback就是要执行的函数,不需要()
yield scrapy.Request(url = url, callback =self.parse)
3. 电影天堂
获取:
第一页的名字
第二页的图片
涉及到两个页面:使用meta进行传递
pipelines.py
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
class DemoDyttPipeline:
#开始
def open_spider(self,spider):
self.fp = open('dytt.json','w',encoding='utf-8')
def process_item(self, item, spider):
#中间
self.fp.write(str(item))
return item
#结束
def close_spider(self,spider):
self.fp.close()
items.py
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class DemoDyttItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
# pass
# 名字
name = scrapy.Field()
# 图片
src = scrapy.Field()
dytt.py
import scrapy
#导入:从项目的items中导入
from demo_dytt.items import DemoDyttItem
class DyttSpider(scrapy.Spider):
name = "dytt"
# 调整allowed_domains访问范围:只要域名
allowed_domains = ["www.dydytt.net"]
start_urls = ["https://www.dydytt.net/html/gndy/dyzz/20250306/65993.html"]
def parse(self, response):
# pass
# print('===========================================================')
# 要第一页的图片,和第二页的名字
a_list = response.xpath('//div[@class = "co_content8"]//tr[2]//a[2]')
for a in a_list:
# 获取第一页的name,和要点击的链接
name = a.xpath('./text()').extract_first()
href =a.xpath('/@href').extract_first()
# 第二页的地址
url = 'https://www.dydytt.net' + href
# 对第二页链接发起访问
# 1)meta字典:传递
yield scrapy.Request(url = url,callback=self.parse_second,meta={'name':name})
def parse_second(self,response):
# 拿不到数据,检查xpath语法是否错误
src = response.xpath('//div[@id = "Zoom"]/span/img/@src').extract_first()
print(src)
#2)接收meta字典
meta = response.meta['name']
dytt = DemoDyttItem(src = src, name = name)
#将dytt返回给管道,需要在settings中开启管道:解除管道注释即是开启管道
# ITEM_PIPELINES = {
# "demo_dytt.pipelines.DemoDyttPipeline": 300,
# }
yield dytt
开启管道:
在settings.py中解除管道的注释即是开启管道
ITEM_PIPELINES = {
"demo_dytt.pipelines.DemoDyttPipeline": 300,
}
4. CrawlSpider
继承自scrapy.spider
CrawlSpider:what?
1)定义规则
2)提取符合规则的链接
3)解析
链接提取器
1)导入链接提取器
from scrapy.linkextractors import LinkExtractor
2)
allow = () :正则表达式
restrict_xpaths = () :xpath
restrict_css = () :不推荐
scrapy shell 网址,然后进行3)4)的链接提取
导入链接提取器:
from scrapy.linkextractors import LinkExtractor
3)allow = () 语法
link = LinkExtraactor(allow = r' /book/1188_\d+\.html')
\d表示数字
+表示1~多
查看:
link.extract_links(response)
4)restrict_xpaths = ()语法
link1 = LinkExtractor(restrict_xpaths = r' //div[@class = "pages"]/a/@href ')
查看:
link.extract_links(response)
5. CrawlSpider案例
1)创建文件:
scrapy genspider -t crawl 文件名 网址
2)首页不在提取规则,所以不能提取首页
修改start_urls:
start_urls = ["https://www.dushu.com/book/1157.html"]
修改后:
start_urls = ["https://www.dushu.com/book/1157_1.html"]