【scrapy】信号量—扩展随笔
信号
在 Scrapy 中,信号(signals)是一种用于在框架的不同组件之间传递事件和信息的机制。Scrapy 提供了多种内置信号,允许开发者在爬虫运行的不同阶段插入自定义的代码来扩展其功能。比如当爬虫开启时或当item被抓取时。
代码示例:
class HooksasyncExtension(object):
@classmethod
def from_crawler(cls, crawler):
logger.info("HooksasyncExtension from_crawler")
return cls(crawler)
def __init__(self, crawler):
logger.info("HooksasyncExtension Constructor called")
# connect the extension object to signals
cs = crawler.signals.connect
cs(self.engine_started, signal=signals.engine_started)
cs(self.engine_stopped, signal=signals.engine_stopped)
cs(self.spider_opened, signal=signals.spider_opened)
cs(self.spider_idle, signal=signals.spider_idle)
cs(self.spider_closed, signal=signals.spider_closed)
cs(self.spider_error, signal=signals.spider_error)
cs(self.request_scheduled, signal=signals.request_scheduled)
cs(self.response_received, signal=signals.response_received)
cs(self.response_downloaded, signal=signals.response_downloaded)
cs(self.item_scraped, signal=signals.item_scraped)
cs(self.item_dropped, signal=signals.item_dropped)
def engine_started(self):
logger.info("HooksasyncExtension, signals.engine_started fired")
def engine_stopped(self):
logger.info("spider 关闭信号量")
def spider_opened(self, spider):
logger.info("spider 启动信号量")
def spider_idle(self, spider):
logger.info("HooksasyncExtension, signals.spider_idle fired")
def spider_closed(self, spider, reason):
logger.info("HooksasyncExtension, signals.spider_closed fired")
def spider_error(self, failure, response, spider):
logger.info("HooksasyncExtension, signals.spider_error fired")
def request_scheduled(self, request, spider):
logger.info("HooksasyncExtension, signals.request_scheduled fired")
def response_received(self, response, request, spider):
logger.info("HooksasyncExtension, signals.response_received fired")
def response_downloaded(self, response, request, spider):
logger.info("HooksasyncExtension, signals.response_downloaded fired")
def item_scraped(self, item, response, spider):
logger.info("HooksasyncExtension, signals.item_scraped fired")
def item_dropped(self, item, spider, exception):
logger.info("HooksasyncExtension, signals.item_dropped fired")
信号的使用 crawler.signals.connect
使用 crawler.signals.connect 方法 将信号和回调函数进行绑定
cs = crawler.signals.connect
cs(self.engine_started, signal=signals.engine_started)
cs(self.engine_stopped, signal=signals.engine_stopped)
cs(self.spider_opened, signal=signals.spider_opened)
cs(self.spider_idle, signal=signals.spider_idle)
cs(self.spider_closed, signal=signals.spider_closed)
cs(self.spider_error, signal=signals.spider_error)
cs(self.request_scheduled, signal=signals.request_scheduled)
cs(self.response_received, signal=signals.response_received)
cs(self.response_downloaded, signal=signals.response_downloaded)
cs(self.item_scraped, signal=signals.item_scraped)
cs(self.item_dropped, signal=signals.item_dropped)
crawler.signals.connect 接受两个参数 signal和sender。
sender:回调函数
signal:信号量
信号的分类
信号 | 触发时机 |
---|---|
engine_started | Scrapy 引擎启动时 |
engine_stopped | Scrapy 引擎停止时 |
spider_opened | 爬虫(spider)打开并准备开始爬取时 |
spider_idle | 爬虫在等待新的请求时 |
spider_closed | 爬虫关闭时,通常在爬取完成后 |
spider_error | 爬虫遇到错误时 |
request_scheduled | 请求被调度到下载器时 |
request_dropped | 请求被丢弃时,可能是因为重复或过滤 |
request_reached_downloader | 请求到达下载器,准备下载时 |
request_left_downloader | 请求离开下载器,下载完成时 |
response_received | 响应被接收时,表示数据已下载但尚未处理 |
response_downloaded | 响应数据已下载到本地存储时(如果使用了下载中间件中的相关设置) |
headers_received | 响应头被接收时 |
bytes_received | 数据块被接收时,用于跟踪下载进度 |
item_scraped | 项目(Item)被成功提取时 |
item_dropped | 项目被丢弃时,可能是因为重复或过滤 |
item_error | 项目处理过程中出现错误时 |
信号的调用
在配置文件 settings.py 中 设置 EXTENSIONS 和ITEM_PIPELINES的设置方法一样 scrapy在触发某一信号量后 会自动的去调用这个信号绑定的回调方法 而不需要我们在代码中显式的调用