Scrapy的spider中间件:爬虫中间件
爬虫中间件
SPIDER_MIDDLEWARES
注意:关于scrapy爬虫中间件执行顺序的问题
查看默认的爬虫中间件scrapy settings –get SPIDER_MIDDLEWARES_BASE
1
2
3
4
5
6
7{
"scrapy.spidermiddlewares.httperror.HttpErrorMiddleware": 50,
"scrapy.spidermiddlewares.offsite.OffsiteMiddleware": 500,
"scrapy.spidermiddlewares.referer.RefererMiddleware": 700,
"scrapy.spidermiddlewares.urllength.UrlLengthMiddleware": 800,
"scrapy.spidermiddlewares.depth.DepthMiddleware": 900
}SPIDER_MIDDLEWARES设置会与Scrapy定义的SPIDER_MIDDLEWARES_BASE设置合并(但不是覆盖), 而后根据顺序(order)进行排序,最后得到启用中间件的有序列表: 第一个中间件是最靠近引擎的,最后一个中间件是最靠近spider的。关于如何分配中间件的顺序请查看SPIDER_MIDDLEWARES_BASE设置,而后根据您想要放置中间件的位置选择一个值。 由于每个中间件执行不同的动作,您的中间件可能会依赖于之前(或者之后)执行的中间件,因此顺序是很重要如果您想禁止内置的(在SPIDER_MIDDLEWARES_BASE中设置并默认启用的)中间件, 您必须在项目的SPIDER_MIDDLEWARES设置中定义该中间件,并将其值赋为 None需要重点关注的几个默认中间件
from scrapy.spidermiddlewares.httperror import HttpErrorMiddleware
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
class HttpErrorMiddleware:
def from_crawler(cls, crawler):
return cls(crawler.settings)
def __init__(self, settings):
self.handle_httpstatus_all = settings.getbool('HTTPERROR_ALLOW_ALL')
self.handle_httpstatus_list = settings.getlist('HTTPERROR_ALLOWED_CODES')
def process_spider_input(self, response, spider):
if 200 <= response.status < 300: # common case
return
meta = response.meta
if 'handle_httpstatus_all' in meta:
return
if 'handle_httpstatus_list' in meta:
allowed_statuses = meta['handle_httpstatus_list']
elif self.handle_httpstatus_all:
return
else:
allowed_statuses = getattr(spider, 'handle_httpstatus_list', self.handle_httpstatus_list)
if response.status in allowed_statuses:
return
raise HttpError(response, 'Ignoring non-200 response')
def process_spider_exception(self, response, exception, spider):
if isinstance(exception, HttpError):
spider.crawler.stats.inc_value('httperror/response_ignored_count')
spider.crawler.stats.inc_value(
f'httperror/response_ignored_status_count/{response.status}'
)
logger.info(
"Ignoring response %(response)r: HTTP status code is not handled or not allowed",
{'response': response}, extra={'spider': spider},
)
return []注意,如果REDIRECT_ENABLED为Fasle或者没有配置任何HTTPERROR_ALLOW_ALL等参数,那么scrapy只会处理200 <= response.status < 300的请求,
本博客所有文章除特别声明外,均采用 CC BY-NC-SA 4.0 许可协议。转载请注明来源 desperado!
