scrapy豆瓣爬虫
1.1 创建豆瓣爬虫文件夹,进入豆瓣
scrapy startproject Douban
cd Douban
1.2 修改豆瓣模型,进行后续信息设定
1.3 创建豆瓣详细爬虫分支
scrapy genspider movie douban.com
1.4 修改目标网址
‘
1.5 开始爬取测试,结果403
import scrapy
class MovieSpider(scrapy.Spider):
name = "movie"
allowed_domains = ["douban.com"]
start_urls = ["https://movie.douban.com/top250"]
def parse(self, response):
el_list = response.xpath("(//div[@class='info'])")
print(len(el_list))
scrapy crawl movie
1.6 修改爬虫请求头,设置不遵守机器人协议
1.7 给scrapy设置cookie,模拟登录
COOKIES_ENABLED = False
当COOKIES_ENABLED是注释的时候scrapy默认没有开启cookie
当COOKIES_ENABLED没有注释,设置为False的时候scrapy默认使用了settings里面的cookie
当COOKIES_ENABLED设置为True的时候scrapy就会把settings的cookie关掉,使用自定义cookie
这个时候scrapy已经得到了正确的值
1.8 提取豆瓣评分的标题
实际上是继承上文的xpath
1.9 运行豆瓣爬虫
import scrapy
from ..items import DoubanItem
class MovieSpider(scrapy.Spider):
name = "movie"
allowed_domains = ["douban.com"]
start_urls = ["https://movie.douban.com/top250"]
def parse(self, response):
el_list = response.xpath("(//div[@class='info'])")
print(len(el_list))
for el in el_list:
item = DoubanItem()
item["name"] = el.xpath("./div[1]/a/span[1]/text()").extract_first()
yield item
url = response.xpath('//span[@class="next"]/a/@href').extract_first()
if url != None:
url = response.urljoin(url)
yield scrapy.Request(
url=url
)
这时候已经追踪到了豆瓣的评分