当前位置：首页 > article >正文

scrapy爬取汽车、车评数据【中】

article 2024/10/4 16:35:50

这个爬虫我想分三期来写：
✅ 第一期写如何爬取汽车的车型信息；
✅ 第二期写如何爬取汽车的车评；
✅ 第三期写如何对车评嵌入情感分析结果，以及用简单的方法把数据插入mysql中；
技术基于scrapy框架、BERT语言模型、mysql数据库。

1 编写评论爬虫

评论爬取我们也是基于接口去获取数据，接口利用工具可以测试下：
在这里插入图片描述

这里面关键的参数有 series_or_motor_id，count和offset ，其中count和offset是翻页参数，series_or_motor_id就是我们上一期爬取到的车型的ID，也就是说，根据车型我们可以爬取对应的车评。
下面编写车评的爬虫，在scrapy工程的spider下新建一个comment.py文件：

name = 'comment'
    allowed_domains = ['dongchedi.com']
    start_urls = [
        'https://www.dongchedi.com/motor/pc/ugc/community/cheyou_list?aid=1839&app_name=auto_web_pc&series_or_motor_id=393&sort_type=2&tab_name=dongtai&count=30&offset=0'
    ]

    series_or_motor_ids = [393]  # 可以扩展为多个ID
    count = 30

    def start_requests(self):
        for series_or_motor_id in self.series_or_motor_ids:
            url = f'https://www.dongchedi.com/motor/pc/ugc/community/cheyou_list?aid=1839&app_name=auto_web_pc&series_or_motor_id={series_or_motor_id}&sort_type=2&tab_name=dongtai&count={self.count}&offset=0'
            yield scrapy.Request(url, callback=self.parse, meta={'series_or_motor_id': series_or_motor_id, 'offset': 0})

    def __init__(self, *args, **kwargs):
        super(CommentSpider, self).__init__(*args, **kwargs)
        self.item_count = 0  # 初始化计数器

一开始先爬去固定的ID，等到后续可以拓展为根据数据库里读取的汽车的ID来循环爬取，这边固定写死一个[393]的列表，作开发测试。
解析代码如下:

    def parse(self, response):
        data = json.loads(response.text)
        total_count = data['data']['total_count']
        series_or_motor_id = response.meta['series_or_motor_id']
        offset = response.meta['offset']

        # 提取并处理cheyou_list
        cheyou_list = data['data']['cheyou_list']
        for item in cheyou_list:
            comment_item = CommentItem()
            comment_item['title'] = item.get('title')
            comment_item['content'] = item.get('content')
            comment_item['username'] = item['profile_info'].get('name')
            comment_item['avatar'] = item['profile_info'].get('avatar_url')
            self.item_count += 1  # 增加计数器
            yield comment_item

        # 判断是否需要继续翻页
        if offset + self.count < total_count:
            next_offset = offset + self.count
            next_url = f'https://www.dongchedi.com/motor/pc/ugc/community/cheyou_list?aid=1839&app_name=auto_web_pc&series_or_motor_id={series_or_motor_id}&sort_type=2&tab_name=dongtai&count={self.count}&offset={next_offset}'
            yield scrapy.Request(next_url, callback=self.parse, meta={'series_or_motor_id': series_or_motor_id, 'offset': next_offset})

    def closed(self, reason):
        """爬虫结束时调用"""
        purple_text = f"\033[95m总共爬取了 {self.item_count} 个评论。\033[0m"
        self.logger.info(purple_text)

2 items.py

编写对应的Items，其中label是预留到后面存情感分析结果的：

class CommentItem(scrapy.Item):
    title = scrapy.Field()
    content = scrapy.Field()
    username = scrapy.Field()
    avatar = scrapy.Field()
    label = scrapy.Field()

3 pipelines.py

pipelines里区分一下，因为之前写过一个item了，这边可以针对评论的Item嵌入情感分析评论内容item[‘content’]的过程。

class SentimentPipeline:
    def process_item(self, item, spider):
        # 可以添加保存数据库或文件的逻辑
        # print(item)
        if isinstance(item, CommentItem):
            print('进行情感分析...')
            # TODO
            print(item)
        if isinstance(item, DongchediItem):
            print(item)
        return item