当前位置: 首页 > article >正文

豆瓣书摘 | 爬虫 | Python

获取豆瓣书摘,存入MongoDB中。

import logging
import time

import requests
from bs4 import BeautifulSoup
from pymongo import MongoClient

headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
    'accept-language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
    'cache-control': 'max-age=0',
    'priority': 'u=0, i',
    'sec-ch-ua': '"Chromium";v="130", "Microsoft Edge";v="130", "Not?A_Brand";v="99"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Windows"',
    'sec-fetch-dest': 'document',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-site': 'none',
    'sec-fetch-user': '?1',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36 Edg/130.0.0.0',
}

params = {
    'sort': 'score',
    'start': 0,
}
# 连接到 MongoDB 服务器(假设在本地运行,默认端口 27017)
client = MongoClient('localhost', 27017)

# 选择数据库(如果数据库不存在,MongoDB 会在插入数据时自动创建)
db = client['douban_database']

# 选择集合(如果集合不存在,MongoDB 会在插入数据时自动创建)
collection = db['blockquotes_1009393']
for start in range(0, 1260, 20):
    params['start'] = start
    response = requests.get('https://book.douban.com/subject/1009393/blockquotes', params=params, headers=headers)
    text = response.text
    soup = BeautifulSoup(text, 'lxml')
    if len(soup.findAll("div", attrs={"class": "blockquote-list"})) == 0:
        logging.error("blockquote-list is not exist")
        exit(1)
    blockquote_list = soup.findAll("div", attrs={"class": "blockquote-list"})[0]
    if blockquote_list is None:
        logging.error("blockquote-list None")
        exit(1)
    figures = blockquote_list.findAll("figure")
    for figure in figures:
        if figure is None:
            logging.warning("figure is None")
            continue
        data = {
            'author_avatar': None,
            'author_name': None,
            'likes': None,
            'datetime': None,
            'page_reference': None
        }
        try:
            data['author_avatar'] = figure.find('img')['src']
        except:
            data['author_avatar'] = None
            logging.error(figure)

        try:
            data['author_name'] = figure.find('a', class_='author-name').text.strip()
        except:
            data['author_name'] = None
            logging.error(figure)

        try:
            data['likes'] = figure.find('span').text.strip().replace('赞', '')
        except:
            data['likes'] = None
            logging.error(figure)

        try:
            data['datetime'] = figure.find('datetime').text.strip()
        except:
            data['datetime'] = None
            logging.error(figure)

        try:
            data['page_reference'] = figure.find('figcaption')['title']
        except:
            data['page_reference'] = None
            logging.error(figure)

        try:
            blockquote_extra = figure.find('div', class_='blockquote-extra')
            a_href = figure.find('a')
            blockquote_extra.decompose()
            a_href.decompose()
            content = figure.text.strip().replace('()', '')
            # print(content)
            data['content'] = content
        except:
            data['content'] = None
            logging.error(figure)
        try:
            pass
            collection.insert_one(data)
        except Exception as e:
            print(e)
    time.sleep(3)

效果图:
存入数据库效果图


http://www.kler.cn/a/402532.html

相关文章:

  • 双指针算法(1)
  • torch.set_printoptions
  • DataGear 5.2.0 发布,数据可视化分析平台
  • springboot集成shiro和前后端分离配置
  • Spring Boot项目集成Redisson 原始依赖与 Spring Boot Starter 的流程
  • 【隐私计算大模型】联邦深度学习之拆分学习Split learning原理及安全风险、应对措施以及在大模型联合训练中的应用案例
  • 性能稳定的云计算监控工具大全
  • 硬件工程师零基础入门:一.电子设计安全要点与欧姆定律
  • 蓝桥杯2024年11月20日个人赛报名页下方例题解答
  • YOLOv8-ultralytics-8.2.103部分代码阅读笔记-conv.py
  • 重构代码之引入外部方法
  • 【c++篇】:深入c++的set和map容器--掌握提升编程效率的利器
  • 【JavaSE】【网络编程】UDP数据报套接字编程
  • MFC1(note)
  • 高频面试题(含笔试高频算法整理)基本总结回顾21
  • goland单元测试
  • 【虚拟机】VMWare的CentOS虚拟机断电或强制关机出现问题
  • 一次成功尝试:旧电脑通过网线,连接带无线网卡电脑上外网
  • Android和IOS的区别
  • C++——智能指针剖析
  • 专家PID控制
  • 在 for 循环中,JVM可能会将 arr.length 提升到循环外部,仅计算一次。可能会将如何解释 详解
  • AwsCredentialsProvider认证接口
  • Python运算符列表
  • C++设计模式之适配器模式与桥接模式,装饰器模式及代理模式相似点与不同点
  • 数据结构 【带环单链表】