当前位置：首页 > article >正文

Python爬虫 - 豆瓣图书数据爬取、处理与存储

article 2025/3/10 10:18:33

文章目录

前言
一、使用版本
二、需求分析
- 1. 分析要爬取的内容
- - 1.1 分析要爬取的单个图书信息
  - 1.2 爬取步骤
  - - 1.2.1 爬取豆瓣图书标签分类页面
    - 1.2.2 爬取分类页面
    - 1.2.3 爬取单个图书页面
  - 1.3 内容所在的标签定位
- 2. 数据用途
- - 2.1 基础分析
  - 2.2 高级分析
- 3. 应对反爬机制的策略
- - 3.1 使用 `User-Agent` 模拟真实浏览器请求
  - 3.2 实施随机延时策略
  - 3.3 构建和使用代理池
  - 3.4 其他
三、编写爬虫代码
- 1. 爬取标签分类html
- 2. 爬取单个分类的所有页面
- 3. 爬取单个图书的html
四、数据处理与存储
- 1. 解析html并把数据保存到csv文件
- - 1.1 字段说明
  - 1.2 代码实现
- 2. 数据清洗与存储
- - 2.1 数据清洗
  - 2.2 数据存储
  - - 2.2.1 表设计
    - 2.2.2 表实现
  - 2.3 代码实现

前言

在数字化时代，网络爬虫技术为我们提供了强大的数据获取能力，使得从各类网站提取信息变得更加高效和便捷。豆瓣读书作为一个广受欢迎的图书评价和推荐平台，汇聚了大量的书籍信息，包括书名、作者、出版社、评分等。这些信息不仅对读者选择图书有帮助，也为出版商和研究人员提供了宝贵的数据资源。

本项目旨在通过 Python 爬虫技术，系统性地抓取豆瓣读书网站上的图书信息，并将其存储为结构化的数据格式，以便后续分析和研究。我们将使用 requests 和 BeautifulSoup 库进行网页请求和数据解析，利用 pandas 进行数据处理，最后将清洗后的数据存储到 MySQL 数据库中。

一、使用版本

	python	requests	bs4	beautifulsoup4	soupsieve	lxml	pandas	sqlalchemy	mysql-connector-python	selenium
版本	3.8.5	2.31.0	0.0.2	4.12.3	2.6	4.9.3	2.0.3	2.0.36	9.0.0	4.15.2

二、需求分析

1. 分析要爬取的内容

1.1 分析要爬取的单个图书信息

点击进入豆瓣读书官网：https://book.douban.com/

随便点开一本图书

在这里插入图片描述

如下图，在图书首页可以看到标题、作者、出版社、出版日期、页数、价格和评分等信息。那我们的目的就是要把这些信息爬取下来保存到csv文件中作为原始数据。

在这里插入图片描述

鼠标右击，选择检查，找到相关信息的网页源码。

在这里插入图片描述

当鼠标悬浮在如下图红色箭头所指的标签上之后，我们发现左侧我们想要爬取的信息范围被显示出来，说明我们要爬取的单个图书信息内容就在该标签中。

在这里插入图片描述

复制了该标签的内容如下图所示，从该标签中可以看到需要爬取的信息都有。

我们的目的就是把单个图书的hmtl文件爬取下来，然后使用BeautifulSoup解析后把数据保存到csv文件中。

<div class="subjectwrap clearfix">
<div class="subject clearfix">
<div id="mainpic" class="">
  <a class="nbg" href="https://img1.doubanio.com/view/subject/l/public/s34971089.jpg" title="再造乡土">
      <img src="https://img1.doubanio.com/view/subject/s/public/s34971089.jpg" title="点击看大图" alt="再造乡土" rel="v:photo" style="max-width: 135px;max-height: 200px;">
  </a>
</div>
<div id="info" class="">
    <span>
      <span class="pl"> 作者</span>:
            <a class="" href="/author/4639586">（美）萨拉·法默</a>
    </span><br>
    <span class="pl">出版社:</span>
      <a href="https://book.douban.com/press/2476">广西师范大学出版社</a>
    <br>
    <span class="pl">出品方:</span>
      <a href="https://book.douban.com/producers/795">望mountain</a>
    <br>
    <span class="pl">副标题:</span> 1945年后法国农村社会的衰落与重生<br>
    <span class="pl">原作名:</span> Rural Inventions: The French Countryside after 1945<br>
    <span>
      <span class="pl"> 译者</span>:
            <a class="" href="/search/%E5%8F%B6%E8%97%8F">叶藏</a>
    </span><br>
    <span class="pl">出版年:</span> 2024-11<br>
    <span class="pl">页数:</span> 288<br>
    <span class="pl">定价:</span> 79.20元<br>
    <span class="pl">装帧:</span> 精装<br>
      <span class="pl">ISBN:</span> 9787559874597<br>
</div>
</div>
<div id="interest_sectl" class="">
  <div class="rating_wrap clearbox" rel="v:rating">
    <div class="rating_logo">
            豆瓣评分
    </div>
    <div class="rating_self clearfix" typeof="v:Rating">
      <strong class="ll rating_num " property="v:average"> 8.5 </strong>
      <span property="v:best" content="10.0"></span>
      <div class="rating_right ">
          <div class="ll bigstar45"></div>
            <div class="rating_sum">
                <span class="">
                    <a href="comments" class="rating_people"><span property="v:votes">55</span>人评价</a>
                </span>
            </div>
      </div>
    </div>
<span class="stars5 starstop" title="力荐">
    5星
</span>
<div class="power" style="width:37px"></div>
            <span class="rating_per">29.1%</span>
            <br>
<span class="stars4 starstop" title="推荐">
    4星
</span>
<div class="power" style="width:64px"></div>
            <span class="rating_per">49.1%</span>
            <br>
<span class="stars3 starstop" title="还行">
    3星
</span>
<div class="power" style="width:26px"></div>
            <span class="rating_per">20.0%</span>
            <br>
<span class="stars2 starstop" title="较差">
    2星
</span>
<div class="power" style="width:2px"></div>
            <span class="rating_per">1.8%</span>
            <br>
<span class="stars1 starstop" title="很差">
    1星
</span>
<div class="power" style="width:0px"></div>
            <span class="rating_per">0.0%</span>
            <br>
  </div>
</div>
</div>

1.2 爬取步骤

1.2.1 爬取豆瓣图书标签分类页面

豆瓣图书标签分类地址：https://book.douban.com/tag/?view=type&icn=index-sorttags-all

爬取图书标签分类页面保存为../douban/douban_book/douban_book_tag/douban_book_all_tag.html文件。然后使用BeautifulSoup解析../douban/douban_book/douban_book_tag/douban_book_all_tag.html文件，获取每个分类标签的名称和链接。

在这里插入图片描述

1.2.2 爬取分类页面

例如，点进小说标签后的页面如下：
可以看到访问的网址是https://book.douban.com/tag/小说，由此可以推断不同分类标签第一页的网址是https://book.douban.com/tag/标签名称。

在这里插入图片描述

在上面的两个页面中可以看到每一页显示了多个小说的大概信息（这些信息并不能满足我的爬取要求），那我就需要获取每个分页的链接，然后根据每个分页的链接保存每一页的html文件。

如下图所示，检查后发现每一页是20条数据，而且带有两个参数（start、type；start表示每页开始位置，每页20条数据），由此可以推断每一页的链接为：https://book.douban.com/tag/<标签名称>?start=<20的倍数>&type=T。然后从每一页中解析出每个图书的链接。

在这里插入图片描述

1.2.3 爬取单个图书页面

获得每个图书的链接后，就可以根据链接保存每个图书的html文件。然后就可以使用BeautifulSoup从该页面中解析出图书的信息。

单个图书的页面如下图所示：

在这里插入图片描述

1.3 内容所在的标签定位

可以使用CSS选择器定位需要爬取的内容所在的标签位置。
示例：标题标签定位
鼠标右击标题部分，选择检查，显示出标题部分的源码之后；右击有标题的源码，点击复制，选择复制selector。

在这里插入图片描述

复制后的selector如下：

#wrapper > h1 > span

2. 数据用途

2.1 基础分析

描述性统计：
- 计算书籍价格、页数等数值型字段的平均值、中位数、最大值、最小值以及标准差。
- 统计不同装帧类型（binding）或出版社（publisher）的书籍数量。
频率分布：
- 制作出版年份（publication_year）的频率分布图，观察每年的出版趋势。
- 分析各星级评分（stars5_starstop至stars1_starstop）所占的比例，了解整体评分分布情况。
简单关系探索：
- 探索书籍价格与评分之间的简单相关性。
- 研究书籍页数与评分的关系，看是否有明显的关联。
分类汇总：
- 按作者（author）、出版社（publisher）或者丛书系列（series）对书籍进行分组，并计算每组的平均评分、总销量等指标。

2.2 高级分析

预测建模：
- 使用机器学习算法预测一本书的可能评分，基于如作者、出版社、价格、出版年份等因素。
- 构建模型预测书籍销售量，帮助出版社或书店优化库存管理。
聚类分析：
- 对书籍进行聚类，找出具有相似特征的书籍群体，例如相似的主题、读者群体或市场表现。
- 根据用户评论链接中的文本信息进行主题建模，以识别常见的读者关注点或反馈类型。
因果分析：
- 通过控制其他变量，研究特定因素（如封面设计、翻译质量等）对书籍评分或销量的影响。
- 使用实验设计或准实验方法评估营销活动对书籍销量的影响。
时间序列分析：
- 如果有连续多年的数据，可以对出版年份和销量等进行时间序列分析，预测未来的趋势。
- 研究特定事件（如作者获得奖项）对书籍销量的时间影响。
网络分析：
- 构建作者合作网络或书籍引用网络，探索学术或文学领域内的合作模式和影响力传播。
情感分析：
- 对用户评论链接指向的内容进行情感分析，理解读者对书籍的情感倾向。
多变量回归分析：
- 研究多个变量（如价格、页数、出版年份等）如何共同影响一本书的评分或销量。

3. 应对反爬机制的策略

3.1 使用 `User-Agent` 模拟真实浏览器请求

许多网站通过检查HTTP请求头中的 User-Agent 字段来判断请求是否来自真实的浏览器。默认情况下，Python库发送的请求可能带有明显的标识，容易被识别为自动化工具。因此，修改 User-Agent 来模拟不同的浏览器和操作系统可以有效地绕过这一检测。

3.2 实施随机延时策略

频繁且规律性的请求频率是典型的爬虫行为特征之一。通过在每次请求之间加入随机延迟，不仅可以模仿人类用户的访问模式，还能减少服务器负载，降低被封禁的风险。

3.3 构建和使用代理池

直接从同一个IP地址发起大量请求容易引起封禁。通过构建并使用代理池，您可以轮换不同的IP地址来进行请求，从而分散风险。这不仅增加了爬虫的隐蔽性，也减轻了单个IP地址的压力。

3.4 其他

验证码处理：某些网站可能还会使用验证码来验证用户身份。针对这种情况，可以考虑使用第三方OCR服务或专门的验证码识别API。
JavaScript渲染页面：部分现代网站依赖JavaScript动态加载内容，普通的HTML解析可能无法获取完整数据。这时可以使用像Selenium这样的工具，它能启动一个真实的浏览器实例执行JavaScript。

三、编写爬虫代码

1. 爬取标签分类html

页面如下图所示：

在这里插入图片描述

代码实现：

import random
import time
from pathlib import Path

import requests


def get_request(url, **kwargs):
    time.sleep(random.uniform(0.1, 2))
    print(f'===============================地址：{url} ===============================')
    # 定义一组User-Agent字符串
    user_agents = [
        # Chrome
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36',
        'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36',
        # Firefox
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/117.0',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/117.0',
        'Mozilla/5.0 (X11; Linux i686; rv:109.0) Gecko/20100101 Firefox/117.0',
        # Edge
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2040.0',
        # Safari
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.5 Safari/605.1.15',
    ]

    # 请求头
    headers = {
        'User-Agent': random.choice(user_agents)
    }

    # 用户名密码认证(私密代理/独享代理)
    username = ""
    password = ""
    proxies = {
        "http": "http://%(user)s:%(pwd)s@%(proxy)s/" % {"user": username, "pwd": password,
                                                        "proxy": '36.25.243.5:11768'},
        "https": "http://%(user)s:%(pwd)s@%(proxy)s/" % {"user": username, "pwd": password,
                                                         "proxy": '36.25.243.5:11768'}
    }

    max_retries = 3
    for attempt in range(max_retries):
        try:
            response = requests.get(url=url, timeout=10, headers=headers, **kwargs)
            # response = requests.get(url=url, timeout=10, headers=headers, proxies=proxies, **kwargs)
            if response.status_code == 200:
                return response
            else:
                print(f"请求失败，状态码: {response.status_code}，正在重新发送请求 (尝试 {attempt + 1}/{max_retries})")
        except requests.exceptions.RequestException as e:
            print(f"请求过程中发生异常: {e}，正在重新发送请求 (尝试 {attempt + 1}/{max_retries})")

        # 如果不是最后一次尝试，则等待一段时间再重试
        if attempt < max_retries - 1:
            time.sleep(random.uniform(1, 2))
    print('================多次请求失败，请查看异常情况================')
    return None  # 或者返回最后一次的响应，取决于你的需求


def save_book_html_file(save_dir, file_name, content):
    dir_path = Path(save_dir)
    # 确保保存目录存在，如果不存在则创建所有必要的父级目录
    dir_path.mkdir(parents=True, exist_ok=True)
    # 使用 'with' 语句打开文件以确保正确关闭文件流
    with open(save_dir + file_name, 'w', encoding='utf-8') as fp:
        print(f"==============================={save_dir + file_name} 文件已保存===============================")
        fp.write(str(content))


def download_book_tag():
    save_dir = '../douban/douban_book/douban_book_tag/'
    file_name = 'douban_book_all_tag.html'
    book_tag_url = 'https://book.douban.com/tag/?view=type&icn=index-sorttags-all'
    tag_file_path = Path(save_dir + file_name)
    if tag_file_path.exists() and tag_file_path.is_file():
        print(f'\n===============================文件 {tag_file_path} 已存在===============================')
    else:
        print(f'===============================文件 {tag_file_path} 不存在，正在下载...===============================')
        save_book_html_file(save_dir=save_dir, file_name=file_name, content=get_request(book_tag_url).text)


if __name__ == '__main__':
    download_book_tag()

运行结果如下图所示：

在这里插入图片描述

该代码可以重复执行，重复执行会自动检查文件是否已下载，如下图所示：

在这里插入图片描述

保存后的文件如下图：

在这里插入图片描述

2. 爬取单个分类的所有页面

基于上面的爬取标签分类继续实现的代码，使用BeautifulSoup解析标签分类html后，根据获取的标签分类名称和链接循环获取每个分类下的所有html页面。

import random
import time
from pathlib import Path

import requests
from bs4 import BeautifulSoup


# 快代理试用：https://www.kuaidaili.com/freetest/


def get_request(url, **kwargs):
    time.sleep(random.uniform(0.1, 2))
    print(f'===============================地址：{url} ===============================')
    # 定义一组User-Agent字符串
    user_agents = [
        # Chrome
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36',
        'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36',
        # Firefox
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/117.0',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/117.0',
        'Mozilla/5.0 (X11; Linux i686; rv:109.0) Gecko/20100101 Firefox/117.0',
        # Edge
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2040.0',
        # Safari
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.5 Safari/605.1.15',
    ]

    # 请求头
    headers = {
        'User-Agent': random.choice(user_agents)
    }
    # 用户名密码认证(私密代理/独享代理)
    username = "17687015657"
    password = "qvbgms8w"
    proxies = {
        "http": "http://%(user)s:%(pwd)s@%(proxy)s/" % {"user": username, "pwd": password,
                                                        "proxy": '36.25.243.5:11768'},
        "https": "http://%(user)s:%(pwd)s@%(proxy)s/" % {"user": username, "pwd": password,
                                                         "proxy": '36.25.243.5:11768'}
    }

    max_retries = 3
    for attempt in range(max_retries):
        try:
            response = requests.get(url=url, timeout=10, headers=headers, **kwargs)
            # response = requests.get(url=url, timeout=10, headers=headers, proxies=proxies, **kwargs)
            if response.status_code == 200:
                return response
            else:
                print(f"请求失败，状态码: {response.status_code}，正在重新发送请求 (尝试 {attempt + 1}/{max_retries})")
        except requests.exceptions.RequestException as e:
            print(f"请求过程中发生异常: {e}，正在重新发送请求 (尝试 {attempt + 1}/{max_retries})")

        # 如果不是最后一次尝试，则等待一段时间再重试
        if attempt < max_retries - 1:
            time.sleep(random.uniform(1, 2))
    print('================多次请求失败，请查看异常情况================')
    return None  # 或者返回最后一次的响应，取决于你的需求


def save_book_html_file(save_dir, file_name, content):
    dir_path = Path(save_dir)
    # 确保保存目录存在，如果不存在则创建所有必要的父级目录
    dir_path.mkdir(parents=True, exist_ok=True)
    # 使用 'with' 语句打开文件以确保正确关闭文件流
    with open(save_dir + file_name, 'w', encoding='utf-8') as fp:
        print(f"==============================={save_dir + file_name} 文件已保存===============================")
        fp.write(str(content))


def download_book_tag():
    save_dir = '../douban/douban_book/douban_book_tag/'
    file_name = 'douban_book_all_tag.html'
    book_tag_url = 'https://book.douban.com/tag/?view=type&icn=index-sorttags-all'
    tag_file_path = Path(save_dir + file_name)
    if tag_file_path.exists() and tag_file_path.is_file():
        print(f'\n===============================文件 {tag_file_path} 已存在===============================')
    else:
        print(f'===============================文件 {tag_file_path} 不存在，正在下载...===============================')
        save_book_html_file(save_dir=save_dir, file_name=file_name, content=get_request(book_tag_url).text)


def get_soup(markup):
    return BeautifulSoup(markup=markup, features='lxml')


def get_book_type_and_href():
    # 定义HTML文件路径
    file = '../douban/douban_book/douban_book_tag/douban_book_all_tag.html'
    # 初始化一个空字典用于存储标签名称和对应的链接
    name_href_result = {}
    # 定义豆瓣书籍的基础URL，用于拼接完整的链接
    book_base_url = 'https://book.douban.com'
    # 打开并读取HTML文件内容
    with open(file=file, mode='r', encoding='utf-8') as fp:
        # 使用BeautifulSoup解析HTML内容
        soup = get_soup(fp)
        # 选择包含所有标签链接的主要容器
        tag = soup.select_one('#content > div > div.article > div:nth-child(2)')
        # 选择所有包含标签链接的表格行（每个类别下的标签表）
        tables = tag.select('div > a.tag-title-wrapper + table.tagCol')
        # 遍历每个表格
        for table in tables:
            # 选择表格中的所有行（tr标签）
            tr_tags = table.select('tr')
            # 遍历每一行
            for tr_tag in tr_tags:
                # 选择行中的所有单元格（td标签）
                td_tags = tr_tag.select('td')
                # 遍历每个单元格
                for td_tag in td_tags:
                    # 选择单元格中的第一个a标签（如果存在）
                    a_tag = td_tag.select_one('a')
                    # 如果找到了a标签，则提取文本和href属性
                    if a_tag:
                        # 提取a标签的文本内容，并去除两端空白字符
                        tag_text = a_tag.string
                        # 获取a标签的href属性，并与基础URL拼接成完整链接
                        tag_href = book_base_url + a_tag.attrs.get('href')
                        # 将提取到的标签文本和链接添加到结果字典中
                        name_href_result[tag_text] = tag_href
    # 返回包含所有标签名称和对应链接的字典
    return name_href_result


def get_book_data_dagai(name, start):
    book_tag_base_url = 'https://book.douban.com/tag/' + name
    payload = {
        'start': start,
        'type': 'T'
    }
    response = get_request(book_tag_base_url, params=payload)
    if response is None:
        return None
    return response.text


def download_book_data_dagai(name, start):
    save_dir = '../douban/douban_book/douban_book_data_dagai/'
    file_name = f'douban_book_data_dagai_{name}_{start}.html'
    dagai_file_path = Path(save_dir + file_name)
    if dagai_file_path.exists() and dagai_file_path.is_file():
        print(f'===============================文件 {dagai_file_path} 已存在===============================')
    else:
        print(
            f'===============================文件 {dagai_file_path} 不存在，正在下载...===============================')
        content = get_book_data_dagai(name, start)
        if content is None:
            return None
        # 判断是否是最后一页
        soup = get_soup(content)
        p_tag = soup.select_one('#subject_list > p')
        if p_tag is not None:
            print(f"===============================分类 {name} 的网页爬取完成===============================")
            return True
        save_book_html_file(save_dir=save_dir, file_name=file_name, content=content)


if __name__ == '__main__':
    download_book_tag()

    book_type = get_book_type_and_href()
    book_type_name = book_type.keys()
    print(book_type_name)
    for type_name in book_type_name:
        print(f'===============================图书分类标签：{type_name}===============================')
        start_ = 0
        while True:
            flag = download_book_data_dagai(type_name, start_)
            start_ = start_ + 20
            if flag is None:
                continue
            if flag:
                print(f'======================================图书分类标签 {type_name} 的大概html下载完成======================================')
                break

执行过程中打印的部分信息如下图所示：

在这里插入图片描述

爬取后保存的部分html文件如下图所示：

在这里插入图片描述

3. 爬取单个图书的html

基于上面的爬取单个分类的所有页面继续实现的代码，使用BeautifulSoup解析每一页的html后，根据获取的单个图书链接获取html页面。

import random
import time
from pathlib import Path

import requests
from bs4 import BeautifulSoup


# 快代理试用：https://www.kuaidaili.com/freetest/


def get_request(url, **kwargs):
    time.sleep(random.uniform(0.1, 2))
    print(f'===============================地址：{url} ===============================')
    # 定义一组User-Agent字符串
    user_agents = [
        # Chrome
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36',
        'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36',
        # Firefox
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/117.0',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/117.0',
        'Mozilla/5.0 (X11; Linux i686; rv:109.0) Gecko/20100101 Firefox/117.0',
        # Edge
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2040.0',
        # Safari
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.5 Safari/605.1.15',
    ]

    # 请求头
    headers = {
        'User-Agent': random.choice(user_agents)
    }
    # 用户名密码认证(私密代理/独享代理)
    username = ""
    password = ""
    proxies = {
        "http": "http://%(user)s:%(pwd)s@%(proxy)s/" % {"user": username, "pwd": password,
                                                        "proxy": '36.25.243.5:11768'},
        "https": "http://%(user)s:%(pwd)s@%(proxy)s/" % {"user": username, "pwd": password,
                                                         "proxy": '36.25.243.5:11768'}
    }

    max_retries = 3
    for attempt in range(max_retries):
        try:
            response = requests.get(url=url, timeout=10, headers=headers, **kwargs)
            # response = requests.get(url=url, timeout=10, headers=headers, proxies=proxies, **kwargs)
            if response.status_code == 200:
                return response
            else:
                print(f"请求失败，状态码: {response.status_code}，正在重新发送请求 (尝试 {attempt + 1}/{max_retries})")
        except requests.exceptions.RequestException as e:
            print(f"请求过程中发生异常: {e}，正在重新发送请求 (尝试 {attempt + 1}/{max_retries})")

        # 如果不是最后一次尝试，则等待一段时间再重试
        if attempt < max_retries - 1:
            time.sleep(random.uniform(1, 2))
    print('================多次请求失败，请查看异常情况================')
    return None  # 或者返回最后一次的响应，取决于你的需求


def save_book_html_file(save_dir, file_name, content):
    dir_path = Path(save_dir)
    # 确保保存目录存在，如果不存在则创建所有必要的父级目录
    dir_path.mkdir(parents=True, exist_ok=True)
    # 使用 'with' 语句打开文件以确保正确关闭文件流
    with open(save_dir + file_name, 'w', encoding='utf-8') as fp:
        print(f"==============================={save_dir + file_name} 文件已保存===============================")
        fp.write(str(content))


def download_book_tag():
    save_dir = '../douban/douban_book/douban_book_tag/'
    file_name = 'douban_book_all_tag.html'
    book_tag_url = 'https://book.douban.com/tag/?view=type&icn=index-sorttags-all'
    tag_file_path = Path(save_dir + file_name)
    if tag_file_path.exists() and tag_file_path.is_file():
        print(f'\n===============================文件 {tag_file_path} 已存在===============================')
    else:
        print(f'===============================文件 {tag_file_path} 不存在，正在下载...===============================')
        save_book_html_file(save_dir=save_dir, file_name=file_name, content=get_request(book_tag_url).text)


def get_soup(markup):
    return BeautifulSoup(markup=markup, features='lxml')


def get_book_type_and_href():
    # 定义HTML文件路径
    file = '../douban/douban_book/douban_book_tag/douban_book_all_tag.html'
    # 初始化一个空字典用于存储标签名称和对应的链接
    name_href_result = {}
    # 定义豆瓣书籍的基础URL，用于拼接完整的链接
    book_base_url = 'https://book.douban.com'
    # 打开并读取HTML文件内容
    with open(file=file, mode='r', encoding='utf-8') as fp:
        # 使用BeautifulSoup解析HTML内容
        soup = get_soup(fp)
        # 选择包含所有标签链接的主要容器
        tag = soup.select_one('#content > div > div.article > div:nth-child(2)')
        # 选择所有包含标签链接的表格行（每个类别下的标签表）
        tables = tag.select('div > a.tag-title-wrapper + table.tagCol')
        # 遍历每个表格
        for table in tables:
            # 选择表格中的所有行（tr标签）
            tr_tags = table.select('tr')
            # 遍历每一行
            for tr_tag in tr_tags:
                # 选择行中的所有单元格（td标签）
                td_tags = tr_tag.select('td')
                # 遍历每个单元格
                for td_tag in td_tags:
                    # 选择单元格中的第一个a标签（如果存在）
                    a_tag = td_tag.select_one('a')
                    # 如果找到了a标签，则提取文本和href属性
                    if a_tag:
                        # 提取a标签的文本内容，并去除两端空白字符
                        tag_text = a_tag.string
                        # 获取a标签的href属性，并与基础URL拼接成完整链接
                        tag_href = book_base_url + a_tag.attrs.get('href')
                        # 将提取到的标签文本和链接添加到结果字典中
                        name_href_result[tag_text] = tag_href
    # 返回包含所有标签名称和对应链接的字典
    return name_href_result


def get_book_data_dagai(name, start):
    book_tag_base_url = 'https://book.douban.com/tag/' + name
    payload = {
        'start': start,
        'type': 'T'
    }
    response = get_request(book_tag_base_url, params=payload)
    if response is None:
        return None
    return response.text


def download_book_data_dagai(name, start):
    save_dir = '../douban/douban_book/douban_book_data_dagai/'
    file_name = f'douban_book_data_dagai_{name}_{start}.html'
    dagai_file_path = Path(save_dir + file_name)
    if dagai_file_path.exists() and dagai_file_path.is_file():
        print(f'===============================文件 {dagai_file_path} 已存在===============================')
    else:
        print(
            f'===============================文件 {dagai_file_path} 不存在，正在下载...===============================')
        content = get_book_data_dagai(name, start)
        if content is None:
            return None
        # 判断是否是最后一页
        soup = get_soup(content)
        p_tag = soup.select_one('#subject_list > p')
        if p_tag is not None:
            print(f"===============================分类 {name} 的网页爬取完成===============================")
            return True
        save_book_html_file(save_dir=save_dir, file_name=file_name, content=content)


def download_book_data_detail():
    save_dir = '../douban/douban_book/douban_book_data_detail/'
    dagai_dir = Path('../douban/douban_book/douban_book_data_dagai/')
    dagai_file_list = dagai_dir.rglob('*.html')
    for dagai_file in dagai_file_list:
        soup = get_soup(markup=open(file=dagai_file, mode='r', encoding='utf-8'))
        a_tag_list = soup.select('#subject_list > ul > li  h2 > a')
        for a_tag in a_tag_list:
            href = a_tag.attrs.get('href')
            book_id = href.split('/')[-2]
            file_name = f'douban_book_data_detail_{book_id}.html'
            detail_file_path = Path(save_dir + file_name)
            if detail_file_path.exists() and detail_file_path.is_file():
                print(f'===============================文件 {detail_file_path} 已存在===============================')
            else:
                print(
                    f'===============================文件 {detail_file_path} 不存在，正在下载...===============================')
                response = get_request(href)
                if response is None:
                    continue
                save_book_html_file(save_dir, file_name, response.text)


def print_in_rows(items, items_per_row=20):
    for index, name in enumerate(items, start=1):
        print(f'{name}', end=' ')
        if index % items_per_row == 0:
            print()


if __name__ == '__main__':
    download_book_tag()

    book_type = get_book_type_and_href()
    book_type_name = book_type.keys()
    print(book_type_name)
    for type_name in book_type_name:
        print(f'===============================图书分类标签：{type_name}===============================')
        start_ = 0
        while True:
            flag = download_book_data_dagai(type_name, start_)
            start_ = start_ + 20
            if flag is None:
                continue
            if flag:
                print(f'======================================图书分类标签 {type_name} 的大概html下载完成======================================')
                break
    download_book_data_detail()

执行过程中打印的部分信息如下图所示：

在这里插入图片描述

爬取后保存的部分html文件如下图所示：

在这里插入图片描述

四、数据处理与存储

1. 解析html并把数据保存到csv文件

使用BeautifulSoup从html文档中解析出单个图书的信息，循环解析出多个图书数据后，把数据保存到csv文件。

1.1 字段说明

字段名称	说明
book_id	书籍的唯一标识符。
title	书名。
img_src	封面图片的网络地址。
author	作者姓名。
publisher	出版社名称。
producer	制作人或出品方（如果有的话）。
original_title	原版书名（如果是翻译作品，则为原语言书名）。
translator	翻译者姓名（如果有）。
publication_year	出版年份。
page_count	页数。
price	定价。
binding	装帧类型（如平装、精装等）。
series	丛书系列名称（如果有的话）。
isbn	国际标准书号。
rating	平均评分。
rating_sum	参与评分的人数。
comment_link	用户评论链接。
stars5_starstop	五星评价所占的比例。
stars4_starstop	四星评价所占的比例。
stars3_starstop	三星评价所占的比例。
stars2_starstop	二星评价所占的比例。
stars1_starstop	一星评价所占的比例。

1.2 代码实现

每解析出100条数据，就把解析出的数据保存到csv文件中。

from pathlib import Path

import pandas as pd
from bs4 import BeautifulSoup


def get_soup(markup):
    return BeautifulSoup(markup=markup, features='lxml')


def parse_detail_html_to_csv():
    # 定义CSV文件路径
    csv_file_dir = '../douban/douban_book/data_csv/'
    csv_file_name = 'douban_books.csv'
    csv_file_path = Path(csv_file_dir + csv_file_name)
    csv_file_dir_path = Path(csv_file_dir)
    csv_file_dir_path.mkdir(parents=True, exist_ok=True)

    detail_dir = Path('../douban/douban_book/douban_book_data_detail/')
    detail_file_list = detail_dir.rglob('*.html')

    book_data = []
    count = 0
    for detail_file in detail_file_list:
        book_id = str(detail_file).split('_')[-1].split('.')[0]
        soup = get_soup(open(file=detail_file, mode='r', encoding='utf-8'))
        title = soup.select_one('#wrapper > h1 > span').string
        tag_subjectwrap = soup.select_one('#content > div > div.article > div.indent > div.subjectwrap.clearfix')
        img_src = tag_subjectwrap.select_one('#mainpic > a > img').attrs.get('src')
        tag_info = tag_subjectwrap.select_one('div.subject.clearfix > #info')
        tag_author = tag_info.find(name='span', attrs={'class': 'pl'}, string=' 作者')
        if tag_author is None:
            author = ''
        else:
            author = tag_author.next_sibling.next_sibling.text.strip()
        tag_publisher = tag_info.find(name='span', attrs={'class': 'pl'}, string='出版社:')
        if tag_publisher is None:
            publisher = ''
        else:
            publisher = tag_publisher.next_sibling.next_sibling.text.strip()
        tag_producer = tag_info.find(name='span', attrs={'class': 'pl'}, string='出品方:')
        if tag_producer is None:
            producer = ''
        else:
            producer = tag_producer.next_sibling.next_sibling.text.strip()
        tag_original_title = tag_info.find(name='span', attrs={'class': 'pl'}, string='原作名:')
        if tag_original_title is None:
            original_title = ''
        else:
            original_title = tag_original_title.next_sibling.strip()
        tag_translator = tag_info.find(name='span', attrs={'class': 'pl'}, string=' 译者')
        if tag_translator is None:
            translator = ''
        else:
            translator = tag_translator.next_sibling.next_sibling.text.strip()
        tag_publication_year = tag_info.find(name='span', attrs={'class': 'pl'}, string='出版年:')
        if tag_publication_year is None:
            publication_year = ''
        else:
            publication_year = tag_publication_year.next_sibling.strip()
        tag_page_count = tag_info.find(name='span', attrs={'class': 'pl'}, string='页数:')
        if tag_page_count is None:
            page_count = ''
        else:
            page_count = tag_page_count.next_sibling.strip()
        tag_price = tag_info.find(name='span', attrs={'class': 'pl'}, string='定价:')
        if tag_price is None:
            price = ''
        else:
            price = tag_price.next_sibling.strip()
        tag_binding = tag_info.find(name='span', attrs={'class': 'pl'}, string='装帧:')
        if tag_binding is None:
            binding = ''
        else:
            binding = tag_binding.next_sibling.strip()
        tag_series = tag_info.find(name='span', attrs={'class': 'pl'}, string='丛书:')
        if tag_series is None:
            series = ''
        else:
            series = tag_series.next_sibling.next_sibling.text.strip()
        tag_isbn = tag_info.find(name='span', attrs={'class': 'pl'}, string='ISBN:')
        if tag_isbn is None:
            isbn = ''
        else:
            isbn = tag_isbn.next_sibling.strip()

        # 评分信息
        tag_rating_wrap_clearbox = tag_subjectwrap.select_one('#interest_sectl > div')
        # 评分
        tag_rating = (tag_rating_wrap_clearbox.select_one('#interest_sectl > div > div.rating_self.clearfix > strong'))
        if tag_rating is None:
            rating = ''
        else:
            rating = tag_rating.string.strip()
        # 评论人数
        tag_rating_sum = tag_rating_wrap_clearbox.select_one('#interest_sectl > div > div.rating_self.clearfix > div > div.rating_sum > span > a > span')
        if tag_rating_sum is None:
            rating_sum = ''
        else:
            rating_sum = tag_rating_sum.string.strip()
        # 评论链接
        comment_link = f'https://book.douban.com/subject/{book_id}/comments/'
        # 五星比例
        tag_stars5_starstop = tag_rating_wrap_clearbox.select_one('#interest_sectl > div > span.stars5.starstop')
        if tag_stars5_starstop is None:
            stars5_starstop = ''
        else:
            stars5_starstop = tag_stars5_starstop.next_sibling.next_sibling.next_sibling.next_sibling.text.strip()
        # 四星比例
        tag_stars4_starstop = tag_rating_wrap_clearbox.select_one('#interest_sectl > div > span.stars4.starstop')
        if tag_stars4_starstop is None:
            stars4_starstop = ''
        else:
            stars4_starstop = tag_stars4_starstop.next_sibling.next_sibling.next_sibling.next_sibling.text.strip()
        # 三星比例
        tag_stars3_starstop = tag_rating_wrap_clearbox.select_one('#interest_sectl > div > span.stars3.starstop')
        if tag_stars3_starstop is None:
            stars3_starstop = ''
        else:
            stars3_starstop = tag_stars3_starstop.next_sibling.next_sibling.next_sibling.next_sibling.text.strip()
        # 二星比例
        tag_stars2_starstop = tag_rating_wrap_clearbox.select_one('#interest_sectl > div > span.stars2.starstop')
        if tag_stars2_starstop is None:
            stars2_starstop = ''
        else:
            stars2_starstop = tag_stars2_starstop.next_sibling.next_sibling.next_sibling.next_sibling.text.strip()
        # 一星比例
        tag_stars1_starstop = tag_rating_wrap_clearbox.select_one('#interest_sectl > div > span.stars1.starstop')
        if tag_stars1_starstop is None:
            stars1_starstop = ''
        else:
            stars1_starstop = tag_stars1_starstop.next_sibling.next_sibling.next_sibling.next_sibling.text.strip()

        data_dict = {
            'book_id': book_id,
            'title': title,
            'img_src': img_src,
            'author': author,
            'publisher': publisher,
            'producer': producer,
            'original_title': original_title,
            'translator': translator,
            'publication_year': publication_year,
            'page_count': page_count,
            'price': price,
            'binding': binding,
            'series': series,
            'isbn': isbn,
            'rating': rating,
            'rating_sum': rating_sum,
            'comment_link': comment_link,
            'stars5_starstop': stars5_starstop,
            'stars4_starstop': stars4_starstop,
            'stars3_starstop': stars3_starstop,
            'stars2_starstop': stars2_starstop,
            'stars1_starstop': stars1_starstop
        }
        print(f'===========================文件路径：{detail_file}，解析后的数据如下：===========================')
        print(data_dict)
        print('===========================================================')
        # 把数据保存到列表中
        book_data.append(data_dict)
        count = count + 1
        if count == 100:
            df = pd.DataFrame(book_data)
            if not csv_file_path.exists():
                df.to_csv(csv_file_dir + csv_file_name, index=False, encoding='utf-8-sig')
            else:
                df.to_csv(csv_file_dir + csv_file_name, index=False, encoding='utf-8-sig', mode='a', header=False)
            book_data = []
            count = 0


if __name__ == '__main__':
    parse_detail_html_to_csv()

执行过程中打印的部分信息如下图所示：

在这里插入图片描述

csv文件位置及内容如下图所示：

在这里插入图片描述

2. 数据清洗与存储

2.1 数据清洗

使用pandas进行数据清洗。
空值：除下列说明外，对于空值统一使用未知来填充。
日期：空值使用1970-01-01来填充，缺失月或日用01填充。
页数：空值使用0来填充。
定价：空值使用0来填充。
评分：空值使用0来填充。
评分人数：空值使用0来填充。
星级评价：空值使用0来填充。

2.2 数据存储

把清洗后的数据保存到MySQL中。

2.2.1 表设计

根据图片中的字段，以下是设计的MySQL表结构。我将使用标准的SQL语法来定义这个表，并以表格形式展示。

字段名称	数据类型	说明
book_id	INT	书籍的唯一标识符。
title	VARCHAR(255)	书名。
img_src	VARCHAR(255)	封面图片的网络地址。
author	VARCHAR(255)	作者姓名。
publisher	VARCHAR(255)	出版社名称。
producer	VARCHAR(255)	制作人或出品方（如果有的话）。
original_title	VARCHAR(255)	原版书名（如果是翻译作品，则为原语言书名）。
translator	VARCHAR(255)	翻译者姓名（如果有）。
publication_year	DATE	出版年份。
page_count	INT	页数。
price	DECIMAL(10, 2)	定价。
binding	VARCHAR(255)	装帧类型（如平装、精装等）。
series	VARCHAR(255)	丛书系列名称（如果有的话）。
isbn	VARCHAR(20)	国际标准书号。
rating	DECIMAL(3, 1)	平均评分。
rating_sum	INT	参与评分的人数。
comment_link	VARCHAR(255)	用户评论链接。
stars5_starstop	DECIMAL(5, 2)	五星评价所占的比例。
stars4_starstop	DECIMAL(5, 2)	四星评价所占的比例。
stars3_starstop	DECIMAL(5, 2)	三星评价所占的比例。
stars2_starstop	DECIMAL(5, 2)	二星评价所占的比例。
stars1_starstop	DECIMAL(5, 2)	一星评价所占的比例。

2.2.2 表实现

创建数据库douban。

create database douban;

切换到数据库douban。

use douban;

创建数据表cleaned_douban_books，用于存储清洗后的数据。

CREATE TABLE cleaned_douban_books (
    book_id INT PRIMARY KEY,
    title VARCHAR(255),
    img_src VARCHAR(255),
    author VARCHAR(255),
    publisher VARCHAR(255),
    producer VARCHAR(255),
    original_title VARCHAR(255),
    translator VARCHAR(255),
    publication_year DATE,
    page_count INT,
    price DECIMAL(10, 2),
    binding VARCHAR(255),
    series VARCHAR(255),
    isbn VARCHAR(20),
    rating DECIMAL(3, 1),
    rating_sum INT,
    comment_link VARCHAR(255),
    stars5_starstop DECIMAL(5, 2),
    stars4_starstop DECIMAL(5, 2),
    stars3_starstop DECIMAL(5, 2),
    stars2_starstop DECIMAL(5, 2),
    stars1_starstop DECIMAL(5, 2)
);

2.3 代码实现

import re

import pandas as pd
from sqlalchemy import create_engine


def read_csv_to_df(file_path):
    # 加载CSV文件到DataFrame
    df = pd.read_csv(file_path, encoding='utf-8')
    return df


def unify_date_format(date_str):
    # 检查是否为 NaN 或 None
    if pd.isna(date_str) or date_str is None:
        return None

    # 定义一个函数来处理特殊格式的日期
    def preprocess_date(date_str):
        # 如果是字符串并且包含中文格式的日期，则进行替换
        if isinstance(date_str, str) and '年' in date_str and '月' in date_str:
            return date_str.replace('年', '-').replace('月', '-').replace('日', '')
        return date_str

    # 预处理日期字符串
    processed_date = preprocess_date(date_str)

    try:
        # 使用pd.to_datetime尝试转换日期格式
        date_obj = pd.to_datetime(processed_date, errors='coerce')

        # 如果只有年份，则添加默认的月份和日子为01
        if isinstance(date_obj, pd.Timestamp) and len(str(processed_date).split('-')) == 1:
            date_obj = date_obj.replace(month=1, day=1)

        # 返回标准化的日期字符串
        return date_obj.strftime('%Y-%m-%d') if not pd.isna(date_obj) else None

    except Exception as e:
        print(f"Error parsing date '{date_str}': {e}")
        return '1970-01-01'


def clean_price(price_str):
    if pd.isna(price_str) or not isinstance(price_str, str):
        return 0

    # 移除所有非数字字符，保留数字和小数点
    cleaned = re.sub(r'[^\d./]+', '', price_str)

    # 处理包含多个价格的情况，这里选择平均值作为代表
    prices = []
    for part in cleaned.split('/'):
        # 进一步清理每个部分，移除非数字和非小数点字符
        sub_parts = re.findall(r'\d+\.\d+|\d+', part)
        if sub_parts:
            try:
                # 取每个部分的第一个匹配的价格
                price = float(sub_parts[0])
                prices.append(price)
            except ValueError:
                continue

    if not prices:
        return 0

    # 根据需要选择不同的策略，这里选择平均值
    avg_price = sum(prices) / len(prices)

    # 确保保留两位小数
    return round(avg_price, 2)


def clean_percentage(percentage_str):
    if pd.isna(percentage_str) or not isinstance(percentage_str, str):
        return 0
    # 移除百分比符号并转换为浮点数
    cleaned = re.sub(r'[^\d.]+', '', percentage_str)
    return round(float(cleaned), 2)


def clean_page_count(page_str):
    if not isinstance(page_str, str) or not page_str.strip():
        return 0

    # 移除非数字字符，保留数字和分号
    cleaned = re.sub(r'[^\d;；]+', '', page_str)

    # 分离多个页数
    pages = [int(p) for p in cleaned.split('；') if p]

    if not pages:
        return 0

    # 根据需要选择不同的策略，这里选择最大值
    max_page = max(pages)

    return max_page


# 定义函数：清理和转换数据格式
def clean_and_transform(df):
    # 删除book_id相同的数据
    df.drop_duplicates(subset=['book_id'])

    df['author'].fillna('未知', inplace=True)
    df['publisher'].fillna('未知', inplace=True)
    df['producer'].fillna('未知', inplace=True)
    df['original_title'].fillna('未知', inplace=True)
    df['translator'].fillna('未知', inplace=True)

    # 日期：空值使用1970-01-01来填充，缺失月或日用01填充
    df['publication_year'] = df['publication_year'].apply(unify_date_format)

    df['page_count'].fillna(0, inplace=True)
    df['page_count'] = df['page_count'].apply(clean_page_count)
    df['page_count'] = df['page_count'].astype(int)
    df['price'] = df['price'].apply(clean_price)
    df['binding'].fillna('未知', inplace=True)
    df['series'].fillna('未知', inplace=True)
    df['isbn'].fillna('未知', inplace=True)
    df['rating'].fillna(0, inplace=True)
    df['rating_sum'].fillna(0, inplace=True)
    df['rating_sum'] = df['rating_sum'].astype(int)

    df['stars5_starstop'] = df['stars5_starstop'].apply(lambda x: clean_percentage(x))
    df['stars4_starstop'] = df['stars4_starstop'].apply(lambda x: clean_percentage(x))
    df['stars3_starstop'] = df['stars3_starstop'].apply(lambda x: clean_percentage(x))
    df['stars2_starstop'] = df['stars2_starstop'].apply(lambda x: clean_percentage(x))
    df['stars1_starstop'] = df['stars1_starstop'].apply(lambda x: clean_percentage(x))

    return df


def save_df_to_db(df):
    # 设置数据库连接信息
    db_user = 'root'
    db_password = 'zxcvbq'
    db_host = '127.0.0.1'  # 或者你的数据库主机地址
    db_port = '3306'  # MySQL默认端口是3306
    db_name = 'douban'

    # 创建数据库引擎
    engine = create_engine(f'mysql+mysqlconnector://{db_user}:{db_password}@{db_host}:{db_port}/{db_name}')
    # 将df写入MySQL表
    df.to_sql(name='cleaned_douban_books', con=engine, if_exists='append', index=False)
    print("所有csv文件的数据已成功清洗并写入MySQL数据库")


if __name__ == '__main__':
    csv_file = r'..\douban\douban_book\data_csv\douban_books.csv'
    df = read_csv_to_df(csv_file)
    df = clean_and_transform(df)
    save_df_to_db(df)

查看cleaned_douban_books表中的图书数据：

select * from cleaned_douban_books limit 10;

在这里插入图片描述

查看全文

http://www.kler.cn/a/467161.html

creating-custom-commands-in-flask

ubuntu 使用s3fs配置自动挂载对象存储

谷歌2025年AI战略与产品线布局

openwrt host方式编译ffmpeg尝试及问题分析

青少年编程与数学 02-006 前端开发框架VUE 02课题、创建工程

LeetCode -Hot100 - 53. 最大子数组和

什么是护网行动？

spring cloud微服务分布式架构

vllm源码(一)

jQuery Mobile 可折叠块

51单片机——LED模块

NS4863 500mA 锂电池充放电管理IC

LeetCode算法题——有序数组的平方

UGUI 优化DrawCall操作记录（基于Unity2021.3.18)

049_小驰私房菜_MTK Camera debug，通过adb 命令读写Camera sensor寄存器地址的值

iOS 中performBatchUpdates 的机制

Day2 -- QingLuoPay基础功能搭建

window11 wsl mysql8 错误分析：1698 - Access denied for user ‘root‘@‘kong.mshome.net‘

vue3 ui组件子组件封装v-model绑定props modelValue

使用SSH建立内网穿透，能够访问内网的web服务器

文章目录

前言

一、使用版本

二、需求分析

1. 分析要爬取的内容

1.1 分析要爬取的单个图书信息

1.2 爬取步骤

1.2.1 爬取豆瓣图书标签分类页面

1.2.2 爬取分类页面

1.2.3 爬取单个图书页面

1.3 内容所在的标签定位

2. 数据用途

2.1 基础分析

2.2 高级分析

3. 应对反爬机制的策略

3.1 使用 User-Agent 模拟真实浏览器请求

3.2 实施随机延时策略

3.3 构建和使用代理池

3.4 其他

三、编写爬虫代码

1. 爬取标签分类html

2. 爬取单个分类的所有页面

3. 爬取单个图书的html

四、数据处理与存储

1. 解析html并把数据保存到csv文件

1.1 字段说明

1.2 代码实现

2. 数据清洗与存储

2.1 数据清洗

2.2 数据存储

2.2.1 表设计

2.2.2 表实现

2.3 代码实现

相关文章：

3.1 使用 `User-Agent` 模拟真实浏览器请求