当前位置：首页 > article >正文

静态网页的爬虫（以电影天堂为例）

article 2025/3/11 13:02:55

一、电影天堂的网址（url）

电影天堂_免费电影_迅雷电影下载_电影天堂网最好的迅雷电影下载网，分享最新电影，高清电影、综艺、动漫、电视剧等下载！https://dydytt.net/index.htm

我们要爬取这个页面上的内容

二、代码详情解释和介绍功能中所用到的知识点

这个代码涵盖了以下主要知识点：

HTTP 请求与响应处理。
HTML 解析与 XPath 使用。
字符串处理与数据格式化。
异常处理与程序健壮性。
列表、字典与数据结构操作。
编码处理与国际化支持。
请求间隔与反爬虫策略。
函数定义与模块化编程。
日志输出与调试技巧。

import requests
from lxml import etree
import time
import warnings

# 禁用 SSL 警告
warnings.filterwarnings("ignore", category=requests.packages.urllib3.exceptions.InsecureRequestWarning)

# 请求头，模拟浏览器访问
HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Referer': 'https://dydytt.net/',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'Accept-Language': 'zh-CN,zh;q=0.9',
    'Connection': 'keep-alive',
}

# 基础域名
BASE_DOMAIN = "https://dydytt.net"


def get_detail_urls(url):
    """
    获取电影详情页的 URL
    """
    try:
        # 发送 HTTP 请求
        response = requests.get(url, headers=HEADERS, verify=False)
        response.raise_for_status()  # 检查请求是否成功
        # 解码响应内容（电影天堂使用 gbk 编码）
        text = response.content.decode("gbk", "ignore")
        # 使用 lxml 解析 HTML
        html = etree.HTML(text)
        # 提取电影详情页的链接
        detail_urls = html.xpath("//div[@class='co_content8']//table//a/@href")
        # 将相对路径转换为完整 URL
        detail_urls = [BASE_DOMAIN + url if url.startswith("/") else BASE_DOMAIN + "/" + url for url in detail_urls]
        return detail_urls
    except Exception as e:
        print(f"Error fetching detail URLs: {e}")
        return []


def parse_detail(url):
    """
    解析电影详情页，提取电影信息
    """
    try:
        # 发送 HTTP 请求
        response = requests.get(url, headers=HEADERS, verify=False)
        response.raise_for_status()  # 检查请求是否成功
        # 解码响应内容
        text = response.content.decode("gbk", "ignore")
        # 使用 lxml 解析 HTML
        html = etree.HTML(text)

        # 提取电影标题
        title_elements = html.xpath("//div[@class='title_all']//font[@color='#07519a']/text()")
        title = title_elements[0] if title_elements else "未知标题"

        # 提取电影图片
        image_elements = html.xpath("//div[@id='Zoom']//img/@src")
        image = image_elements[0] if image_elements else "未知图片"

        # 提取电影信息（年份、评分、导演、主演、简介等）
        infos = html.xpath("//div[@id='Zoom']//text()")
        movie_info = {}
        for info in infos:
            info = info.strip()
            if info.startswith("◎年 代"):
                movie_info["year"] = info.replace("◎年 代", "").strip()
            elif info.startswith("◎豆瓣评分"):
                movie_info["score"] = info.replace("◎豆瓣评分", "").strip()
            elif info.startswith("◎片 长"):
                movie_info["duration"] = info.replace("◎片 长", "").strip()
            elif info.startswith("◎导 演"):
                movie_info["director"] = info.replace("◎导 演", "").strip()
            elif info.startswith("◎主 演"):
                movie_info["actors"] = info.replace("◎主 演", "").strip()
            elif info.startswith("◎简 介"):
                movie_info["profile"] = info.replace("◎简 介", "").strip()

        # 提取下载链接
        downurl_elements = html.xpath("//td[@bgcolor='#fdfddf']/a/@href")
        downurl = downurl_elements[0] if downurl_elements else "未知下载链接"

        # 返回电影信息
        return {
            "title": title,
            "image": image,
            **movie_info,
            "downurl": downurl,
        }
    except Exception as e:
        print(f"Error parsing detail page: {e}")
        return None


def spider():
    """
    主爬虫函数
    """
    # 基础 URL（电影列表页）
    base_url = "https://dydytt.net/index.htm"
    movies = []

    # 获取电影详情页的 URL
    print("Fetching movie list from:", base_url)
    detail_urls = get_detail_urls(base_url)

    # 遍历每个详情页，提取电影信息
    for detail_url in detail_urls:
        print("Fetching movie details from:", detail_url)
        time.sleep(5)  # 增加请求间隔，避免被封禁
        movie = parse_detail(detail_url)
        if movie:
            movies.append(movie)

    # 打印所有电影信息
    for movie in movies:
        print(movie)


if __name__ == '__main__':
    spider()

代码爬虫成功标志：

这个代码涉及多个 Python 编程和网络爬虫的知识点。以下是代码中用到的关键知识点及其作用：

1. HTTP 请求 (`requests` 库)

作用：发送 HTTP 请求，获取网页内容。
知识点：
- requests.get(url, headers=headers, verify=False)：发送 GET 请求，获取网页内容。
- response.content：获取响应的二进制内容。
- response.text：获取响应的文本内容（自动解码）。
- response.raise_for_status()：检查请求是否成功，失败时抛出异常。

2. HTML 解析 (`lxml` 和 `etree`)

作用：解析 HTML 文档，提取所需的数据。
知识点：
- etree.HTML(text)：将 HTML 文本转换为可解析的 etree 对象。
- html.xpath("XPath表达式")：使用 XPath 表达式定位 HTML 元素。
- 示例：html.xpath("//div[@class='title_all']//font[@color='#07519a']/text()") 提取电影标题。

3. XPath 表达式

作用：用于定位 HTML 文档中的特定元素。
知识点：
- //：从根节点开始查找。
- [@属性名='值']：根据属性值筛选元素。
- /text()：提取元素的文本内容。
- /@属性名：提取元素的属性值。
- 示例：//div[@class='co_content8']//table//a/@href 提取电影详情页的链接。

4. 字符串处理

作用：清理和格式化提取的数据。
知识点：
- str.strip()：去除字符串两端的空白字符。
- str.replace(old, new)：替换字符串中的部分内容。
- 示例：info.replace("◎年代", "").strip() 提取电影年份。

5. 异常处理 (`try-except`)

作用：捕获和处理程序运行中的异常，避免程序崩溃。
知识点：
- try：尝试执行可能出错的代码。
- except：捕获异常并处理。
- 示例：捕获 requests.get 的异常，防止网络请求失败导致程序崩溃。

6. 列表和字典操作

作用：存储和处理提取的数据。
知识点：
- list.append(item)：向列表中添加元素。
- dict[key] = value：向字典中添加键值对。
- 示例：movies.append({"title": title, "href": href}) 将电影信息存储到列表中。

7. 编码处理

作用：正确处理网页的编码问题。
知识点：
- response.content.decode("gbk", "ignore")：将二进制内容解码为文本，忽略无法解码的字符。
- 电影天堂使用 gbk 编码，必须正确解码才能获取可读的文本内容。

8. 请求间隔 (`time.sleep`)

作用：避免频繁请求导致被封禁。
知识点：
- time.sleep(seconds)：暂停程序执行指定的秒数。
- 示例：time.sleep(5) 每次请求后暂停 5 秒。

9. 禁用 SSL 验证

作用：忽略 SSL 证书验证，避免 HTTPS 请求失败。
知识点：
- requests.get(verify=False)：禁用 SSL 验证。
- warnings.filterwarnings("ignore", category=InsecureRequestWarning)：禁用 SSL 警告。

10. 函数定义与调用

作用：将代码模块化，提高可读性和复用性。
知识点：
- def function_name(parameters):：定义函数。
- function_name(arguments)：调用函数。
- 示例：get_detail_urls(url) 获取电影详情页的 URL。

11. 循环与条件判断

作用：遍历数据并根据条件执行不同的操作。
知识点：
- for item in list:：遍历列表中的每个元素。
- if condition:：根据条件执行代码。
- 示例：遍历电影详情页的 URL 并解析每个页面。

12. 日志输出 (`print`)

作用：打印程序运行状态和结果，便于调试。
知识点：
- print(message)：输出信息到控制台。
- 示例：print("Fetching movie list from:", base_url) 打印当前正在处理的 URL。

13. 模块导入 (`import`)

作用：引入外部库或模块，扩展 Python 的功能。
知识点：
- import requests：导入 requests 库。
- from lxml import etree：从 lxml 库中导入 etree 模块。
- import time：导入 time 模块。

14. URL 拼接

作用：将相对路径转换为完整 URL。
知识点：
- BASE_DOMAIN + url：拼接基础域名和相对路径。
- 示例：BASE_DOMAIN + "/" + url 处理相对路径。

15. 默认值处理

作用：在数据缺失时提供默认值，避免程序崩溃。
知识点：
- list[0] if list else "默认值"：如果列表不为空，返回第一个元素，否则返回默认值。
- 示例：title_elements[0] if title_elements else "未知标题" 处理电影标题缺失的情况。