当前位置：首页 > article >正文

Python 网络爬虫的高级应用：反爬绕过与爬取多样化数据

article 2025/3/4 6:05:02

经过前三天的学习，我们已经掌握了分布式爬虫和大规模数据处理的基本技术。本篇博客将进一步探讨更复杂的网络爬虫技术，包括反爬绕过策略的深入应用、多样化数据爬取（如图片、视频和表单数据），以及如何爬取 API 数据并结合爬虫应用于实际场景。

一、反爬绕过的高级技术

1. 动态请求模拟：使用 Playwright

虽然 Selenium 是一个强大的动态网页爬取工具，但 Playwright 提供了更现代的接口和更高效的性能。

安装 Playwright

pip install playwright
playwright install

基本用法

以下是一个使用 Playwright 爬取动态网页的例子：

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)  # 启动无头浏览器
    page = browser.new_page()
    page.goto("https://example.com")
    content = page.content()  # 获取动态加载后的 HTML
    print(content)
    browser.close()

Playwright 支持多种浏览器（Chromium、Firefox、WebKit），并支持直接操作 DOM 元素。

2. Cookie 和 Session 保持

许多网站要求用户登录后才能访问某些内容。在这种情况下，我们可以使用 Cookies 或 Session 模拟登录状态。

使用 Requests 模拟登录

import requests

# 模拟登录
login_url = "https://example.com/login"
data = {
    "username": "your_username",
    "password": "your_password"
}
session = requests.Session()
session.post(login_url, data=data)

# 登录后访问其他页面
response = session.get("https://example.com/profile")
print(response.text)

导入浏览器 Cookies

如果登录过程复杂，可以先使用浏览器登录，导出 Cookies，并在爬虫中加载：

import requests

cookies = {
    "sessionid": "your_session_cookie_value",
    # 添加更多 cookies
}

response = requests.get("https://example.com/profile", cookies=cookies)
print(response.text)

3. 绕过验证码

验证码是反爬的常见手段，可以通过以下方式处理：

手动输入：将验证码图片保存并提示用户输入。
OCR 识别：使用 Tesseract 等 OCR 工具。
打码平台：使用第三方服务自动识别。

使用 Tesseract OCR

pip install pytesseract pillow

from PIL import Image
import pytesseract

# 识别验证码
image = Image.open("captcha.png")
captcha_text = pytesseract.image_to_string(image)
print("识别的验证码是：", captcha_text)

二、多样化数据爬取

1. 图片下载

爬取网站图片时，可以先提取 <img> 标签的 src 属性，然后通过 URL 下载图片。

示例代码

import requests
from bs4 import BeautifulSoup
import os

url = "https://example.com/gallery"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

# 创建图片保存目录
os.makedirs("images", exist_ok=True)

# 下载图片
for img in soup.find_all("img"):
    img_url = img["src"]
    img_data = requests.get(img_url).content
    img_name = os.path.join("images", img_url.split("/")[-1])
    with open(img_name, "wb") as f:
        f.write(img_data)
    print(f"下载图片：{img_name}")

2. 视频爬取

视频通常通过 <video> 标签加载，也可能被分片存储（如 M3U8 格式）。

下载 M3U8 视频

使用 ffmpeg 下载并合并分片：

ffmpeg -i "https://example.com/video.m3u8" -c copy output.mp4

使用 Python 自动化下载

import os

m3u8_url = "https://example.com/video.m3u8"
os.system(f"ffmpeg -i {m3u8_url} -c copy output.mp4")

3. 表单数据爬取

爬取表单数据时，需要抓包分析其请求参数，并模拟表单提交。

模拟表单提交

import requests

form_url = "https://example.com/search"
data = {
    "query": "Python 网络爬虫",
    "category": "articles"
}

response = requests.post(form_url, data=data)
print(response.text)

4. API 数据爬取

API 通常是高效获取数据的途径，通过分析网站的网络请求可以发现其使用的 API 接口。

示例：使用 API 获取天气数据

import requests

api_url = "https://api.openweathermap.org/data/2.5/weather"
params = {
    "q": "London",
    "appid": "your_api_key"
}

response = requests.get(api_url, params=params)
weather_data = response.json()
print(weather_data)

三、综合实战：电商网站商品数据爬取

以下是一个爬取电商网站商品信息（包括图片、价格和评价）的完整示例。

示例代码

from selenium import webdriver
from selenium.webdriver.common.by import By
import os
import time
import requests

# 设置 WebDriver
driver = webdriver.Chrome(executable_path="path/to/chromedriver")
driver.get("https://example-ecommerce.com")

time.sleep(2)  # 等待页面加载

# 创建保存目录
os.makedirs("product_images", exist_ok=True)

# 爬取商品信息
products = driver.find_elements(By.CLASS_NAME, "product-item")
for product in products:
    name = product.find_element(By.CLASS_NAME, "product-title").text
    price = product.find_element(By.CLASS_NAME, "product-price").text
    img_url = product.find_element(By.TAG_NAME, "img").get_attribute("src")

    # 下载图片
    img_data = requests.get(img_url).content
    img_name = os.path.join("product_images", img_url.split("/")[-1])
    with open(img_name, "wb") as f:
        f.write(img_data)

    print(f"商品: {name}, 价格: {price}, 图片已保存: {img_name}")

driver.quit()