当前位置：首页 > article >正文

【python数据处理】保存网页

article 2025/2/19 6:48:21

直觉上处理网页信息，很多人会先将网页保存成HTML，然后做文本分析。但这样做是不够的，因为网页可能内嵌图片，这些图片在HTML里就是一处链接，离线处理时无法还原，相当于丢失了图片信息。更好的做法是将整个网页一次性保存下来。

路径一：将网页保存成mhtml，然后保存成图片。没有现成的工具可以做到这点，mhtml可以通过直接将后缀改成doc以doc的方式打开。然后通过工具将doc转成图片。

路径二：将网页保存成mhtml，然后使用chromedriver保存成pdf，然后pdf再转图片。

路径三：使用chromedriver直接将网页保存成pdf，然后pdf再转图片。

import os
import csv
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
import base64

def get_url_list(csv_path):
    content_list = []
    with open(csv_path, "r", encoding="gbk") as fin:
        csv_reader = csv.reader(fin)
        for line in csv_reader:
            content_list.append(line)
    title_list, url_list = list(zip(*content_list))
    return title_list, url_list

if __name__ == "__main__":
    url_file_path = "title_url.csv"
    driver_location = 'chromedriver.exe的绝对路径'
    service = Service(driver_location)
    # 创建Chrome选项
    options = Options()
    # 无头模式，无界面
    options.add_argument("--headless")
    options.add_argument("--disable-gpu")

    driver = webdriver.Chrome(options, service)

    # 设置 PDF 选项
    pdf_options = {
        # 'paperWidth': 33.1,  # 纸张宽度，单位是英寸
        # 'paperHeight': 46.8,  # 纸张高度，单位是英寸
        'printBackground': True,  # 是否打印背景
        'landscape': False  # 是否横向打印
    }

    title_list, url_list = get_url_list(url_file_path)
    for i, url_path in enumerate(url_list):
        driver.get(url_path) # 打开网页

        # 使用 Chrome DevTools 协议保存为 PDF
        pdf_data = driver.execute_cdp_cmd('Page.printToPDF', pdf_options)

        # 解码并保存 PDF 文件
        pdf_content = base64.b64decode(pdf_data['data'])
        cur_title = title_list[i]
        cur_title = cur_title.replace("/", "_").replace("\\", "_")
        output_path = "pdf_output/" + cur_title + ".pdf"
        print(output_path)
        try:
            with open(output_path, 'wb') as file:
                file.write(pdf_content)
        except:
            print("fail", output_path)

    # 关闭 WebDriver
    driver.quit()

查看全文

http://www.kler.cn/a/307881.html