当前位置：首页 > article >正文

python爬取网页数据为json该用什么方法？

article 2025/2/6 10:12:01

Python 爬取网页数据并保存为 JSON 的方法

在 Python 中，爬取网页数据并将其保存为 JSON 格式是一个常见的任务。以下是一些常用的方法和具体代码实现，结合最新的搜索结果。

方法1：使用 `requests` 和 `json` 模块

requests 是一个常用的 HTTP 库，用于发送网络请求。结合 json 模块，可以轻松地将爬取的数据保存为 JSON 格式。

Python复制

import requests
import json

# 爬取网页数据
url = "https://en.wikipedia.org/wiki/Web_scraping"
response = requests.get(url)
data = response.text

# 将数据保存为 JSON
with open("data.json", "w", encoding="utf-8") as file:
    json.dump(data, file, ensure_ascii=False, indent=4)

方法2：使用 `BeautifulSoup` 和 `json` 模块

BeautifulSoup 是一个用于解析 HTML 和 XML 文档的库，结合 json 模块可以提取特定的网页内容并保存为 JSON 格式。

Python复制

import requests
from bs4 import BeautifulSoup
import json

# 爬取网页数据
url = "https://en.wikipedia.org/wiki/Web_scraping"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

# 提取特定内容
titles = []
for tag in soup.find_all(["h1", "h2", "h3", "h4", "h5"]):
    titles.append({"tag": tag.name, "text": tag.text.strip()})

# 将数据保存为 JSON
with open("titles.json", "w", encoding="utf-8") as file:
    json.dump(titles, file, ensure_ascii=False, indent=4)

方法3：使用 `Scrapy` 和 `json` 模块

Scrapy 是一个强大的爬虫框架，适用于大规模的爬取任务。以下是一个简单的 Scrapy 爬虫示例，用于爬取网页数据并保存为 JSON。

Python复制

import scrapy
import json

class Spider(scrapy.Spider):
    name = "bored"

    def start_requests(self):
        url = "https://www.boredapi.com/api/activity"
        yield scrapy.Request(url, self.parse)

    def parse(self, response):
        data = json.loads(response.text)
        activity = data["activity"]
        type = data["type"]
        participants = data["participants"]
        yield {"Activity": activity, "Type": type, "Participants": participants}

运行 Scrapy 爬虫并保存结果到 JSON 文件：

bash复制

scrapy runspider <file_name> -o output.json

方法4：使用第三方 API 服务（如 Scrapeless）

对于不想处理复杂爬虫逻辑的用户，可以使用第三方 API 服务，如 Scrapeless，来简化爬取过程。

Python复制

import http.client
import json

conn = http.client.HTTPSConnection("api.scrapeless.com")
payload = json.dumps({
    "actor": "scraper.amazon",
    "input": {
        "url": "https://www.amazon.com/dp/B0BQXHK363",
        "action": "product"
    }
})
headers = {
    'Content-Type': 'application/json'
}
conn.request("POST", "/api/v1/scraper/request", payload, headers)
res = conn.getresponse()
data = res.read()
print(data.decode("utf-8"))

方法5：使用 `Selenium` 和 `json` 模块

Selenium 是一个用于自动化浏览器操作的工具，适用于动态网页的爬取。

Python复制

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
import json

# 设置 WebDriver
options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome(service=Service(), options=options)

# 爬取网页数据
url = "https://en.wikipedia.org/wiki/Web_scraping"
driver.get(url)

# 提取特定内容
titles = []
for tag in driver.find_elements(By.CSS_SELECTOR, "h1, h2, h3, h4, h5"):
    titles.append({"tag": tag.tag_name, "text": tag.text.strip()})

# 将数据保存为 JSON
with open("titles.json", "w", encoding="utf-8") as file:
    json.dump(titles, file, ensure_ascii=False, indent=4)

driver.quit()

总结

爬取网页数据并保存为 JSON 格式可以通过多种方法实现，具体选择取决于你的需求和偏好。对于简单的任务，requests 和 BeautifulSoup 是不错的选择；对于更复杂的任务，Scrapy 提供了强大的功能；而第三方 API 服务如 Scrapeless 可以进一步简化爬取过程。无论哪种方法，都可以通过 json 模块轻松地将数据保存为 JSON 格式。

查看全文

http://www.kler.cn/a/533540.html