当前位置：首页 > article >正文

今日头条文章爬虫教程

article 2025/3/11 20:05:14

今日头条文章爬虫教程

随着互联网的发展，新闻资讯类平台如今日头条积累了海量的数据。对于数据分析师、研究人员等群体来说，获取这些数据进行分析和研究具有重要的价值。本文将介绍如何使用Python编写爬虫，爬取今日头条的文章数据。

一、准备工作

环境搭建

安装Python：确保电脑已安装Python环境，建议使用3.7及以上版本。
安装必要的库：使用pip命令安装以下库：
```
pip install requests
pip install pandas
pip install selenium
pip install beautifulsoup4
```
其中，requests用于发送HTTP请求，pandas用于数据处理和保存，selenium用于模拟浏览器操作，beautifulsoup4用于解析HTML文档。

今日头条接口分析

今日头条的数据通常是通过其API接口以JSON格式返回的。我们需要找到相应的接口，并分析其请求参数和返回的数据结构。以热点新闻为例，接口可能类似于：

https://www.toutiao.com/api/news/hot/

通过分析接口返回的JSON数据，我们可以获取到新闻的标题、链接、发布时间等信息。

二、爬虫实现步骤

步骤一：获取文章列表

发送请求：使用requests库向今日头条的新闻接口发送GET请求，获取新闻列表的JSON数据。

import requests

url = 'https://www.toutiao.com/api/news/hot/'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
response = requests.get(url, headers=headers)

解析JSON数据：将返回的JSON数据解析为Python字典，提取新闻的标题和链接等信息。

import json

if response.status_code == 200:
    data = json.loads(response.text)
    articles = []
    for item in data['data']:
        article = {
            'title': item['title'],
            'link': item['article_url']
        }
        articles.append(article)

步骤二：获取文章详情

模拟浏览器操作：对于需要登录或动态加载内容的文章页面，使用selenium模拟浏览器操作，获取完整的页面HTML。

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument("--headless")  # 无头模式，不显示浏览器窗口
driver = webdriver.Chrome(options=options)
driver.get(article['link'])
time.sleep(3)  # 等待页面加载完成
html = driver.page_source
driver.quit()

解析HTML内容：使用BeautifulSoup解析HTML，提取文章的正文、发布时间、发布者等信息。

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
# 提取文章正文
article_content = soup.find('div', class_='article-content')
if article_content:
    content = article_content.get_text()
# 提取发布时间和发布者
article_meta = soup.find('div', class_='article-meta')
if article_meta:
    time_text = article_meta.find('span', class_='time').text
    publisher_text = article_meta.find('a', class_='author').text

步骤三：数据处理与保存

数据清洗：对提取的数据进行清洗，如去除非法字符、格式化时间等。

import re

def remove_illegal_characters(text):
    ILLEGAL_CHARACTERS_RE = re.compile(r'[\000-\010]|[\013-\014]|[\016-\037]')
    return ILLEGAL_CHARACTERS_RE.sub('', text)

content = remove_illegal_characters(content)
time_text = remove_illegal_characters(time_text)
publisher_text = remove_illegal_characters(publisher_text)

保存数据：将清洗后的数据保存到Excel文件中，方便后续分析。

import pandas as pd

data.append({
    '标题': title_text,
    '时间': time_text,
    '发布者': publisher_text,
    '正文': content
})
df = pd.DataFrame(data)
df.to_excel("result.xlsx", index=False)