当前位置：首页 > article >正文

【一个超简单的爬虫demo】探索新浪网：使用 Python 爬虫获取动态网页数据

article 2025/1/17 21:32:40

探索新浪网：使用 Python 爬虫获取动态网页数据

引言
准备工作
- 选择目标
- 新浪网的结构
编写爬虫代码
- 爬取example.com
- 爬取新浪首页部分内容
- 解析代码
- 注意： `KeyError: 'href'`
- 结果与展示
其他
- 修改和适应
- 注意事项
总结

引言

可以实战教爬虫吗，搭个环境尝试爬进去。尝试收集一些数据

一位粉丝想了解爬虫，我们今天从最基础的开始吧！

本文将介绍如何使用 Python 爬虫技术爬取新浪网首页的内容。新浪网作为一个内容丰富且更新频繁的新闻网站，是理解动态网页爬取的绝佳例子。

准备工作

首先，确保你已安装 Python 以及 requests、BeautifulSoup 和 lxml 库。

这可以通过以下命令轻松完成：

pip install requests beautifulsoup4

选择目标

对于我们的第一个项目，让我们选择一个简单的网站进行数据抓取。为了简单起见，我们可以选择一个新闻网站或天气预报网站。这些网站通常有清晰的结构，适合初学者练手。

新浪网的结构

新浪网的首页包含了多个新闻类别，如国内新闻、国际新闻、体育新闻等。我们的目标是提取特定类别下的新闻标题和链接。

编写爬虫代码

爬取example.com

作为示例，我们将使用一个简单的网站 - “example.com”。

import requests
from bs4 import BeautifulSoup

def scrape_example_com():
    url = 'https://example.com'
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    text = soup.get_text().strip()
    return text

print(scrape_example_com())

在这里插入图片描述

requests.get 发送一个请求到网站，并获取响应。
BeautifulSoup 解析响应内容，使其更易于操作。
get_text 方法提取页面的文本内容。

爬取新浪首页部分内容

下面是一个 Python 脚本的示例，用于爬取新浪网首页的部分内容：

import requests
from bs4 import BeautifulSoup

def scrape_sina_news():
    url = 'https://www.sina.com.cn/'
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'lxml', from_encoding='utf-8')
    news_titles = soup.find_all('a')
    for title in news_titles[:10]:
        if 'href' in title.attrs:
            print(title.text.strip(), title['href'])

scrape_sina_news()