当前位置: 首页 > article >正文

爬虫实战练习

爬取微博热榜话题内容,这里用两个方法进行数据获取,略有不同(当然selenium+xpath也可以实现,第三种方法,但是此处没有正则模块好用):

1.方法一:

找到数据网址后直接使用复制的方法:

找一个工具格式化,这样直接得到需要的响应内容:

 

然后直接正则匹配获得我们想要的内容:

# -- coding: utf-8 --
# 爬取内容:用户,地区,评论内容,时间
import re
import requests


headers = {
    "accept": "application/json, text/plain, */*",
    "accept-language": "zh-CN,zh-TW;q=0.9,zh;q=0.8",
    "client-version": "v2.46.22",
    "priority": "u=1, i",
    "referer": "https://weibo.com/5976780114/OCu0h7C2m",
    "sec-ch-ua": "\"Chromium\";v=\"130\", \"Microsoft Edge\";v=\"130\", \"Not?A_Brand\";v=\"99\"",
    "sec-ch-ua-mobile": "?0",
    "sec-ch-ua-platform": "\"Windows\"",
    "sec-fetch-dest": "empty",
    "sec-fetch-mode": "cors",
    "sec-fetch-site": "same-origin",
    "server-version": "v2024.10.17.1",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36 Edg/130.0.0.0",
    "x-requested-with": "XMLHttpRequest",
    "x-xsrf-token": "OlQZogwGxB3UWIX8qLDBb17C"
}
cookies = {
    "SUB": "_2AkMRpEM_f8NxqwFRmf0dyGjmZI10yw3EieKn-LLkJRMxHRl-yT9vqnYNtRB6OiRt0XkgWF5nAjxILq2DuCY2u9MVblIg",
    "SUBP": "0033WrSXqPxfM72-Ws9jqgMF55529P9D9Whc2Yb7zCg4iBUi8WCoPA6o",
    "PC_TOKEN": "99fecfe7f5",
    "XSRF-TOKEN": "OlQZogwGxB3UWIX8qLDBb17C",
    "WBPSESS": "Av_uyMf5J_yRg2sn7ncLQbTYjuXCausbP61QbP_ZrvZQj6b2Bo8ZgymyIGPtlYQV7tXXVrP1AvVp3YSm6bWI5wmC-mAp7Zl2UTVzd3v4cFTZsJbJKBXIeC9HfBqpwePIiCb6rP2_3-nBZC5PBwFWubW9MnYH1HFFORkcygVZhkA=",
    "ariaDefaultTheme": "default",
    "ariaFixed": "true",
    "ariaReadtype": "1",
    "ariaMouseten": "null",
    "ariaStatus": "false"
}
url = "https://weibo.com/ajax/statuses/buildComments"
params = {
    "is_reload": "1",
    "id": "5091718011814514",
    "is_show_bulletin": "3",
    "is_mix": "0",
    "count": "10",
    "uid": "5976780114",
    "fetch_level": "0",
    "locale": "zh-CN"
}
response = requests.get(url, headers=headers, cookies=cookies, params=params)
# print(response.text)

# 正则表达式提取,使用更加精确的方法
names = re.findall(r'"screen_name":"([^"]+)"', response.text)
areas = re.findall(r'"source":"来自([^"]+)"', response.text)
comments = re.findall(r'"text":"([^"<]+)', response.text)
times = re.findall(r'"created_at":"([^"]+)"', response.text)

# 打印提取的信息
for name, area, comment, time in zip(names, areas, comments, times):
    print(f"Name: {name}")
    print(f"Area: {area}")
    print(f"Comment: {comment}")
    print(f"Time: {time}")
    print('-' * 40)



运行结果:

 

2.方法二: 

 找到对应ajax存储数据的请求内容网址,直接获取到json内容,然后jsonpath定位输出想要的内容(jsonpath语法我之前博客有讲过):

# -- coding: utf-8 --
# 爬取内容:用户,地区,评论内容,时间
import re
import requests
import json
import jsonpath

url = 'https://weibo.com/ajax/statuses/buildComments?is_reload=1&id=5091718011814514&is_show_bulletin=3&is_mix=0&count=10&uid=5976780114&fetch_level=0&locale=zh-CN'
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36 Edg/130.0.0.0',
    'cookie': 'SUB=_2AkMRpEM_f8NxqwFRmf0dyGjmZI10yw3EieKn-LLkJRMxHRl-yT9vqnYNtRB6OiRt0XkgWF5nAjxILq2DuCY2u9MVblIg; SUBP=0033WrSXqPxfM72-Ws9jqgMF55529P9D9Whc2Yb7zCg4iBUi8WCoPA6o; XSRF-TOKEN=OlQZogwGxB3UWIX8qLDBb17C; ariaDefaultTheme=default; ariaFixed=true; ariaReadtype=1; ariaMouseten=null; ariaStatus=false; WBPSESS=Av_uyMf5J_yRg2sn7ncLQbTYjuXCausbP61QbP_ZrvZQj6b2Bo8ZgymyIGPtlYQVeEGwKEi_ZOjBTLN5NaAfgUx4Qhnjv-JHgGuD5n9k-W0nT8OX-r4BsyBOOf_atoAWLFRIzZjiZifsKkmXzMdVaNVMLol06oE8JgI3RF3HTRU='
}

response = requests.get(url, headers=headers)
# print(response.status_code)

dict_data = json.loads(response.content)
names = jsonpath.jsonpath(dict_data, '$..screen_name')
areas = jsonpath.jsonpath(dict_data, '$..source')
comments = jsonpath.jsonpath(dict_data, '$..text_raw')
times = jsonpath.jsonpath(dict_data, '$..created_at')
comment_list = []
for name, area, comment, time in zip(names, areas, comments, times):
    # print(f"用户: {name}")
    # print(f"地区: {area}")
    # print(f"评论内容: {comment}")
    # print(f"时间: {time}")
    # print("-" * 50)  # 分隔线
    # 字典存储
    comment_dict = {
        "用户": name,
        "地区": area,
        "评论内容": comment,
        "时间": time
    }
    comment_list.append(comment_dict)

print(json.dumps(comment_list, ensure_ascii=False, indent=4))

运行结果:


http://www.kler.cn/news/363865.html

相关文章:

  • 【LeetCode:910. 最小差值 II + 模拟 + 思维】
  • 矩阵基础知识
  • Unity之如何在物体空间中制作马赛克
  • 游戏界面设计的最佳实践
  • hhdb数据库介绍
  • 面试官:数据库 delete 表数据,磁盘空间还是被一直占用,为什么?
  • 第T7周:咖啡豆识别
  • 多模态大语言模型(MLLM)-Blip3/xGen-MM
  • 姿态估计(一)——过程记录与问题记录
  • Spring Boot 3 声明式接口:能否完全替代 OpenFeign?
  • 写在RAGFlow开源2万星标之际
  • vue3使用element-plus手动更改url后is-active和菜单的focus颜色不同步问题
  • JavaFx -- chapter03(多线程网络通信)
  • Linux——K8S的pod的调度
  • 类的创建、构造器、实例属性、实例方法
  • Java while语句练习 C语言的函数递归
  • node入门与npm
  • C++进阶-->继承(inheritance)
  • 使用Python爬虫API,轻松获取电商商品SKU信息
  • 第一次过程序员节
  • 阿里巴巴的数据库连接池Druid报Failed to look up JNDI DataSource with name ‘slaveDataSource‘
  • 头歌——人工智能(搜索策略)
  • bfloat16与float8、float16、float32的区别
  • Python数据分析工具OpenCV用法示例
  • 什么是SQL注入攻击?如何防止呢?
  • Web服务器 多IP访问网站