阿九的python 爬虫进阶课18.3 学习笔记
文章目录
- 前言
- 1. 爬取大标题
- 2. 爬取小标题
- 3. 证券栏下的标题
- 4. 某篇文章里的具体内容
前言
- 网课链接:https://www.bilibili.com/video/BV1kV4y1576b/
- 新浪财经网址:https://finance.sina.com.cn/
- 需先下载库:
conda install lxml
- 布置爬取的一些配置代码
import requests
from bs4 import BeautifulSoup
html = requests.get('https://finance.sina.com.cn/')
html.encoding = 'utf-8'
soup = BeautifulSoup(html.text, 'lxml')
1. 爬取大标题
## 大标题
bigTitle = soup.select("#blk_hdline_01 h3 a")
for bg in bigTitle:
print("大标题:", bg.text)
print("链接:", bg.get('href'))
print("-"*60)
结果:
2. 爬取小标题
## 小标题
smallTitle = soup.select("#blk_hdline_01 p a")
for st in smallTitle:
print("小标题:", st.text)
print("链接:", st.get('href'))
print("-"*60)
结果:
3. 证券栏下的标题
“F12” 后通过其中的“class”获得途径(遇到空格就转化为“.”)
## 证券
zq = soup.select(".m-p1-mb2-list.m-list-container ul li a")
for z in zq:
print("证券标题:", z.text)
print("链接:", z['href'])
print("-"*60)
结果:
4. 某篇文章里的具体内容
id是绝对的,但是class可能会有重复。
## 证券
zq = soup.select(".m-p1-mb2-list.m-list-container ul li a")
for z in zq:
print("证券标题:", z.text)
print("链接:", z['href'])
# 进入连接爬取文本内容
innerHtml = requests.get(z['href'])
innerHtml.encoding = 'utf-8'
soup2 = BeautifulSoup(innerHtml.text, 'lxml')
articles = soup2.select("div .article p")
str = ""
for article in articles:
str += article.text
print(str)
print("-"*30)
结果: