当前位置: 首页 > article >正文

python数据分析——网络爬虫和API

HTML

什么是超文本标记语言?

它包含由标签标记的多层内容,包括开始标签和带有‘/’的结束标签
“head”:用于浏览器特定信息
“style”:层叠样式表(CSS)用于设置HTML页面的样式
“body”:用于可见内容

<html>
<head>
	<title>My First Web Page</title>
	<style>
		.highlight {
			background-color: lightblue;
		}
		div.section h1 {
			color: red;
			font-size: 24px;
		}
		#important {
			font-weight: bold;
		}
		</style>
</head>
<body>
	<h1>This is my first web page!</h1>
	<p>I'm excited to learn HTML.</p>
</html>

数据检索

1. 使用requests.get()下载整个网页的全部内容

2. 使用BeautifulSoup导航并提取精确信息(位于开始标签和结束标签之间)

3. 逐步遍历网页的层次结构从而到达目标位置

import requests
from bs4 import BeautifulSoup

# store the response from the website in a varieble
response = requests.get("https://en.wikipedia.org/wiki/...")
# extract the actual content of the web page
content = response.content

# create a BeautifulSoup object
# 'html.parser' is the default parser for BeautifulSoup
soup = BeautifulSoup(content, 'html.parser')

# get the head element of BeautifulSoup object
body = soup.head
# get the title element of the head element
t = body.title

print(t.text)

选取元素

find_all()

定位并提取网页中所有带有某一tag的元素

  • id: unique
# find_all() with an id attribute passed
links = soup.find_all('li', id = "toc-Computing")

for link in links:
	href = link.get('href') # get the URL
	text = link.text
	print(f"URL: {href}\\nText: {text}\\n")
  • class_: not unique
import requests
from bs4 import BeautifulSoup
import pandas as pd

url = "https://en.wikipedia.org/wiki/List_of_Nobel_laureates"

response = requests.get(url)

soup = BeautifulSoup(response.text, "html.parser")

# scrape a table from web page
# find an HTML table with the specified class_ attribute
table = soup.find("table", class_ = "wikitable sortable")

# read the HTML table into a DataFrame
# select the first DataFrame from the list
df = pd.read_htnl(str(table))[0]

CSS selectors

  • tag selector
  • .class selector
  • #ID selector
# select all elements with the class "highlight"
highlighted_elements = soup.select(".highlight")
for element in highlighted_elements:
	print("Highlighted Element Text:", element.text)

# select <h2> elements inside a <div> with class "section"	
section_h2_elements = soup.select("div.section h2")
for element in section_h2_elements:
	print("Section <h2> Text:", element.text)
	
# select the element with the ID "important"
important_elements = soup.select("#important")
print("Important Element Text:", important_elements[0].text)

API

它是一种软件组件之间相互交互的方式

它可以用来从外部源(如数据库、Web服务和云存储)提取数据

获取一个API密钥来向API发送请求

端点(endpoint)

一个用于从API访问特定资源或功能的URL

Google Maps API 
/geocode/json: get the latitude and longitude of a given address
/directions/json: get directions between two points
/places/nearby: get a list of places nearby a given location

GitHub API 
/users/{username}/repos: get a list of repositories for a specified user
/repos/{owner}/{repo}/commits: get a list of commits for a repository
/repos/{owner}/{repo}/issues: get a list issues for a repository

OpenWeatherMap API 
/weather: get current weather data for a specified location
/forecast/hourly: get hourly weather forecast for 4 days of a specified location
/history/city: get hourly historical weather data for specified location

向API发送请求

使用HTTP客户端:一个可以发送和接收HTTP请求的软件应用程序

requests.get(url):向URL发送HTTP请求,并从API端点检索数据,其中URL作为参数传入

import requests

url = 'https://api.openweathermap.org/data/2.5/weather?q=Singapore&APPID=YOUR_API_KEY'

response = requests.get(url)

# check whether the request is successful
# 200 indicates successful
# 400 indicates the request was invalid
# 500 indicates an error occureed on the surver
if response.status_code == 200:
	# convert response content into a dictionary or list
	weather_data = response.json() # output
	
	weather_description = weather_data['weather'][0]['description']
	print(weather_description)
else:
	print("An error occurred.")

JSON

JavaScript Object Notation

  • 一种轻量级数据交换格式(无需额外标签)
  • 基于文本,且与平台无关,在不同应用程序之间交换数据时非常流行
  • 由对象({})和数组([])组成,以层次化的树状格式进行结构化

response data

import json
import requests
import pandas as pd

url = 'https://api.openweathermap.org/data/2.5/weather?q=Singapore&APPID=YOUR_API_KEY'

response = requests.get(url)

# parse the JSON response data into a dictionary
weather_data = json.loads(response.content)

# convert JSON data into a DataFrame
df = pd.json_normalize(weather_data)
print(pd)

pass URL parameters

  • 在?后添加URL参数
  • q=value: 特定的查询条件
  • appid=your_api_key: 传入你的API key
  • &: 不同参数用&分隔开
import requests

URL = 'https://api.openweathermap.org/data/2.5/weather'

# parameters are stored in key-value pair structure
PARAMETERS = {
		"q": "Singapore",
		"appid": "your_api_key",
		"units": "imperial"
}

# params require a dictionary
response = requests.get(URL, params=PARAMETERS)

# parse the JSON response data into a dictionary
data = response.json()

# convert temperature form Kelvins to Celsius
temperature = data['main']['temp'] - 273.15

print(f"The current temperature in Singapore is {round(temperatire,2)} degrees Celsius.")

http://www.kler.cn/a/283971.html

相关文章:

  • nginx部署H5端程序与PC端进行区分及代理多个项目及H5内页面刷新出现404问题。
  • AI写作(二)NLP:开启自然语言处理的奇妙之旅(2/10)
  • 如何查看电脑关机时间
  • 图论基本术语
  • pycharm快速更换虚拟环境
  • 将python下载的依赖包传到没网的服务器
  • 图灵盾IOS SDK
  • 数据结构之拓扑排序
  • 【王树森】RNN模型与NLP应用(6/9):Text Generation(个人向笔记)
  • 【C#】属性的声明
  • Elasticsearch中修改mapping的字段类型该怎么操作
  • Go语言结构快速说明
  • JAVA后端框架--【Mybatis】
  • 【单片机原理及应用】实验:数字秒表显示器
  • ubuntu录屏解决ubuntu下无法播放MP4格式文件的方法
  • 【栈】| 力扣高频题: 基本计算器二
  • 忘掉 Siri 吧:苹果可能会推出拥有自己AI“个性”的机器人设备|TodayAI
  • linux信号处理机制基础(下)
  • 【 WPF 中常用的 `Effect` 类的介绍、使用示例和适用场景】
  • Qt Creator 配置pcl1.14.1
  • 物理机安装Centos后无法连接网络(网线网络)怎么办?-呕心沥血总结版-超简单
  • CSRF漏洞的预防
  • CMake基本语法大全
  • 2024.08.30
  • JVM面试(一)什么是虚拟机?什么是class文件?
  • ASP.NET Core6.0-wwwroot文件夹无法访问解决方法