当前位置：首页 > article >正文

Python爬虫教程003：请求对象的定制、get请求的quote和urlencode方法

article 2025/3/31 9:58:49

2.4 请求对象的定制

在 Python 爬虫中，User-Agent（UA）反爬是指网站通过检测请求头中的 User-Agent 来识别并屏蔽爬虫。许多网站会检查 UA 是否是常见的爬虫（如 Python-urllib 或 Scrapy），并拒绝非浏览器的访问。因此，我们需要使用伪装 UA、随机 UA 甚至 UA 池来绕过这种反爬机制。

# -*- coding: utf-8 -*-
# @Time: 2022/9/25 0025 11:32
# @Author: Wang
# @File: 04_urllib_请求对象的控制.py

import urllib.request

url = 'https://www.baidu.com'

headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
}

# 因为urlopen方法中不能存储字典 所以headers不能传递出去

# url的组成
# http/https   www.baidu.com   80/443   s   wd=周杰伦
#    协议            主机        端口号   路径    参数
# http   80
# https   443
# mysql   3306

# 因为参数顺序的问题 不能直接写url 和 headers
request = urllib.request.Request(url=url, headers=headers)

response = urllib.request.urlopen(request)

content = response.read().decode('utf-8')

print(content)

解释：

User-Agent 模拟真实浏览器，避免服务器识别 Python 爬虫。
这里的 UA 来自 Chrome，可在 UA 生成器获取最新 UA。

2.5 编解码

直接访问https://www.baidu.com/s?wd=杜兰特会报错，无法识别汉字，所以需要通过quote把汉字转换为能识别的编码

示例：

# -*- coding: utf-8 -*-
# @Time: 2022/9/25 0025 11:50
# @Author: Wang
# @File: 05_urllib_get请求的quote方法.py

import urllib.request
import urllib.parse

# url = 'https://www.baidu.com/s?wd=%E6%9D%9C%E5%85%B0%E7%89%B9'

# 获取https://www.baidu.com/s?wd=周杰伦的网页源码

url = 'https://www.baidu.com/s?wd='

# 请求对象的定制 是为了解决反爬的第一种手段
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
}

# 将周杰伦三个字变成unicode编码的格式
# 我们需要依赖于urllib.parse
name = urllib.parse.quote('周杰伦')
print(name)  # %E6%9D%9C%E5%85%B0%E7%89%B9

# 请求对象的定制
request = urllib.request.Request(url=url+name, headers=headers)

# 模拟浏览器向服务器发送请求
response = urllib.request.urlopen(request)

# 获取响应的内容
content = response.read().decode('utf-8')

print(content)

结果：

2.6 urlencode

在 Python 爬虫中，urlencode 主要用于对 URL 参数进行编码，以确保它们可以安全地传输。它通常用于 requests 或 urllib 处理 GET 请求时的 URL 参数构造。

示例1：

import urllib.parse

data = {
    'wd': '周杰伦',
    'sex': '男'
}

a = urllib.parse.urlencode(data)

print(a)

打印结果：可以将字典转换为 URL 编码的查询字符串。

wd=%E5%91%A8%E6%9D%B0%E4%BC%A6&sex=%E7%94%B7

示例2：

# -*- coding: utf-8 -*-
# @Time: 2022/9/27 0027 10:48
# @Author: Wang
# @File: 06_urllib_get请求的urlencode.py

# https://www.baidu.com/s?wd=周杰伦&sex=男

import urllib.request
import urllib.parse

base_url = 'https://www.baidu.com/s?'

data = {
    'wd': '周杰伦',
    'sex': '男',
    'location': '中国台湾省'
}

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
}

new_data = urllib.parse.urlencode(data)

url = base_url + new_data

print(url)  # https://www.baidu.com/s?wd=%E5%91%A8%E6%9D%B0%E4%BC%A6&sex=%E7%94%B7&location=%E4%B8%AD%E5%9B%BD%E5%8F%B0%E6%B9%BE%E7%9C%81

request = urllib.request.Request(url=url, headers=headers)

response = urllib.request.urlopen(request)

content = response.read().decode('utf-8')

print(content)