当前位置：首页 > article >正文

网络爬虫-2:正则化

article 2025/3/13 19:42:33

1.正则化

一.正则化

1.转义字符

转义字符	含义
`\s`	空白字符（空格、制表符等）
`\d`	数字字符（0-9）
`\w`	字母、数字或下划线
`.`	除换行符外的任意字符
`\n`	换行符
`\t`	制表符

import re
result = re.findall(r'\s', 'Hello World\nPython')
print(result)  # 输出: [' ', '\n']
result = re.findall(r'\d', 'Python 3.10')
print(result)  # 输出: ['3', '1', '0']
result = re.findall(r'\w', 'Python_3.10!')
print(result)  # 输出: ['P', 'y', 't', 'h', 'o', 'n', '_', '3', '1', '0']
result = re.findall(r'P.th.n', 'Python Pathon Pithon')
print(result)  # 输出: ['Python', 'Pathon', 'Pithon']
result = re.findall(r'\.', 'Python 3.10')
print(result)  # 输出: ['.']

2.转义字符

`{}`	量词，指定匹配次数
`*`	匹配前面的字符 0 次或多次
`+`	匹配前面的字符 1 次或多次
`?`	匹配前面的字符 0 次或 1 次
`$`	匹配字符串结尾
`^`	匹配字符串开头

import re
result = re.findall(r'\d{3}', '123 4567 89')
print(result)  # 输出: ['123', '456']
result = re.findall(r'ba*', 'ba baa baaa b')
print(result)  # 输出: ['ba', 'baa', 'baaa', 'b']
result = re.findall(r'ba+', 'ba baa baaa b')
print(result)  # 输出: ['ba', 'baa', 'baaa']
result = re.findall(r'ba?', 'ba baa baaa b')
print(result)  # 输出: ['ba', 'ba', 'ba', 'b']
result = re.findall(r'world$', 'world1 hello world')
print(result)  # 输出: ['world']
result = re.findall(r'^hello', 'hello world hello1')
print(result)  # 输出: ['hello']

1.re.match():仅从字符串开头

1.1常规匹配

re.match 是 Python 中 re 模块的一个函数，用于从字符串的起始位置匹配正则表达式。如果匹配成功，返回一个匹配对象；否则返回 None

import re
content="Hello 123 456789 World_This is a Regex Demo"
res=re.match("^Hello\s\d\d\d\s\d{6}\s\w{10}.*Demo$",content)
print(res)  #返回一个匹配的对象
print(res.group())#获取匹配的内容
print(res.span()) #获取匹配长度
print(len(content))

<re.Match object; span=(0, 43), match='Hello 123 456789 World_This is a Regex Demo'>
Hello 123 456789 World_This is a Regex Demo
(0, 43)
43

1.2泛匹配

content="Hello 123 456789 World_This is a Regex Demo"
res=re.match("He.*?Demo",content)
print(res)
print(res.group())
print(res.span())
print(len(content))

<re.Match object; span=(0, 43), match='Hello 123 456789 World_This is a Regex Demo'>
Hello 123 456789 World_This is a Regex Demo
(0, 43)
43

1.3分组匹配

import re
content="Hello 123 456789 World_This is a Regex Demo"
res=re.match("Hello\s(\d+)\s(\d{3})\d{3}\s(\w+)",content)
print(res)
print(res.group())
print(res.group(1))
print(res.group(2))
print(res.group(3))

<re.Match object; span=(0, 27), match='Hello 123 456789 World_This'>
Hello 123 456789 World_This
123
456
World_This

1.4贪婪匹配:尽可能多的去匹配

import re
content="Hello 123 w 456789 World_This is a Regex Demo"
res = re.match("^Hello.*(\d+)\s",content)#最后的\s会匹配到一个空格
res2=re.match("^Hello.*(\d*)",content)
print(res)
print(res.group())
print(res.group(1))
a=res2.group(1)
if not a:
    print("匹配到0个")

<re.Match object; span=(0, 19), match='Hello 123 w 456789 '>
Hello 123 w 456789 
9
匹配到0个

1.6非贪婪匹配:尽可能少的去匹配

.*?：非贪婪匹配任意字符（尽可能短）。
.+?：非贪婪匹配至少一个任意字符。

import re
content="Hello 123 w 456789 World_This is a Regex Demo"
res = re.match("^Hello.*?(\d+)\s",content)
print(res)
print(res.group(1))

<re.Match object; span=(0, 10), match='Hello 123 '>
123

1.6匹配模式

换行使用re.s或\n

import re
content=("""Hello 123 w 456789
World_This is a Regex Demo""")
res = re.match("^Hello.*Demo$",content,re.S)
res2=re.match("^Hello.*\n.*Demo$",content)
print(res)
print(res2)

<re.Match object; span=(0, 45), match='Hello 123 w 456789\nWorld_This is a Regex Demo'>
<re.Match object; span=(0, 45), match='Hello 123 w 456789\nWorld_This is a Regex Demo'>

1.7转义

在特殊字符前加\

import re
content="price is $5"
res=re.match("^price\s(.*)$5",content)
res1=re.match("^price\s(.*)\$5",content)
print(res)
print(res1)

None
<re.Match object; span=(0, 11), match='price is $5'>

2.re.search():搜索整个字符串

import re
content="Hello 123 w 456789 World_This is a Regex Demo price is $5"
res=re.search("price\s(.*)\$5$",content)
print(res)
print(res.group())
print(res.group(1))

<re.Match object; span=(46, 57), match='price is $5'>
price is $5
is

3.re.findall():因为re.match()和re.search(),都只能查找到符合的第一个字符使用findall查找所有的符合标准的字符

import re
content="""
<div class="songlist__artist">
    <a class="playlist__author" title="虞书欣" href="/n/ryqq/singer/0031rIlo4Xka96">虞书欣</a><!-- -->
    /<a class="playlist__author" title="丁禹兮" href="/n/ryqq/singer/004fOu5r1U3AJh">丁禹兮</a><!-- -->
    /<a class="playlist__author" title="祝绪丹" href="/n/ryqq/singer/003IHuTa1HGoKK">祝绪丹</a><!-- -->
    /<a class="playlist__author" title="杨仕泽" href="/n/ryqq/singer/0007YOgR1AUf1l">杨仕泽</a><!-- -->
    /<a class="playlist__author" title="费启鸣" href="/n/ryqq/singer/000ic7PL1ViRKA">费启鸣</a><!-- -->
    /<a class="playlist__author" title="李奕臻" href="/n/ryqq/singer/001w0v9Z0P1YuO">李奕臻</a><!-- -->
    /<a class="playlist__author" title="卢禹豪" href="/n/ryqq/singer/000bfUG63tBcwb">卢禹豪</a>
    </div>
<div class="songlist__time">03:49</div>
"""
res=re.findall('<a class="playlist__author"\stitle="(.*?)"\shref="(.*?)">(.*?)</a>',content,re.S)
print(res)
for i in res:
    print(i)

4.re.sub:替换原文中的字符为新的字符串

import re
content="timme time timese"
res=re.sub("m","7",content)
print(res)  #ti77e ti7e ti7ese

查看全文

http://www.kler.cn/a/583150.html

大语言模型-01-语言模型发展历程-03-预训练语言模型到大语言模型

两会期间的科技强音：DeepSeek技术引领人工智能新篇章

Node.js：快速启动你的第一个Web服务器

无人机快速发展，无人机反制如何应对？

第44天：WEB攻防-PHP应用SQL盲注布尔回显延时判断报错处理增删改查方式

【写作模板】JosieBook的写作模板

使用CPR库编写的爬虫程序

大语言模型微调和大语言模型应用区别

机器学习常见激活函数

vscode出现：No module named ‘requests‘ 问题的解决方法

ubuntu 在VirtualBox 打不开终端

Oracle RAC环境下自动清理归档日志实战指南

Java静态变量与PHP静态变量的对比

HTMLCSS绘制三角形

数据结构-队列（详解）

软件安全分析与应用之综合安全应用 (三)

深度学习基础：线性代数本质4——矩阵乘法

从零开始学机器学习——初探分类器

CI/CD—GitLab部署

【C++】string类的相关成员函数以及string的模拟实现