当前位置：首页 > article >正文

Python（正则表达式）

article 2025/3/21 23:42:34

re模块

#在Python中需要通过正则表达式对字符串进行匹配的时候，可以使用一个re模块
'''
re模块三步走
# 第一步：导入re模块
import re
# 第二步：使用match方法进行匹配操作
result = re.match(pattern正则表达式, string要匹配的字符串, flags=0)
# 第三步：如果数据匹配成功，使用group方法来提取数据
result.group()

re.match(pattern, string, flags=0)
从字符串的起始位置匹配，如果匹配成功则返回匹配内容， 否则返回None

re.findall(pattern, string, flags=0)
- 扫描整个串，返回所有与pattern匹配的列表
- 注意: 如果pattern中有分组则返回与分组匹配的列表
- 举例： `re.findall("\d","chuan1zhi2") >> ["1","2"]`

re.finditer(pattern, string, flags)
 功能与上面findall一样，不过返回的时迭代器
'''

'''
match函数参数说明：

| 参数    | 描述                                                         |
| ------- | ------------------------------------------------------------ |
| pattern | 匹配的正则表达式                                             |
| string  | 要匹配的字符串。                                             |
| flags   | 标志位，用于控制正则表达式的匹配方式，如：是否区分大小写，多行匹配等等。参见：正则表达式修饰符 - 可选标志 |

匹配成功re.match方法返回一个匹配的对象，否则返回None。
我们可以使用group(num) 或 groups() 匹配对象函数来获取匹配数据。

正则表达式可以包含一些可选标志修饰符来控制匹配的模式。修饰符被指定为一个可选的标志。多个标志可以通过按位 OR(|) 它们来指定。
如 re.I | re.M 被设置成 I 和 M 标志：

| 修饰符 | 描述                                                         |
| ------ | ------------------------------------------------------------ |
| re.I   | 使匹配对大小写不敏感                                     |
| re.L   | 做本地化识别（locale-aware）匹配，这个功能是为了支持多语言版本的字符集使用环境的，比如在转义符\w，在英文环境下，它代表[a-zA-Z0-9_]，即所以英文字符和数字。如果在一个法语环境下使用，缺省设置下，不能匹配"é" 或   "ç"。加上这L选项和就可以匹配了。不过这个对于中文环境似乎没有什么用，它仍然不能匹配中文字符。 |
| re.M   | 多行匹配，影响 ^ 和 $                                    |
| re.S   | 使 . 匹配包括换行在内的所有字符                          |
| re.U   | 根据Unicode字符集解析字符。这个标志影响 \w, \W, \b, \B.      |
| re.X   | VERBOSE，冗余模式， 此模式忽略正则表达式中的空白和#号的注释，例如写一个匹配邮箱的正则表达式。该标志通过给予你更灵活的格式以便你将正则表达式写得更易于理解。 |
'''

'''
参数说明：
- pattern : 模式字符串。
- repl : 替换的字符串，也可为一个函数。
- string : 要被查找替换的原始字符串。
- count : 模式匹配后替换的最大次数，默认 0 表示替换所有的匹配。
- flags: 匹配方式:
  - re.I 使匹配对大小写不敏感，I代表Ignore忽略大小写
  - re.S 使 . 匹配包括换行在内的所有字符
  - re.M 多行模式,会影响^,$
'''

import re

def demo1_match():
    result=re.match(pattern='.it.',string='aitbdaf')
    #match匹配是逐字进行匹配不能跳过
    #result=re.match(pattern='.it.',string='aaitbdaf')不满足 . + it + . 的形式则无法匹配，.可以是任意字符，数字也可以除了\n
    if result:
        print(result.group())
    else:
        print('未匹配到满足规则的字符')

if __name__=='__main__':
    demo1_match()

def demo2_search():
    result=re.search(pattern='\d.*',string='sda1abc2efg')
    #/d是以数字开头，.*后面可以匹配任意多个字符
    if result:
        print(result.group())
    else:
        print('未匹配到满足规则的字符')

if __name__=='__main__':
    demo2_search()

def demo3_replace替换字符串():

    import re
    sentence = "车主说:你的刹车片应该更换了啊,嘿嘿"

    # 正则表达式: 去除多余字符
    p = r"呢|吧|哈|啊|啦|嘿|嘿嘿"
    r = re.compile(pattern=p)
    mystr = r.sub('', sentence)
    print('mystr-->', mystr)

    # 正则表达: 删除除了汉字数字字母和，！？。.- 以外的字符
    # \u4e00-\u9fa5 是用来判断是不是中文的一个条件
    p = "[^，！？。\.\-\u4e00-\u9fa5_a-zA-Z0-9]"
    r = re.compile(pattern=p)
    mystr = r.sub('', sentence)
    print('mystr-->', mystr)

    # 半角变为全角  sentence.replace(",", "，") 逗号 感叹号 问号
    sentence = "你好."
    mystr = sentence.replace(".", "。")
    print('mystr-->', mystr)

if __name__=='__main__':
    demo3_replace替换字符串()

匹配单个字符

# 匹配单个字符 功能演示
# 1 .   匹配任意1个字符（除了\n）
# 2 [ ]	匹配[ ]中列举的字符
# 3 \d	匹配数字,即0-9 => [0123456789] => [0-9]
# 4 \D	匹配非数字,即不是数字  # 一般大写D表示非
# 5 \s	匹配空白,即空格,tab键
# 6 \S	匹配非空白
# 7 \w	匹配非特殊字符，即a-z, A-Z, 0-9, _, 汉字
# 8 \W	匹配特殊字符,即非字母, 非数字, 非_, 非汉字



import re

# # 1 . 匹配任意1个字符（除了\n）
# # 匹配数据: 从左向右匹配，一个字符接着一个字符的匹配
# result = re.match("itcast.", "itcast2")
#
# # 获取数据
# if result:
#     info = result.group()
#     print(info)
# else:
#     print("没有匹配到")



# # 2 [ ]	匹配[ ]中列举的字符
# # [a-z]  [A-Z] [0-9]   [a-zA-Z0-9]
# # 匹配数据
# result = re.match("itcast[123abc]", "itcast376")
#
# # 获取数据
# if result:
#     info = result.group()
#     print(info)
# else:
#     print("没有匹配到")



# # 3 \d	匹配数字,即0-9 => [0123456789] => [0-9]
# # 匹配数据
# result = re.match("itcast\d", "itcast5")
#
# # 获取数据
# if result:
#     info = result.group()
#     print(info)
# else:
#     print("没有匹配到")



# # 4 \D	匹配非数字, 即不是数字
# # 匹配数据
# result = re.match("itcast\D", "itcast-")
#
# # 获取数据
# if result:
#     info = result.group()
#     print(info)
# else:
#     print("没有匹配到")



# # 5 \s	匹配空白,即空格,tab键
# # 匹配数据
# result = re.match("itcast\s111", "itcast\t111")
#
# # 获取数据
# if result:
#     info = result.group()
#     print(info)
# else:
#     print("没有匹配到")



# # 6 \S	匹配非空白
# # 匹配数据
# result = re.match("itcast\S", "itcast\t")
#
# # 获取数据
# if result:
#     info = result.group()
#     print(info)
# else:
#     print("没有匹配到")



# # 7 \w	匹配非特殊字符，即a-z, A-Z, 0-9, _, 汉字
# # 匹配数据
# result = re.match("itcast\w", "itcasta")
# # result = re.match("itcast\w", "itcast!")
#
# # 获取数据
# if result:
#     info = result.group()
#     print(info)
# else:
#     print("没有匹配到")



# 8 \W	匹配特殊字符,即非字母, 非数字, 非_, 非汉字
# 匹配数据
result = re.match("itcast\W", "itcast\t2aa")

# 获取数据
if result:
    info = result.group()
    print(info)
else:
    print("没有匹配到")

匹配多个字符

# 1 *   匹配前一个字符出现0次或者无限次，即可有可无
# 2 +   匹配前一个字符出现1次或者无限次，即至少有1次
# 3 ?	匹配前一个字符出现1次或者0次，即要么有1次，要么没有
# 4 {m}	匹配前一个字符出现m次
# 5 {m,n}	匹配前一个字符出现从m到n次

import re
# 1 *   匹配前一个字符出现0次或者无限次，即可有可无
# result = re.match("itcast1*", "itcast111123333itcast")
# result = re.match("itcast1*itcast", "itcast111123333itcast")
# result = re.match("itcast1*itcast", "itcast1111itcast")
result = re.match("itcast\d*itcast", "itcast11112222itcast")
if result:
    info = result.group()
    print(info)
else:
    print("没有匹配到")
print('_'*30)

# 2 +   匹配前一个字符出现1次或者无限次，即至少有1次
# result = re.match("itcast1+itcast", "itcastitcast")
result = re.match("itcast1+itcast", "itcast1itcast")
if result:
    info = result.group()
    print(info)
else:
    print("没有匹配到")
print('_'*30)

# 3 ?	匹配前一个字符出现1次或者0次，即要么有1次，要么没有
# result = re.match("itcast1?", "itcast1itcast")
result = re.match("itcast1?itcast", "itcast1itcast")
if result:
    info = result.group()
    print(info)
else:
    print("没有匹配到")

匹配开头和结尾

# 1 ^   匹配字符串开头
# 2 $	匹配字符串结尾
# 3 [^指定字符]  匹配除了指定字符以外的所有字符


import re


# # 1-1 ^ 匹配字符串开头
# # 匹配数据: 匹配1个数字开头的子串
# result = re.match("^\ditcast", "2itcast")   # 1 匹配1个数字开头 + itcast
# # result = re.match("^\ditcast", "22itcast")  # 2 匹配不上
#
# # 获取数据
# if result:
#     info = result.group()
#     print(info)
# else:
#     print("没有匹配到")



# # 1-2 以数字为开头的字符串
# result = re.match("^\d.*", "22itcast")  # "^\d":以数字开头, ".*":以字符结尾
# # result = re.match("^\d{1,3}it", "1itcast")  # "^\d{1,3}":以1~3个数字开头, ".*":以字符结尾
#
# # 获取数据
# if result:
#     info = result.group()
#     print(info)
# else:
#     print("没有匹配到")



# # 2-1 $	匹配字符串结尾
# result = re.match(".*\d$", "itcast66")  # ".*" : 0个多个字符开头, "\d$":数字结尾
# # result = re.match(".*\d{5}$", "itcast666")  # ".*" : 0个多个字符开头, "\d{5}$": 以5个数字结尾
#
# # 获取数据
# if result:
#     info = result.group()
#     print(info)
# else:
#     print("没有匹配到")



# # 3 匹配以数字为开头以数字为结尾
# result = re.match("^\d.*\d$", "11itcast22")
#
# # 获取数据
# if result:
#     info = result.group()
#     print(info)
# else:
#     print("没有匹配到")



# 4 [^指定字符]  匹配除了指定字符以外的所有字符
result = re.match("^\d.*[^4]$", "11itcast@")
# result = re.match("^\d.*[^4]$", "11itcast4")

# 获取数据
if result:
    info = result.group()
    print(info)
else:
    print("没有匹配到")

匹配分组

# 1 需求：在列表中["apple", "banana", "orange", "pear"]，匹配apple和pear
# 2 需求：匹配出163、126、qq等邮箱
# 3 需求：匹配qq:10567这样的数据，提取出来qq文字和qq号码


# 知识点列表:
# 1 |	匹配左右任意一个表达式
# 2 (ab)	将括号中字符作为一个分组
# 3 \ 转义字符

import re



# # 1 需求：在列表中["apple", "banana", "orange", "pear"]，匹配apple和pear
# fruit = ["apple", "banana", "orange", "pear"]
#
# # 获取字符串数据
# for value in fruit:
#     result = re.match("apple|pear", value)
#     # 判断匹配是否成功
#     if result:
#         info = result.group()
#         print("我想吃的水果:",value)
#     else:
#         print(f"这个不是我想吃的水果{value}")




# # 2 需求：匹配出163、126、qq等邮箱
# # |	匹配左右任意一个表达式
# # (ab)	将括号中字符作为一个分组
# # \ 转义字符
#
# # 2-1
# # result = re.match("[a-zA-Z0-9_]{4,20}@163|126|qq.com", "hello@163.com")  # 只能把"hello@163"匹配出来
#
# # 2-2
# # 不能匹配出来子串 ,因为分解成: "{4,20}@163" | "126" |  "qq.com"
# # result = re.match("[a-zA-Z0-9_]{4,20}@163|126|qq.com", "hello@qq.com")  # 不能匹配出来子串
#
# # 2-3 (ab)	将括号中字符作为一个分组
# # result = re.match("[a-zA-Z0-9_]{4,20}@(163|126|qq).com", "hello@qq.com")
# # result = re.match("[a-zA-Z0-9_]{4,20}@(163|126|qq).com", "hello@qqxcom")
# result = re.match("[a-zA-Z0-9_]{4,20}@(163|126|qq)\.com", "hello@qq.com")  # 需要使用转义字符
#
# info = result.group()
#
# print('result-->', result)
# print(info)




# # 3 需求：匹配qq:10567这样的数据，提取出来qq文字和qq号码
# # group(0)/group() 代表的是匹配的所有数据 1:第一个分组的数据 2:第二个分组的数据 顺序是从左到右依次排序的
# # result = re.match("(qq):([0-9]\d{4,11})", "qq:10567")  # 一般qq号不以0开头
# result = re.match("(qq):([1-9]\d{4,11})", "qq:10567")
# if result:
#     info = result.group(0)
#     print(info)
#
#     num = result.group(2)
#     print(num)
#
#     type = result.group(1)
#     print(type)
# else:
#     print("匹配失败")



# 有关分组, 分组的引用, 给分组起个别名
# 4 需求：匹配出<html>hh</html>
# \num	引用分组num匹配到的字符串
# result = re.match("<([a-zA-Z1-6]{4})>.*</([a-zA-Z1-6]{4})>", "<html>hh</html>")
# result = re.match("<([a-zA-Z1-6]{4})>.*</\\1>", "<html>hh</html>")
result = re.match(r"<([a-zA-Z1-6]{4})>.*</\1>", "<html>hh</html>")  # 前面加1个r,也不用转义了
if result:
    info = result.group()
    print(info)
else:
    print("匹配失败")

# 测试打印,比较不同
print('\1')
print('\\1')




# # 5 需求：匹配出<html><h1>www.itcast.cn</h1></html>
# result = re.match("<([a-zA-Z1-6]{4})><([a-zA-Z1-6]{2})>.*</\\2></\\1>", "<html><h1>www.itcast.cn</h1></html>")
# if result:
#     info = result.group()
#     print(info)
# else:
#     print("匹配失败")



# # 6 需求：匹配出<html><h1>www.itcast.cn</h1></html>
# # (?P<name>)	分组起别名
# # (?P=name)	引用别名为name分组匹配到的字符串
# result = re.match("<(?P<html>[a-zA-Z1-6]{4})><(?P<h1>[a-zA-Z1-6]{2})>.*</(?P=h1)></(?P=html)>", "<html><h1>www.itcast.cn</h1></html>")
#其中（?p:想要的命名）,这个操作就相当于给后面的匹配规则分组命名了，如果需要使用相同的规则可以直接调用这个分组命名。
# if result:
#     info = result.group()
#     print(info)
# else:
#     print("匹配失败")

小练习

'''
# 小案例
1 使用input函数输入一个字符串
2 必须匹配的模式: 前5个是字母, 后5个是数字
3 成功打印匹配ok, 否则打印匹配失败
'''
import re

str_input=input('请输入一个字符串用于匹配：')
result=re.match(pattern='^[a-zA-Z]{5}.*[0-9]{5}$',string=str_input)
if result:
    print(result.group())
else:
    print('匹配失败')

'''
正则编写三步走：查什么、查多少、从哪查
正则表达式通常是由两部分数据组成的：普通字符 与 元字符
普通字符：0123456789abcd@...
元字符：正则表达式所特有的符号 => [0-9]，^，*，+，？

 1、查什么
| 代码 | 功能                      |
| ---- | ------------------------- |
| .（英文点号） | 匹配任意某1个字符（除了\n） |
| [ ]  | 匹配[ ]中列举的某1个字符，专业名词 => ==字符簇== |
| \[^指定字符] | 匹配除了指定字符以外的其他某个字符，^专业名词 => ==托字节== |
| \d   | 匹配数字，即0-9           |
| \D   | 匹配非数字，即不是数字    |
| \s   | 匹配空白，即   空格，tab键               |
| \S   | 匹配非空白                               |
| \w   | 匹配非特殊字符，即a-z、A-Z、0-9、_ |
| \W   | 匹配特殊字符，即非字母、非数字、非下划线 |

字符簇常见写法：
① [abcdefg] 代表匹配abcdefg字符中的任意某个字符（1个）
② [aeiou] 代表匹配a、e、i、o、u五个字符中的任意某个字符
③ [a-z] 代表匹配a-z之间26个字符中的任意某个
④ [A-Z] 代表匹配A-Z之间26个字符中的任意某个
⑤ [0-9] 代表匹配0-9之间10个字符中的任意某个
⑥ [0-9a-zA-Z] 代表匹配0-9之间、a-z之间、A-Z之间的任意某个字符


字符簇 + 托字节结合代表取反的含义：
① \[^aeiou] 代表匹配除了a、e、i、o、u以外的任意某个字符
② \[^a-z] 代表匹配除了a-z以外的任意某个字符
\d 等价于 [0-9]， 代表匹配0-9之间的任意数字
\D 等价于 \[^0-9]，代表匹配非数字字符，只能匹配1个

2、查多少
| 代码  | 功能                                                         |
| ----- | ------------------------------------------------------------ |
| *     | 匹配前一个字符出现0次或者无限次，即可有可无（0到多）         |
| +     | 匹配前一个字符出现1次或者无限次，即至少有1次（1到多）        |
| ?     | 匹配前一个字符出现1次或者0次，即要么有1次，要么没有（0或1）  |
| {m}   | 匹配前一个字符出现m次，匹配手机号码\d{11}                    |
| {m,}  | 匹配前一个字符至少出现m次，\\w{3,}，代表前面这个字符最少要出现3次，最多可以是无限次 |
| {m,n} | 匹配前一个字符出现从m到n次，\w{6,10}，代表前面这个字符出现6到10次 |

基本语法：
正则匹配字符.或\w或\S + 跟查多少
如\w{6, 10}
如.*，匹配前面的字符出现0次或多次

3、从哪查
| 代码 | 功能                 |
| ---- | -------------------- |
| ^    | 匹配以某个字符串开头 |
| $    | 匹配以某个字符串结尾 |
'''

查看全文

http://www.kler.cn/a/594606.html