Python文本处理[编辑 | 编辑源代码]

Python中的文本处理是指使用字符串操作、正则表达式及相关库对文本数据进行清洗、转换、分析和提取的过程。它是数据处理、自然语言处理（NLP）、日志分析等领域的基础技能。本章将详细介绍Python中常见的文本处理技术，包括字符串基本操作、正则表达式、编码处理及实际应用案例。

字符串基础操作[编辑 | 编辑源代码]

Python的字符串（str）是不可变序列，支持索引、切片和多种内置方法。以下是常见的字符串操作：

字符串拼接与格式化[编辑 | 编辑源代码]

字符串可以通过+运算符拼接，或使用f-string（Python 3.6+）、format()方法格式化：

# 字符串拼接
str1 = "Hello"
str2 = "World"
result = str1 + ", " + str2 + "!"
print(result)  # 输出: Hello, World!

# f-string格式化
name = "Alice"
age = 25
print(f"{name} is {age} years old.")  # 输出: Alice is 25 years old.

常用字符串方法[编辑 | 编辑源代码]

常用字符串方法示例
方法	描述	示例
`str.lower()`	转为小写	`"HELLO".lower() → "hello"`
`str.upper()`	转为大写	`"hello".upper() → "HELLO"`
`str.strip()`	去除两端空白符	`" text ".strip() → "text"`
`str.split()`	按分隔符分割	`"a,b,c".split(",") → ["a", "b", "c"]`

正则表达式[编辑 | 编辑源代码]

正则表达式（Regular Expression）是文本模式匹配的强大工具，Python通过re模块提供支持。

基本匹配[编辑 | 编辑源代码]

import re

text = "The rain in Spain"
# 查找所有匹配的单词
matches = re.findall(r"\b\w{4}\b", text)  # 匹配4字母单词
print(matches)  # 输出: ['rain', 'Spain']

常用正则符号[编辑 | 编辑源代码]

正则表达式元字符
符号	含义	示例
`.`	匹配任意字符（除换行符）	`r"a.c"`匹配"abc", "a c"
`\d`	匹配数字	`r"\d+"`匹配"123"
`\w`	匹配字母/数字/下划线	`r"\w+"`匹配"word_1"

编码与解码[编辑 | 编辑源代码]

文本处理中常需处理不同编码（如UTF-8、ASCII）。Python 3默认使用UTF-8编码：

text = "中文"
encoded = text.encode("utf-8")  # b'\xe4\xb8\xad\xe6\x96\x87'
decoded = encoded.decode("utf-8")  # "中文"

实际应用案例[编辑 | 编辑源代码]

案例1：日志分析[编辑 | 编辑源代码]

从服务器日志中提取IP地址和访问时间：

log = "192.168.1.1 - - [21/Jan/2023:10:15:32] GET /index.html"
pattern = r"(\d+\.\d+\.\d+\.\d+).*?\[(.*?)\]"
match = re.search(pattern, log)
print(match.groups())  # 输出: ('192.168.1.1', '21/Jan/2023:10:15:32')

案例2：数据清洗[编辑 | 编辑源代码]

清理CSV文件中的不规则空格：

dirty_data = "John Doe, 25,  New York"
clean_data = ",".join([x.strip() for x in dirty_data.split(",")])
print(clean_data)  # 输出: "John Doe,25,New York"

数学公式示例（可选）[编辑 | 编辑源代码]

在文本相似度计算中，余弦相似度公式为： $similarity = \cos (θ) = \frac{A \cdot B}{‖ A ‖ ‖ B ‖}$

总结[编辑 | 编辑源代码]

Python文本处理的核心要点：

掌握字符串基本操作（拼接、分割、替换）
熟练使用正则表达式进行模式匹配
注意文本编码问题（UTF-8/ASCII）
结合实际场景选择合适方法（如str方法 vs re模块）