编辑“︁Python 文本处理”︁（章节）

= Python文本处理 =

Python中的'''文本处理'''是指使用字符串操作、正则表达式及相关库对文本数据进行清洗、转换、分析和提取的过程。它是数据处理、自然语言处理（NLP）、日志分析等领域的基础技能。本章将详细介绍Python中常见的文本处理技术，包括字符串基本操作、正则表达式、编码处理及实际应用案例。

== 字符串基础操作 ==
Python的字符串（<code>str</code>）是不可变序列，支持索引、切片和多种内置方法。以下是常见的字符串操作：

=== 字符串拼接与格式化 ===
字符串可以通过<code>+</code>运算符拼接，或使用<code>f-string</code>（Python 3.6+）、<code>format()</code>方法格式化：

<syntaxhighlight lang="python">
# 字符串拼接
str1 = "Hello"
str2 = "World"
result = str1 + ", " + str2 + "!"
print(result)  # 输出: Hello, World!

# f-string格式化
name = "Alice"
age = 25
print(f"{name} is {age} years old.")  # 输出: Alice is 25 years old.
</syntaxhighlight>

=== 常用字符串方法 ===
{| class="wikitable"
|+ 常用字符串方法示例
! 方法 !! 描述 !! 示例
|-
| <code>str.lower()</code> || 转为小写 || <code>"HELLO".lower() → "hello"</code>
|-
| <code>str.upper()</code> || 转为大写 || <code>"hello".upper() → "HELLO"</code>
|-
| <code>str.strip()</code> || 去除两端空白符 || <code>"  text  ".strip() → "text"</code>
|-
| <code>str.split()</code> || 按分隔符分割 || <code>"a,b,c".split(",") → ["a", "b", "c"]</code>
|}

== 正则表达式 ==
'''正则表达式'''（Regular Expression）是文本模式匹配的强大工具，Python通过<code>re</code>模块提供支持。

=== 基本匹配 ===
<syntaxhighlight lang="python">
import re

text = "The rain in Spain"
# 查找所有匹配的单词
matches = re.findall(r"\b\w{4}\b", text)  # 匹配4字母单词
print(matches)  # 输出: ['rain', 'Spain']
</syntaxhighlight>

=== 常用正则符号 ===
{| class="wikitable"
|+ 正则表达式元字符
! 符号 !! 含义 !! 示例
|-
| <code>.</code> || 匹配任意字符（除换行符） || <code>r"a.c"</code>匹配"abc", "a c"
|-
| <code>\d</code> || 匹配数字 || <code>r"\d+"</code>匹配"123"
|-
| <code>\w</code> || 匹配字母/数字/下划线 || <code>r"\w+"</code>匹配"word_1"
|}

<mermaid>
graph LR
    A[原始文本] --> B(正则匹配)
    B --> C{是否匹配?}
    C -->|是| D[提取结果]
    C -->|否| E[跳过]
</mermaid>

== 编码与解码 ==
文本处理中常需处理不同编码（如UTF-8、ASCII）。Python 3默认使用UTF-8编码：

<syntaxhighlight lang="python">
text = "中文"
encoded = text.encode("utf-8")  # b'\xe4\xb8\xad\xe6\x96\x87'
decoded = encoded.decode("utf-8")  # "中文"
</syntaxhighlight>

== 实际应用案例 ==
=== 案例1：日志分析 ===
从服务器日志中提取IP地址和访问时间：
<syntaxhighlight lang="python">
log = "192.168.1.1 - - [21/Jan/2023:10:15:32] GET /index.html"
pattern = r"(\d+\.\d+\.\d+\.\d+).*?\[(.*?)\]"
match = re.search(pattern, log)
print(match.groups())  # 输出: ('192.168.1.1', '21/Jan/2023:10:15:32')
</syntaxhighlight>

=== 案例2：数据清洗 ===
清理CSV文件中的不规则空格：
<syntaxhighlight lang="python">
dirty_data = "John Doe, 25,  New York"
clean_data = ",".join([x.strip() for x in dirty_data.split(",")])
print(clean_data)  # 输出: "John Doe,25,New York"
</syntaxhighlight>

== 数学公式示例（可选） ==
在文本相似度计算中，'''余弦相似度'''公式为：
<math>
\text{similarity} = \cos(\theta) = \frac{A \cdot B}{\|A\| \|B\|}
</math>

== 总结 ==
Python文本处理的核心要点：
* 掌握字符串基本操作（拼接、分割、替换）
* 熟练使用正则表达式进行模式匹配
* 注意文本编码问题（UTF-8/ASCII）
* 结合实际场景选择合适方法（如<code>str</code>方法 vs <code>re</code>模块）

[[Category:编程语言]]
[[Category:Python]]
[[Category:Python 字符串]]