Python数据爬取[编辑 | 编辑源代码]

Python数据爬取（Web Scraping）是指通过程序自动从互联网上提取和解析数据的技术。它是网络编程的重要组成部分，广泛应用于数据分析、市场研究、搜索引擎优化等领域。Python因其丰富的库（如Requests、BeautifulSoup、Scrapy）成为数据爬取的首选语言。

核心概念[编辑 | 编辑源代码]

HTTP请求与响应[编辑 | 编辑源代码]

数据爬取的基础是HTTP协议。客户端（爬虫程序）发送请求（Request）到服务器，服务器返回响应（Response）。Python中常用requests库处理HTTP交互：

import requests
response = requests.get("https://example.com")
print(response.status_code)  # 输出：200
print(response.text[:100])   # 输出网页前100字符

HTML解析[编辑 | 编辑源代码]

服务器返回的通常是HTML文档，需用解析库提取结构化数据。BeautifulSoup是常用工具：

from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
title = soup.find('h1').text  # 提取第一个<h1>标签内容

技术栈[编辑 | 编辑源代码]

Python爬虫常用库对比
库名称	用途	适用场景
Requests	发送HTTP请求	简单页面抓取
BeautifulSoup	HTML/XML解析	静态页面分析
Scrapy	全功能爬虫框架	大规模爬取项目
Selenium	浏览器自动化	动态渲染页面

实战案例[编辑 | 编辑源代码]

案例1：静态页面爬取[编辑 | 编辑源代码]

以爬取维基百科Python词条首段为例：

url = "https://en.wikipedia.org/wiki/Python_(programming_language)"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
first_paragraph = soup.find('div', class_='mw-parser-output').p.text
print(first_paragraph)

案例2：动态内容抓取[编辑 | 编辑源代码]

对于JavaScript渲染的页面，使用Selenium：

from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://dynamic-website.com")
dynamic_content = driver.find_element_by_id("content").text