Python 解析库详解

数据提取是爬虫的核心环节。获取到网页源码后，我们需要从中提取出有价值的信息。Python 提供了多种强大的解析工具：

正则表达式：最基础但也最万能的文本匹配工具。
lxml (XPath)：基于 XML 路径语言，解析速度快，适合大规模抓取。
Beautiful Soup：API 友好，容错性强，适合初学者和格式不规范的网页。
pyquery：jQuery 风格的语法，前端开发者上手极快。

本章将逐一介绍这些工具的使用方法。

一、正则表达式

正则表达式是处理字符串的强大工具，用于字符串的检索、替换、匹配验证。

1.1 常用匹配规则

模式	描述
`\w`	匹配字母、数字及下划线
`\s`	匹配任意空白字符
`\d`	匹配任意数字
`.`	匹配任意字符（除换行符）
`*`	匹配前面的字符 0 次或多次
`+`	匹配前面的字符 1 次或多次
`?`	匹配前面的字符 0 次或 1 次（非贪婪）
`^`	匹配字符串开头
`$`	匹配字符串结尾
`[...]`	匹配字符组中的任意一个
`(...)`	分组，用于提取内容

1.2 match 方法

从字符串开头开始匹配：

python

import re

content = 'Hello 123 4567 World_This is a Regex Demo'
result = re.match('^Hello\s\d{3}\s\d{4}\s\w{10}', content)

print(result.group())  # 匹配的内容
print(result.span())   # 匹配的位置范围

提取目标内容（使用分组）：

python

result = re.match('^Hello\s(\d+)\sWorld', content)
print(result.group(1))  # 1234567

通用匹配 .*：

python

result = re.match('^Hello.*Demo$', content)

贪婪与非贪婪：

python

# 贪婪匹配 .* - 尽可能多匹配
result = re.match('^He.*(\d+).*Demo$', content)
print(result.group(1))  # 7（只匹配到最后一个数字）

# 非贪婪匹配 .*? - 尽可能少匹配
result = re.match('^He.*?(\d+).*Demo$', content)
print(result.group(1))  # 1234567

修饰符：

python

# re.S 使 . 匹配包括换行符在内的所有字符
content = '''Hello 1234567 World_This
is a Regex Demo
'''
result = re.match('^He.*?(\d+).*?Demo$', content, re.S)

修饰符	描述
`re.I`	忽略大小写
`re.S`	使 `.` 匹配包括换行符
`re.M`	多行匹配

转义匹配：

python

content = '(百度) www.baidu.com'
result = re.match('\(百度\) www\.baidu\.com', content)

1.3 search 方法

扫描整个字符串，返回第一个成功匹配的结果：

python

content = 'Extra stings Hello 1234567 World_This is a Regex Demo'
result = re.search('Hello.*?(\d+).*?Demo', content)
print(result.group(1))  # 1234567

建议：为了匹配方便，尽量使用 search 而非 match。

1.4 findall 方法

返回所有匹配结果（列表形式）：

python

html = '''
<li data-view="2">一路上有你</li>
<li data-view="7"><a href="/2.mp3" singer="任贤齐">沧海一声笑</a></li>
<li data-view="4"><a href="/3.mp3" singer="齐秦">往事随风</a></li>
'''

results = re.findall('<li.*?href="(.*?)".*?singer="(.*?)">(.*?)</a>', html, re.S)
for result in results:
    print(result[0], result[1], result[2])
# /2.mp3 任贤齐 沧海一声笑
# /3.mp3 齐秦 往事随风

1.5 sub 方法

替换字符串中匹配的内容：

python

content = '54aK54yr5oiR54ix5L2g'
content = re.sub('\d+', '', content)
print(content)  # aKyroiRixLg

1.6 compile 方法

编译正则表达式对象，便于复用：

python

pattern = re.compile('\d{2}:\d{2}')

result1 = re.sub(pattern, '', '2016-12-15 12:00')
result2 = re.sub(pattern, '', '2016-12-17 12:55')

二、实战：正则抓取猫眼电影 TOP100

本节演示如何结合 requests 请求库与 re 正则表达式模块，完成一个简单的爬虫。

2.1 目标分析

目标 URL：http://maoyan.com/board/4
分页规律：offset=0, 10, 20, ..., 90
提取信息：排名、图片、名称、演员、时间、评分

2.2 完整代码

python

import json
import requests
import re
import time

def get_one_page(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    }
    try:
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            return response.text
        return None
    except Exception:
        return None

def parse_one_page(html):
    pattern = re.compile(
        r'<dd>.*?board-index.*?>(\d+)</i>.*?data-src="(.*?)".*?name"><a'
        r'.*?>(.*?)</a>.*?star">(.*?)</p>.*?releasetime">(.*?)</p>'
        r'.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>',
        re.S
    )
    items = re.findall(pattern, html)
    for item in items:
        yield {
            'index': item[0],
            'image': item[1],
            'title': item[2].strip(),
            'actor': item[3].strip()[3:] if len(item[3]) > 3 else '',
            'time': item[4].strip()[5:] if len(item[4]) > 5 else '',
            'score': item[5] + item[6]
        }

def write_to_file(content):
    with open('result.txt', 'a', encoding='utf-8') as f:
        f.write(json.dumps(content, ensure_ascii=False) + '\n')

def main(offset):
    url = f'http://maoyan.com/board/4?offset={offset}'
    html = get_one_page(url)
    if html:
        for item in parse_one_page(html):
            print(item)
            write_to_file(item)

if __name__ == '__main__':
    for i in range(10):
        main(offset=i * 10)
        time.sleep(1)  # 避免请求过快被封

2.3 运行结果

python

{'index': '1', 'image': 'http://...', 'title': '霸王别姬', 
 'actor': '张国荣，张丰毅，巩俐', 'time': '1993-01-01(中国香港)', 'score': '9.6'}
{'index': '2', 'image': 'http://...', 'title': '肖申克的救赎', 
 'actor': '蒂姆·罗宾斯，摩根·弗里曼', 'time': '1994-10-14(美国)', 'score': '9.5'}
...

三、使用 XPath (lxml)

XPath，全称 XML Path Language，即 XML 路径语言。它最初用于搜寻 XML 文档，但同样适用于 HTML 文档。

官方文档：https://www.w3.org/TR/xpath/

3.1 XPath 常用规则

表达式	描述
`nodename`	选取此节点的所有子节点
`/`	从当前节点选取直接子节点
`//`	从当前节点选取子孙节点
`.`	选取当前节点
`..`	选取当前节点的父节点
`@`	选取属性

示例：//title[@lang='eng'] 表示选择所有名称为 title，且属性 lang 的值为 eng 的节点。

3.2 基本使用

python

from lxml import etree

text = '''
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>
     </ul>
 </div>
'''

html = etree.HTML(text)  # 自动修正 HTML
result = etree.tostring(html)
print(result.decode('utf-8'))

也可以从文件读取：

python

html = etree.parse('./test.html', etree.HTMLParser())

3.3 选取节点

所有节点

python

result = html.xpath('//*')  # 匹配所有节点
result = html.xpath('//li')  # 匹配所有 li 节点

子节点

python

# 直接子节点用 /
result = html.xpath('//li/a')  # li 的直接子节点 a

# 子孙节点用 //
result = html.xpath('//ul//a')  # ul 下所有 a 节点（包括孙节点）

父节点

python

# 使用 .. 获取父节点
result = html.xpath('//a[@href="link4.html"]/../@class')

# 或使用 parent::
result = html.xpath('//a[@href="link4.html"]/parent::*/@class')

3.4 属性匹配与获取

属性过滤

python

result = html.xpath('//li[@class="item-0"]')  # class 为 item-0 的 li 节点

获取属性值

python

result = html.xpath('//li/a/@href')  # 获取所有 a 节点的 href 属性
# ['link1.html', 'link2.html', 'link3.html', 'link4.html', 'link5.html']

多值属性匹配 (contains)

python

# 当 class 有多个值时，使用 contains
text = '<li class="li li-first"><a href="link.html">first item</a></li>'
html = etree.HTML(text)
result = html.xpath('//li[contains(@class, "li")]/a/text()')

多属性匹配 (and)

python

result = html.xpath('//li[contains(@class, "li") and @name="item"]/a/text()')

3.5 获取文本

python

# 获取直接子节点文本
result = html.xpath('//li[@class="item-0"]/a/text()')
# ['first item', 'fifth item']

# 获取所有子孙节点文本
result = html.xpath('//li[@class="item-0"]//text()')
# ['first item', 'fifth item', '\n     ']

3.6 按序选择

python

result = html.xpath('//li[1]/a/text()')         # 第一个 li（注意从 1 开始）
result = html.xpath('//li[last()]/a/text()')    # 最后一个 li
result = html.xpath('//li[position()<3]/a/text()')  # 前两个 li
result = html.xpath('//li[last()-2]/a/text()')  # 倒数第三个 li

3.7 节点轴选择

python

# 祖先节点
result = html.xpath('//li[1]/ancestor::*')
result = html.xpath('//li[1]/ancestor::div')

# 属性
result = html.xpath('//li[1]/attribute::*')

# 子节点
result = html.xpath('//li[1]/child::a[@href="link1.html"]')

# 子孙节点
result = html.xpath('//li[1]/descendant::span')

# 后续节点
result = html.xpath('//li[1]/following::*[2]')

# 后续同级节点
result = html.xpath('//li[1]/following-sibling::*')

四、使用 Beautiful Soup

Beautiful Soup 是 Python 的 HTML/XML 解析库，借助网页的结构和属性来解析网页。

官方文档：https://www.crummy.com/software/BeautifulSoup/bs4/doc/

4.1 解析器选择

解析器	使用方法	优势
Python 标准库	`BeautifulSoup(markup, "html.parser")`	内置、无需安装
lxml HTML	`BeautifulSoup(markup, "lxml")`	速度快、容错强（推荐）
lxml XML	`BeautifulSoup(markup, "xml")`	唯一支持 XML
html5lib	`BeautifulSoup(markup, "html5lib")`	最好的容错性

推荐使用 lxml 解析器：

python

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

4.2 基本使用

python

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

print(soup.prettify())      # 格式化输出
print(soup.title.string)    # The Dormouse's story

4.3 节点选择器

选择元素

python

print(soup.title)           # <title>The Dormouse's story</title>
print(type(soup.title))     # <class 'bs4.element.Tag'>
print(soup.title.string)    # The Dormouse's story
print(soup.head)            # <head>...</head>
print(soup.p)               # 第一个 p 节点

获取名称、属性、文本

python

# 节点名称
print(soup.title.name)  # title

# 所有属性
print(soup.p.attrs)  # {'class': ['title'], 'name': 'dromouse'}

# 单个属性
print(soup.p['class'])  # ['title']
print(soup.p.attrs['name'])  # dromouse

# 文本内容
print(soup.p.string)  # The Dormouse's story

嵌套选择

python

print(soup.head.title.string)  # The Dormouse's story

4.4 关联选择

子节点

python

# 直接子节点（列表）
print(soup.p.contents)

# 直接子节点（迭代器）
for child in soup.p.children:
    print(child)

# 所有子孙节点
for descendant in soup.p.descendants:
    print(descendant)

父节点

python

# 直接父节点
print(soup.a.parent)

# 所有祖先节点
for parent in soup.a.parents:
    print(parent.name)

兄弟节点

python

print(soup.a.next_sibling)       # 下一个兄弟元素
print(soup.a.previous_sibling)   # 上一个兄弟元素

# 所有后续/前面的兄弟节点
for sibling in soup.a.next_siblings:
    print(sibling)

4.5 方法选择器

find_all

python

# 按节点名查询
soup.find_all(name='ul')
soup.find_all(name='li')

# 按属性查询
soup.find_all(attrs={'id': 'list-1'})
soup.find_all(id='list-1')
soup.find_all(class_='element')  # class 是关键字，加下划线

# 按文本查询（支持正则）
import re
soup.find_all(text=re.compile('link'))

find

find 返回第一个匹配的元素，而非列表：

python

soup.find(name='ul')
soup.find(class_='list')

其他方法

方法	描述
`find_parents()` / `find_parent()`	所有/直接父节点
`find_next_siblings()` / `find_next_sibling()`	后面所有/第一个兄弟节点
`find_previous_siblings()` / `find_previous_sibling()`	前面所有/第一个兄弟节点

4.6 CSS 选择器

python

# 使用 select 方法
soup.select('.panel .panel-heading')  # class 选择器
soup.select('ul li')                   # 层级选择器
soup.select('#list-2 .element')        # id + class 选择器

# 嵌套选择
for ul in soup.select('ul'):
    print(ul.select('li'))

# 获取属性
for ul in soup.select('ul'):
    print(ul['id'])

# 获取文本
for li in soup.select('li'):
    print(li.get_text())  # 或 li.string

五、使用 pyquery

pyquery 是仿照 jQuery 的 API 设计的解析库，如果你熟悉 jQuery，这个库会非常顺手。

官方文档：http://pyquery.readthedocs.io

5.1 初始化

python

from pyquery import PyQuery as pq

# 字符串初始化
doc = pq(html)

# URL 初始化
doc = pq(url='http://example.com')

# 文件初始化
doc = pq(filename='demo.html')

5.2 基本 CSS 选择器

python

html = '''
<div id="container">
    <ul class="list">
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
 </div>
'''

from pyquery import PyQuery as pq
doc = pq(html)
print(doc('#container .list li'))  # 选择所有 li 节点

5.3 查找节点

子节点

python

items = doc('.list')

# 所有子孙节点
lis = items.find('li')

# 直接子节点
lis = items.children()

# 筛选子节点
lis = items.children('.active')

父节点

python

# 直接父节点
container = items.parent()

# 所有祖先节点
parents = items.parents()

# 筛选祖先节点
parent = items.parents('.wrap')

兄弟节点

python

li = doc('.list .item-0.active')

print(li.siblings())           # 所有兄弟节点
print(li.siblings('.active'))  # 筛选兄弟节点

5.4 遍历

python

lis = doc('li').items()  # 返回生成器

for li in lis:
    print(li, type(li))  # 每个元素都是 PyQuery 类型

5.5 获取信息

获取属性

python

a = doc('.item-0.active a')
print(a.attr('href'))  # link3.html
print(a.attr.href)     # link3.html

# 多个节点时，需遍历
for item in doc('a').items():
    print(item.attr('href'))

获取文本

python

a = doc('.item-0.active a')
print(a.text())  # third item（纯文本）

li = doc('.item-0.active')
print(li.html())  # <a href="link3.html"><span class="bold">third item</span></a>

注意：多个节点时，text() 返回所有文本（空格分隔），html() 只返回第一个。

5.6 节点操作

python

li = doc('.item-0.active')

# 添加/移除 class
li.removeClass('active')
li.addClass('active')

# 修改属性
li.attr('name', 'link')

# 修改内容
li.text('changed item')
li.html('<span>changed item</span>')

# 移除节点
wrap = doc('.wrap')
wrap.find('p').remove()
print(wrap.text())

5.7 伪类选择器

python

doc('li:first-child')      # 第一个 li
doc('li:last-child')       # 最后一个 li
doc('li:nth-child(2)')     # 第二个 li
doc('li:gt(2)')            # 第三个之后的 li
doc('li:nth-child(2n)')    # 偶数位置的 li
doc('li:contains(second)') # 包含 second 文本的 li

六、总结与选择建议

库	特点	适用场景
lxml (XPath)	速度最快、功能强大	大规模爬虫、性能敏感
Beautiful Soup	容错性好、API 友好	初学者、HTML 不规范
pyquery	jQuery 风格、CSS 选择器	熟悉 jQuery 的开发者

推荐组合：

高性能场景：lxml + XPath
通用场景：Beautiful Soup + lxml 解析器
jQuery 用户：pyquery

三者都可以很好地完成网页解析任务，选择你最熟悉的即可。

← 返回 Python 深度研究

Python 解析库详解 ​

一、正则表达式 ​

1.1 常用匹配规则 ​

1.2 match 方法 ​

1.3 search 方法 ​

1.4 findall 方法 ​

1.5 sub 方法 ​

1.6 compile 方法 ​

二、实战：正则抓取猫眼电影 TOP100 ​

2.1 目标分析 ​

2.2 完整代码 ​

2.3 运行结果 ​

三、使用 XPath (lxml) ​

3.1 XPath 常用规则 ​

3.2 基本使用 ​

3.3 选取节点 ​

所有节点 ​

子节点 ​

父节点 ​

3.4 属性匹配与获取 ​

属性过滤 ​

获取属性值 ​

多值属性匹配 (contains) ​

多属性匹配 (and) ​

3.5 获取文本 ​

3.6 按序选择 ​

3.7 节点轴选择 ​

四、使用 Beautiful Soup ​

4.1 解析器选择 ​

4.2 基本使用 ​

4.3 节点选择器 ​

选择元素 ​

获取名称、属性、文本 ​

嵌套选择 ​

4.4 关联选择 ​

子节点 ​

父节点 ​

兄弟节点 ​

4.5 方法选择器 ​

find_all ​

find ​

其他方法 ​

4.6 CSS 选择器 ​

五、使用 pyquery ​

5.1 初始化 ​

5.2 基本 CSS 选择器 ​

5.3 查找节点 ​

子节点 ​

父节点 ​

兄弟节点 ​

5.4 遍历 ​

5.5 获取信息 ​

获取属性 ​

获取文本 ​

5.6 节点操作 ​

5.7 伪类选择器 ​

六、总结与选择建议 ​

Python 解析库详解

一、正则表达式

1.1 常用匹配规则

1.2 match 方法

1.3 search 方法

1.4 findall 方法

1.5 sub 方法

1.6 compile 方法

二、实战：正则抓取猫眼电影 TOP100

2.1 目标分析

2.2 完整代码

2.3 运行结果

三、使用 XPath (lxml)

3.1 XPath 常用规则

3.2 基本使用

3.3 选取节点

所有节点

子节点

父节点

3.4 属性匹配与获取

属性过滤

获取属性值

多值属性匹配 (contains)

多属性匹配 (and)

3.5 获取文本

3.6 按序选择

3.7 节点轴选择

四、使用 Beautiful Soup

4.1 解析器选择

4.2 基本使用

4.3 节点选择器

选择元素

获取名称、属性、文本

嵌套选择

4.4 关联选择

子节点

父节点

兄弟节点

4.5 方法选择器

find_all

find

其他方法

4.6 CSS 选择器

五、使用 pyquery

5.1 初始化

5.2 基本 CSS 选择器

5.3 查找节点

子节点

父节点

兄弟节点

5.4 遍历

5.5 获取信息

获取属性

获取文本

5.6 节点操作

5.7 伪类选择器

六、总结与选择建议