Python 请求库详解

学习爬虫，最初的操作便是模拟浏览器向服务器发出请求。Python 的强大之处就是提供了功能齐全的类库来帮助我们完成这些请求。

除了基础的 urllib 和经典的 requests，现代爬虫开发中我们还会用到支持异步的 httpx、能绕过 TLS 指纹检测的 curl_cffi，以及用于处理动态渲染页面的浏览器自动化工具 Selenium 和 Playwright。为了应对日益复杂的反爬环境，新兴的融合工具 DrissionPage 也正受到开发者的热捧。

一、使用 urllib

urllib 是 Python 内置的 HTTP 请求库，不需要额外安装即可使用。

官方文档：https://docs.python.org/3/library/urllib.html

它包含如下 4 个模块：

模块	功能
`request`	最基本的 HTTP 请求模块，模拟发送请求
`error`	异常处理模块，捕获请求错误
`parse`	URL 处理工具，提供拆分、解析、合并等方法
`robotparser`	解析网站的 `robots.txt` 文件

1.1 发送请求

urlopen 基础用法

python

import urllib.request

response = urllib.request.urlopen('https://www.python.org')
print(response.read().decode('utf-8'))

返回对象是 http.client.HTTPResponse 类型，包含以下常用属性和方法：

python

response.status        # 状态码，如 200
response.getheaders()  # 所有响应头
response.getheader('Server')  # 获取特定响应头
response.read()        # 读取响应体

urlopen API 详解

python

urllib.request.urlopen(url, data=None, timeout=None, *, 
                       cafile=None, capath=None, context=None)

data 参数：传递后请求方式变为 POST

python

import urllib.parse
import urllib.request

data = bytes(urllib.parse.urlencode({'word': 'hello'}), encoding='utf8')
response = urllib.request.urlopen('http://httpbin.org/post', data=data)
print(response.read().decode('utf-8'))

timeout 参数：设置超时时间（秒）

python

import socket
import urllib.request
import urllib.error

try:
    response = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1)
except urllib.error.URLError as e:
    if isinstance(e.reason, socket.timeout):
        print('TIME OUT')

Request 对象

当需要添加 Headers 等信息时，使用 Request 类：

python

from urllib import request, parse

url = 'http://httpbin.org/post'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Host': 'httpbin.org'
}
data = bytes(parse.urlencode({'name': 'Germey'}), encoding='utf8')

req = request.Request(url=url, data=data, headers=headers, method='POST')
response = request.urlopen(req)
print(response.read().decode('utf-8'))

也可以用 add_header 方法动态添加请求头：

python

req = request.Request(url=url, data=data, method='POST')
req.add_header('User-Agent', 'Mozilla/5.0 ...')

Handler 高级用法

Handler 是各种处理器，用于处理登录验证、Cookies、代理设置等高级功能。

代理设置：

python

from urllib.request import ProxyHandler, build_opener

proxy_handler = ProxyHandler({
    'http': 'http://127.0.0.1:9743',
    'https': 'https://127.0.0.1:9743'
})
opener = build_opener(proxy_handler)
response = opener.open('https://www.baidu.com')

Cookies 处理：

python

import http.cookiejar
import urllib.request

cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')

for item in cookie:
    print(f"{item.name}={item.value}")

保存 Cookies 到文件：

python

filename = 'cookies.txt'
cookie = http.cookiejar.MozillaCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)

1.2 处理异常

URLError

python

from urllib import request, error

try:
    response = request.urlopen('https://example.com/notfound')
except error.URLError as e:
    print(e.reason)

HTTPError

HTTPError 是 URLError 的子类，包含更多信息：

python

from urllib import request, error

try:
    response = request.urlopen('https://example.com/notfound')
except error.HTTPError as e:
    print(e.reason, e.code, e.headers, sep='\n')
except error.URLError as e:
    print(e.reason)
else:
    print('Request Successfully')

1.3 解析链接

urllib.parse 模块提供了丰富的 URL 处理方法。

urlparse - URL 解析

python

from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html;user?id=5#comment')
print(result)
# ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', 
#             params='user', query='id=5', fragment='comment')

标准 URL 格式：scheme://netloc/path;params?query#fragment

urlunparse - URL 构造

python

from urllib.parse import urlunparse

data = ['http', 'www.baidu.com', 'index.html', 'user', 'a=6', 'comment']
print(urlunparse(data))
# http://www.baidu.com/index.html;user?a=6#comment

urljoin - URL 拼接

python

from urllib.parse import urljoin

print(urljoin('http://www.baidu.com', 'FAQ.html'))
# http://www.baidu.com/FAQ.html

print(urljoin('http://www.baidu.com', 'https://cuiqingcai.com/FAQ.html'))
# https://cuiqingcai.com/FAQ.html

urlencode - 参数编码

python

from urllib.parse import urlencode

params = {'name': 'germey', 'age': 22}
base_url = 'http://www.baidu.com?'
url = base_url + urlencode(params)
# http://www.baidu.com?name=germey&age=22

quote / unquote - URL 编码解码

python

from urllib.parse import quote, unquote

keyword = '壁纸'
url = 'https://www.baidu.com/s?wd=' + quote(keyword)
# https://www.baidu.com/s?wd=%E5%A3%81%E7%BA%B8

print(unquote(url))
# https://www.baidu.com/s?wd=壁纸

1.4 Robots 协议

robots.txt 文件告诉爬虫哪些页面可以抓取，哪些不可以。

python

from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.set_url('http://www.jianshu.com/robots.txt')
rp.read()

print(rp.can_fetch('*', 'http://www.jianshu.com/p/xxx'))  # True
print(rp.can_fetch('*', 'http://www.jianshu.com/search?q=python'))  # False

二、使用 requests

requests 库比 urllib 更加简洁易用，是 Python 爬虫的事实标准。

官方文档：https://requests.readthedocs.io/en/latest/

2.1 基本用法

GET 请求

python

import requests

r = requests.get('https://www.baidu.com/')
print(r.status_code)  # 200
print(r.text)         # 网页源码
print(r.cookies)      # Cookies

带参数的 GET 请求：

python

data = {'name': 'germey', 'age': 22}
r = requests.get('http://httpbin.org/get', params=data)
print(r.url)  # http://httpbin.org/get?name=germey&age=22

解析 JSON 响应：

python

r = requests.get('http://httpbin.org/get')
print(r.json())  # 自动解析为字典

添加 Headers：

python

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
r = requests.get('https://www.zhihu.com/explore', headers=headers)

抓取二进制数据（图片/视频）：

python

r = requests.get('https://github.com/favicon.ico')
with open('favicon.ico', 'wb') as f:
    f.write(r.content)  # 使用 content 获取二进制数据

POST 请求

python

data = {'name': 'germey', 'age': '22'}
r = requests.post('http://httpbin.org/post', data=data)
print(r.text)

响应对象

python

r = requests.get('http://www.jianshu.com')

print(r.status_code)  # 状态码
print(r.headers)      # 响应头
print(r.cookies)      # Cookies
print(r.url)          # 最终 URL
print(r.history)      # 重定向历史

状态码判断：

python

if r.status_code == requests.codes.ok:
    print('Request Successfully')

2.2 高级用法

文件上传

python

files = {'file': open('favicon.ico', 'rb')}
r = requests.post('http://httpbin.org/post', files=files)

Cookies 处理

获取 Cookies：

python

r = requests.get('https://www.baidu.com')
for key, value in r.cookies.items():
    print(f"{key}={value}")

设置 Cookies：

python

# 方式一：通过 headers
headers = {'Cookie': 'your_cookie_string'}
r = requests.get('https://www.zhihu.com', headers=headers)

# 方式二：通过 cookies 参数
jar = requests.cookies.RequestsCookieJar()
jar.set('key', 'value')
r = requests.get('http://www.zhihu.com', cookies=jar, headers=headers)

Session 会话维持

模拟同一浏览器的多次请求，自动处理 Cookies：

python

s = requests.Session()
s.get('http://httpbin.org/cookies/set/number/123456789')
r = s.get('http://httpbin.org/cookies')
print(r.text)
# {"cookies": {"number": "123456789"}}

SSL 证书验证

python

# 忽略 SSL 验证
import requests
from requests.packages import urllib3

urllib3.disable_warnings()
r = requests.get('https://www.12306.cn', verify=False)

代理设置

python

proxies = {
    'http': 'http://10.10.1.10:3128',
    'https': 'http://10.10.1.10:1080',
}
r = requests.get('https://www.taobao.com', proxies=proxies)

# SOCKS 代理（需安装 requests[socks]）
proxies = {
    'http': 'socks5://user:password@host:port',
    'https': 'socks5://user:password@host:port'
}

超时设置

python

# 总超时
r = requests.get('https://www.taobao.com', timeout=1)

# 分别设置连接和读取超时
r = requests.get('https://www.taobao.com', timeout=(5, 30))

身份认证

python

from requests.auth import HTTPBasicAuth

r = requests.get('http://localhost:5000', auth=HTTPBasicAuth('user', 'pass'))
# 简写
r = requests.get('http://localhost:5000', auth=('user', 'pass'))

Prepared Request

将请求表示为数据结构，便于队列调度：

python

from requests import Request, Session

url = 'http://httpbin.org/post'
data = {'name': 'germey'}
headers = {'User-Agent': 'Mozilla/5.0'}

s = Session()
req = Request('POST', url, data=data, headers=headers)
prepped = s.prepare_request(req)
r = s.send(prepped)
print(r.text)

三、新一代神器：httpx

httpx 是一个现代化的 HTTP 客户端，可以看作是 requests 的下一代替代品。它主要有两大优势：

支持 HTTP/2: requests 仅支持 HTTP/1.1。
支持异步: 原生支持 Python 的 async/await 语法。

官方文档：https://www.python-httpx.org/

3.1 基础用法

API 设计与 requests 高度兼容：

python

import httpx

# 同步请求（与 requests 几乎相同）
r = httpx.get('https://www.example.com')
print(r.status_code)
print(r.text)

3.2 开启 HTTP/2

python

# 需显式通过 client 开启
with httpx.Client(http2=True) as client:
    r = client.get('https://www.example.com')
    print(r.http_version)  # HTTP/2

3.3 异步请求

这是 httpx 最核心的特性：

python

import asyncio
import httpx

async def main():
    async with httpx.AsyncClient() as client:
        r = await client.get('https://www.example.com')
        print(r.text)

if __name__ == '__main__':
    asyncio.run(main())

HTTPX vs AIOHTTP:
httpx: 无论是同步还是异步，API 都与 requests 高度一致，上手简单，且原生支持 HTTP/2。推荐作为首选异步库。
aiohttp: 历史悠久的纯异步框架（既是客户端也是服务端）。在追求极致并发性能（如数万并发）时，aiohttp 可能略快。

四、绕过指纹：curl_cffi

在面对一些高防御网站（如 Cloudflare）时，普通的 Python 请求库（即使是伪造了 User-Agent）也会因为 TLS 指纹（JA3 指纹）与真实浏览器不同而被拦截。

curl_cffi 底层调用 curl-impersonate，能够完美模拟真实浏览器的 TLS 指纹。

GitHub 仓库：https://github.com/yifeikong/curl_cffi

4.1 安装

bash

pip install curl_cffi

4.2 模拟浏览器

通过 impersonate 参数指定要模拟的浏览器版本：

python

from curl_cffi import requests

# 模拟 Chrome 浏览器
r = requests.get('https://tls.browserleaks.com/json', impersonate='chrome110')
print(r.json())

# 模拟 Safari
r = requests.get('https://www.example.com', impersonate='safari15_3')

它也支持 Session 维持：

python

from curl_cffi import requests

s = requests.Session()
s.impersonate = "chrome110"

r = s.get("https://www.google.com")

五、浏览器自动化 (动态渲染)

对于大量使用 JavaScript 动态渲染、加密参数复杂或有严格反爬虫机制的现代网页，直接使用请求库可能困难重重。此时，浏览器自动化工具是最佳选择。

5.1 Selenium

Selenium 是浏览器自动化的老牌王者，生态成熟，文档丰富，支持多种浏览器（Chrome, Firefox, Edge 等）。

官方文档：https://www.selenium.dev/documentation/

特点：

成熟稳定：拥有庞大的社区和插件生态。
兼容性强：支持几乎所有主流浏览器和操作系统。
适合场景：复杂的企业级自动化测试、传统爬虫项目维护。

基本用法：

python

from selenium import webdriver
from selenium.webdriver.common.by import By

# 初始化浏览器
driver = webdriver.Chrome()

try:
    driver.get("https://www.python.org")
    print(driver.title)
    
    # 查找元素
    search_bar = driver.find_element(By.NAME, "q")
    search_bar.clear()
    search_bar.send_keys("pycon")
    
    # 获取页面源码
    print(driver.page_source[:100])
finally:
    driver.quit()

5.2 Playwright

Playwright 是由 Microsoft 开发的新一代自动化工具，专为现代 Web 应用设计，速度更快，API 更人性化。

官方文档：https://playwright.dev/python/

特点：

速度极快：原生支持异步，基于 WebSocket 通信，比 Selenium 快数倍。
自动等待：内置自动等待机制，告别 time.sleep 和显式等待的烦恼。
功能强大：支持录制脚本、拦截网络请求、移动端模拟、截屏录像等。
推荐场景：2025 年及以后的新项目首选。

同步模式 (Sync)：

python

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=False) # 有头模式
    page = browser.new_page()
    page.goto("http://playwright.dev")
    print(page.title())
    browser.close()

异步模式 (Async)：

python

import asyncio
from playwright.async_api import async_playwright

async def main():
    async with async_playwright() as p:
        browser = await p.chromium.launch()
        page = await browser.new_page()
        await page.goto("http://playwright.dev")
        print(await page.title())
        await browser.close()

asyncio.run(main())

5.3 选择建议 (2025版)

特性	Selenium	Playwright
上手难度	中等	简单 (API 设计更现代)
执行速度	较慢	极快
稳定性	需手动处理等待	内置自动等待
网络拦截	困难 (需额外配置)	原生支持 (非常强大)
反爬对抗	易被识别 (需 undetected版)	较好 (但仍需注意特征)
生态系统	极大	快速增长中
结论	维护旧项目时使用	新项目首选

六、融合神器：DrissionPage

在 2025 年的爬虫工具箱中，DrissionPage 异军突起。它创造性地将 浏览器控制 (Chromium based) 和 数据包收发 (基于 requests) 合二为一。

官方文档：https://drissionpage.cn/

6.1 核心优势

无缝切换：可以在一个 Session 中由“浏览器模式”处理登录（过验证码），然后瞬间切换到“收发包模式”进行高效率的数据抓取。
不依赖 WebDriver：直接基于 CDP 协议控制浏览器，启动速度快，且没有 WebDriver 的特征值，天然过部分反爬。
极简语法：作者对 API 做了深度封装，查找元素、处理弹窗、跨 iframe 操作比 Selenium/Playwright 更简洁。

6.2 基础示例

python

from DrissionPage import ChromiumPage

# 像 Selenium 一样控制浏览器
page = ChromiumPage()
page.get('https://gitee.com/explore')

# 查找元素并点击
page.ele('text:开源软件').click()

# 像 requests 一样直接获取数据（在浏览器上下文中）
for item in page.eles('.project-title'):
    print(item.text, item.link)

# 也是支持 s 模式（Session 模式），纯发包
# page.change_mode() # 切换模式

七、工程与实战指南

7.1 自动重试 (Retrying)

网络请求充满了不确定性。在生产环境中，重试机制是必须的。不要自己写 while True 和 try-except，推荐使用 tenacity 库。

bash

pip install tenacity

优雅的重试代码：

python

from tenacity import retry, stop_after_attempt, wait_fixed
import requests

# 遇到异常自动重试 3 次，每次间隔 2 秒
@retry(stop=stop_after_attempt(3), wait=wait_fixed(2))
def fetch_url(url):
    print(f"Trying to fetch {url}...")
    response = requests.get(url, timeout=5)
    response.raise_for_status() # 如果状态码不是 200，抛出异常触发重试
    return response.text

try:
    content = fetch_url("http://httpbin.org/flaky")
except Exception:
    print("重试 3 次后依然失败")

7.2 库的选用建议

场景	推荐方案	理由
简单脚本 / 学习	`requests`	简单直观，资料最多
高性能 / 异步	`httpx`	现代 API，支持 HTTP/2，兼容 async
高难度反爬 (TLS)	`curl_cffi`	完美模拟浏览器指纹，简单易用
动态网页 / 验证码	`DrissionPage` / `Playwright`	DP 语法更简，PW 社区更强
生产环境	`httpx` + `tenacity`	兼顾性能与稳定性

← 返回 Python 深度研究

Python 请求库详解 ​

一、使用 urllib ​

1.1 发送请求 ​

urlopen 基础用法 ​

urlopen API 详解 ​

Request 对象 ​

Handler 高级用法 ​

1.2 处理异常 ​

URLError ​

HTTPError ​

1.3 解析链接 ​

urlparse - URL 解析 ​

urlunparse - URL 构造 ​

urljoin - URL 拼接 ​

urlencode - 参数编码 ​

quote / unquote - URL 编码解码 ​

1.4 Robots 协议 ​

二、使用 requests ​

2.1 基本用法 ​

GET 请求 ​

POST 请求 ​

响应对象 ​

2.2 高级用法 ​

文件上传 ​

Cookies 处理 ​

Session 会话维持 ​

SSL 证书验证 ​

代理设置 ​

超时设置 ​

身份认证 ​

Prepared Request ​

三、新一代神器：httpx ​

3.1 基础用法 ​

3.2 开启 HTTP/2 ​

3.3 异步请求 ​

四、绕过指纹：curl_cffi ​

4.1 安装 ​

4.2 模拟浏览器 ​

五、浏览器自动化 (动态渲染) ​

5.1 Selenium ​

5.2 Playwright ​

5.3 选择建议 (2025版) ​

六、融合神器：DrissionPage ​

6.1 核心优势 ​

6.2 基础示例 ​

七、工程与实战指南 ​

7.1 自动重试 (Retrying) ​

7.2 库的选用建议 ​

Python 请求库详解

一、使用 urllib

1.1 发送请求

urlopen 基础用法

urlopen API 详解

Request 对象

Handler 高级用法

1.2 处理异常

URLError

HTTPError

1.3 解析链接

urlparse - URL 解析

urlunparse - URL 构造

urljoin - URL 拼接

urlencode - 参数编码

quote / unquote - URL 编码解码

1.4 Robots 协议

二、使用 requests

2.1 基本用法

GET 请求

POST 请求

响应对象

2.2 高级用法

文件上传

Cookies 处理

Session 会话维持

SSL 证书验证

代理设置

超时设置

身份认证

Prepared Request

三、新一代神器：httpx

3.1 基础用法

3.2 开启 HTTP/2

3.3 异步请求

四、绕过指纹：curl_cffi

4.1 安装

4.2 模拟浏览器

五、浏览器自动化 (动态渲染)

5.1 Selenium

5.2 Playwright

5.3 选择建议 (2025版)

六、融合神器：DrissionPage

6.1 核心优势

6.2 基础示例

七、工程与实战指南

7.1 自动重试 (Retrying)

7.2 库的选用建议