【Python 第三方库】pyppeteer

2024-01-25 00:00:00

#python

pyppeteer 模块介绍
获取页面标题和页面内容
定位元素
执行 JS 代码
生成截图
生成 PDF
网络请求拦截与监控
加载本地 HTML 文件

pyppeteer 模块介绍

Pyppeteer 是 Google Puppeteer（一个用于控制 Chrome/Chromium 浏览器的 Node.js 库）的非官方 Python 端口实现。允许开发者通过 Python 程序化地控制 Chromium 或 Chrome 浏览器，实现网页自动化操作，如爬取动态渲染内容、自动化测试、生成截图与 PDF 等。

相较于传统的 Selenium，Pyppeteer 因其免驱动配置、基于异步协程而效率更高，成为处理 JavaScript 渲染页面的强大工具。

底层基于 Python 的 asyncio 异步库构建，所有主要操作都是协程（coroutine），天然支持异步并发，这在处理大量网页时能显著提升效率。例如，在爬取多个基金净值数据的对比实验中，异步执行相比顺序执行速度提升了约 6 倍。

无头浏览器本身占用内存较多。在生产环境中应使用 headless=True 模式，并通过 args 添加 –no-sandbox、–disable-gpu 等参数以提高稳定性。对于无限滚动页面，合理设置视口 (setViewport) 和窗口大小可能有助于内容正确加载。

对于反爬虫，除了使用 –disable-infobars 和注入 JS 隐藏 webdriver 属性外，还可以考虑启用无痕模式 (createIncognitoBrowserContext) 来隔离环境。

安装依赖:

$ pip install pyppeteer

安装后，建议运行 pyppeteer-install 命令来预先下载 Chromium。

如果本机已经安装过 Chrome，可以使用 executablePath 参数指定自定义的 Chrome/Chromium 可执行文件路径，绕过自动下载，我在 MacOS 上使用本地 Chrome 示例如下:

import asyncio
from pyppeteer import launch

async def main():
    browser = await launch(
        headless=False, 
        args=['--disable-infobars'], 
        executablePath="/Applications/Google Chrome.app/Contents/MacOS/Google Chrome",
    )
    page = await browser.newPage()
    await page.goto('https://www.baidu.com')
    # ... 后续操作
    await browser.close()

asyncio.get_event_loop().run_until_complete(main())

获取页面标题和页面内容

page.title() 可以获取网页标题，需要注意的是所有方法都是异步的，前面要加上 await，比如:

print(await page.title())

page.content() 可以获取网页 HTML 完整内容，前面也需要加上 await。

import asyncio
from pyppeteer import launch

async def main():
    browser = await launch(
        headless=False, 
        args=['--disable-infobars'], 
        executablePath="/Applications/Google Chrome.app/Contents/MacOS/Google Chrome",
    )
    page = await browser.newPage()
    await page.goto('https://www.baidu.com')
    print(f'网页标题: {await page.title()}')
    html_content = await page.content()
    # print(f'网页内容: {html_content}')
    await browser.close()

asyncio.get_event_loop().run_until_complete(main())

定位元素

# 定位搜索框并输入文本
await page.type('#kw', 'Python Pyppeteer', {'delay': 100})  # delay模拟输入间隔
# 定位并点击搜索按钮
await page.click('#su')
# 或者先定位元素再操作
search_box = await page.querySelector('#kw')
await search_box.type('Hello World')

# 获取元素的属性和文本内容
elements = await page.querySelectorAll('a.news-title')
for elem in elements:
    text = await (await elem.getProperty('textContent')).jsonValue()
    link = await (await elem.getProperty('href')).jsonValue()
    print(f'标题: {text}, 链接: {link}')

执行 JS 代码

page.evaluate() 可以执行 JS 代码。

比如需要获取视口大小:

dimensions = await page.evaluate('''() => {
    return {
        width: document.documentElement.clientWidth,
        height: document.documentElement.clientHeight,
        deviceScaleFactor: window.devicePixelRatio,
    }
}''')
print(dimensions)

比如滚动页面到底部:

await page.evaluate('window.scrollTo(0, document.body.scrollHeight);')

比如隐藏 WebDriver 特征（重要反反爬措施）

await page.evaluate('''() => {
    Object.defineProperty(navigator, 'webdriver', { get: () => false });
}''')

生成截图

page.screenshot() 可以方便地将页面保存为图片。

import asyncio
from pyppeteer import launch

async def main():
    browser = await launch(
        headless=False, 
        args=['--disable-infobars'], 
        executablePath="/Applications/Google Chrome.app/Contents/MacOS/Google Chrome",
    )
    page = await browser.newPage()
    await page.goto('https://blog.dkvirus.com')
    await page.screenshot({'path': 'screenshot.png', 'fullPage': True})  # 截图
    await browser.close()

asyncio.get_event_loop().run_until_complete(main())

生成 PDF

page.pdf() 可以方便地将页面保存为 PDF 文档。

import asyncio
from pyppeteer import launch

async def main():
    browser = await launch(
        headless=False, 
        args=['--disable-infobars'], 
        executablePath="/Applications/Google Chrome.app/Contents/MacOS/Google Chrome",
    )
    page = await browser.newPage()
    await page.goto('https://blog.dkvirus.com')
    await page.pdf({
        'path': 'page.pdf', 
        'format': 'A4', 
        'printBackground': True,    # 启用背景
        'preferCSSPageSize': True,  # 启用 CSS 控制
    })  # 生成 PDF
    await browser.close()

asyncio.get_event_loop().run_until_complete(main())

生成的 PDF 里面的内容是矢量图，放大也不会变模糊，关键是生成速度太快了。

对于 box-shadow 盒子阴影支持并不好，直接从浏览器上打印 PDF 也有这个问题，解决方法是在全局样式中设置禁用盒子阴影，加上如下 CSS 代码:

* { box-shadow: none!important; }

网络请求拦截与监控

通过设置请求拦截并监听 request 和 response 事件，可以优化爬取效率或分析数据接口。

await page.setRequestInterception(True)

async def intercept_request(req):
    # 阻止图片、媒体等非必要请求以提升速度
    if req.resourceType in ['image', 'stylesheet', 'font', 'media']:
        await req.abort()
    else:
        await req.continue_()

page.on('request', intercept_request)

# 监听响应，获取API数据
async def intercept_response(res):
    if 'api/data' in res.url:
        data = await res.json()
        print(data)

page.on('response', intercept_response)

加载本地 HTML 文件

import asyncio
from pathlib import Path
from pyppeteer import launch

async def main():
    browser = await launch(
        headless=False, 
        args=['--disable-infobars'], 
        executablePath="/Applications/Google Chrome.app/Contents/MacOS/Google Chrome",
    )
    page = await browser.newPage()
    await page.goto(f'file:///{Path("test.html").absolute()}')
    print(await page.title())
    await browser.close()

asyncio.get_event_loop().run_until_complete(main())

↶ 返回首页 ↶

本文总阅读量次

皖ICP备17026209号-3

总访问量:

总访客量: