【Python 第三方库】w3lib

2024-01-25 00:00:00

#python

w3lib 模块介绍
移除 HTML 标签
移除注释
将转义字符替换为可读字符
将换行制表符等替换为空格
提取 URL
规范化 URL
操作 URL 参数
URL 安全编码
验证 URL 是否有效
编码检测

w3lib 模块介绍

w3lib 用于处理与网页有关的 URL、HTML、表单和 HTTP 等任务。最初是 Scrapy 框架的一部分，后来被剥离出来。

安装依赖:

$ pip install w3lib

移除 HTML 标签

可以选择移除所有标签，或仅保留指定的标签，或仅移除指定标签，从而快速提取纯文本。

from w3lib.html import remove_tags, remove_tags_with_content

html = '<p>Hello <b>World</b>!</p>'
# 移除所有标签
clean_text = remove_tags(html)
print(clean_text)  # 输出: Hello World!

# 仅保留 <b> 标签
result = remove_tags(html, keep=('b',))
print(result)  # 输出: Hello <b>World</b>!

# 仅移除 <script> 标签
html_with_js = '<p>内容</p><script>alert(1);</script>'
print(remove_tags_with_content(html_with_js, ('script',)))  # 输出: <p>内容</p>

移除注释

from w3lib.html import remove_comments

html_with_comment = '<div><!-- 注释 --><p>正文</p></div>'
print(remove_comments(html_with_comment))  # 输出: <div><p>正文</p></div>

将转义字符替换为可读字符

将 &(&)、<(<)、 (空格) 等 HTML 实体转换回对应的字符。

from w3lib.html import replace_entities

text = "Hello &amp; World &gt; Python"
clean_text = replace_entities(text)
print(clean_text)  # 输出: Hello & World > Python

将换行制表符等替换为空格

from w3lib.html import replace_escape_chars

# 将换行、制表符等转义字符替换为空格
text = """
<p>hello world</p>

<p>I love Python</p>
"""
text = replace_escape_chars(text, replace_by=" ")
print(text)
# 输出:  <p>hello world</p>  <p>I love Python</p>

提取 URL

从 HTML 片段中智能提取或推断基准 URL (<base> 标签)，用于解析相对链接。

from w3lib.html import get_base_url

html = '<html><head><base href="https://example.com/blog/"></head></html>'
base_url = get_base_url(html, "https://example.com/default/")
print(base_url)  # 输出: https://example.com/blog/

base_url = get_base_url(html, "https://example.com/")
print(base_url)  # 输出: https://example.com/blog/

base_url = get_base_url(html, "https://example.com/blog/sample.html")
print(base_url)  # 输出: https://example.com/blog/

规范化 URL

将主机名转为小写、对查询参数进行排序等，使得看似不同的 URL 能够被识别为同一个资源。

from w3lib.url import canonicalize_url

url = "HTTP://www.Example.com/a%20b?b=2&a=1#fragment"
canonical_url = canonicalize_url(url)
print(canonical_url)  # 输出: "http://www.example.com/a%20b?a=1&b=2"

操作 URL 参数

可以方便地添加、替换或清理 URL 中的查询参数。

from w3lib.url import add_or_replace_parameter, url_query_cleaner

# 添加参数
url = "http://www.example.com/page?param1=value1"
new_url = add_or_replace_parameter(url, "param2", "value2")
print(new_url)  # 输出: "http://www.example.com/page?param1=value1&param2=value2"

# 清理参数：仅保留指定的参数
url_with_many_params = "http://example.com/?a=1&b=2&c=3"
cleaned_url = url_query_cleaner(url_with_many_params, parameterlist=["a", "b"])
print(cleaned_url)  # 输出: "http://example.com/?a=1&b=2"

URL 安全编码

from w3lib.url import safe_url_string

url = "http://www.example.com/文件"
safe_url = safe_url_string(url)
print(safe_url)  # 输出: "http://www.example.com/%E6%96%87%E4%BB%B6"

验证 URL 是否有效

from w3lib.url import is_url

print(is_url("http://example.com"))  # True
print(is_url("not_a_url"))           # False

编码检测

从 HTML 的 <meta> 标签中检测页面声明的字符编码。

from w3lib.encoding import html_body_declared_encoding

html = '<meta charset="UTF-8"><p>Hello!</p>'
encoding = html_body_declared_encoding(html)
print(encoding)  # 输出: "UTF-8"

↶ 返回首页 ↶

本文总阅读量次

皖ICP备17026209号-3

总访问量:

总访客量: