目录:
w3lib 用于处理与网页有关的 URL、HTML、表单和 HTTP 等任务。最初是 Scrapy 框架的一部分,后来被剥离出来。
安装依赖:
$ pip install w3lib
可以选择移除所有标签,或仅保留指定的标签,或仅移除指定标签,从而快速提取纯文本。
from w3lib.html import remove_tags, remove_tags_with_content
html = '<p>Hello <b>World</b>!</p>'
# 移除所有标签
clean_text = remove_tags(html)
print(clean_text) # 输出: Hello World!
# 仅保留 <b> 标签
result = remove_tags(html, keep=('b',))
print(result) # 输出: Hello <b>World</b>!
# 仅移除 <script> 标签
html_with_js = '<p>内容</p><script>alert(1);</script>'
print(remove_tags_with_content(html_with_js, ('script',))) # 输出: <p>内容</p>
from w3lib.html import remove_comments
html_with_comment = '<div><!-- 注释 --><p>正文</p></div>'
print(remove_comments(html_with_comment)) # 输出: <div><p>正文</p></div>
将 &(&)、<(<)、 (空格) 等 HTML 实体转换回对应的字符。
from w3lib.html import replace_entities
text = "Hello & World > Python"
clean_text = replace_entities(text)
print(clean_text) # 输出: Hello & World > Python
from w3lib.html import replace_escape_chars
# 将换行、制表符等转义字符替换为空格
text = """
<p>hello world</p>
<p>I love Python</p>
"""
text = replace_escape_chars(text, replace_by=" ")
print(text)
# 输出: <p>hello world</p> <p>I love Python</p>
从 HTML 片段中智能提取或推断基准 URL (<base> 标签),用于解析相对链接。
from w3lib.html import get_base_url
html = '<html><head><base href="https://example.com/blog/"></head></html>'
base_url = get_base_url(html, "https://example.com/default/")
print(base_url) # 输出: https://example.com/blog/
base_url = get_base_url(html, "https://example.com/")
print(base_url) # 输出: https://example.com/blog/
base_url = get_base_url(html, "https://example.com/blog/sample.html")
print(base_url) # 输出: https://example.com/blog/
将主机名转为小写、对查询参数进行排序等,使得看似不同的 URL 能够被识别为同一个资源。
from w3lib.url import canonicalize_url
url = "HTTP://www.Example.com/a%20b?b=2&a=1#fragment"
canonical_url = canonicalize_url(url)
print(canonical_url) # 输出: "http://www.example.com/a%20b?a=1&b=2"
可以方便地添加、替换或清理 URL 中的查询参数。
from w3lib.url import add_or_replace_parameter, url_query_cleaner
# 添加参数
url = "http://www.example.com/page?param1=value1"
new_url = add_or_replace_parameter(url, "param2", "value2")
print(new_url) # 输出: "http://www.example.com/page?param1=value1¶m2=value2"
# 清理参数:仅保留指定的参数
url_with_many_params = "http://example.com/?a=1&b=2&c=3"
cleaned_url = url_query_cleaner(url_with_many_params, parameterlist=["a", "b"])
print(cleaned_url) # 输出: "http://example.com/?a=1&b=2"
from w3lib.url import safe_url_string
url = "http://www.example.com/文件"
safe_url = safe_url_string(url)
print(safe_url) # 输出: "http://www.example.com/%E6%96%87%E4%BB%B6"
from w3lib.url import is_url
print(is_url("http://example.com")) # True
print(is_url("not_a_url")) # False
从 HTML 的 <meta> 标签中检测页面声明的字符编码。
from w3lib.encoding import html_body_declared_encoding
html = '<meta charset="UTF-8"><p>Hello!</p>'
encoding = html_body_declared_encoding(html)
print(encoding) # 输出: "UTF-8"
↶ 返回首页 ↶