跳转至

在lxml中使用自定义元素类

lxml 对自定义 Element 类有非常复杂的支持。 您可以为元素提供自己的类,并让 lxml 默认将它们用于特定解析器生成的所有元素,仅用于特定命名空间中的特定标记名称,甚至用于树中特定位置的确切元素。

自定义Element必须继承自lxml.etree.ElementBase类,该类为子类提供Element接口:

>>> from lxml import etree

>>> class honk(etree.ElementBase):
...    @property
...    def honking(self):
...       return self.get('honking') == 'true'

这定义了一个新的 Element 类 honk 以及有一个 honking 属性。

以下文档描述了如何使 lxml.etree 使用这些自定义 Element 类。

Using custom Element classes in lxml

lxml has very sophisticated support for custom Element classes. You can provide your own classes for Elements and have lxml use them by default for all elements generated by a specific parser, only for a specific tag name in a specific namespace or even for an exact element at a specific position in the tree.

Custom Elements must inherit from the lxml.etree.ElementBase class, which provides the Element interface for subclasses:

>>> from lxml import etree

>>> class honk(etree.ElementBase):
...    @property
...    def honking(self):
...       return self.get('honking') == 'true'

This defines a new Element class honk with a property honking.

The following document describes how you can make lxml.etree use these custom Element classes.

Element 代理的背后实现

lxml.etree 基于 libxml2,以 C 结构保存整个 XML 树。 为了与 Python 代码通信,它根据需要为 XML 元素创建 Python 代理对象。

proxies

C 元素和 Python Element 类之间的映射是完全可配置的。 当您使用lxml.etree的API向lxml.etree请求一个元素时,它会为您实例化您的类。 您所要做的就是告诉 lxml 对于哪种元素使用哪个类。 这是通过类查找方案完成的,如以下各节所述。

Background on Element proxies

Being based on libxml2, lxml.etree holds the entire XML tree in a C structure. To communicate with Python code, it creates Python proxy objects for the XML elements on demand.

proxies

The mapping between C elements and Python Element classes is completely configurable. When you ask lxml.etree for an Element by using its API, it will instantiate your classes for you. All you have to do is tell lxml which class to use for which kind of Element. This is done through a class lookup scheme, as described in the sections below.

Element 初始化

预先要知道一件事。 元素类不得具有__init_____new__方法。 除了存储在底层 XML 树中的数据之外,也不应该有任何内部状态。 元素实例是根据需要创建的并进行垃圾收集,因此通常无法预测为它们创建代理的时间和频率。 更糟糕的是,当调用 __init__ 方法时,对象甚至还没有初始化来表示 XML 标记,因此在子类中提供 __init__ 方法没有多大用处。

大多数用例不需要任何类初始化或代理状态,因此您现在可以跳到下一部分。 但是,如果您确实需要在实例化时设置元素类,或者需要一种在代理实例而不是 XML 树中持久存储状态的方法,可以使用以下方法。

关于 Element 代理,有一个重要的保证。 一旦代理被实例化,只要存在对它的 Python 引用,它就会保持活动状态,并且对树中 XML 元素的任何访问都将返回这个实例。 因此,如果您需要在自定义 Element 类中存储本地状态(通常不鼓励这样做),您可以通过保持树中的 Elements 处于活动状态来实现此目的。 如果树没有改变,你可以简单地这样做:

proxy_cache = list(root.iter())

或者这样

proxy_cache = set(root.iter())

或使用任何其他合适的容器。 请注意,如果树发生变化,您必须手动保持此缓存最新,这在某些情况下可能会变得棘手。

对于代理初始化,ElementBase 类有一个可以重写的_init()方法,这与普通的__init__()方法相反。 它可用于修改 XML 树,例如 构造特殊子项验证和更新属性

_init() 的语义如下:

  • 它在 Element 类实例化时调用一次。 也就是说,当 lxml 创建元素的 Python 表示形式时。 此时,元素对象已完全初始化以表示树中的特定 XML 元素。
  • 该方法可以完全访问 XML 树。 可以按照与程序中其他任何地方完全相同的方式进行修改。
  • 在底层 C 树中 XML 元素的生命周期内,元素的 Python 表示可能会被多次创建。 子类提供的 _init() 代码本身必须特别注意,多次执行要么是无害的,要么被 XML 树中的某种标志阻止。 后者可以通过修改属性值或删除或添加特定子节点,然后在运行 init 进程之前进行验证来实现。
  • _init() 中引发的任何异常都将通过 API 调用传播,从而导致元素的创建。 因此,请小心您在此处编写的代码,因为它的异常可能会出现在各种意想不到的地方。

Element initialization

There is one thing to know up front. Element classes must not have an __init___ or __new__ method. There should not be any internal state either, except for the data stored in the underlying XML tree. Element instances are created and garbage collected at need, so there is normally no way to predict when and how often a proxy is created for them. Even worse, when the __init__ method is called, the object is not even initialized yet to represent the XML tag, so there is not much use in providing an __init__ method in subclasses.

Most use cases will not require any class initialisation or proxy state, so you can content yourself with skipping to the next section for now. However, if you really need to set up your element class on instantiation, or need a way to persistently store state in the proxy instances instead of the XML tree, here is a way to do so.

There is one important guarantee regarding Element proxies. Once a proxy has been instantiated, it will keep alive as long as there is a Python reference to it, and any access to the XML element in the tree will return this very instance. Therefore, if you need to store local state in a custom Element class (which is generally discouraged), you can do so by keeping the Elements in a tree alive. If the tree doesn't change, you can simply do this:

proxy_cache = list(root.iter())

or

proxy_cache = set(root.iter())

or use any other suitable container. Note that you have to keep this cache manually up to date if the tree changes, which can get tricky in cases.

For proxy initialisation, ElementBase classes have an _init() method that can be overridden, as oppose to the normal __init__() method. It can be used to modify the XML tree, e.g. to construct special children or verify and update attributes.

The semantics of _init() are as follows:

  • It is called once on Element class instantiation time. That is, when a Python representation of the element is created by lxml. At that time, the element object is completely initialized to represent a specific XML element within the tree.
  • The method has complete access to the XML tree. Modifications can be done in exactly the same way as anywhere else in the program.
  • Python representations of elements may be created multiple times during the lifetime of an XML element in the underlying C tree. The _init() code provided by subclasses must take special care by itself that multiple executions either are harmless or that they are prevented by some kind of flag in the XML tree. The latter can be achieved by modifying an attribute value or by removing or adding a specific child node and then verifying this before running through the init process.
  • Any exceptions raised in _init() will be propagated through the API call that lead to the creation of the Element. So be careful with the code you write here as its exceptions may turn up in various unexpected places.

设置类查找方案

部署自定义元素类时要做的第一件事是在解析器上注册类查找方案。 lxml.etree 提供了相当多不同的方案,它们也支持基于名称空间或属性值的类查找。 大多数查找都支持回退链接,这允许在前一个查找机制未能找到类时接管下一个查找机制。

例如,将 honk 元素设置为解析器的默认元素类的工作方式如下:

>>> parser_lookup = etree.ElementDefaultClassLookup(element=honk)
>>> parser = etree.XMLParser()
>>> parser.set_element_class_lookup(parser_lookup)

基于解析器的方案有一个缺点:Element() 工厂不知道您的专用解析器,并创建一个部署默认解析器的新文档:

>>> el = etree.Element("root")
>>> print(isinstance(el, honk))
False

因此,您应该避免在使用自定义类的代码中使用此工厂函数。 解析器的 makeelement() 方法提供了一个简单的替换:

>>> el = parser.makeelement("root")
>>> print(isinstance(el, honk))
True

如果您在模块级别使用解析器,则可以通过添加如下代码轻松地将模块级别 Element() 工厂重定向到解析器方法:

>>> module_level_parser = etree.XMLParser()
>>> Element = module_level_parser.makeelement

虽然 XML()HTML() 工厂也依赖于默认解析器,但您可以将不同的解析器作为第二个参数传递给它们:

>>> element = etree.XML("<test/>")
>>> print(isinstance(element, honk))
False

>>> element = etree.XML("<test/>", parser)
>>> print(isinstance(element, honk))
True

每当您使用解析器创建文档时,它将继承查找方案,并且该文档的所有后续元素实例化都将使用它:

>>> element = etree.fromstring("<test/>", parser)
>>> print(isinstance(element, honk))
True
>>> el = etree.SubElement(element, "subel")
>>> print(isinstance(el, honk))
True

对于在 Python 解释器中测试代码和小型项目,您还可以考虑在默认解析器上设置查找方案。 然而,为了避免干扰其他模块,通常更好的主意是为每个模块使用专用解析器(或使用线程时使用解析器池),然后仅为该解析器注册所需的查找方案。

Setting up a class lookup scheme

The first thing to do when deploying custom element classes is to register a class lookup scheme on a parser. lxml.etree provides quite a number of different schemes that also support class lookup based on namespaces or attribute values. Most lookups support fallback chaining, which allows the next lookup mechanism to take over when the previous one fails to find a class.

For example, setting the honk Element as a default element class for a parser works as follows:

>>> parser_lookup = etree.ElementDefaultClassLookup(element=honk)
>>> parser = etree.XMLParser()
>>> parser.set_element_class_lookup(parser_lookup)

There is one drawback of the parser based scheme: the Element() factory does not know about your specialised parser and creates a new document that deploys the default parser:

>>> el = etree.Element("root")
>>> print(isinstance(el, honk))
False

You should therefore avoid using this factory function in code that uses custom classes. The makeelement() method of parsers provides a simple replacement:

>>> el = parser.makeelement("root")
>>> print(isinstance(el, honk))
True

If you use a parser at the module level, you can easily redirect a module level Element() factory to the parser method by adding code like this:

>>> module_level_parser = etree.XMLParser()
>>> Element = module_level_parser.makeelement

While the XML() and HTML() factories also depend on the default parser, you can pass them a different parser as second argument:

>>> element = etree.XML("<test/>")
>>> print(isinstance(element, honk))
False

>>> element = etree.XML("<test/>", parser)
>>> print(isinstance(element, honk))
True

Whenever you create a document with a parser, it will inherit the lookup scheme and all subsequent element instantiations for this document will use it:

>>> element = etree.fromstring("<test/>", parser)
>>> print(isinstance(element, honk))
True
>>> el = etree.SubElement(element, "subel")
>>> print(isinstance(el, honk))
True

For testing code in the Python interpreter and for small projects, you may also consider setting a lookup scheme on the default parser. To avoid interfering with other modules, however, it is usually a better idea to use a dedicated parser for each module (or a parser pool when using threads) and then register the required lookup scheme only for this parser.

基于默认类的查找

这是最简单的查找机制。 它始终返回默认元素类。 因此,不支持进一步的后备,但此方案对于其他自定义查找机制来说是一个很好的后备。 具体来说,它还处理注释和处理指令,这些在将代理映射到类时很容易被忘记。

Usage:

>>> lookup = etree.ElementDefaultClassLookup()
>>> parser = etree.XMLParser()
>>> parser.set_element_class_lookup(lookup)

请注意,新解析器的默认设置是使用全局回退,这也是默认查找(如果没有另外配置)。

要更改默认元素实现,您可以将新类传递给构造函数。 虽然它接受元素、注释和 pi 节点的类,但大多数用例只会覆盖元素类:

>>> el = parser.makeelement("myelement")
>>> print(isinstance(el, honk))
False

>>> lookup = etree.ElementDefaultClassLookup(element=honk)
>>> parser.set_element_class_lookup(lookup)

>>> el = parser.makeelement("myelement")
>>> print(isinstance(el, honk))
True
>>> el.honking
False
>>> el = parser.makeelement("myelement", honking='true')
>>> etree.tostring(el)
b'<myelement honking="true"/>'
>>> el.honking
True

>>> root = etree.fromstring(
...     '<root honking="true"><!--comment--></root>', parser)
>>> root.honking
True
>>> print(root[0].text)
comment

Default class lookup

This is the most simple lookup mechanism. It always returns the default element class. Consequently, no further fallbacks are supported, but this scheme is a nice fallback for other custom lookup mechanisms. Specifically, it also handles comments and processing instructions, which are easy to forget about when mapping proxies to classes.

Usage:

>>> lookup = etree.ElementDefaultClassLookup()
>>> parser = etree.XMLParser()
>>> parser.set_element_class_lookup(lookup)

Note that the default for new parsers is to use the global fallback, which is also the default lookup (if not configured otherwise).

To change the default element implementation, you can pass your new class to the constructor. While it accepts classes for element, comment and pi nodes, most use cases will only override the element class:

>>> el = parser.makeelement("myelement")
>>> print(isinstance(el, honk))
False

>>> lookup = etree.ElementDefaultClassLookup(element=honk)
>>> parser.set_element_class_lookup(lookup)

>>> el = parser.makeelement("myelement")
>>> print(isinstance(el, honk))
True
>>> el.honking
False
>>> el = parser.makeelement("myelement", honking='true')
>>> etree.tostring(el)
b'<myelement honking="true"/>'
>>> el.honking
True

>>> root = etree.fromstring(
...     '<root honking="true"><!--comment--></root>', parser)
>>> root.honking
True
>>> print(root[0].text)
comment

基于命名空间的类查找

这是一种高级查找机制,支持命名空间/标记名称特定的元素类。 您可以通过调用以下命令来选择它:

>>> lookup = etree.ElementNamespaceClassLookup()
>>> parser = etree.XMLParser()
>>> parser.set_element_class_lookup(lookup)

请参阅下面有关实现命名空间 的单独部分,了解如何使用它。

该方案支持回退机制,在找不到命名空间或没有为元素名称注册类的情况下使用该机制。 通常,这里使用默认的类查找。 要更改它,请将所需的后备查找方案传递给构造函数:

>>> fallback = etree.ElementDefaultClassLookup(element=honk)
>>> lookup = etree.ElementNamespaceClassLookup(fallback)
>>> parser.set_element_class_lookup(lookup)

>>> root = etree.fromstring(
...     '<root honking="true"><!--comment--></root>', parser)
>>> root.honking
True
>>> print(root[0].text)
comment

Namespace class lookup

This is an advanced lookup mechanism that supports namespace/tag-name specific element classes. You can select it by calling:

>>> lookup = etree.ElementNamespaceClassLookup()
>>> parser = etree.XMLParser()
>>> parser.set_element_class_lookup(lookup)

See the separate section on implementing namespaces below to learn how to make use of it.

This scheme supports a fallback mechanism that is used in the case where the namespace is not found or no class was registered for the element name. Normally, the default class lookup is used here. To change it, pass the desired fallback lookup scheme to the constructor:

>>> fallback = etree.ElementDefaultClassLookup(element=honk)
>>> lookup = etree.ElementNamespaceClassLookup(fallback)
>>> parser.set_element_class_lookup(lookup)

>>> root = etree.fromstring(
...     '<root honking="true"><!--comment--></root>', parser)
>>> root.honking
True
>>> print(root[0].text)
comment

基于属性的查找

该方案使用从属性值到类的映射。 属性名称在初始化时设置,然后用于在字典中查找相应的值。 它的设置如下:

>>> id_class_mapping = {'1234' : honk} # 将属性值映射到类

>>> lookup = etree.AttributeBasedElementClassLookup(
...                                      'id', id_class_mapping)
>>> parser = etree.XMLParser()
>>> parser.set_element_class_lookup(lookup)

以下是如何使用它:

>>> xml = '<a id="123"><b id="1234"/><b id="1234" honking="true"/></a>'
>>> a = etree.fromstring(xml, parser)

>>> a.honking       # id 不匹配!
Traceback (most recent call last):
AttributeError: 'lxml.etree._Element' object has no attribute 'honking'

>>> a[0].honking
False
>>> a[1].honking
True

如果未找到该属性或其值不在映射中,则此查找方案将使用其后备。 通常,这里使用默认的类查找。 例如,如果您想使用名称空间查找,可以使用以下代码:

>>> fallback = etree.ElementNamespaceClassLookup()
>>> lookup = etree.AttributeBasedElementClassLookup(
...                       'id', id_class_mapping, fallback)
>>> parser = etree.XMLParser()
>>> parser.set_element_class_lookup(lookup)

Attribute based lookup

This scheme uses a mapping from attribute values to classes. An attribute name is set at initialisation time and is then used to find the corresponding value in a dictionary. It is set up as follows:

>>> id_class_mapping = {'1234' : honk} # maps attribute values to classes

>>> lookup = etree.AttributeBasedElementClassLookup(
...                                      'id', id_class_mapping)
>>> parser = etree.XMLParser()
>>> parser.set_element_class_lookup(lookup)

And here is how to use it:

>>> xml = '<a id="123"><b id="1234"/><b id="1234" honking="true"/></a>'
>>> a = etree.fromstring(xml, parser)

>>> a.honking       # id does not match !
Traceback (most recent call last):
AttributeError: 'lxml.etree._Element' object has no attribute 'honking'

>>> a[0].honking
False
>>> a[1].honking
True

This lookup scheme uses its fallback if the attribute is not found or its value is not in the mapping. Normally, the default class lookup is used here. If you want to use the namespace lookup, for example, you can use this code:

>>> fallback = etree.ElementNamespaceClassLookup()
>>> lookup = etree.AttributeBasedElementClassLookup(
...                       'id', id_class_mapping, fallback)
>>> parser = etree.XMLParser()
>>> parser.set_element_class_lookup(lookup)

基于自定义元素类的查找

这是在每个元素的基础上查找元素类的最可定制的方法。 它允许您在子类中实现自定义查找方案:

>>> class MyLookup(etree.CustomElementClassLookup):
...     def lookup(self, node_type, document, namespace, name):
...         if node_type == 'element':
...             return honk  # 这里更有选择性一点......
...         else:
...             return None  # 传递给(默认)后备

>>> parser = etree.XMLParser()
>>> parser.set_element_class_lookup(MyLookup())

>>> root = etree.fromstring(
...     '<root honking="true"><!--comment--></root>', parser)
>>> root.honking
True
>>> print(root[0].text)
comment

.lookup() 方法必须返回 None (这会触发回退机制)或 lxml.etree.ElementBase 的子类。 它可以根据节点类型(“element”、“comment”、“PI”、“entity”之一)、元素的 XML 文档或其命名空间或标记名称做出任何所需的决定。

Custom element class lookup

This is the most customisable way of finding element classes on a per-element basis. It allows you to implement a custom lookup scheme in a subclass:

>>> class MyLookup(etree.CustomElementClassLookup):
...     def lookup(self, node_type, document, namespace, name):
...         if node_type == 'element':
...             return honk  # be a bit more selective here ...
...         else:
...             return None  # pass on to (default) fallback

>>> parser = etree.XMLParser()
>>> parser.set_element_class_lookup(MyLookup())

>>> root = etree.fromstring(
...     '<root honking="true"><!--comment--></root>', parser)
>>> root.honking
True
>>> print(root[0].text)
comment

The .lookup() method must return either None (which triggers the fallback mechanism) or a subclass of lxml.etree.ElementBase. It can take any decision it wants based on the node type (one of "element", "comment", "PI", "entity"), the XML document of the element, or its namespace or tag name.

Python 中基于树的元素类查找

在纯 Python 中很难做出比自定义方案允许的更复杂的决策,因为这会导致先有鸡还是先有蛋的问题。 在树中的元素被实例化为 Python 元素代理之前,它需要访问树。

幸运的是,有一种方法可以做到这一点。 PythonElementClassLookup 的工作方式与自定义查找方案类似:

>>> class MyLookup(etree.PythonElementClassLookup):
...     def lookup(self, document, element):
...         return MyElementClass # 其他地方定义的

>>> parser = etree.XMLParser()
>>> parser.set_element_class_lookup(MyLookup())

和以前一样,lookup() 方法的第一个参数是包含 Element 的不透明文档实例。 第二个参数是一个轻量级 Element 代理实现,仅在查找期间有效。 不要试图保留对它的引用。 一旦查找完成,代理就会失效。 如果您访问实例化查找调用范围之外的任何属性或方法,您将收到 AssertionError

在查找过程中,元素对象的行为大多类似于普通的 Element 实例。 它提供了 tagtexttail等属性,并支持索引、切片以及getchildren()getparent()等方法。 它不支持迭代,也不支持任何类型的修改。 它的所有属性都是只读的,不能删除或插入到其他树中。 您可以使用它作为起点来自由遍历树并收集其元素提供的任何类型的信息。 一旦您决定了该元素使用哪个类,您只需返回它并让 lxml 负责清理实例化的代理类。

旁注

这个查找方案最初位于一个名为 lxml.pyclasslookup 的单独模块中。

Tree based element class lookup in Python

Taking more elaborate decisions than allowed by the custom scheme is difficult to achieve in pure Python, as it results in a chicken-and-egg problem. It would require access to the tree - before the elements in the tree have been instantiated as Python Element proxies.

Luckily, there is a way to do this. The PythonElementClassLookup works similar to the custom lookup scheme:

>>> class MyLookup(etree.PythonElementClassLookup):
...     def lookup(self, document, element):
...         return MyElementClass # defined elsewhere

>>> parser = etree.XMLParser()
>>> parser.set_element_class_lookup(MyLookup())

As before, the first argument to the lookup() method is the opaque document instance that contains the Element. The second arguments is a lightweight Element proxy implementation that is only valid during the lookup. Do not try to keep a reference to it. Once the lookup is finished, the proxy will become invalid. You will get an AssertionError if you access any of the properties or methods outside the scope of the lookup call where they were instantiated.

During the lookup, the element object behaves mostly like a normal Element instance. It provides the properties tag, text, tail etc. and supports indexing, slicing and the getchildren(), getparent() etc. methods. It does not support iteration, nor does it support any kind of modification. All of its properties are read-only and it cannot be removed or inserted into other trees. You can use it as a starting point to freely traverse the tree and collect any kind of information that its elements provide. Once you have taken the decision which class to use for this element, you can simply return it and have lxml take care of cleaning up the instantiated proxy classes.

Sidenote: this lookup scheme originally lived in a separate module called lxml.pyclasslookup.

使用自定义类生成 XML

在 lxml 2.1 之前,您无法自己实例化代理类。 在创建现有 XML 元素的对象表示形式时,只有 lxml.etree 可以做到这一点。 然而,从 lxml 2.2 开始,实例化此类将简单地创建一个新元素:

>>> el = honk(honking='true')
>>> el.tag
'honk'
>>> el.honking
True

但请注意,您在此处创建的代理将像任何其他代理一样被垃圾收集。 因此,您不能指望 lxml.etree 使用在释放其引用后再次访问此元素时实例化的同一类。 因此,您应该始终使用相应的类查找方案,该方案返回它们创建的元素的 Element 代理类。 ElementNamespaceClassLookup 通常是一个很好的匹配。

您可以使用自定义 Element 类来快速创建 XML 片段:

>>> class hale(etree.ElementBase): pass
>>> class bopp(etree.ElementBase): pass

>>> el = hale( "some ", honk(honking = 'true'), bopp, " text" )

>>> print(etree.tostring(el, encoding='unicode'))
<hale>some <honk honking="true"/><bopp/> text</hale>

Generating XML with custom classes

Up to lxml 2.1, you could not instantiate proxy classes yourself. Only lxml.etree could do that when creating an object representation of an existing XML element. Since lxml 2.2, however, instantiating this class will simply create a new Element:

>>> el = honk(honking='true')
>>> el.tag
'honk'
>>> el.honking
True

Note, however, that the proxy you create here will be garbage collected just like any other proxy. You can therefore not count on lxml.etree using the same class that you instantiated when you access this Element a second time after letting its reference go. You should therefore always use a corresponding class lookup scheme that returns your Element proxy classes for the elements that they create. The ElementNamespaceClassLookup is generally a good match.

You can use custom Element classes to quickly create XML fragments:

>>> class hale(etree.ElementBase): pass
>>> class bopp(etree.ElementBase): pass

>>> el = hale( "some ", honk(honking = 'true'), bopp, " text" )

>>> print(etree.tostring(el, encoding='unicode'))
<hale>some <honk honking="true"/><bopp/> text</hale>

实现命名空间

lxml 允许您从字面意义上实现名称空间。 如上所述设置命名空间类查找机制后,您可以通过调用查找的 get_namespace(uri) 方法构建新的元素命名空间(或检索现有的元素命名空间):

>>> lookup = etree.ElementNamespaceClassLookup()
>>> parser = etree.XMLParser()
>>> parser.set_element_class_lookup(lookup)

>>> namespace = lookup.get_namespace('http://hui.de/honk')

然后向该命名空间注册新的元素类型,例如在标签名称 honk 下:

>>> namespace['honk'] = honk

如果您在一个模块中声明了许多 Element 类,并且它们的命名方式都与它们创建的元素相同,则只需在模块末尾使用 namespace.update(globals()) 即可自动声明它们。 该实现足够智能,可以忽略不是 Element 类的所有内容。

之后,您可以通过 lxml 的普通 API 创建和使用 XML 元素:

>>> xml = '<honk xmlns="http://hui.de/honk" honking="true"/>'
>>> honk_element = etree.XML(xml, parser)
>>> print(honk_element.honking)
True

手动创建元素时也是如此:

>>> honk_element = parser.makeelement('{http://hui.de/honk}honk',
...                                   honking='true')
>>> print(honk_element.honking)
True

本质上,这允许您做的是根据 Elements 的命名空间和标签名称为 Elements 提供自定义 API。

一个有点相关的主题是扩展函数,它使用类似的机制来注册 Python 函数以在 XPath 和 XSLT 中使用。

在上面的设置示例中,我们仅将 honk Element 类与 honk 元素关联。 如果 XML 树在同一命名空间中包含不同的元素,它们不会采用相同的实现:

>>> xml = ('<honk xmlns="http://hui.de/honk" honking="true">'
...        '<bla/><!--comment-->'
...        '</honk>')
>>> honk_element = etree.XML(xml, parser)
>>> print(honk_element.honking)
True
>>> print(honk_element[0].honking)
Traceback (most recent call last):
  ...
AttributeError: 'lxml.etree._Element' object has no attribute 'honking'
>>> print(honk_element[1].text)
comment

因此,您可以为每个命名空间中的每个元素名称提供一个实现,并让 lxml 即时选择正确的实现。 如果您希望每个命名空间有一个元素实现(忽略元素名称),或者希望除少数元素之外的大多数元素都有一个公共类,则可以通过使用空元素名称(无)注册该类来为整个命名空间指定默认实现。

您可以考虑在这里遵循面向对象的方法。 如果构建元素类的类层次结构,则还可以为未提供特定元素类时使用的命名空间实现基类。 同样,您可以只传递 None 作为元素名称:

>>> class HonkNSElement(etree.ElementBase):
...    def honk(self):
...       return "HONK"
>>> namespace[None] = HonkNSElement  # 命名空间的默认元素

>>> class HonkElement(HonkNSElement):
...    @property
...    def honking(self):
...       return self.get('honking') == 'true'
>>> namespace['honk'] = HonkElement  # 特定标签的元素

现在,您可以依靠 lxml 始终返回此命名空间的元素的 HonkNSElement 类型或其子类的对象:

>>> xml = ('<honk xmlns="http://hui.de/honk" honking="true">'
...        '<bla/><!--comment-->'
...        '</honk>')
>>> honk_element = etree.fromstring(xml, parser)

>>> print(type(honk_element))
<class 'HonkElement'>
>>> print(type(honk_element[0]))
<class 'HonkNSElement'>

>>> print(honk_element.honking)
True
>>> print(honk_element.honk())
HONK

>>> print(honk_element[0].honk())
HONK
>>> print(honk_element[0].honking)
Traceback (most recent call last):
...
AttributeError: 'HonkNSElement' object has no attribute 'honking'

>>> print(honk_element[1].text)  # 对非元素使用后备
comment

从 lxml 4.1 开始,使用类装饰器可以更方便地完成注册。 命名空间注册表对象可以使用名称(或 None)作为参数进行调用,然后可以用作装饰器。

>>> honk_elements = lookup.get_namespace('http://hui.de/honk')

>>> @honk_elements(None)
... class HonkNSElement(etree.ElementBase):
...    def honk(self):
...       return "HONK"

如果类与标签同名,您也可以省略调用并使用空白装饰器:

>>> @honk_elements
... class honkel(HonkNSElement):
...    @property
...    def honking(self):
...       return self.get('honking') == 'true'

>>> xml = '<honkel xmlns="http://hui.de/honk" honking="true"><bla/><!--comment--></honkel>'
>>> honk_element = etree.fromstring(xml, parser)

>>> print(type(honk_element))
<class 'honkel'>
>>> print(type(honk_element[0]))
<class 'HonkNSElement'>

Implementing namespaces

lxml allows you to implement namespaces, in a rather literal sense. After setting up the namespace class lookup mechanism as described above, you can build a new element namespace (or retrieve an existing one) by calling the get_namespace(uri) method of the lookup:

>>> lookup = etree.ElementNamespaceClassLookup()
>>> parser = etree.XMLParser()
>>> parser.set_element_class_lookup(lookup)

>>> namespace = lookup.get_namespace('http://hui.de/honk')

and then register the new element type with that namespace, say, under the tag name honk:

>>> namespace['honk'] = honk

If you have many Element classes declared in one module, and they are all named like the elements they create, you can simply use namespace.update(globals()) at the end of your module to declare them automatically. The implementation is smart enough to ignore everything that is not an Element class.

After this, you create and use your XML elements through the normal API of lxml:

>>> xml = '<honk xmlns="http://hui.de/honk" honking="true"/>'
>>> honk_element = etree.XML(xml, parser)
>>> print(honk_element.honking)
True

The same works when creating elements by hand:

>>> honk_element = parser.makeelement('{http://hui.de/honk}honk',
...                                   honking='true')
>>> print(honk_element.honking)
True

Essentially, what this allows you to do, is to give Elements a custom API based on their namespace and tag name.

A somewhat related topic are extension functions which use a similar mechanism for registering Python functions for use in XPath and XSLT.

In the setup example above, we associated the honk Element class only with the 'honk' element. If an XML tree contains different elements in the same namespace, they do not pick up the same implementation:

>>> xml = ('<honk xmlns="http://hui.de/honk" honking="true">'
...        '<bla/><!--comment-->'
...        '</honk>')
>>> honk_element = etree.XML(xml, parser)
>>> print(honk_element.honking)
True
>>> print(honk_element[0].honking)
Traceback (most recent call last):
  ...
AttributeError: 'lxml.etree._Element' object has no attribute 'honking'
>>> print(honk_element[1].text)
comment

You can therefore provide one implementation per element name in each namespace and have lxml select the right one on the fly. If you want one element implementation per namespace (ignoring the element name) or prefer having a common class for most elements except a few, you can specify a default implementation for an entire namespace by registering that class with the empty element name (None).

You may consider following an object oriented approach here. If you build a class hierarchy of element classes, you can also implement a base class for a namespace that is used if no specific element class is provided. Again, you can just pass None as an element name:

>>> class HonkNSElement(etree.ElementBase):
...    def honk(self):
...       return "HONK"
>>> namespace[None] = HonkNSElement  # default Element for namespace

>>> class HonkElement(HonkNSElement):
...    @property
...    def honking(self):
...       return self.get('honking') == 'true'
>>> namespace['honk'] = HonkElement  # Element for specific tag

Now you can rely on lxml to always return objects of type HonkNSElement or its subclasses for elements of this namespace:

>>> xml = ('<honk xmlns="http://hui.de/honk" honking="true">'
...        '<bla/><!--comment-->'
...        '</honk>')
>>> honk_element = etree.fromstring(xml, parser)

>>> print(type(honk_element))
<class 'HonkElement'>
>>> print(type(honk_element[0]))
<class 'HonkNSElement'>

>>> print(honk_element.honking)
True
>>> print(honk_element.honk())
HONK

>>> print(honk_element[0].honk())
HONK
>>> print(honk_element[0].honking)
Traceback (most recent call last):
...
AttributeError: 'HonkNSElement' object has no attribute 'honking'

>>> print(honk_element[1].text)  # uses fallback for non-elements
comment

Since lxml 4.1, the registration is more conveniently done with class decorators. The namespace registry object is callable with a name (or None) as argument and can then be used as decorator.

>>> honk_elements = lookup.get_namespace('http://hui.de/honk')

>>> @honk_elements(None)
... class HonkNSElement(etree.ElementBase):
...    def honk(self):
...       return "HONK"

If the class has the same name as the tag, you can also leave out the call and use the blank decorator instead:

>>> @honk_elements
... class honkel(HonkNSElement):
...    @property
...    def honking(self):
...       return self.get('honking') == 'true'

>>> xml = '<honkel xmlns="http://hui.de/honk" honking="true"><bla/><!--comment--></honkel>'
>>> honk_element = etree.fromstring(xml, parser)

>>> print(type(honk_element))
<class 'honkel'>
>>> print(type(honk_element[0]))
<class 'HonkNSElement'>

最后更新: 2024年4月11日
创建日期: 2024年4月11日