14.3 元数据¶
14.3 Metadata
14.3.1 概述¶
14.3.1 General
PDF 文档可以包含一般信息,例如文档的标题、作者以及创建和修改日期。这些关于文档的全局信息(与其内容或结构相对)称为元数据,旨在帮助在外部数据库中对文档进行编目和搜索。从 PDF 1.4 开始,元数据也可以为文档的各个组件指定。
元数据可以通过以下两种方式存储在 PDF 文档中:
注意
文档信息字典是最初在 PDF 文件中包含元数据的方式。元数据流在 PDF 1.4 中引入,现在是包含元数据的首选方法。
A PDF document may include general information, such as the document’s title, author, and creation and modification dates. Such global information about the document (as opposed to its content or structure) is called metadata and is intended to assist in cataloguing and searching for documents in external databases. Beginning with PDF 1.4, metadata may also be specified for individual components of a document.
Metadata may be stored in a PDF document in either of the following ways:
- In a metadata stream (PDF 1.4) associated with the document or a component of the document (14.3.2, “Metadata Streams”)
- In a document information dictionary associated with the document (14.3.3, “Document Information Dictionary”)
NOTE
Document information dictionaries is the original way that metadata was included in a PDF file. Metadata streams were introduced in PDF 1.4 and is now the preferred method to include metadata.
14.3.2 元数据流¶
14.3.2 Metadata Streams
元数据可以用于整个文档,也可以用于文档中的各个组件,并且可以存储在称为 metadata 流的 PDF 流中(PDF 1.4)。
注意 1
元数据流相比于文档信息字典具有以下优势:
- 基于 PDF 的工作流程通常会将包含元数据的艺术作品嵌入到更大的文档组件中。元数据流提供了一种标准方式来保留这些组件的元数据,以便后续检查。支持 PDF 的符合标准的产品应该能够从 PDF 文档本身派生出所有包含元数据的文档组件列表。
- PDF 文档通常会在 Web 或其他环境中提供,其中许多工具会例行检查、编目和分类文档。即使这些工具不理解 PDF,也应该能够理解文档的自包含描述。
除了所有流字典共有的常规条目(参见 表 5),元数据流字典还应包含 表 315 中列出的附加条目。
元数据流的内容应采用可扩展标记语言(XML)格式表示元数据。
注意 2
仅当元数据流既未经过滤也未加密时,对于不支持 PDF 的工具,该信息才能以纯文本的形式可见。
键 | 类型 | 值 |
---|---|---|
Type | name | (必需)此字典描述的 PDF 对象类型;对于元数据流,该值应为 Metadata。 |
Subtype | name | (必需)此字典描述的元数据流类型;该值应为 XML。 |
注意 3
表示元数据的 XML 格式定义为一个名为可扩展元数据平台(XMP)的框架的一部分,并在 Adobe 文档 XMP: Extensible Metadata Platform 中进行了描述(参见 参考文献)。该框架提供了一种使用 XML 来表示文档及其组件的元数据的方法,旨在被比仅处理 PDF 更广泛的产品类别采用。它包括一种方法,可以在非 XML 数据文件中嵌入 XML 数据,以平台无关的格式存储,使其可以通过简单扫描轻松定位和访问,而无需解析整个文档文件。
可以通过文档目录中的 Metadata 条目(参见 7.7.2,“文档目录”)将元数据流附加到文档。元数据框架提供了一个时间戳,用于标记框架中的元数据。如果该时间戳等于或晚于文档信息字典中记录的文档修改日期,则应以元数据流为准。然而,如果文档信息字典中的文档修改日期晚于元数据流的时间戳,则表明该文档可能是由不支持元数据流的编写器保存的。在这种情况下,文档信息字典中存储的信息应覆盖元数据流中的任何语义等价项。此外,以流或字典形式表示的 PDF 文档组件可能包含一个 Metadata 条目(参见 表 316)。
键 | 类型 | 值 |
---|---|---|
Metadata | stream | (可选;PDF 1.4)包含该组件元数据的元数据流。 |
通常,任何 PDF 流或字典都可以附加元数据,只要该流或字典表示的是实际的信息资源,而不是仅仅作为实现的附属物。一些 PDF 结构被视为实现性构造,因此可能无法附加元数据。
当确切的流或字典是否可以包含 Metadata 条目存在歧义时,应尽可能将元数据附加到实际存储所描述数据资源的对象上。
注意 4
用于描述平铺图案的元数据应附加到图案流的字典中,而着色(shading)的元数据应附加到着色字典中,而不是引用它的着色图案字典中。同样,用于描述 ICCBased 颜色空间的元数据应附加到描述它的 ICC 配置文件流中,而字体的元数据应附加到字体文件流中,而不是字体字典中。
注意 5
在本规范中描述文档组件的表格中,Metadata 条目仅列出最可能使用的组件。然而,请注意,该条目可能会出现在其他以流或字典表示的组件中。
- 此外,元数据还可以与内容流中的标记内容相关联。这种关联应通过在属性列表字典中包含一个键为 Metadata、值为元数据流字典的条目来创建。由于此结构引用了内容流之外的对象,因此属性列表是间接引用的,作为命名资源(参见 14.6.2,“属性列表”)。
Metadata, both for an entire document and for components within a document, may be stored in PDF streams called metadata streams (PDF 1.4).
NOTE 1
Metadata streams have the following advantages over the document information dictionary:
- PDF-based workflows often embed metadata-bearing artwork as components within larger documents. Metadata streams provide a standard way of preserving the metadata of these components for examination downstream. PDF-aware conforming products should be able to derive a list of all metadata- bearing document components from the PDF document itself.
- PDF documents are often made available on the Web or in other environments, where many tools routinely examine, catalogue, and classify documents. These tools should be able to understand the self-contained description of the document even if they do not understand PDF.
Besides the usual entries common to all stream dictionaries (see Table 5), the metadata stream dictionary shall contain the additional entries listed in Table 315.
The contents of a metadata stream shall be the metadata represented in Extensible Markup Language (XML).
NOTE 2
This information is visible as plain text to tools that are not PDF-aware only if the metadata stream is both unfiltered and unencrypted.
Key | Type | Value |
---|---|---|
Type | name | (Required) The type of PDF object that this dictionary describes; shall be Metadata for a metadata stream. |
Subtype | name | (Required) The type of metadata stream that this dictionary describes; shall be XML. |
NOTE 3
The format of the XML representing the metadata is defined as part of a framework called the Extensible Metadata Platform (XMP) and described in the Adobe document XMP: Extensible Metadata Platform (see the Bibliography). This framework provides a way to use XML to represent metadata describing documents and their components and is intended to be adopted by a wider class of products than just those that process PDF. It includes a method to embed XML data within non-XML data files in a platform-independent format that can be easily located and accessed by simple scanning rather than requiring the document file to be parsed.
A metadata stream may be attached to a document through the Metadata entry in the document catalogue (see 7.7.2, “Document Catalog”). The metadata framework provides a date stamp for metadata expressed in the framework. If this date stamp is equal to or later than the document modification date recorded in the document information dictionary, the metadata stream shall be taken as authoritative. If, however, the document modification date recorded in the document information dictionary is later than the metadata stream’s date stamp, the document has likely been saved by a writer that is not aware of metadata streams. In this case, information stored in the document information dictionary shall be taken to override any semantically equivalent items in the metadata stream. In addition, PDF document components represented as a stream or dictionary may have a Metadata entry (see Table 316).
Key | Type | Value |
---|---|---|
Metadata | stream | (Optional; PDF 1.4) A metadata stream containing metadata for the component. |
In general, any PDF stream or dictionary may have metadata attached to it as long as the stream or dictionary represents an actual information resource, as opposed to serving as an implementation artifact. Some PDF constructs are considered implementational, and hence may not have associated metadata.
When there is ambiguity about exactly which stream or dictionary may bear the Metadata entry, the metadata shall be attached as close as possible to the object that actually stores the data resource described.
NOTE 4
Metadata describing a tiling pattern should be attached to the pattern stream’s dictionary, but a shading should have metadata attached to the shading dictionary rather than to the shading pattern dictionary that refers to it. Similarly, metadata describing an ICCBased colour space should be attached to the ICC profile stream describing it, and metadata for fonts should be attached to font file streams rather than to font dictionaries.
NOTE 5
In tables describing document components in this specification, the Metadata entry is listed only for those in which it is most likely to be used. Keep in mind, however, that this entry may appear in other components represented as streams or dictionaries.
- In addition, metadata may also be associated with marked content within a content stream. This association shall be created by including an entry in the property list dictionary whose key shall be Metadata and whose value shall be the metadata stream dictionary. Because this construct refers to an object outside the content stream, the property list is referred to indirectly as a named resource (see 14.6.2, “Property Lists”).
14.3.3 文档信息字典¶
14.3.3 Document Information Dictionary
PDF 文件的尾部可选的 Info 条目(参见 7.5.5,“文件尾部”)应包含一个 文档信息字典,其中存储了文档的元数据。表 317 显示了其内容。任何值未知的条目应从字典中省略,而不是以空字符串作为其值包含在内。
某些符合标准的阅读器可能允许对文档信息字典的内容进行搜索。为了便于浏览和编辑,字典中的所有键应完整拼写,不得使用缩写。新增的键应经过谨慎选择,以确保对用户而言具有合理的意义。
表 317 中未明确提及的任何键,其关联值应为文本字符串。
尽管符合标准的阅读器可以在文档信息字典中存储自定义元数据,但不得在其中存储私有内容或结构信息。此类信息应存储在文档目录中(参见 7.7.2,“文档目录”)。
键 | 类型 | 值 |
---|---|---|
Title | 文本字符串 | (可选;PDF 1.1)文档的标题。 |
Author | 文本字符串 | (可选)创建文档的人员姓名。 |
Subject | 文本字符串 | (可选;PDF 1.1)文档的主题。 |
Keywords | 文本字符串 | (可选;PDF 1.1)与文档关联的关键词。 |
Creator | 文本字符串 | (可选)如果文档是从其他格式转换为 PDF,则该字段记录创建原始文档的符合标准的产品名称。 |
Producer | 文本字符串 | (可选)如果文档是从其他格式转换为 PDF,则该字段记录执行转换的符合标准的产品名称。 |
CreationDate | 日期 | (可选)以可读形式记录文档的创建日期和时间(参见 7.9.4,“日期”)。 |
ModDate | 日期 | (如果文档目录中存在 PieceInfo 则必需,否则为可选;PDF 1.1)以可读形式记录文档最近修改的日期和时间(参见 7.9.4,“日期”)。 |
Trapped | 名称 | (可选;PDF 1.3)一个名称对象,指示文档是否已修改以包含陷印(trapping)信息(参见 14.11.6,“陷印支持”): True 文档已完全进行陷印,不需要进一步的陷印处理。该值应为名称 True,而非布尔值 true。 False 文档尚未进行陷印。该值应为名称 False,而非布尔值 false。 Unknown 可能未知文档是否已进行陷印,或者仅部分陷印但尚未完全陷印,仍可能需要额外的陷印处理。 默认值:Unknown。 注意 该条目的值可能由创建文档陷印信息的软件自动设置,或者只能由人工操作员手动输入。 |
示例
下面是一个典型的文档信息字典示例:
1 0 obj
<< /Title ( PostScript Language Reference, Third Edition )
/Author ( Adobe Systems Incorporated )
/Creator ( Adobe FrameMaker 5 . 5 . 3 for Power Macintosh® )
/Producer ( Acrobat Distiller 3 . 01 for Power Macintosh )
/CreationDate ( D : 19970915110347 - 08 ' 00 ' )
/ModDate ( D : 19990209153925 - 08 ' 00 ' )
>>
endobj
The optional Info entry in the trailer of a PDF file (see 7.5.5, “File Trailer”) shall hold a document information dictionary containing metadata for the document; Table 317 shows its contents. Any entry whose value is not known should be omitted from the dictionary rather than included with an empty string as its value.
Some conforming readers may choose to permit searches on the contents of the document information dictionary. To facilitate browsing and editing, all keys in the dictionary shall be fully spelled out, not abbreviated. New keys should be chosen with care so that they make sense to users.
The value associated with any key not specifically mentioned in Table 317 shall be a text string.
Although conforming readers may store custom metadata in the document information dictionary, they may not store private content or structural information there. Such information shall be stored in the document catalogue instead (see 7.7.2, “Document Catalog”).
Key | Type | Value |
---|---|---|
Title | text string | (Optional; PDF 1.1) The document’s title. |
Author | text string | (Optional) The name of the person who created the document. |
Subject | text string | (Optional; PDF 1.1) The subject of the document. |
Keywords | text string | (Optional; PDF 1.1) Keywords associated with the document. |
Creator | text string | (Optional) If the document was converted to PDF from another format, the name of the conforming product that created the original document from which it was converted. |
Producer | text string | (Optional) If the document was converted to PDF from another format, the name of the conforming product that converted it to PDF. |
CreationDate | date | (Optional) The date and time the document was created, in human-readable form (see 7.9.4, “Dates”). |
ModDate | date | (Required if PieceInfo is present in the document catalogue; otherwise optional; PDF 1.1) The date and time the document was most recently modified, in human-readable form (see 7.9.4, “Dates”). |
Trapped | name | (Optional; PDF 1.3) A name object indicating whether the document has been modified to include trapping information (see 14.11.6, “Trapping Support”): True The document has been fully trapped; no further trapping shall be needed. This shall be the name True, not the boolean value true. False The document has not yet been trapped. This shall be the name False, not the boolean value false. Unknown Either it is unknown whether the document has been trapped or it has been partly but not yet fully trapped; some additional trapping may still be needed. Default value: Unknown. NOTE The value of this entry may be set automatically by the software creating the document’s trapping information, or it may be known only to a human operator and entered manually. |
EXAMPLE
This example shows a typical document information dictionary.
1 0 obj
<< /Title ( PostScript Language Reference, Third Edition )
/Author ( Adobe Systems Incorporated )
/Creator ( Adobe FrameMaker 5 . 5 . 3 for Power Macintosh® )
/Producer ( Acrobat Distiller 3 . 01 for Power Macintosh )
/CreationDate ( D : 19970915110347 - 08 ' 00 ' )
/ModDate ( D : 19990209153925 - 08 ' 00 ' )
>>
endobj