跳转至

14.10 网页捕获

14.10 Web Capture

14.10.1 概述

14.10.1 General

Web 捕获数据结构中的信息使符合要求的产品能够执行以下操作:

  • 本地保存并保留来自 Web 的材料的视觉外观
  • 从 Web 检索额外的材料并将其添加到现有的 PDF 文件中
  • 更新或修改先前从 Web 捕获的现有材料
  • 查找捕获自 Web 的材料的源信息,例如捕获该材料的 URL(如果有)
  • 查找 PDF 文件中由给定 URL 生成的所有材料
  • 查找 PDF 文件中与给定数字标识符(MD5 哈希值)匹配的所有材料

执行这些操作所需的信息应记录在 PDF 文件中的两个数据结构中:

  • Web 捕获信息字典,该字典应保存与 Web 捕获相关的文档级信息。
  • Web 捕获内容数据库,该数据库应保存通过 Web 捕获检索的源内容资源的完整注册表及其来源。

注意 3

Web 捕获内容数据库 使捕获过程能够避免下载已存在于文件中的材料。

The information in the Web Capture data structures enables conforming products to perform the following operations:

  • Save locally and preserve the visual appearance of material from the Web
  • Retrieve additional material from the Web and add it to an existing PDF file
  • Update or modify existing material previously captured from the Web
  • Find source information for material captured from the Web, such as the URL (if any) from which it was captured
  • Find all material in a PDF file that was generated from a given URL
  • Find all material in a PDF file that matches a given digital identifier (MD5 hash)

The information needed to perform these operations shall be recorded in two data structures in the PDF file:

  • The Web Capture information dictionary, which shall hold document-level information related to Web Capture.
  • The Web Capture content database, which shall hold a complete registry of the source content resources retrieved by Web Capture and where it came from.

NOTE 3

The Web Capture content database enables the capturing process to avoid downloading material that is already present in the file.

14.10.2 网页捕获信息字典

14.10.2 Web Capture Information Dictionary

可选的 SpiderInfo 条目(见 7.7.2,“文档目录”)如果存在,应包含 Web 捕获信息字典。

表 350 – Web 捕获信息字典中的条目
类型
V 数字 (必需)Web 捕获版本号。符合要求的文件中版本号应为 1.0。

该值应为单一的实数,而不是主版本号和次版本号。

示例   版本号 1.2 应被视为大于 1.15。
C 数组 (可选)Web 捕获命令字典的间接引用数组(见 14.10.5.3,“命令字典”),描述了在构建 PDF 文件时使用的命令。命令应按照执行顺序出现在数组中。

The optional SpiderInfo entry in the document catalogue (see 7.7.2, “Document Catalog”), if present, shall hold Web Capture information dictionary.

Table 350 – Entries in the Web Capture information dictionary
Key Type Value
V number (Required) The Web Capture version number. The version number shall be 1.0 in a conforming file.

This value shall be a single real number, not a major and minor version number.

EXAMPLE   A version number of 1.2 would be considered greater than 1.15.
C array (Optional) An array of indirect references to Web Capture command dictionaries (see 14.10.5.3, “Command Dictionaries”) describing commands that were used in building the PDF file. The commands shall appear in the array in the order in which they were executed in building the file.

14.10.3 内容数据库

14.10.3 Content Database

14.10.3.1 概述

14.10.3.1 General

当 PDF 文件或其部分内容是由另存格式(如 HTML 页面)中的内容资源构建时,生成的 PDF 文件(或其一部分)可能包含来自多个内容资源的内容。相反,由于许多内容格式没有静态分页,单一内容资源可能会生成多个 PDF 页面。

为了跟踪 PDF 内容与其来源资源之间的对应关系,PDF 文件可以包含一个 内容数据库,将 URL 和数字标识符映射到 PDF 对象,如页面和 XObject。

NOTE 4

通过在数据库中查找数字标识符,Web 捕获可以确定新下载的内容是否与从其他 URL 获取的内容相同。因此,它可以执行优化,例如只存储一个图像副本,即使该图像被多个 HTML 页面引用。

Web 捕获的内容数据库应组织为 内容集。每个内容集应是一个字典,保存关于从相同源数据生成的一组相关 PDF 对象的信息。内容集的 S(子类型)条目的值应为 SPS,表示页面集,或 SIS,表示图像集。

从源内容资源到 PDF 文档中的内容集的映射可以保存在 PDF 文件中。该映射可以是从资源的 URL 到内容集的关联,存储在 PDF 文档的 URLS 名称树中。映射也可以是从资源数据生成的数字标识符(见 14.10.3.3,“数字标识符”)到内容集的关联,存储在 PDF 文档的 IDS 名称树中。PDF 文件中可以同时存在这两种关联。

123

URLSIDS 名称树中的条目可以引用一个内容集或多个内容集。如果条目是一个数组,则内容集不需要具有相同的子类型;该数组可以同时包含页面集和图像集。

123

When a PDF file, or part of a PDF file, is built from a content resource stored in another format, such as an HTML page, the resulting PDF file (or portion thereof) may contain content from more than the single content resources. Conversely, since many content formats do not have static pagination, a single content resource may give rise to multiple PDF pages.

To keep track of the correspondence between PDF content and the resources from which the content was derived, a PDF file may contain a content database that maps URLs and digital identifiers to PDF objects such as pages and XObjects.

NOTE 4

By looking up digital identifiers in the database, Web Capture can determine whether newly downloaded content is identical to content already retrieved from a different URL. Thus, it can perform optimizations such as storing only one copy of an image that is referenced by multiple HTML pages.

Web Capture’s content database shall be organized into content sets. Each content set shall be a dictionary holding information about a group of related PDF objects generated from the same source data. A content set shall have for the value of its S (subtype) entry either the value SPS, for a page set, or SIS, for an image set.

The mapping from a source content resource to a content set in a PDF document may be saved in the PDF file. The mapping may be an association from the resource's URL to the content set, stored in the PDF document's URLS name tree. The mapping may also be an association from a digital identifier (14.10.3.3, “Digital Identifiers”) generated from resource's data to the content set, stored in the PDF document's IDS name tree. Both associations may be present in the PDF file.

123

Entries in the URLS and IDS name trees may refer to an array of content sets or a single content set. If the entry is an array, the content sets need not have the same subtype; the array may include both page sets and image sets.

123

14.10.3.2 URL 字符串

14.10.3.2 URL Strings

与 Web 捕获内容集关联的 URL 在用作 URLS 名称树中的键之前,应将其转换为可预测的规范形式。以下步骤描述了如何执行此转换,使用了来自互联网 RFC 1738《统一资源定位符》和 RFC 1808《相对统一资源定位符》的术语(请参见 参考文献)。该算法应适用于 HTTP、FTP 和文件 URL:

算法:URL 字符串

a) 如果 URL 是相对的,则应将其转换为绝对 URL。 b) 如果 URL 中包含一个或多个 NUMBER SIGN(十六进制 02h3)字符,则应在第一个 NUMBER SIGN 之前将其截断。 c) URL 的方案部分中的任何大写 ASCII 字符应替换为相应的小写 ASCII 字符。 d) 如果 URL 有主机部分,则其中的任何大写 ASCII 字符应转换为小写 ASCII。 e) 如果方案是文件且主机是 localhost,则应删除主机部分。 f) 如果 URL 有端口部分且端口是给定协议的默认端口(HTTP 为 80,FTP 为 21),则应删除端口部分。 g) 如果路径部分包含 PERIOD(十六进制 2Eh,.)或 DOUBLE PERIOD(十六进制 2E 2E,..)子序列,应按 RFC 1808 第 4 节的描述转换路径。

NOTE

由于 PERCENT SIGN(百分号,25h)在 RFC 1738 中被视为不安全字符,且它也是编码字符的转义字符,因此通常无法区分一个包含未编码字符的 URL 和一个包含已编码字符的 URL。例如,无法确定序列 %00 是表示单个编码的 null 字符,还是表示一系列三个未编码字符。因此,URL 无法通过编码或解码多次而达到稳定状态。从经验上讲,嵌入 HTML 文件中的 URL 会通过一次编码处理不安全字符,而 Web 服务器对接收到的路径执行一次解码(尽管 CGI 脚本可以自行决定)。

因此,规范 URL 假定已经经过一次且仅一次编码处理。已知初始编码状态的 URL 可以安全地转换为只经过一次编码处理的 URL。

URLs associated with Web Capture content sets shall be reduced to a predictable, canonical form before being used as keys in the URLS name tree. The following steps describe how to perform this reduction, using terminology from Internet RFCs 1738, Uniform Resource Locators, and 1808, Relative Uniform Resource Locators (see the Bibliography). This algorithm shall be applied for HTTP, FTP, and file URLs:

Algorithm: URL strings

a) If the URL is relative, it shall be converted into an absolute URL. b) If the URL contains one or more NUMBER SIGN (02h3) characters, it shall be truncated before the first NUMBER SIGN. c) Any uppercase ASCII characters within the scheme section of the URL shall be replaced with the corresponding lowercase ASCII characters. d) If there is a host section, any uppercase ASCII characters therein shall be converted to lowercase ASCII. e) If the scheme is file and the host is localhost, the host section shall be removed. f) If there is a port section and the port is the default port for the given protocol (80 for HTTP or 21 for FTP), the port section shall be removed. g) If the path section contains PERIOD (2Eh) ( . ) or DOUBLE PERIOD ( . . ) subsequences, transform the path as described in section 4 of RFC 1808.

NOTE

Because the PERCENT SIGN (25h) is unsafe according to RFC 1738 and is also the escape character for encoded characters, it is not possible in general to distinguish a URL with unencoded characters from one with encoded characters. For example, it is impossible to decide whether the sequence %00 represents a single encoded null character or a sequence of three unencoded characters. Hence, no number of encoding or decoding passes on a URL can ever cause it to reach a stable state. Empirically, URLs embedded in HTML files have unsafe characters encoded with one encoding pass, and Web servers perform one decoding pass on received paths (though CGI scripts can make their own decisions).

Canonical URLs are thus assumed to have undergone one and only one encoding pass. A URL whose initial encoding state is known can be safely transformed into a URL that has undergone only one encoding pass.

14.10.3.3 数字标识符

14.10.3.3 Digital Identifiers

数字标识符,用于通过 IDS 名称树将源内容资源与内容集关联,应使用 MD5 消息摘要算法(互联网 RFC 1321)生成。

NOTE 1

传递给算法的确切数据取决于内容集的类型和计算标识符的性质。

对于页面集,应首先将源数据传递给 MD5 算法,接着是表示源数据中引用的任何辅助数据文件(如图像)的数字标识符,按照它们首次引用的顺序。如果某个辅助文件被引用多次,则应仅在第一次引用时传递其标识符。生成的字符串将作为源内容资源的数字标识符。

NOTE 2

这个序列生成了一个复合标识符,表示页面集中的页面的视觉外观。

NOTE 3

两个 HTML 源文件如果内容相同,但所引用的图像包含不同的数据(例如,由脚本生成或由相对 URL 指向的图像),它们不会产生相同的标识符。

当源数据是 PDF 文件时,标识符应仅从该文件的内容生成;不应包含任何辅助数据。

页面集还可以有一个文本标识符,通过将 MD5 算法应用于源数据中的文本部分来计算。

EXAMPLE 1

对于 HTML 文件,文本标识符仅基于标记标签之间的文本;图像不参与计算。

对于图像集,数字标识符应通过将原始图像的源数据传递给 MD5 算法来计算。

EXAMPLE 2

从 GIF 图像创建的图像集的标识符是通过计算 GIF 文件的内容来生成的。

Digital identifiers, used to associate source content resources with content sets by the IDS name tree, shall be generated using the MD5 message-digest algorithm (Internet RFC 1321).

NOTE 1

The exact data passed to the algorithm depends on the type of content set and the nature of the identifier being calculated.

For a page set, the source data shall be passed to the MD5 algorithm first, followed by strings representing the digital identifiers of any auxiliary data files (such as images) referenced in the source data, in the order in which they are first referenced. If an auxiliary file is referenced more than once, its identifier shall be passed only the first time. The resulting string shall be used as the digital identifier for the source content resource.

NOTE 2

This sequence produces a composite identifier representing the visual appearance of the pages in the page set.

NOTE 3

Two HTML source files that are identical, but for which the referenced images contain different data—for example, if they have been generated by a script or are pointed to by relative URLs—do not produce the same identifier.

When the source data is a PDF file, the identifier shall be generated solely from the contents of that file; there shall be no auxiliary data.

A page set may also have a text identifier, calculated by applying the MD5 algorithm to just the text present in the source data.

EXAMPLE 1

For an HTML file the text identifier is based solely on the text between markup tags; no images are used

in the calculation.

For an image set, the digital identifier shall be calculated by passing the source data for the original image to the MD5 algorithm.

EXAMPLE 2

The identifier for an image set created from a GIF image is calculated from the contents of the GIF.

14.10.3.4 唯一名称生成

14.10.3.4 Unique Name Generation

在从数据源生成 PDF 页面时,诸如超文本链接和 HTML 表单字段等项目会被转换为相应的命名目标和交互式表单字段。这些项目应使用不与文件中其他类似项目名称冲突的名称。

NOTE

此处所说的名称是指字符串,而不是名称对象。

此外,在更新现有文件时,符合标准的处理器应确保每个目标或字段具有唯一名称,该名称应从其原始名称派生,但构造方式应避免与文件中其他类似名称的项目冲突。

唯一名称应通过将页面集的数字标识符字符串的编码形式附加到目标或字段的原始名称来构建。标识符字符串应进行编码,以去除在目标和字段中具有特殊含义的字符。下表第一列列出了在目标和字段中具有特殊含义的字符,应使用第二列中的相应字节值进行编码。

表 351 – 目标和字段中具有特殊含义的字符及其字节值
字符 字节值 转义序列
(nul) 0x00 \0 (0x5c 0x30)
. (点) 0x2e \p (0x5c 0x70)
(反斜杠) 0x5c \ (0x5c 0x5c)

EXAMPLE

由于点字符(2Eh)用作交互式表单字段名称中的字段分隔符,因此它不会出现在唯一名称的标识符部分中。

如果该名称用于交互式表单字段,则需要进行额外的编码以确保唯一性并兼容交互式表单。源字符串中的每个字节,按照前面描述的方式进行编码后,将替换为目标字符串中的两个字节。每对字节中的第一个字节是 65(对应 ASCII 字符 A)加上源字节的高 4 位;第二个字节是 65 加上源字节的低 4 位。

In generating PDF pages from a data source, items such as hypertext links and HTML form fields are converted into corresponding named destinations and interactive form fields. These items shall be given names that do not conflict with those of other such items in the file.

NOTE

As used here, the term name refers to a string, not a name object.

Furthermore, when updating an existing file, a conforming processor shall ensure that each destination or field is given a unique name that shall be derived from its original name but constructed so that it avoids conflicts with similarly named items elsewhere.

The unique name shall be formed by appending an encoded form of the page set’s digital identifier string to the original name of the destination or field. The identifier string shall be encoded to remove characters that have special meaning in destinations and fields. The characters listed in the first column of Table 351 have special meaning and shall be encoded using the corresponding byte values from second column of Table 351.

Table 351 – Characters with special meaning in destinations and fields and their byte values
Character Byte value Escape sequence
(nul) 0x00 \0 (0x5c 0x30)
. (PERIOD) 0x2e \p (0x5c 0x70)
(backslash) 0x5c \ (0x5c 0x5c)

EXAMPLE

Since the PERIOD character (2Eh) is used as the field separator in interactive form field names, it does not appear in the identifier portion of the unique name.

If the name is used for an interactive form field, there is an additional encoding to ensure uniqueness and compatibility with interactive forms. Each byte in the source string, encoded as described previously, shall be replaced by two bytes in the destination string. The first byte in each pair is 65 (corresponding to the ASCII character A) plus the high-order 4 bits of the source byte; the second byte is 65 plus the low-order 4 bits of the source byte.

14.10.4 内容集

14.10.4 Content Sets

14.10.4.1 概述

14.10.4.1 General

Web 捕获 内容集 是描述从相同源数据生成的 PDF 对象集的字典。它可能包含集合中所有对象共有的信息以及集合本身的信息。表 352 定义了此类字典的内容。

A Web Capture content set is a dictionary describing a set of PDF objects generated from the same source data. It may include information common to all the objects in the set as well as about the set itself. Table 352 defines the contents of this type of dictionary.

14.10.4.2 页面集

14.10.4.2 Page Sets

页面集是包含一组由相同源生成的 PDF 页面对象的内容集,例如 HTML 文件。这些页面应按照最初添加到文件中的顺序列出在页面集字典的 O 数组中。单个页面对象不得属于多个页面集。表 353 定义了特定于页面集的内容集字典条目。

TID(文本标识符)条目可用于存储由属于页面集的页面的文本生成的标识符(见 14.10.3.3,“数字标识符”)。对于某些页面集(例如没有文本的页面集),可能不适用文本标识符,因此可以省略。

EXAMPLE

此标识符可用于确定文档文本是否发生了变化。

表 352 – 所有 Web Capture 内容集共有的条目
类型
Type name (可选) 该字典描述的 PDF 对象类型;如果存在,应为 SpiderContentSet,表示 Web Capture 内容集。
S name (必需) 该字典描述的内容集子类型。值应为以下之一:

SPS   (“蜘蛛页面集”) 页面集

SIS   (“蜘蛛图像集”) 图像集

ID byte string (必需) 内容集的数字标识符(见 14.10.3.3,“数字标识符”)。
O array (必需) 一个间接引用的数组,指向属于内容集的对象。数组中的对象顺序在内容集子类型(S 条目)为 SPS 时受到限制(见 [14.10.4.2],“页面集”)。
SI 字典或数组 (必需) 一个源信息字典(见 14.10.5,“源信息”)或一个此类字典的数组,描述属于内容集的对象的来源。
CT ASCII 字符串 (可选) 内容类型,表示创建属于内容集的对象的源的 ASCII 字符串。该字符串应符合互联网 RFC 2045 中描述的内容类型规范,Multipurpose Internet Mail Extensions(MIME)第一部分:互联网邮件正文的格式(见参考文献)。

EXAMPLE   对于由 HTML 文件生成的一组 PDF 页面组成的页面集,内容类型将是 text / html。
TS 日期 (可选) 内容集创建的时间戳,给出内容集创建的日期和时间。

表 353 – 特定于 Web Capture 页面集的附加条目
类型
S name (必需) 该字典描述的内容集子类型;应为 SPS
T 文本字符串 (可选) 页面集的标题,一个人类可读的文本字符串。
TID 字节字符串 (可选) 从页面集的文本生成的文本标识符,如 14.10.3.3,“数字标识符”中所述。

A page set is a content set containing a group of PDF page objects generated from a common source, such as an HTML file. The pages shall be listed in the O array of the page set dictionary (see Table 352) in the same order in which they were initially added to the file. A single page object shall not belong to more than one page set. Table 353 defines the content set dictionary entries specific to Page Sets.

The TID (text identifier) entry may be used to store an identifier generated from the text of the pages belonging to the page set (see 14.10.3.3, “Digital Identifiers”). A text identifier may not be appropriate for some page sets (such as those with no text) and may be omitted in these cases.

EXAMPLE

This identifier may be used to determine whether the text of a document has changed.

Table 352 – Entries common to all Web Capture content sets
Key Type Value
Type name (Optional) The type of PDF object that this dictionary describes; if present, shall be SpiderContentSet for a Web Capture content set.
S name (Required) The subtype of content set that this dictionary describes. The value shall be one of:

SPS   (“Spider page set”) A page set

SIS   (“Spider image set”) An image set

ID byte string (Required) The digital identifier of the content set (see 14.10.3.3, “DigitalIdentifiers”).
O array (Required) An array of indirect references to the objects belonging to the content set. The order of objects in the array is restricted when the content set subtype (S entry) is SPS (see [14.10.4.2], “Page Sets”).
SI dictionary or array (Required) A source information dictionary (see 14.10.5, “Source Information”) or an array of such dictionaries, describing the sources from which the objects belonging to the content set were created.
CT ASCII string (Optional) The content type, an ASCII string characterizing the source from which the objects belonging to the content set were created. The string shall conform to the content type specification described in Internet RFC 2045, Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies (see the Bibliography).

EXAMPLE   for a page set consisting of a group of PDF pages created from an HTML file, the content type would be text / html.
TS date (Optional) A time stamp giving the date and time at which the content set was created.

Table 353 – Additional entries specific to a Web Capture page set
Key Type Value
S name (Required) The subtype of content set that this dictionary describes; shall be SPS.
T text string (Optional) The title of the page set, a human-readable text string.
TID byte string (Optional) A text identifier generated from the text of the page set, as described in 14.10.3.3, “Digital Identifiers.”

14.10.4.3 图像集

14.10.4.3 Image Sets

图像集是包含由共同源生成的一组图像 XObject 的内容集,例如多个帧的动画 GIF 图像。单个 XObject 不得属于多个图像集。表 354 显示了特定于图像集的内容集字典条目。

表 354 – 特定于 Web Capture 图像集的附加条目
类型
S name (必需) 该字典描述的内容集子类型;应为 SIS
R 整数或数组 (必需) 图像 XObject 的引用计数,这些 XObject 属于图像集。对于包含单个 XObject 的图像集,值应为该 XObject 的整数引用计数。对于包含多个 XObject 的图像集,值应为与 O 数组平行的引用计数数组(见 表 352);也就是说,R 数组中的每个元素应包含与 O 数组中相应位置的图像 XObject 的引用计数。

每个图像 XObject 在图像集中都有一个引用计数,指示有多少个 PDF 页面引用该 XObject。当 Web Capture 创建一个新页面引用该 XObject(包括已存在页面的副本)时,引用计数应增加;当该页面被销毁时,引用计数应减少。每个页面的引用计数仅增加或减少一次,无论该页面引用该 XObject 的次数。若引用计数降至 0,则应假定没有页面再引用该 XObject,并且该 XObject 可以从图像集的 O 数组中移除。当从图像集的 O 数组中移除 XObject 时,相应的 R 数组中的条目也应被移除。

An image set is a content set containing a group of image XObjects generated from a common source, such as multiple frames of an animated GIF image. A single XObject shall not belong to more than one image set. Table 354 shows the content set dictionary entries specific to Image Sets.

Table 354 – Additional entries specific to a Web Capture image set
Key Type Value
S name (Required) The subtype of content set that this dictionary describes; shall be SIS.
R integer or array (Required) The reference counts for the image XObjects belonging to the image set. For an image set containing a single XObject, the value shall be the integer reference count for that XObject. For an image set containing multiple XObjects, the value shall be an array of reference counts parallel to the O array (see Table 352); that is, each element in the R array shall hold the reference count for the image XObject at the corresponding position in the O array.

Each image XObject in an image set has a reference count indicating the number of PDF pages referring to that XObject. The reference count shall be incremented whenever Web Capture creates a new page referring to the XObject (including copies of already existing pages) and decremented whenever such a page is destroyed. The reference count shall be incremented or decremented only once per page, regardless of the number of times the XObject may be referenced by that page. If the reference count reaches 0, it shall be assumed that there are no remaining pages referring to the XObject and that the XObject can be removed from the image set’s O array. When removing an XObject from the O array of an image set, the corresponding entry in the R array shall be removed also.

14.10.5 源信息

14.10.5 Source Information

14.10.5.1 概述

14.10.5.1 General

内容集字典中的 SI 条目(见 表 352)应包含一个或多个 源信息字典,每个字典包含关于从何处检索源数据的信息。

表 355 – 源信息字典中的条目
类型
AU ASCII 字符串或字典 (必需) 一个 ASCII 字符串或 URL 别名字典(见 14.10.5.2,“URL 别名字典”),它应标识从中检索源数据的 URL。
TS 日期 (可选) 一个时间戳,若存在,应包含内容集的内容被认为与源数据同步的最新日期和时间。
E 日期 (可选) 一个过期戳,若存在,应包含内容集的内容应被视为过时的日期和时间。
S 整数 (可选) 一个代码,若存在,表示访问源数据时使用的表单提交类型(如果有)(见 12.7.5.2,“提交表单操作”)。若存在,S 条目的值应为 0、1 或 2,具体含义如下:

0   未通过表单提交访问
1   通过 HTTP GET 请求访问
2   通过 HTTP POST 请求访问

此条目仅可出现在与页面集相关的源信息字典中。默认值:0。
C 字典 (可选;如果存在,应为间接引用) 一个命令字典(见 14.10.5.3,“命令字典”),描述了导致源数据被检索的命令。此条目仅可出现在与页面集相关的源信息字典中。

内容集的 SI 条目可以包含单个源信息字典。然而,PDF 处理器可能会尝试检测源数据通过两个或多个不同 URL 获取的情况。如果处理器检测到这种情况,它可以根据源数据生成一个单独的内容集,包含相关 PDF 页面或图像 XObject 的单个副本。在这种情况下,SI 条目应为一个数组,每个数组元素对应于从中获取原始源内容的不同 URL 的源信息字典。

判断不同的 URL 是否产生相同的源数据应通过比较源数据的数字标识符来进行。

源信息字典的 AU(别名 URL)条目应标识从中检索源数据的 URL。如果只有一个这样的 URL,则此条目的值可以是一个字符串。如果多个 URL 通过重定向映射到相同的位置,则 AU 值应为一个 URL 别名字典(见 14.10.5.2,“URL 别名字典”)。

注意 1

为了提高文件大小效率,整个 URL 别名字典(不包括 URL 字符串)应表示为直接对象,因为其内部结构不应被共享或外部引用。

TS(时间戳)条目允许与内容集相关的每个源位置具有自己的时间戳。

注意 2

这是必要的,因为内容集字典中的时间戳(见 表 352)仅指向内容集的创建日期。如果发现源数据自上次设置时间戳以来没有变化,假设存在“更新内容集”命令,它可能会将源信息字典中的时间戳重置为当前时间。

E(过期)条目为与内容集相关的每个源位置指定过期日期。如果当前日期和时间晚于指定日期和时间,则内容集的内容应被视为与原始源数据过时。

The SI entry in a content set dictionary (see Table 352) shall contain one or more source information dictionaries, each containing information about the locations from which the source data for the content set was retrieved.

Table 355 – Entries in a source information dictionary
Key Type Value
AU ASCII string or dictionary (Required) An ASCII string or URL alias dictionary (see 14.10.5.2, “URL Alias Dictionaries”) which shall identify the URLs from which the source data was retrieved.
TS date (Optional) A time stamp which, if present, shall contain the most recent date and time at which the content set’s contents were known to be up to date with the source data.
E date (Optional) An expiration stamp which, if present, shall contain the date and time at which the content set’s contents shall be considered out of date with the source data.
S integer (Optional) A code which, if present, shall indicate the type of form submission, if any, by which the source data was accessed (see 12.7.5.2, “Submit-Form Action”). If present, the value of the S entry shall be 0, 1, or 2, in accordance with the following meanings:

0  Not accessed by means of a form submission
1  Accessed by means of an HTTP GET request
2  Accessed by means of an HTTP POST request

This entry may be present only in source information dictionaries associated with page sets. Default value: 0.
C dictionary (Optional; if present, shall be an indirect reference) A command dictionary (see 14.10.5.3, “Command Dictionaries”) describing the command that caused the source data to be retrieved. This entry may be present only in source information dictionaries associated with page sets.

A content set's SI entry may contain a single source information dictionary. However, a PDF processor may attempt to detect situations in which the same source data has been located via two or more distinct URLs. If a processor detects such a situation, it may generate a single content set from the source data, containing a single copy of the relevant PDF pages or image XObjects. In this case, the SI entry shall be an array containing one source information dictionary for each distinct URL from which the original source content was found.

The determination that distinct URLs produce the same source data shall be made by comparing digital identifiers for the source data.

A source information dictionary’s AU (aliased URLs) entry shall identify the URLs from which the source data was retrieved. If there is only one such URL, the v value of this entry may be a string. If multiple URLs map to the same location through redirection, the AU value shall be a URL alias dictionary (see 14.10.5.2, “URL Alias Dictionaries”).

NOTE 1

For file size efficiency, the entire URL alias dictionary (excluding the URL strings) should be represented as a direct object because its internal structure should never be shared or externally referenced.

The TS (time stamp) entry allows each source location associated with a content set to have its own time stamp.

NOTE 2

This is necessary because the time stamp in the content set dictionary (see Table 352) merely refers to the creation date of the content set. A hypothetical “Update Content Set” command might reset the time stamp in the source information dictionary to the current time if it found that the source data had not changed since the time stamp was last set.

The E (expiration) entry specifies an expiration date for each source location associated with a content set. If the current date and time are later than those specified, the contents of the content set shall be considered out of date with respect to the original source.

14.10.5.2 URL 别名字典

14.10.5.2 URL Alias Dictionaries

当通过 HTTP 访问 URL 时,可能会返回一个响应头,指示请求的数据位于另一个 URL。这种 重定向 过程可能会在新 URL 处继续,并且可能会无限期地继续下去。找到多个 URL,它们通过一个或多个重定向最终都指向相同的目的地,并不罕见。URL 别名字典表示这样一组 URL 链,它们最终都通向同一个目的地。表 356 显示了此类字典的内容。

表 356 – URL 别名字典中的条目
类型
U ASCII 字符串 (必需) 所有由 C 条目指定的链条最终指向的目的地 URL。
C 数组 (可选) 一个数组,包含一个或多个字符串数组,每个数组表示一条 URL 链,指向 U 指定的公共目的地。

如果 URL 别名字典仅包含一个 URL,则 C(链条)条目可以省略。如果 C 存在,它的值应为一个数组,包含多个数组,每个数组代表一条指向公共目的地的 URL 链。在每条链中,URL 应按重定向顺序以 ASCII 字符串的形式存储。公共目的地(链条中的最后一个 URL)可以省略,因为它已经由 U 条目标识。

When a URL is accessed via HTTP, a response header may be returned indicating that the requested data is at a different URL. This redirection process may be repeated in turn at the new URL and can potentially continue indefinitely. It is not uncommon to find multiple URLs that all lead eventually to the same destination through one or more redirections. A URL alias dictionary represents such a set of URL chains leading to a common destination. Table 356 shows the contents of this type of dictionary.

Table 356 – Entries in a URL alias dictionary
Key Type Value
U ASCII string (Required) The destination URL to which all of the chains specified by the C entry lead.
C array (Optional) An array of one or more arrays of strings, each representing a chain of URLs leading to the common destination specified by U.

The C (chains) entry may be omitted if the URL alias dictionary contains only one URL. If C is present, its value shall be an array of arrays, each representing a chain of URLs leading to the common destination. Within each chain, the URLs shall be stored as ASCII strings in the order in which they occur in the redirection sequence. The common destination (the last URL in a chain) may be omitted, since it is already identified by the U entry.

14.10.5.3 命令字典

14.10.5.3 Command Dictionaries

Web Capture 命令字典 表示由 Web Capture 执行的命令,用于检索一个或多个源数据,这些数据被用来创建新页面或修改现有页面。此字典中的条目表示由请求 Web 内容捕获的用户最初交互指定的参数。记录这些信息是为了以后可以重复执行该命令,以便更新捕获的内容。表 357 显示了此类字典的内容。

表 357 – Web Capture 命令字典中的条目
类型
URL ASCII 字符串 (必需) 从中请求源数据的初始 URL。
L 整数 (可选) 从初始 URL 获取的页面层级数。默认值:1。
F 整数 (可选) 一组标志,指定命令的各种特性(参见表 358)。默认值:0。
P 字符串或流 (可选) 向 URL 提交的数据。
CT ASCII 字符串 (可选) 描述提交到 URL 的数据的内容类型。默认值:application/x-www-form-urlencoded。
H 字符串 (可选) 发送到 URL 的附加 HTTP 请求头。
S 字典 (可选) 包含转换过程中使用的设置的命令设置字典(参见 14.10.5.4,“命令设置”)。

URL 条目应包含用于检索命令的初始 URL。L(层级)条目应包含从此 URL 开始的超链接 URL 层级数,以创建来自检索到的材料的 PDF 页面。如果省略 L 条目,其值应默认为 1,表示仅检索初始 URL。

命令字典中的 F 条目的值应为一个整数,该整数应被解释为一个标志数组,指定命令的各种特性。标志的解释如 表 358 所定义。只能将 表 358 中定义的标志设置为 1;所有其他标志应为 0。未在 表 358 中定义的标志保留供未来使用,符合规范的读者不得使用。

注释 3

标志值的低阶位被称为位位置 1。

表 358 – Web Capture 命令标志
位位置 名称 含义
1 SameSite 如果设置,则页面仅从初始 URL 指定的主机上检索。
2 SamePath 如果设置,则页面仅从初始 URL 指定的路径上检索。
3 Submit 如果设置,则命令表示一个表单提交。

如果检索源内容时仅限于与初始 URL 相同路径中的内容,则应设置 SamePath 标志。如果其方案和网络位置组件(如 Internet RFC 1808 中定义的相对统一资源定位符)与初始 URL 匹配,并且路径组件与初始 URL 中的最后一个正斜杠(/)字符之前的路径匹配,则视为在同一路径下。

示例 1

URL

http://www.adobe.com/fiddle/faddle/foo.html

被认为与初始 URL

http://www.adobe.com/fiddle/initial.html

在同一路径下。

比较时,方案和网络位置组件不区分大小写,而路径组件区分大小写。

Submit 标志应在命令表示表单提交时设置。如果 P(提交的数据)条目不存在,则提交的数据应通过 URL 编码(HTTP GET 请求)。如果 P 存在,则命令应为 HTTP POST 请求。在这种情况下,Submit 标志的值应被忽略。

注释 4

如果提交的数据足够小,可以表示为字符串。对于大量数据,应使用流表示,因为它可以压缩。

CT(内容类型)条目仅在 POST 请求中出现。它应描述提交到 URL 的数据的内容类型,如 Internet RFC 2045 中所描述的 多用途互联网邮件扩展(MIME),第一部分:互联网消息体的格式

H(请求头)条目,如果存在,应指定为 URL 请求发送的附加 HTTP 请求头。每行头部应以回车符和换行符(如以下示例所示)结束:

示例 2

(Referer: http://frumble.com\015\012From:veeble@frotz.com\015\012)

HTTP 请求头格式在 Internet RFC 2616 中有详细说明,超文本传输协议—HTTP/1.1

S(设置)条目指定命令设置字典(参见 14.10.5.4,“命令设置”)。其中包含特定于转换引擎的设置。

A Web Capture command dictionary represents a command executed by Web Capture to retrieve one or more pieces of source data that were used to create new pages or modify existing pages. The entries in this dictionary represent parameters that were originally specified interactively by the user who requested that the Web content be captured. This information is recorded so that the command can subsequently be repeated to update the captured content. Table 357 shows the contents of this type of dictionary.

Table 357 – Entries in a Web Capture command dictionary
Key Type Value
URL ASCII string (Required) The initial URL from which source data was requested.
L integer (Optional) The number of levels of pages retrieved from the initial URL. Default value: 1.
F integer (Optional) A set of flags specifying various characteristics of the command (see Table 357). Defaut value: 0.
P string or stream (Optional) Data that was posted to the URL.
CT ASCII string (Optional) A content type describing the data posted to the URL. Default value: application / x-www-form-urlencoded.
H string (Optional) Additional HTTP request headers sent to the URL.
S dictionary (Optional) A command settings dictionary containing settings used in the conversion process (see 14.10.5.4, “Command Settings”).

The URL entry shall contain the initial URL for the retrieval command. The L (levels) entry shall contain the number of levels of the hyperlinked URL hierarchy to follow from this URL, creating PDF pages from the retrieved material. If the L entry is omitted, its value shall be assumed to be 1, denoting retrieval of the initial URL only.

The value of the command dictionary’s F entry shall be an integer that shall be interpreted as an array of flags specifying various characteristics of the command. The flags shall be interpreted as defined in Table 358. Only those flags defined in Table 358 may be set to 1; all other flags shall be 0. Flags not defined in Table 358 are reserved for future use, and shall not be used by a conforming reader.

NOTE 3

The low-order bit of the flags value is referred to as being at bit-position 1.

Table 358 – Web Capture command flags
Bit position Name Meaning
1 SameSite If set, pages were retrieved only from the host specified in the initial URL.
2 SamePath If set, pages were retrieved only from the path specified in the initial URL.
3 Submit If set, the command represents a form submission.

The SamePath flag shall be set if the retrieval of source content was restricted to source content in the same path as specified in the initial URL. Source content shall be considered to be in the same path if its scheme and network location components (as defined in Internet RFC 1808, Relative Uniform Resource Locators) match those of the initial URL and its path component matches up to and including the last forward slash ( / ) character in the initial URL.

EXAMPLE 1

the URL

http://www.adobe.com/fiddle/faddle/foo.html

is considered to be in the same path as the initial URL

http://www.adobe.com/fiddle/initial.html

The comparison shall be case-insensitive for the scheme and network location components and case-sensitive for the path component.

The Submit flag shall be set when the command represents a form submission. If no P (posted data) entry is present, the submitted data shall be encoded in the URL (an HTTP GET request). If P is present, the command shall be an HTTP POST request. In this case, the value of the Submit flag shall be ignored.

NOTE 4

If the posted data is small enough, it may be represented by a string. For large amounts of data, a stream should be used because it can be compressed.

The CT (content type) entry shall only be present for POST requests. It shall describe the content type of the posted data, as described in Internet RFC 2045, Multipurpose Internet Mail Extensions (MIME), Part One: Format of Internet Message Bodies (see the Bibliography).

The H (headers) entry, if present, shall specify additional HTTP request headers that were sent in the request for the URL. Each header line in the string shall be terminated with a CARRIAGE RETURN and a LINE FEED, as in this example:

EXAMPLE 2

(Referer: http://frumble.com\015\012From:veeble@frotz.com\015\012)

The HTTP request header format is specified in Internet RFC 2616, Hypertext Transfer Protocol—HTTP/1.1 (see the Bibliography).

The S (settings) entry specifies a command settings dictionary (see 14.10.5.4, “Command Settings”). Holding settings specific to the conversion engines.

14.10.5.4 命令设置

14.10.5.4 Command Settings

S(设置)条目在命令字典中,如果存在,应包含一个 命令设置字典,该字典保存用于将命令结果转换为 PDF 的转换引擎设置。表 359 显示了此类字典的内容。如果此条目省略,则假定使用默认值。命令设置字典可以被任何使用相同设置的命令字典共享。

表 359 – Web Capture 命令设置字典中的条目
类型
G 字典 (可选) 包含与所有转换引擎相关的全局转换引擎设置。如果此条目缺失,则应使用默认设置。
C 字典 (可选) 特定转换引擎的设置。该字典中的每个键都是转换引擎的内部名称。关联的值是包含该转换引擎相关设置的字典。如果字典中找不到特定转换引擎的设置,则应使用默认设置。

C 字典中的每个键表示转换引擎的内部名称,它应为以下形式的名称对象:

/company:product:version:contentType

其中

  • company 表示创建转换引擎的公司的名称(或缩写)。
  • product 表示转换引擎的名称。此字段可以为空,但仍需要尾随的冒号字符(3Ah)。
  • version 表示转换引擎的版本。
  • contentType 表示与该设置关联的内容类型标识符,因为某些转换器可能处理多种内容类型。

示例

/ADBE:H2PDF:1.0:HTML

内部名称中的所有字段都是区分大小写的。company 字段应符合 附录 E 中描述的命名指南。其他字段的值没有限制,但不能包含冒号。

由命令设置字典根节点的 PDF 对象的有向图应完全自包含;即,它不应包含任何在 PDF 文件其他位置引用的对象。

注释

这有助于在没有明确了解其可能包含的设置的情况下,对命令设置字典进行深拷贝操作。

The S (settings) entry in a command dictionary, if present, shall contain a command settings dictionary, which holds settings for conversion engines that shall be used in converting the results of the command to PDF. Table 359 shows the contents of this type of dictionary. If this entry is omitted, default values are assumed. Command settings dictionaries may be shared by any command dictionaries that use the same settings.

Table 359 – Entries in a Web Capture command settings dictionary
Key Type Value
G dictionary (Optional) A dictionary containing global conversion engine settings relevant to all conversion engines. If this entry is absent, default settings shall be used.
C dictionary (Optional) Settings for specific conversion engines. Each key in this dictionary is the internal name of a conversion engine. The associated value is a dictionary containing the settings associated with that conversion engine. If the settings for a particular conversion engine are not found in the dictionary, default settings shall be used.

Each key in the C dictionary represents the internal name of a conversion engine, which shall be a name object of the following form:

/company:product:version:contentType

where

company denotes the name (or abbreviation) of the company that created the conversion engine.

product denotes the name of the conversion engine. This field may be left blank, but the trailing COLON character (3Ah) is still required.

version denotes the version of the conversion engine.

contentType denotes an identifier for the content type the associated settings. shall be used because some converters may handle multiple content types.

EXAMPLE

/ADBE:H2PDF:1.0:HTML

All fields in the internal name are case-sensitive. The company field shall conform to the naming guidelines described in Annex E. The values of the other fields shall be unrestricted, except that they shall not contain a COLON.

The directed graph of PDF objects rooted by the command settings dictionary shall be entirely self-contained; that is, it shall not contain any object referred to from elsewhere in the PDF file.

NOTE

This facilitates the operation of making a deep copy of a command settings dictionary without explicit knowledge of the settings it may contain.

14.10.6 与网页捕获相关的对象属性

14.10.6 Object Attributes Related to Web Capture

给定的页面对象或图像 XObject 最多可以属于一个 Web Capture 内容集,该内容集被称为其父内容集。然而,该对象不应直接指向其父内容集。因为这样的指针可能会给应用程序带来问题,尤其是在应用程序跟踪对象的所有指针,以确定对象所依赖的资源时。相反,该对象的 ID 条目(见 表 30表 89)包含父内容集的数字标识符,应该通过文档名称字典中的 IDS 名称树来定位父内容集。(如果 IDS 条目包含一个内容集数组,则可以通过搜索数组来找到包含该子对象的父内容集,其 O 条目应包括该子对象。)

在从 HTML 文件创建 PDF 页面时,Web Capture 通常会缩放内容以适应固定大小的页面。页面对象中的 PZ(首选缩放)条目(见 7.7.3.3,"页面对象")指定了一个缩放因子,通过该因子可以将页面的缩放恢复到原始大小。换句话说,当页面以首选的放大倍数查看时,默认用户空间中的一个单位对应一个原始源像素。

A given page object or image XObject may belong to at most one Web Capture content set, called its parent content set. However, the object shall not have direct pointer to its parent content set. Such a pointer may present problems for an application that traces all pointers from an object to determine what resources the object depends on. Instead, the object’s ID entry (see Table 30 and Table 89) contains the digital identifier of the parent content set, which shall be used to locate the parent content set via the IDS name tree in the document’s name dictionary. (If the IDS entry for the identifier contains an array of content sets, the parent may be found by searching the array for the content set whose O entry includes the child object.)

In the course of creating PDF pages from HTML files, Web Capture frequently scales the contents down to fit on fixed-sized pages. The PZ (preferred zoom) entry in a page object (see 7.7.3.3, “Page Objects”) specifies a magnification factor by which the page may be scaled to undo the downscaling and view the page at its original size. That is, when the page is viewed at the preferred magnification factor, one unit in default user space corresponds to one original source pixel.