附录 G(资料性)线性化 PDF 访问策略¶
Annex G (informative) Linearized PDF Access Strategies
G.1 概述¶
G.1 General
本节概述了符合规范的阅读器如何利用线性化 PDF 文件的结构来高效地检索和显示该文件。此材料可能有助于解释该组织的基本原理。
This section outlines how the conforming reader can take advantage of the structure of a Linearized PDF file to retrieve and display it efficiently. This material may help explain the rationale for the organization.
G.2 打开第一页¶
G.2 Opening at the First Page
如前所述,首次访问文档时,会发出请求以从头开始检索整个文件。因此,线性化 PDF 的组织方式是,显示第一页所需的所有数据都位于文件的开头。这包括从第一页引用的所有资源,无论它们是否也从其他页面引用。
第一页通常但不一定是第 0 页。如果文档目录包含一个 OpenAction 条目,该条目指定从第 0 页以外的某个页面打开,则该页面是物理上位于文档开头的页面。因此,在默认位置(而不是特定目标)打开文档只需等待第一页数据到达即可;无需其他事务。
在普通的符合要求的阅读器中,打开文档需要先定位到末尾以获取
startxref 行。由于线性化 PDF 文件的开头有第一页的交叉引用表,因此无需读取 startxref 行。只需验证文件开头的线性化参数字典中给出的文件长度是否与文件的实际长度相匹配,这表明 PDF 文件未附加任何更新。
主提示流位于第一页部分之前或之后,这意味着它也是作为文件初始顺序读取的一部分进行检索的。符合要求的阅读器应解释并保留提示表中的所有信息。这些表相当紧凑,并非设计为从文件中随机获取。
符合要求的阅读器现在必须决定是继续按顺序读取文档的其余部分,还是中止初始事务并使用请求字节范围的单独事务访问后续页面。此决定取决于文件的大小、通道的数据速率和事务的开销成本。
As described earlier, when a document is initially accessed, a request is issued to retrieve the entire file, starting at the beginning. Consequently, Linearized PDF is organized so that all the data required to display the first page is at the beginning of the file. This includes all resources that are referenced from the first page, regardless of whether they are also referenced from other pages.
The first page is usually but not necessarily page 0. If the document catalogue contains an OpenAction entry that specifies opening at some page other than page 0, that page is the one physically located at the beginning of the document. Thus, opening a document at the default place (rather than a specific destination) requires simply waiting for the first-page data to arrive; no additional transactions are required.
In an ordinary conforming reader, opening a document requires first positioning to the end to obtain the startxref line. Since a Linearized PDF file has the first page’s cross-reference table at the beginning, reading the startxref line is not necessary. All that is required is to verify that the file length given in the linearization parameter dictionary at the beginning of the file matches the actual length of the file, indicating that no updates have been appended to the PDF file.
The primary hint stream is located either before or after the first-page section, which means that it is also retrieved as part of the initial sequential read of the file. The conforming reader is expected to interpret and retain all the information in the hint tables. The tables are reasonably compact and are not designed to be obtained from the file in random pieces.
The conforming reader must now decide whether to continue reading the remainder of the document sequentially or to abort the initial transaction and access subsequent pages by using separate transactions requesting byte ranges. This decision is a function of the size of the file, the data rate of the channel, and the overhead cost of a transaction.
G.3 打开任意页面¶
G.3 Opening at an Arbitrary Page
可以请求符合要求的阅读器打开 PDF 文件的任意页面。页面可以通过以下三种方式之一指定:
- 按页码(远程转到操作、整数页面说明符)
- 按命名目标(远程转到操作、名称或字符串页面说明符)
- 按文章线程(线程操作)
此外,索引搜索会按页码打开文档。有效处理这种情况尤为重要。
如上所述,当文档首次打开时,会从开头开始按顺序检索。一旦收到提示表,符合要求的阅读器就有足够的信息来请求检索给定页码的文档的任何页面。因此,符合要求的阅读器可以中止初始事务并为目标页面发出新事务,如 G.4“转到打开文档的另一页”中所述。
主提示流(F.3.1“常规”中的第 5 部分)相对于首页部分(F.3.1“常规”中的第 6 部分)的位置决定了完成此操作的速度。如果主提示流位于首页部分之前,则可以非常快速地中止初始事务;但是,这会增加打开第一页文档时的延迟。另一方面,如果主提示流位于首页部分之后,则显示第一页会更快(因为不需要提示表),但在任意页面上打开会延迟接收第一页所需的时间。在线性化 PDF 文件时,必须决定是倾向于在第一页打开还是在任意页面上打开。
如果存在溢出提示流,则获取它需要发出额外的事务。因此,尽管允许在线性化 PDF 中包含溢出提示流,但不建议这样做。该功能允许线性化器在写入 PDF 文件时为估计大小的主提示流保留空间,然后返回并填充提示表。如果估计值太小,线性化器可以附加包含剩余提示表数据的溢出流。因此,PDF 文件可以一次性写入,如果认为写入 PDF 的性能很重要,这可能是一个优势。
在命名目标处打开需要符合要求的阅读器首先读取整个 Dests 或 Names 字典,其中存在提示。使用此信息,可以确定包含名称标识的特定目标的页面。
打开文章需要符合要求的阅读器首先读取整个 Threads 数组,该数组与文档开头的文档目录一起位于一起。使用此信息,可以确定包含任何线程的第一个珠子的页面。在线程的第一个珠子以外的位置打开需要链接所有珠子,直到到达所需的珠子;没有提示可以加速此过程。
The conforming reader may be requested to open a PDF file at an arbitrary page. The page can be specified in one of three ways:
- By page number (remote go-to action, integer page specifier)
- By named destination (remote go-to action, name or string page specifier)
- By article thread (thread action)
Additionally, an indexed search results in opening a document by page number. Handling this case efficiently is especially important.
As indicated above, when the document is initially opened, it is retrieved sequentially starting at the beginning. As soon as the hint tables have been received, the conforming reader has sufficient information to request retrieval of any page of the document given its page number. Therefore, the conforming reader can abort the initial transaction and issue a new transaction for the target page, as described in G.4, "Going to Another Page of an Open Document".
The position of the primary hint stream (part 5 in F.3.1, "General") with respect to the first-page section (part 6 in F.3.1, "General") determines how quickly this can be done. If the primary hint stream precedes the first-page section, the initial transaction can be aborted very quickly; however, this is at the cost of increased delay when opening the document at the first page. On the other hand, if the primary hint stream follows the first-page section, displaying the first page is quicker (since the hint tables are not needed for that), but opening at an arbitrary page is delayed by the time required to receive the first page. The decision whether to favour opening at the first page or opening at an arbitrary page must be made at the time a PDF file is linearized.
If an overflow hint stream exists, obtaining it requires issuing an additional transaction. For this reason, inclusion of an overflow hint stream in Linearized PDF, although permitted, is not recommended. The feature exists to allow the linearizer to write the PDF file with space reserved for a primary hint stream of an estimated size and then go back and fill in the hint tables. If the estimate is too small, the linearizer can append an overflow stream containing the remaining hint table data. Thus, the PDF file can be written in one pass, which may be an advantage if the performance of writing PDF is considered important.
Opening at a named destination requires the conforming reader first to read the entire Dests or Names dictionary, for which a hint is present. Using this information, it is possible to determine the page containing the specific destination identified by the name.
Opening to an article requires the conforming reader first to read the entire Threads array, which is located with the document catalogue at the beginning of the document. Using this information, it is possible to determine the page containing the first bead of any thread. Opening at other than the first bead of a thread requires chaining through all the beads until the desired one is reached; there are no hints to accelerate this.
G.4 访问打开文档的其他页面¶
G.4 Going to Another Page of an Open Document
给定页码和提示表中的信息,符合规范的阅读器现在可以直接构造单个请求来检索文档的任意页面。请求应包括以下项目:
- 页面本身的对象,其字节范围可以从页面偏移提示表中的条目确定。
- 主交叉引用表中引用这些对象的部分。这可以从主交叉引用表位置(线性化参数字典中的 T 条目)和页面偏移提示表中的累积对象号计算得出。
- 从页面引用的共享对象,其字节范围可以从共享对象提示表中的信息确定。
- 主交叉引用表中引用这些对象的部分或多个部分,如上所述。
页面偏移提示表中分数的目的是使符合规范的阅读器能够以允许在数据到达时增量显示的方式安排页面的检索。它通过构造一个请求来实现这一点,该请求将页面内容的片段与内容引用的共享资源交错。这与对第一页进行的物理交错的目的大致相同。
Given a page number and the information in the hint tables, it is now straightforward for the conforming reader to construct a single request to retrieve any arbitrary page of the document. The request should include the following items:
- The objects of the page itself, whose byte range can be determined from the entry in the page offset hint table.
- The portion of the main cross-reference table referring to those objects. This can be computed from main cross-reference table location (the T entry in the linearization parameter dictionary) and the cumulative object number in the page offset hint table.
- The shared objects referenced from the page, whose byte ranges can be determined from information in the shared object hint table.
- The portion or portions of the main cross-reference table referring to those objects, as described above.
The purpose of the fractions in the page offset hint table is to enable the conforming reader to schedule retrieval of the page in a way that allows incremental display of the data as it arrives. It accomplishes this by constructing a request that interleaves pieces of the page contents with the shared resources that the contents refer to. This serves much the same purpose as the physical interleaving that is done for the first page.
G.5 逐步绘制页面¶
G.5 Drawing a Page Incrementally
页面中对象的排序和提示表的组织旨在允许逐步更新显示,并在数据到达缓慢时尽早为用户提供交互机会。符合要求的阅读器必须识别间接对象引用的目标尚未到达的情况,并在可能的情况下重新排列其对页面中对象的作用顺序。
建议采取以下操作顺序:
a)激活注释,但不要绘制它们。还要激活页面中任何文章线程的光标反馈。 b)开始绘制内容。每当有对尚未到达的图像 XObject 的引用时,跳过它。每当有对字体的引用(其定义是尚未到达的嵌入字体文件)时,使用替代字体绘制文本(如果可能)。 c)绘制注释。 d)在图像到达时绘制它们,以及与它们重叠的任何内容。 e)一旦嵌入字体定义到达,使用正确的字体重新绘制文本,以及与文本重叠的任何内容。
如果可能,最后两个步骤应使用屏幕外缓冲区完成,以避免在重绘过程中出现令人反感的闪烁。
遇到引用 XObject(请参阅 8.10.4,“引用 XObject”)时,符合要求的阅读器可以选择最初将该对象显示为代理,并推迟对导入内容的检索和渲染。请注意,由于线性化 PDF 文件中的所有 XObject 都遵循它们出现的页面的内容流,因此它们的检索已被推迟;使用引用 XObject 会导致额外的延迟级别。
The ordering of objects in pages and the organization of the hint tables are intended to allow progressive update of the display and early opportunities for user interaction when the data is arriving slowly. The conforming reader must recognize instances in which the targets of indirect object references have not yet arrived and, where possible, rearrange the order in which it acts on the objects in the page.
The following sequence of actions is recommended:
a)Activate the annotations, but do not draw them yet. Also activate the cursor feedback for any article threads in the page. b)Begin drawing the contents. Whenever there is a reference to an image XObject that has not yet arrived, skip over it. Whenever there is a reference to a font whose definition is an embedded font file that has not yet arrived, draw the text using a substitute font (if that is possible). c)Draw the annotations. d)Draw the images as they arrive, together with anything that overlaps them. e)Once the embedded font definitions have arrived, redraw the text using the correct fonts, together with anything that overlaps the text.
The last two steps should be done using an off-screen buffer, if possible, to avoid objectionable flashing during the redraw process.
On encountering a reference XObject (see 8.10.4, "Reference XObjects"), the conforming reader may choose to initially display the object as a proxy and defer the retrieval and rendering of the imported content. Note that, since all XObjects in a Linearized PDF file follow the content stream of the page on which they appear, their retrieval is already deferred; the use of a reference XObject results in an additional level of deferral.
G.6 跟随文章线索¶
G.6 Following an Article Thread
如前所述,访问给定页面的任何文章线索的珠子词典都与该页面一起定位。这样就可以激活珠子矩形并显示正确的光标反馈。
如果用户关注某个线索,则符合要求的阅读器可以从珠子词典的 N 或 P 条目中获取对象编号。这标识了一个目标珠子,它与它所属的页面一起定位。给定此对象编号,符合要求的阅读器可以转到该页面,如 G.4“转到打开文档的另一页”中所述。
As indicated earlier, the bead dictionaries for any article thread that visits a given page are located with that page. This enables the bead rectangles to be activated and proper cursor feedback to be shown.
If the user follows a thread, the conforming reader can obtain the object number from the N or P entry of the bead dictionary. This identifies a target bead, which is located with the page to which it belongs. Given this object number, the conforming reader can go to that page, as discussed in G.4, "Going to Another Page of an Open Document."
G.7 访问更新后的文件¶
G.7 Accessing an Updated File
如前所述,如果线性化 PDF 文件随后附加了增量更新,则线性化和提示不再有效。实际上,这不一定是真的,但符合要求的阅读器必须做一些额外的工作来验证信息。
当符合要求的阅读器看到文件长度超过线性化参数字典中给出的长度时,它必须发出额外的事务来读取附加的所有内容。然后,它必须分析该更新中的对象,以查看其中是否有任何对象修改了第一页中的对象或提示的目标。如果是这样,它必须根据需要扩充其内部数据结构以考虑更新。
对于仅收到少量更新的 PDF 文件,这种方法可能是值得的。以这种方式访问文件比在没有提示的情况下访问文件或在显示任何内容之前检索整个文件更快。
As stated earlier, if a Linearized PDF file subsequently has an incremental update appended to it, the linearization and hints are no longer valid. Actually, this is not necessarily true, but the conforming reader must do some additional work to validate the information.
When the conforming reader sees that the file is longer than the length given in the linearization parameter dictionary, it must issue an additional transaction to read everything that was appended. It must then analyse the objects in that update to see whether any of them modify objects that are in the first page or that are the targets of hints. If so, it must augment its internal data structures as necessary to take the updates into account.
For a PDF file that has received only a small update, this approach may be worthwhile. Accessing the file this way is quicker than accessing it without hints or retrieving the entire file before displaying any of it.