7.2 词汇惯例¶
Lexical Conventions
7.2.1 概述¶
General
PDF文件在最基础的层面上是一系列字节的序列。根据本子条款中描述的语法规则,这些字节可以被组合成标记。一个或多个标记被组装成更高层次的语法实体,主要是对象,这些对象是构成PDF文档的基本数据值。
一个未加密的PDF可以完全使用对应于ANSI X3.4-1986定义的字符集的可见打印子集的字节值,加上空白字符来表示。然而,PDF文件并不局限于ASCII字符集;它可能包含任意字节,但需遵守以下考虑:
- 定界对象并描述PDF文件结构的标记应使用ASCII字符集。此外,所有保留字和用作PDF标准字典中的键的名称,以及某些类型的数组应使用ASCII字符集定义。
- 字符串和流对象的数据值可以完全使用ASCII字符集书写,或者完全用二进制数据书写。在实际操作中,自然为二进制的数据,如采样图像,通常以二进制形式表示以实现紧凑和高效。
- 包含二进制数据的PDF文件应作为二进制文件传输,而不是作为文本文件,以确保文件中的所有字节都被完整保留。
NOTE1
二进制文件不便于移植到那些强加保留字符代码、最大行长度、行结束约定或其他限制的环境中。
NOTE2
在本条款中,字符一词的使用完全独立于当它作为特定上下文中的数据时可能具有的任何逻辑含义,如表示可读的文本或从字体中选择字形。
At the most fundamental level, a PDF file is a sequence of bytes. These bytes can be grouped into tokens according to the syntax rules described in this sub-clause. One or more tokens are assembled to form higher-level syntactic entities, principally objects, which are the basic data values from which a PDF document is constructed.
A non-encrypted PDF can be entirely represented using byte values corresponding to the visible printable subset of the character set defined in ANSI X3.4-1986, plus white space characters. However, a PDF file is not restricted to the ASCII character set; it may contain arbitrary bytes, subject to the following considerations:
- The tokens that delimit objects and that describe the structure of a PDF file shall use the ASCII character set. In addition all the reserved words and the names used as keys in PDF standard dictionaries and certain types of arrays shall be defined using the ASCII character set.
- The data values of strings and streams objects may be written either entirely using the ASCII character set or entirely in binary data. In actual practice, data that is naturally binary, such as sampled images, is usually represented in binary for compactness and efficiency.
- A PDF file containing binary data shall be transported as a binary file rather than as a text file to insure that all bytes of the file are faithfully preserved.
NOTE1
A binary file is not portable to environments that impose reserved character codes, maximum line lengths, end- of-line conventions, or other restrictions
NOTE2
In this clause, the usage of the term character is entirely independent of any logical meaning that the value may have when it is treated as data in specific contexts, such as representing human-readable text or selecting a glyph from a font.
7.2.2 字符集¶
Character Set
PDF字符集被分为三个类别,称为普通(regular)、定界符(delimiter)和空白字符(white-space)。这种分类决定了字符如何被分组为标记。本子条款中定义的规则适用于文件中的所有字符,除了在字符串、流和注释中。
表1中显示的空白(White-space)字符将诸如名称和数字这样的语法构造彼此分开。除了在注释、字符串和流中,所有空白字符都是等价的。在所有其他上下文中,PDF将连续的空白字符序列视为一个字符。
十进制 | 十六进制 | 八进制 | 名称 |
---|---|---|---|
0 | 00 | 000 | 空字符 (NUL) |
0 | 09 | 011 | 水平制表(HORIZONTAL TAB (HT)) |
10 | 0A | 012 | 换行符(LINE FEED (LF)) |
12 | 0C | 014 | 表单馈送(FORM FEED (FF)) |
13 | 0D | 015 | 回车符(CARRIAGE RETURN (CR)) |
32 | 20 | 040 | 空格(SPACE (SP)) |
回车符(0Dh)和换行符(0Ah)字符,也称为换行字符,应被视为行结束(EOL)标记。紧跟在回车符后面的换行符的组合应被视为一个EOL标记。EOL标记可能被视为与其他空白字符相同。然而,有时需要或建议使用EOL标记——即在必须出现在一行开始的标记之前。
注
本标准中的示例采用了一种将标记排成行的约定。然而,示例中使用空白进行缩进只是为了阐述清晰,并不需要在实际使用中包含。
定界符字符((
, )
, <
, >
, [
, ]
, {
, }
, /
, 和 %
)是特殊的(左括号(28h)、右括号(29h)、小于号(3Ch)、大于号(3Eh)、左方括号(5Bh)、右方括号(5Dh)、左大括号(7Bh)、右大括号(7Dh)、实线斜杠(2Fh)和百分号(25h))。它们定界了诸如数组、名称和注释这样的语法实体。这些字符中的任何一个都终止了它之前的实体,并且不包含在实体中。在遵循构成字符串的规则时,定界符字符是允许在字符串的作用域内的;见7.3.4.2,“文字字符串”。字符串的开头(左括号)确实定界了前面的实体,而字符串的结尾(右括号)定界了字符串的结束。
字形 | 十进制 | 十六进制 | 八进制 | 名称 |
---|---|---|---|---|
( | 40 | 28 | 50 | 左括号(LEFT PARENTHESIS) |
) | 41 | 29 | 51 | 右括号(RIGHT PARENTHESIS) |
< | 60 | 3C | 60 | 小于号(LESS-THAN SIGN) |
> | 62 | 3E | 62 | 大于号(GREATER-THAN SIGN) |
[ | 91 | 5B | 133 | 左方括号(LEFT SQUARE BRACKET) |
] | 93 | 5D | 135 | 右方括号(RIGHT SQUARE BRACKET) |
{ | 123 | 7B | 173 | 左大括号(LEFT CURLY BRACKET) |
} | 125 | 7D | 175 | 右大括号(RIGHT CURLY BRACKET) |
/ | 47 | 2F | 57 | 实线斜杠(SOLIDUS) |
% | 37 | 25 | 45 | 百分号(PERCENT SIGN) |
除了空白字符和定界符之外的所有字符被称为普通字符(regular characters)。这些字符包括在ASCII字符集之外的字节。连续的普通字符序列构成一个单独的标记。PDF是区分大小写的;相应的大写和小写字母将被视为不同的。
The PDF character set is divided into three classes, called regular, delimiter, and white-space characters. This classification determines the grouping of characters into tokens. The rules defined in this sub-clause apply to all characters in the file except within strings, streams, and comments.
The White-space characters shown in Table 1 separate syntactic constructs such as names and numbers from each other. All white-space characters are equivalent, except in comments, strings, and streams. In all other contexts, PDF treats any sequence of consecutive white-space characters as one character.
Decimal | Hexadecimal | Octal | Name |
---|---|---|---|
0 | 00 | 000 | Null (NUL) |
0 | 09 | 011 | HORIZONTAL TAB (HT) |
10 | 0A | 012 | LINE FEED (LF) |
12 | 0C | 014 | FORM FEED (FF) |
13 | 0D | 015 | CARRIAGE RETURN (CR) |
32 | 20 | 040 | SPACE (SP) |
The CARRIAGE RETURN (0Dh) and LINE FEED (0Ah) characters, also called newline characters, shall be treated as end-of-line (EOL) markers. The combination of a CARRIAGE RETURN followed immediately by a LINE FEED shall be treated as one EOL marker. EOL markers may be treated the same as any other white- space characters. However, sometimes an EOL marker is required or recommended—that is, preceding a token that must appear at the beginning of a line.
NOTE
The examples in this standard use a convention that arranges tokens into lines. However, the examples’ use of white space for indentation is purely for clarity of exposition and need not be included in practical use.
The delimiter characters (
, )
, <
, >
, [
, ]
, {
, }
, /
, and %
are special (LEFT PARENTHESIS (28h), RIGHT PARENTHESIS (29h), LESS-THAN SIGN (3Ch), GREATER-THAN SIGN (3Eh), LEFT SQUARE BRACKET (5Bh), RIGHT SQUARE BRACKET (5Dh), LEFT CURLY BRACE (7Bh), RIGHT CURLY BRACE (07Dh), SOLIDUS (2Fh) and PERCENT SIGN (25h), respectively). They delimit syntactic entities such as arrays, names, and comments. Any of these characters terminates the entity preceding it and is not included in the entity. Delimiter characters are allowed within the scope of a string when following the rules for composing strings; see 7.3.4.2, “Literal Strings”. The leading ( of a string does delimit a preceding entity and the closing ) of a string delimits the string’s end.
Glyph | Decimal | Hexadecimal | Octal | Name |
---|---|---|---|---|
( | 40 | 28 | 50 | LEFT PARENTHESIS |
) | 41 | 29 | 51 | RIGHT PARENTHESIS |
< | 60 | 3C | 60 | LESS-THAN SIGN |
> | 62 | 3E | 62 | GREATER-THAN SIGN |
[ | 91 | 5B | 133 | LEFT SQUARE BRACKET |
] | 93 | 5D | 135 | RIGHT SQUARE BRACKET |
{ | 123 | 7B | 173 | LEFT CURLY BRACKET |
} | 125 | 7D | 175 | RIGHT CURLY BRACKET |
/ | 47 | 2F | 57 | SOLIDUS |
% | 37 | 25 | 45 | PERCENT SIGN |
All characters except the white-space characters and delimiters are referred to as regular characters. These characters include bytes that are outside the ASCII character set. A sequence of consecutive regular characters comprises a single token. PDF is case-sensitive; corresponding uppercase and lowercase letters shall be considered distinct.
7.2.3 批注¶
Comments
百分号(25h)的任何出现,如果在字符串或流之外,都会引入一个注释。注释由百分号之后的所有字符组成,直到但不包括行的结束,包括普通字符、定界符、空格(20h)和水平制表符(09h)。符合规范的阅读器应该忽略注释,并将它们视为单个空白字符。也就是说,注释将其前面的标记与后面的标记分开。
Example
此示例中的PDF片段在语法上等同于仅标记abc和123。
abc% comment ( /% ) blah blah blah
123
(注:示例中的 "blah blah blah" 意为无意义的填充词,相当于中文里的“废话”。)
注释(不同于7.5中描述的%PDF–n.m和%%EOF注释的“文件结构”)没有语义。编辑PDF文件的应用程序不一定保留注释。
Any occurrence of the PERCENT SIGN (25h) outside a string or stream introduces a comment. The comment consists of all characters after the PERCENT SIGN and up to but not including the end of the line, including regular, delimiter, SPACE (20h), and HORZONTAL TAB characters (09h). A conforming reader shall ignore comments, and treat them as single white-space characters. That is, a comment separates the token preceding it from the one following it.\
Example
The PDF fragment in this example is syntactically equivalent to just the tokens abc and 123.
abc% comment ( /% ) blah blah blah
123
Comments (other than the %PDF–n.m and %%EOF comments described in 7.5, "File Structure") have no semantics. They are not necessarily preserved by applications that edit PDF files.