7.2 词汇惯例¶
Lexical Conventions
7.2.1 概述¶
一个未加密的PDF可以完全使用对应于ANSI X3.4-1986定义的字符集的可见打印子集的字节值,加上空白字符来表示。然而,PDF文件并不局限于ASCII字符集;它可能包含任意字节,但需遵守以下考虑:
- 定界对象并描述PDF文件结构的标记应使用ASCII字符集。此外,所有保留字和用作PDF标准字典中的键的名称,以及某些类型的数组应使用ASCII字符集定义。
- 字符串和流对象的数据值可以完全使用ASCII字符集书写,或者完全用二进制数据书写。在实际操作中,自然为二进制的数据,如采样图像,通常以二进制形式表示以实现紧凑和高效。
- 包含二进制数据的PDF文件应作为二进制文件传输,而不是作为文本文件,以确保文件中的所有字节都被完整保留。
At the most fundamental level, a PDF file is a sequence of bytes. These bytes can be grouped into tokens according to the syntax rules described in this sub-clause. One or more tokens are assembled to form higher-level syntactic entities, principally objects, which are the basic data values from which a PDF document is constructed.
A non-encrypted PDF can be entirely represented using byte values corresponding to the visible printable subset of the character set defined in ANSI X3.4-1986, plus white space characters. However, a PDF file is not restricted to the ASCII character set; it may contain arbitrary bytes, subject to the following considerations:
- The tokens that delimit objects and that describe the structure of a PDF file shall use the ASCII character set. In addition all the reserved words and the names used as keys in PDF standard dictionaries and certain types of arrays shall be defined using the ASCII character set.
- The data values of strings and streams objects may be written either entirely using the ASCII character set or entirely in binary data. In actual practice, data that is naturally binary, such as sampled images, is usually represented in binary for compactness and efficiency.
- A PDF file containing binary data shall be transported as a binary file rather than as a text file to insure that all bytes of the file are faithfully preserved.
A binary file is not portable to environments that impose reserved character codes, maximum line lengths, end- of-line conventions, or other restrictions
In this clause, the usage of the term character is entirely independent of any logical meaning that the value may have when it is treated as data in specific contexts, such as representing human-readable text or selecting a glyph from a font.
7.2.2 字符集¶
Character Set
十进制 | 十六进制 | 八进制 | 名称 |
0 | 00 | 000 | 空字符 (NUL) |
0 | 09 | 011 | 水平制表(HORIZONTAL TAB (HT)) |
10 | 0A | 012 | 换行符(LINE FEED (LF)) |
12 | 0C | 014 | 表单馈送(FORM FEED (FF)) |
13 | 0D | 015 | 回车符(CARRIAGE RETURN (CR)) |
32 | 20 | 040 | 空格(SPACE (SP)) |
, )
, <
, >
, [
, ]
, {
, }
, /
, 和 %
字形 | 十进制 | 十六进制 | 八进制 | 名称 |
( | 40 | 28 | 50 | 左括号(LEFT PARENTHESIS) |
) | 41 | 29 | 51 | 右括号(RIGHT PARENTHESIS) |
< | 60 | 3C | 60 | 小于号(LESS-THAN SIGN) |
> | 62 | 3E | 62 | 大于号(GREATER-THAN SIGN) |
[ | 91 | 5B | 133 | 左方括号(LEFT SQUARE BRACKET) |
] | 93 | 5D | 135 | 右方括号(RIGHT SQUARE BRACKET) |
{ | 123 | 7B | 173 | 左大括号(LEFT CURLY BRACKET) |
} | 125 | 7D | 175 | 右大括号(RIGHT CURLY BRACKET) |
/ | 47 | 2F | 57 | 实线斜杠(SOLIDUS) |
% | 37 | 25 | 45 | 百分号(PERCENT SIGN) |
除了空白字符和定界符之外的所有字符被称为普通字符(regular characters)。这些字符包括在ASCII字符集之外的字节。连续的普通字符序列构成一个单独的标记。PDF是区分大小写的;相应的大写和小写字母将被视为不同的。
The PDF character set is divided into three classes, called regular, delimiter, and white-space characters. This classification determines the grouping of characters into tokens. The rules defined in this sub-clause apply to all characters in the file except within strings, streams, and comments.
The White-space characters shown in Table 1 separate syntactic constructs such as names and numbers from each other. All white-space characters are equivalent, except in comments, strings, and streams. In all other contexts, PDF treats any sequence of consecutive white-space characters as one character.
Decimal | Hexadecimal | Octal | Name |
0 | 00 | 000 | Null (NUL) |
0 | 09 | 011 | HORIZONTAL TAB (HT) |
10 | 0A | 012 | LINE FEED (LF) |
12 | 0C | 014 | FORM FEED (FF) |
13 | 0D | 015 | CARRIAGE RETURN (CR) |
32 | 20 | 040 | SPACE (SP) |
The CARRIAGE RETURN (0Dh) and LINE FEED (0Ah) characters, also called newline characters, shall be treated as end-of-line (EOL) markers. The combination of a CARRIAGE RETURN followed immediately by a LINE FEED shall be treated as one EOL marker. EOL markers may be treated the same as any other white- space characters. However, sometimes an EOL marker is required or recommended—that is, preceding a token that must appear at the beginning of a line.
The examples in this standard use a convention that arranges tokens into lines. However, the examples’ use of white space for indentation is purely for clarity of exposition and need not be included in practical use.
The delimiter characters (
, )
, <
, >
, [
, ]
, {
, }
, /
, and %
are special (LEFT PARENTHESIS (28h), RIGHT PARENTHESIS (29h), LESS-THAN SIGN (3Ch), GREATER-THAN SIGN (3Eh), LEFT SQUARE BRACKET (5Bh), RIGHT SQUARE BRACKET (5Dh), LEFT CURLY BRACE (7Bh), RIGHT CURLY BRACE (07Dh), SOLIDUS (2Fh) and PERCENT SIGN (25h), respectively). They delimit syntactic entities such as arrays, names, and comments. Any of these characters terminates the entity preceding it and is not included in the entity. Delimiter characters are allowed within the scope of a string when following the rules for composing strings; see, “Literal Strings”. The leading ( of a string does delimit a preceding entity and the closing ) of a string delimits the string’s end.
Glyph | Decimal | Hexadecimal | Octal | Name |
( | 40 | 28 | 50 | LEFT PARENTHESIS |
) | 41 | 29 | 51 | RIGHT PARENTHESIS |
< | 60 | 3C | 60 | LESS-THAN SIGN |
> | 62 | 3E | 62 | GREATER-THAN SIGN |
[ | 91 | 5B | 133 | LEFT SQUARE BRACKET |
] | 93 | 5D | 135 | RIGHT SQUARE BRACKET |
{ | 123 | 7B | 173 | LEFT CURLY BRACKET |
} | 125 | 7D | 175 | RIGHT CURLY BRACKET |
/ | 47 | 2F | 57 | SOLIDUS |
% | 37 | 25 | 45 | PERCENT SIGN |
All characters except the white-space characters and delimiters are referred to as regular characters. These characters include bytes that are outside the ASCII character set. A sequence of consecutive regular characters comprises a single token. PDF is case-sensitive; corresponding uppercase and lowercase letters shall be considered distinct.
7.2.3 批注¶
abc% comment ( /% ) blah blah blah
(注:示例中的 "blah blah blah" 意为无意义的填充词,相当于中文里的“废话”。)
Any occurrence of the PERCENT SIGN (25h) outside a string or stream introduces a comment. The comment consists of all characters after the PERCENT SIGN and up to but not including the end of the line, including regular, delimiter, SPACE (20h), and HORZONTAL TAB characters (09h). A conforming reader shall ignore comments, and treat them as single white-space characters. That is, a comment separates the token preceding it from the one following it.\
The PDF fragment in this example is syntactically equivalent to just the tokens abc and 123.
abc% comment ( /% ) blah blah blah
Comments (other than the %PDF–n.m and %%EOF comments described in 7.5, "File Structure") have no semantics. They are not necessarily preserved by applications that edit PDF files.