跳转至

9.10 文本内容提取

9.10 Extraction of Text Content

9.10.1 概述

9.10.1 General

前面的子条款描述了所有在页面上显示文本和绘制字形的功能。除了显示文本,符合规范的读取器有时需要确定文本的信息内容——即,根据某种标准字符识别,它的含义,而不是其渲染外观。这种需求在进行搜索、索引和将文本导出到其他文件格式等操作时出现。

Unicode标准定义了一个为大量语言中常见字符编号的系统。它是表示文本信息内容的合适方案,但不是表示其外观的方案,因为Unicode值标识的是字符,而不是字形。有关Unicode的信息,请参见Unicode联盟的Unicode标准(见参考书目)。

在提取字符内容时,如果字体的字符是根据符合规范的读取器已知的标准字符集进行识别的,符合规范的读取器可以轻松地将文本转换为Unicode值。如果字体使用标准命名编码,或者字体中的字符通过标准字符名称或CID在一个著名集合中进行识别,那么字符识别就可以发生。9.10.2,“字符代码到Unicode值的映射”,详细描述了将字符代码映射到Unicode值的总体算法。

如果字体没有以这些方式定义,字形仍然可以显示,但字符不能在没有额外信息的情况下转换为Unicode值:

  • 这些信息可以通过字体字典中的一个可选ToUnicode条目提供(PDF 1.2;参见9.10.3,“ToUnicode CMaps”),其值应为一个流对象,包含一种特殊的CMap文件,用于将字符代码映射到Unicode值。
  • 结构元素或标记内容序列的ActualText条目(参见14.9.4,“替换文本”)可用于直接指定文本内容。

The preceding sub-clauses describe all the facilities for showing text and causing glyphs to be painted on the page. In addition to displaying text, conforming readers sometimes need to determine the information content of text—that is, its meaning according to some standard character identification as opposed to its rendered appearance. This need arises during operations such as searching, indexing, and exporting of text to other file formats.

The Unicode standard defines a system for numbering all of the common characters used in a large number of languages. It is a suitable scheme for representing the information content of text, but not its appearance, since Unicode values identify characters, not glyphs. For information about Unicode, see the Unicode Standard by the Unicode Consortium (see the Bibliography).

When extracting character content, a conforming reader can easily convert text to Unicode values if a font’s characters are identified according to a standard character set that is known to the conforming reader. This character identification can occur if either the font uses a standard named encoding or the characters in the font are identified by standard character names or CIDs in a well-known collection. 9.10.2, "Mapping Character Codes to Unicode Values", describes in detail the overall algorithm for mapping character codes to Unicode values.

If a font is not defined in one of these ways, the glyphs can still be shown, but the characters cannot be converted to Unicode values without additional information:

  • This information can be provided as an optional ToUnicode entry in the font dictionary (PDF 1.2; see 9.10.3, "ToUnicode CMaps"), whose value shall be a stream object containing a special kind of CMap file that maps character codes to Unicode values.
  • An ActualText entry for a structure element or marked-content sequence (see 14.9.4, "Replacement Text") may be used to specify the text content directly.

9.10.2 将字符代码映射到 Unicode 值

9.10.2 Mapping Character Codes to Unicode Values

符合规范的读取器可以按照优先顺序使用以下方法将字符代码映射到Unicode值。特别是,标签化PDF文档必须提供至少一种这些方法(参见14.8.2.4.2,“标签化PDF中的Unicode映射”):

  • 如果字体字典包含一个ToUnicode CMap(参见9.10.3,“ToUnicode CMaps”),则使用该CMap将字符代码转换为Unicode。
  • 如果字体是一个简单字体,且使用预定义编码之一MacRomanEncodingMacExpertEncodingWinAnsiEncoding,或者它的编码的Differences数组仅包含来自Adobe标准拉丁字符集和符号字体中命名字符集的字符名称(参见[附录D]):

    a) 根据表D.1和字体的Differences数组将字符代码映射到字符名称。

    b) 在Adobe字形列表中查找字符名称(参见参考书目)以获取相应的Unicode值。

  • 如果字体是一个复合字体,且使用表118中列出的预定义CMaps之一(除了Identity–H和Identity–V)或其后代CIDFont使用Adobe-GB1、Adobe-CNS1、Adobe-Japan1或Adobe-Korea1字符集:

    a) 根据字体的CMap将字符代码映射到字符标识符(CID)。

    b) 从字体的CIDSystemInfo字典中获取字体CMap使用的字符集的注册表和排序(例如,Adobe和Japan1)。

    c) 通过将步骤(b)中获得的注册表和排序连接起来,构造第二个CMap名称,格式为注册表–排序–UCS2(例如,Adobe–Japan1–UCS2)。

    d) 获取步骤©中构造的CMap名称(可从ASN网站获得;参见参考书目)。

    e) 根据步骤(d)中获取的CMap映射步骤(a)中获得的CID,生成Unicode值。

注意

Type 0 字体的后代CIDFont使用Adobe-GB1、Adobe-CNS1、Adobe-Japan1或Adobe-Korea1字符集(如CIDSystemInfo字典中所指定)时,应具有与符合规范的读取器支持的PDF版本对应的补充号。参见表3以查看与给定PDF版本对应的字符集列表。(这些字符集的其他补充可以使用,但如果补充号高于与所支持PDF版本对应的补充号,则只有后者补充中的CID才被视为标准CID。)

如果这些方法未能生成Unicode值,则无法确定字符代码所表示的内容,在这种情况下,符合规范的读取器可以选择自己选择的字符代码。

A conforming reader can use these methods, in the priority given, to map a character code to a Unicode value. Tagged PDF documents, in particular, shall provide at least one of these methods (see 14.8.2.4.2, "Unicode Mapping in Tagged PDF"):

  • If the font dictionary contains a ToUnicode CMap (see 9.10.3, "ToUnicode CMaps"), use that CMap to convert the character code to Unicode.
  • If the font is a simple font that uses one of the predefined encodings MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding, or that has an encoding whose Differences array includes only character names taken from the Adobe standard Latin character set and the set of named characters in the Symbol font (see Annex D):

    a) Map the character code to a character name according to Table D.1 and the font’s Differences array.

    b) Look up the character name in the Adobe Glyph List (see the Bibliography) to obtain the corresponding Unicode value.

  • If the font is a composite font that uses one of the predefined CMaps listed in Table 118 (except Identity–H and Identity–V) or whose descendant CIDFont uses the Adobe-GB1, Adobe-CNS1, Adobe-Japan1, or Adobe-Korea1 character collection:

a) Map the character code to a character identifier (CID) according to the font’s CMap.

b) Obtain the registry and ordering of the character collection used by the font’s CMap (for example, Adobe and Japan1) from its CIDSystemInfo dictionary.

c) Construct a second CMap name by concatenating the registry and ordering obtained in step (b) in the format registry–ordering–UCS2 (for example, Adobe–Japan1–UCS2).

d) Obtain the CMap with the name constructed in step © (available from the ASN Web site; see the Bibliography).

e) Map the CID obtained in step (a) according to the CMap obtained in step (d), producing a Unicode value.

NOTE

Type 0 fonts whose descendant CIDFonts use the Adobe-GB1, Adobe-CNS1, Adobe-Japan1, or Adobe-Korea1 character collection (as specified in the CIDSystemInfo dictionary) shall have a supplement number corresponding to the version of PDF supported by the conforming reader. See Table 3 for a list of the character collections corresponding to a given PDF version. (Other supplements of these character collections can be used, but if the supplement is higher-numbered than the one corresponding to the supported PDF version, only the CIDs in the latter supplement are considered to be standard CIDs.)

If these methods fail to produce a Unicode value, there is no way to determine what the character code represents in which case a conforming reader may choose a character code of their choosing.

9.10.3 ToUnicode CMaps

9.10.3 ToUnicode CMaps

在字体字典的ToUnicode条目中定义的CMap应遵循9.7.5中介绍的CMap语法,并在Adobe技术笔记#5014《Adobe CMap和CIDFont文件规范》中有详细记录。有关此条目中定义的CMap的附加指导,请参阅Adobe技术笔记#5411,ToUnicode映射文件教程。该CMap与普通CMap的不同之处如下:

  • CMap流字典中唯一相关的条目(参见表120)是UseCMap,如果CMap是基于另一个ToUnicode CMap,则可以使用此条目。
  • CMap文件应包含begincodespacerangeendcodespacerange运算符,这些运算符与字体使用的编码一致。特别地,对于简单字体,代码空间应为一个字节长。
  • 它应使用beginbfcharendbfcharbeginbfrangeendbfrange运算符来定义字符代码到Unicode字符序列(以UTF-16BE编码表示)的映射。

示例 1

该示例演示了一个使用Identity-H CMap将字符代码映射到CID的Type 0字体,并且其后代CIDFont使用从CID到TrueType字形索引的Identity映射。使用该字体显示的文本仅使用每个字形的2字节字形索引。如果没有ToUnicode条目,则无法得知字形的含义。

14 0 obj
    <<  /Type /Font
        /Subtype /Type0
        /BaseFont /Ryumin−Light
        /Encoding /Identity−H
        /DescendantFonts [ 15 0 R ]
        /ToUnicode 16 0 R
    >>
endobj

15 0 obj
    <<  /Type /Font
        /Subtype /CIDFontType2
        /BaseFont /Ryumin−Light
        /CIDSystemInfo 17 0 R
        /FontDescriptor 18 0 R
        /CIDToGIDMap /Identity
    >>
endobj

示例 2

在这个示例中,ToUnicode条目的值是一个流对象,其中包含CMap的定义。

begincodespacerangeendcodespacerange运算符定义源字符代码范围为从<00 00>到的2字节字符代码。几个字符代码的具体映射如下所示。

16 0 obj
    << /Length 433 >>
stream
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo
<< /Registry ( Adobe )
/Ordering ( UCS )
/Supplement 0
>> def
/CMapName /Adobe−Identity−UCS def
/CMapType 2 def
1 begincodespacerange
<0000>     <FFFF>
endcodespacerange
2 beginbfrange
<0000>     <005E>    <0020>
<005F>     <0061>    [ <00660066> <00660069> <00660066006C> ]
endbfrange
1 beginbfchar
<3A51>     <D840DC3E>
endbfchar
endcmap
CMapName currentdict /CMap defineresource pop
end
end
endstream
endobj

<00 00> 到 <00 5E> 被映射为Unicode值U+0020到U+007E,接下来是定义的映射,其中每个字符代码表示多个Unicode值:

<005F> <0061> [ <00660066> <00660069> <00660066006C> ]

在这种情况下,原始字符代码是字形联接符ff、fi和ffi的字形索引。该条目定义了从字符代码<00 5F>、<00 60>和<00 61>到Unicode值序列的映射,其中每个字形的Unicode标量值为该联接符中的每个字符:U+0066 U+0066是字符序列ff的Unicode值,U+0066 U+0069是fi的Unicode值,U+0066 U+0066 U+006C是ffi的Unicode值。

最后,字符代码<3A 51>被映射到Unicode值U+2003E,该值在UTF-16BE编码中由字节序列表示。

示例2在本子条款中说明了几种目标值定义方式的扩展。为了支持从源代码到目标代码串的映射,这一扩展已应用于beginbfchar运算符之后定义的范围:

n beginbfchar
srcCode dstString
endbfchar

其中dstString可以是最多512字节的字符串。同样,beginbfrange运算符之后的映射可以定义为:

n beginbfrange
srcCode1 srcCode2 dstString
endbfrange

在这种情况下,字符串中的最后一个字节应针对源代码范围内的每个连续代码递增。

在定义这种类型的范围时,字符串中的最后一个字节的值应小于或等于255 − (srcCode2srcCode1)。这样可以确保字符串的最后一个字节不会超过255;否则,映射结果是未定义的。

为了支持从源字符代码范围到不连续的目标代码范围的更紧凑的映射表示,ToUnicode条目使用的CMap可以使用以下语法进行映射,定义在beginbfrange定义之后。

n beginbfrange
srcCode1 srcCode2 [ dstString1 dstString2 … dstStringm ]
endbfrange

srcCode1开始并以srcCode2结束的连续代码将被映射到从dstString1dstStringm的目标字符串数组中。dstString的值可以是最多512字节的字符串。m的值表示源字符代码范围中的连续字符代码数量。

\(\text{m} = \text{srcCode}_2 - \text{srcCode}_1 + 1\)

The CMap defined in the ToUnicode entry of the font dictionary shall follow the syntax for CMaps introduced in 9.7.5, "CMaps" and fully documented in Adobe Technical Note #5014, Adobe CMap and CIDFont Files Specification. Additional guidance regarding the CMap defined in this entry is provided in Adobe Technical Note #5411, ToUnicode Mapping File Tutorial. This CMap differs from an ordinary one in these ways:

  • The only pertinent entry in the CMap stream dictionary (see Table 120) is UseCMap, which may be used if the CMap is based on another ToUnicode CMap.
  • The CMap file shall contain begincodespacerange and endcodespacerange operators that are consistent with the encoding that the font uses. In particular, for a simple font, the codespace shall be one byte long.
  • It shall use the beginbfchar, endbfchar, beginbfrange, and endbfrange operators to define the mapping from character codes to Unicode character sequences expressed in UTF-16BE encoding.

EXAMPLE 1

This example illustrates a Type 0 font that uses the Identity-H CMap to map from character codes to CIDs and whose descendant CIDFont uses the Identity mapping from CIDs to TrueType glyph indices. Text strings shown using this font simply use a 2-byte glyph index for each glyph. In the absence of a ToUnicode entry, no information would be available about what the glyphs mean.

14 0 obj
    <<  /Type /Font
        /Subtype /Type0
        /BaseFont /Ryumin−Light
        /Encoding /Identity−H
        /DescendantFonts [ 15 0 R ]
        /ToUnicode 16 0 R
    >>
endobj

15 0 obj
    <<  /Type /Font
        /Subtype /CIDFontType2
        /BaseFont /Ryumin−Light
        /CIDSystemInfo 17 0 R
        /FontDescriptor 18 0 R
        /CIDToGIDMap /Identity
    >>
endobj

EXAMPLE 2

In this example, the value of the ToUnicode entry is a stream object that contains the definition of the CMap.

The begincodespacerange and endcodespacerange operators define the source character code range to be the 2-byte character codes from <00 00> to <FF FF>. The specific mappings for several of the character codes are shown.

16 0 obj
    << /Length 433 >>
stream
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo
<< /Registry ( Adobe )
/Ordering ( UCS )
/Supplement 0
>> def
/CMapName /Adobe−Identity−UCS def
/CMapType 2 def
1 begincodespacerange
<0000>     <FFFF>
endcodespacerange
2 beginbfrange
<0000>     <005E>    <0020>
<005F>     <0061>    [ <00660066> <00660069> <00660066006C> ]
endbfrange
1 beginbfchar
<3A51>     <D840DC3E>
endbfchar
endcmap
CMapName currentdict /CMap defineresource pop
end
end
endstream
endobj

<00 00> to <00 5E> are mapped to the Unicode values U+0020 to U+007E This is followed by the definition of a mapping where each character code represents more than one Unicode value:

<005F> <0061> [ <00660066> <00660069> <00660066006C> ]

In this case, the original character codes are the glyph indices for the ligatures ff, fi, and ffl. The entry defines the mapping from the character codes <00 5F>, <00 60>, and <00 61> to the strings of Unicode values with a Unicode scalar value for each character in the ligature: U+0066 U+0066 are the Unicode values for the character sequence f f, U+0066 U+0069 for f i, and U+0066 U+0066 U+006c for f f l.

Finally, the character code <3A 51> is mapped to the Unicode value U+2003E, which is expressed by the byte sequence <D840DC3E> in UTF-16BE encoding.

EXAMPLE 2 in this sub-clause illustrates several extensions to the way destination values may be defined. To support mappings from a source code to a string of destination codes, this extension has been made to the ranges defined after a beginbfchar operator:

n beginbfchar
srcCode dstString
endbfchar

where dstString may be a string of up to 512 bytes. Likewise, mappings after the beginbfrange operator may be defined as:

n beginbfrange
srcCode1 srcCode2 dstString
endbfrange

In this case, the last byte of the string shall be incremented for each consecutive code in the source code range.

When defining ranges of this type, the value of the last byte in the string shall be less than or equal to 255 − (srcCode2srcCode1). This ensures that the last byte of the string shall not be incremented past 255; otherwise, the result of mapping is undefined.

To support more compact representations of mappings from a range of source character codes to a discontiguous range of destination codes, the CMaps used for the ToUnicode entry may use this syntax for the mappings following a beginbfrange definition.

n beginbfrange
srcCode1 srcCode2 [ dstString1 dstString2 … dstStringm ]
endbfrange

Consecutive codes starting with srcCode1 and ending with srcCode2 shall be mapped to the destination strings in the array starting with dstString1 and ending with dstStringm . The value of dstString can be a string of up to 512 bytes. The value of m represents the number of continuous character codes in the source character code range.

\(\text{m} = \text{srcCode}_2 - \text{srcCode}_1 + 1\)