读取 PDF 注释¶
PDF 2.0 定义了以下几种注释类型:
Text
Link
FreeText
Line
Square
Circle
Polygon
PolyLine
Highlight
Underline
Squiggly
StrikeOut
Caret
Stamp
Ink
Popup
FileAttachment
Sound
Movie
Screen
Widget
PrinterMark
TrapNet
Watermark
3D
Redact
Projection
RichMedia
通常,可以通过以下方式读取注释:
from pypdf import PdfReader
reader = PdfReader("annotated.pdf")
for page in reader.pages:
if "/Annots" in page:
for annotation in page["/Annots"]:
obj = annotation.get_object()
print({"subtype": obj["/Subtype"], "location": obj["/Rect"]})
以下是读取三种常见注释的示例:
文本(Text)¶
from pypdf import PdfReader
reader = PdfReader("example.pdf")
for page in reader.pages:
if "/Annots" in page:
for annotation in page["/Annots"]:
subtype = annotation.get_object()["/Subtype"]
if subtype == "/Text":
print(annotation.get_object()["/Contents"])
高亮(Highlights)¶
from pypdf import PdfReader
reader = PdfReader("example.pdf")
for page in reader.pages:
if "/Annots" in page:
for annotation in page["/Annots"]:
subtype = annotation.get_object()["/Subtype"]
if subtype == "/Highlight":
coords = annotation.get_object()["/QuadPoints"]
x1, y1, x2, y2, x3, y3, x4, y4 = coords
附件(Attachments)¶
from pypdf import PdfReader
reader = PdfReader("example.pdf")
attachments = {}
for page in reader.pages:
if "/Annots" in page:
for annotation in page["/Annots"]:
subtype = annotation.get_object()["/Subtype"]
if subtype == "/FileAttachment":
fileobj = annotation.get_object()["/FS"]
attachments[fileobj["/F"]] = fileobj["/EF"]["/F"].get_data()