读取 PDF 注释

PDF 2.0 定义了以下几种注释类型:

  • Text

  • Link

  • FreeText

  • Line

  • Square

  • Circle

  • Polygon

  • PolyLine

  • Highlight

  • Underline

  • Squiggly

  • StrikeOut

  • Caret

  • Stamp

  • Ink

  • Popup

  • FileAttachment

  • Sound

  • Movie

  • Screen

  • Widget

  • PrinterMark

  • TrapNet

  • Watermark

  • 3D

  • Redact

  • Projection

  • RichMedia

通常,可以通过以下方式读取注释:

from pypdf import PdfReader

reader = PdfReader("annotated.pdf")

for page in reader.pages:
    if "/Annots" in page:
        for annotation in page["/Annots"]:
            obj = annotation.get_object()
            print({"subtype": obj["/Subtype"], "location": obj["/Rect"]})

以下是读取三种常见注释的示例:

文本(Text)

from pypdf import PdfReader

reader = PdfReader("example.pdf")

for page in reader.pages:
    if "/Annots" in page:
        for annotation in page["/Annots"]:
            subtype = annotation.get_object()["/Subtype"]
            if subtype == "/Text":
                print(annotation.get_object()["/Contents"])

高亮(Highlights)

from pypdf import PdfReader

reader = PdfReader("example.pdf")

for page in reader.pages:
    if "/Annots" in page:
        for annotation in page["/Annots"]:
            subtype = annotation.get_object()["/Subtype"]
            if subtype == "/Highlight":
                coords = annotation.get_object()["/QuadPoints"]
                x1, y1, x2, y2, x3, y3, x4, y4 = coords

附件(Attachments)

from pypdf import PdfReader

reader = PdfReader("example.pdf")

attachments = {}
for page in reader.pages:
    if "/Annots" in page:
        for annotation in page["/Annots"]:
            subtype = annotation.get_object()["/Subtype"]
            if subtype == "/FileAttachment":
                fileobj = annotation.get_object()["/FS"]
                attachments[fileobj["/F"]] = fileobj["/EF"]["/F"].get_data()