使用 Python 从 PDF 中提取元素#
Extract elements from a PDF using Python
高级函数可用于实现常见任务。在本例中,我们可以使用 extract_pages:
from pdfminer.high_level import extract_pages
for page_layout in extract_pages("test.pdf"):
for element in page_layout:
print(element)
每个 element
将是一个 LTTextBox
, LTFigure
, LTLine
, LTRect
或一个 LTImage
。其中一些可以进一步迭代,例如迭代 LTTextBox
将得到一个 LTTextLine
,而这些又可以迭代得到一个 LTChar
。请参阅此处的图表:布局分析算法。
假设我们想提取所有文本。我们可以这样做:
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer
for page_layout in extract_pages("test.pdf"):
for element in page_layout:
if isinstance(element, LTTextContainer):
print(element.get_text())
或者,我们可以提取每个字符的字体名称或大小:
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LTChar
for page_layout in extract_pages("test.pdf"):
for element in page_layout:
if isinstance(element, LTTextContainer):
for text_line in element:
for character in text_line:
if isinstance(character, LTChar):
print(character.fontname)
print(character.size)
The high level functions can be used to achieve common tasks. In this case, we can use extract_pages:
from pdfminer.high_level import extract_pages
for page_layout in extract_pages("test.pdf"):
for element in page_layout:
print(element)
Each element
will be an LTTextBox
, LTFigure
, LTLine
, LTRect
or an LTImage
. Some of these can be iterated further, for example iterating
though an LTTextBox
will give you an LTTextLine
, and these in turn can
be iterated through to get an LTChar
. See the diagram here:
布局分析算法.
Let’s say we want to extract all of the text. We could do:
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer
for page_layout in extract_pages("test.pdf"):
for element in page_layout:
if isinstance(element, LTTextContainer):
print(element.get_text())
Or, we could extract the fontname or size of each individual character:
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LTChar
for page_layout in extract_pages("test.pdf"):
for element in page_layout:
if isinstance(element, LTTextContainer):
for text_line in element:
for character in text_line:
if isinstance(character, LTChar):
print(character.fontname)
print(character.size)