附录 4:性能比较方法#
Appendix 4: Performance Comparison Methodology
本文记录了测量 PyMuPDF 性能的方法,以及用于比较的工具和示例文件。
以下三个部分涉及不同的性能方面:
文档复制 - 包括打开和解析 PDF,然后将其写入输出文件。由于相同的基本操作也用于合并(连接) PDF,因此这些结果同样适用于这些用例。
文本提取 - 从 PDF 中提取纯文本,并将其写入输出文本文件。
页面渲染 - 将 PDF 页面转换为与页面完全相同的图像文件。这一功能是使用 Python GUI 脚本浏览文档的基本前提。我们选择了中等质量(150 DPI)的版本。
请注意,在所有情况下,处理 PDF 结构的实际速度并没有被直接测量:相反,时间测量还包括将文件写入操作系统文件系统的持续时间。这是无法避免的,因为除 PyMuPDF 外的工具不提供将图像 创建 步骤与随后的 写入 步骤分开的选项。
因此,所有记录的时间都包括了一个通用的、面向操作系统的基本开销。因此,不同工具之间的性能 差异实际上比数字所表明的要大。
This article documents the approach to measure PyMuPDF’s performance and the tools and example files used to do comparisons.
The following three sections deal with different performance aspects:
Document Copying - This includes opening and parsing PDFs, then writing them to an output file. Because the same basic activities are also used for joining (merging) PDFs, the results also apply to these use cases.
Text Extraction - This extracts plain text from PDFs and writes it to an output text file.
Page Rendering - This converts PDF pages to image files looking identical to the pages. This ability is the basic prerequisite for using a tool in Python GUI scripts to scroll through documents. We have chosen a medium-quality (resolution 150 DPI) version.
Please note that in all cases the actual speed in dealing with PDF structures is not directly measured: instead, the timings also include the durations of writing files to the operating system’s file system. This cannot be avoided because tools other than PyMuPDF do not offer the option to e.g., separate the image creation step from the following step, which writes the image into a file.
So all timings documented include a common, OS-oriented base effort. Therefore, performance differences per tool are actually larger than the numbers suggest.
使用的文件#
Files used
用于性能测试的一组文件共八个。每个文件包含以下信息:
文件名称 和下载 链接 。
大小 (以字节为单位)。
文件的总 页面数 。
总 书签数 (目录项)。
总 链接数 。
每页 KB大小 。
每页文本大小 是文件中文本的总大小(以KB为单位),除以页面数。
任何 备注 ,用于一般描述文件的类型。
A set of eight files is used for the performance testing. With each file we have the following information:
Name of the file and download link.
Size in bytes.
Total number of pages in file.
Total number of bookmarks (Table of Contents entries).
Total number of links.
KB size per page.
Textsize per page is the amount text in the whole file in KB, divided by the number of pages.
Any notes to generally describe the type of file.
Name |
Size (bytes) |
Pages |
TOC size |
Links |
KB/page |
Textsize/page |
Notes |
---|---|---|---|---|---|---|---|
32,472,771 |
1,310 |
794 |
32,096 |
24 |
1,942 |
linearized, many links / bookmarks |
|
31,570,732 |
47 |
46 |
2,035 |
656 |
3,538 |
graphics oriented |
|
29,326,355 |
1,241 |
0 |
0 |
23 |
2,142 |
||
8,222,384 |
214 |
31 |
242 |
38 |
1,058 |
mix of text & graphics |
|
10,585,962 |
3,071 |
536 |
16,554 |
3 |
1,539 |
many pages |
|
6,805,176 |
478 |
276 |
5,277 |
14 |
1,937 |
text oriented |
|
9,983,856 |
669 |
198 |
1,953 |
15 |
1,929 |
||
52,521,850 |
1 |
0 |
0 |
51,291 |
23,860 |
single page, graphics oriented, large file size |
备注
adobe.pdf 和 pymupdf.pdf 明显是文本导向的, artifex-website.pdf 和 sample-50-MB-pdf-file.pdf 是图形导向的。其他文件则是两者的混合。
备注
adobe.pdf and pymupdf.pdf are clearly text oriented, artifex-website.pdf and sample-50-MB-pdf-file.pdf are graphics oriented. Other files are a mix of both.
使用的工具#
Tools used
在每个部分中,使用相同的一组 PDF 文件,并通过一组工具进行处理。然而,每个性能方面使用的工具集有所不同,具体取决于工具的功能支持。
所有工具要么是平台独立的,要么至少可以在 Windows 和 Unix / Linux 上运行。
In each section, the same fixed set of PDF files is being processed by a set of tools. The set of tools used per performance aspect however varies, depending on the supported tool features.
All tools are either platform independent, or at least can run on both, Windows and Unix / Linux.
Tool |
Description |
---|---|
PyMuPDF |
The tool of this manual. |
A pure Python tool, being used by rst2pdf, has interface to ReportLab. |
|
A pure Python tool with a large function set. |
|
A pure Python to extract text and other data from PDF. |
|
A command line utility with multiple functions. |
|
A Python package similar to PDFrw, but based on C++ library QPDF. |
|
A Python package specialized on rendering PDF pages to JPG images. |
复制/连接/合并#
Copying / Joining / Merging
处理 PDF 文件读取及其内容解析的速度如何?由于批量工具总是一次性完全执行请求的任务,从头到尾,因此无法直接比较纯粹的解析性能。 PDFrw 也有一种 惰性 解析策略,这意味着它只解析当前需要的文档部分。
为了回答这个问题,我们测量了使用每个工具将 PDF 文件复制到输出文件的时间,并不执行其他任何操作。
以下是每个工具使用的 Python 命令:
PyMuPDF
import pymupdf
doc = pymupdf.open("input.pdf")
doc.save("output.pdf")
PDFrw
doc = PdfReader("input.pdf")
writer = PdfWriter()
writer.trailer = doc
writer.write("output.pdf")
PikePDF
from pikepdf import Pdf
doc = Pdf.open("input.pdf")
doc.save("output.pdf")
PyPDF2
pdfmerge = PyPDF2.PdfMerger()
pdfmerge.append("input.pdf")
pdfmerge.write("output.pdf")
pdfmerge.close()
观察结果
以下是我们的运行时间发现,以 秒 为单位,并根据 PyMuPDF 的基准汇总:
文件名 |
PyMuPDF |
PDFrw |
PikePDF |
PyPDF2 |
---|---|---|---|---|
adobe.pdf |
1.75 |
5.15 |
22.37 |
374.05 |
artifex-website.pdf |
0.26 |
0.38 |
1.41 |
2.81 |
db-systems.pdf |
0.15 |
0.8 |
1.68 |
2.46 |
fontforge.pdf |
0.09 |
0.14 |
0.28 |
1.1 |
pandas.pdf |
0.38 |
2.21 |
2.73 |
70.3 |
pymupdf.pdf |
0.11 |
0.56 |
0.83 |
6.05 |
pythonbook.pdf |
0.19 |
1.2 |
1.34 |
37.19 |
sample-50-MB-pdf-file.pdf |
0.12 |
0.1 |
2.93 |
0.08 |
总计 |
3.05 |
10.54 |
33.57 |
494.04 |
与 PyMuPDF 比较的速率 |
1.0 |
3.5 |
11.0 |
162 |
How fast is a PDF file read and its content parsed for further processing? The sheer parsing performance cannot directly be compared, because batch utilities always execute a requested task completely, in one go, front to end. PDFrw too, has a lazy strategy for parsing, meaning it only parses those parts of a document that are required in any moment.
To find an answer to the question, we therefore measure the time to copy a PDF file to an output file with each tool, and do nothing else.
These are the Python commands for how each tool is used:
PyMuPDF
import pymupdf
doc = pymupdf.open("input.pdf")
doc.save("output.pdf")
PDFrw
doc = PdfReader("input.pdf")
writer = PdfWriter()
writer.trailer = doc
writer.write("output.pdf")
PikePDF
from pikepdf import Pdf
doc = Pdf.open("input.pdf")
doc.save("output.pdf")
PyPDF2
pdfmerge = PyPDF2.PdfMerger()
pdfmerge.append("input.pdf")
pdfmerge.write("output.pdf")
pdfmerge.close()
Observations
These are our run time findings in seconds along with a base rate summary compared to PyMuPDF:
Name |
PyMuPDF |
PDFrw |
PikePDF |
PyPDF2 |
---|---|---|---|---|
adobe.pdf |
1.75 |
5.15 |
22.37 |
374.05 |
artifex-website.pdf |
0.26 |
0.38 |
1.41 |
2.81 |
db-systems.pdf |
0.15 |
0.8 |
1.68 |
2.46 |
fontforge.pdf |
0.09 |
0.14 |
0.28 |
1.1 |
pandas.pdf |
0.38 |
2.21 |
2.73 |
70.3 |
pymupdf.pdf |
0.11 |
0.56 |
0.83 |
6.05 |
pythonbook.pdf |
0.19 |
1.2 |
1.34 |
37.19 |
sample-50-MB-pdf-file.pdf |
0.12 |
0.1 |
2.93 |
0.08 |
Total |
3.05 |
10.54 |
33.57 |
494.04 |
Rate compared to PyMuPDF |
1.0 |
3.5 |
11.0 |
162 |
文本提取#
Text Extraction
以下表格显示了纯文本提取的时长。所有工具均使用其最基本的功能,即不进行布局重新排列等操作。
观察结果
以下是我们的运行时间发现,以 秒 为单位,并根据 PyMuPDF 的基准汇总:
文件名 |
PyMuPDF |
XPDF |
PyPDF2 |
PDFMiner |
---|---|---|---|---|
adobe.pdf |
2.01 |
6.19 |
22.2 |
49.15 |
artifex-website.pdf |
0.18 |
0.3 |
1.1 |
4.06 |
db-systems.pdf |
1.57 |
4.26 |
25.75 |
42.19 |
fontforge.pdf |
0.24 |
0.47 |
2.69 |
4.2 |
pandas.pdf |
2.41 |
10.54 |
25.38 |
76.56 |
pymupdf.pdf |
0.49 |
2.34 |
6.44 |
13.55 |
pythonbook.pdf |
0.84 |
2.88 |
9.28 |
24.27 |
sample-50-MB-pdf-file.pdf |
0.27 |
0.44 |
8.8 |
13.29 |
总计 |
8.01 |
27.42 |
101.64 |
227.27 |
与 PyMuPDF 比较的速率 |
1.0 |
3.42 |
12.69 |
28.37 |
The following table shows plain text extraction durations. All tools have been used with their most basic functionality - i.e. no layout re-arrangements, etc.
Observations
These are our run time findings in seconds along with a base rate summary compared to PyMuPDF:
Name |
PyMuPDF |
XPDF |
PyPDF2 |
PDFMiner |
---|---|---|---|---|
adobe.pdf |
2.01 |
6.19 |
22.2 |
49.15 |
artifex-website.pdf |
0.18 |
0.3 |
1.1 |
4.06 |
db-systems.pdf |
1.57 |
4.26 |
25.75 |
42.19 |
fontforge.pdf |
0.24 |
0.47 |
2.69 |
4.2 |
pandas.pdf |
2.41 |
10.54 |
25.38 |
76.56 |
pymupdf.pdf |
0.49 |
2.34 |
6.44 |
13.55 |
pythonbook.pdf |
0.84 |
2.88 |
9.28 |
24.27 |
sample-50-MB-pdf-file.pdf |
0.27 |
0.44 |
8.8 |
13.29 |
Total |
8.01 |
27.42 |
101.64 |
227.27 |
Rate compared to PyMuPDF |
1.0 |
3.42 |
12.69 |
28.37 |
页面渲染#
Page Rendering
我们测试了 PyMuPDF 在与 pdf2jpg 和 XPDF 的渲染速度对比,分辨率为 150 DPI。
以下是如何使用每个工具的 Python 命令:
PyMuPDF
def ProcessFile(datei):
print "processing:", datei
doc=pymupdf.open(datei)
for p in pymupdf.Pages(doc):
pix = p.get_pixmap(dpi=150)
pix.save("t-%s.png" % p.number)
pix = None
doc.close()
return
XPDF
pdftopng.exe -r 150 file.pdf ./
PDF2JPG
def ProcessFile(datei):
print("processing:", datei)
pdf2jpg.convert_pdf2jpg(datei, "images", pages="ALL", dpi=150)
return
观察结果
以下是我们的运行时间发现,以 秒 为单位,并根据 PyMuPDF 的基准汇总:
We have tested rendering speed of PyMuPDF against pdf2jpg and XPDF at a resolution of 150 DPI,
These are the Python commands for how each tool is used:
PyMuPDF
def ProcessFile(datei):
print "processing:", datei
doc=pymupdf.open(datei)
for p in pymupdf.Pages(doc):
pix = p.get_pixmap(dpi=150)
pix.save("t-%s.png" % p.number)
pix = None
doc.close()
return
XPDF
pdftopng.exe -r 150 file.pdf ./
PDF2JPG
def ProcessFile(datei):
print("processing:", datei)
pdf2jpg.convert_pdf2jpg(datei, "images", pages="ALL", dpi=150)
return
Observations
These are our run time findings in seconds along with a base rate summary compared to PyMuPDF:
Name |
PyMuPDF |
XPDF |
PDF2JPG |
---|---|---|---|
adobe.pdf |
51.33 |
98.16 |
75.71 |
artifex-website.pdf |
26.35 |
51.28 |
54.11 |
db-systems.pdf |
84.59 |
143.16 |
405.22 |
fontforge.pdf |
12.23 |
22.18 |
20.14 |
pandas.pdf |
138.74 |
241.67 |
202.06 |
pymupdf.pdf |
22.35 |
39.11 |
33.38 |
pythonbook.pdf |
30.44 |
49.12 |
55.68 |
sample-50-MB-pdf-file.pdf |
1.01 |
1.32 |
5.22 |
Total |
367.04 |
646 |
851.52 |
Rate compared to PyMuPDF |
1.0 |
1.76 |
2.32 |