PyMuPDF、LLM 和 RAG#

PyMuPDF, LLM & RAG

中文

整合 PyMuPDF 到您的大型语言模型 (LLM) 框架及整体 RAG (检索增强生成) 解决方案中，是提供文档数据的最快速、最可靠的方法。

目前已有一些知名的 LLM 解决方案与 PyMuPDF 进行了集成 —— 这是一个快速发展的领域，因此如果您发现了更多相关案例，请告诉我们！

如果您需要导出为 Markdown 或从文件中获取 LlamaIndex 文档：

英文

Integrating PyMuPDF into your Large Language Model (LLM) framework and overall RAG (Retrieval-Augmented Generation) solution provides the fastest and most reliable way to deliver document data.

There are a few well known LLM solutions which have their own interfaces with PyMuPDF - it is a fast growing area, so please let us know if you discover any more!

If you need to export to Markdown or obtain a LlamaIndex Document from a file:

与 LangChain 集成#

Integration with LangChain

中文

可以直接使用专门的加载器，与 LangChain 进行集成，方法如下：

from langchain_community.document_loaders import PyMuPDFLoader
loader = PyMuPDFLoader("example.pdf")
data = loader.load()

完整详情请参阅 LangChain 使用 PyMuPDF 。

英文

It is simple to integrate directly with LangChain by using their dedicated loader as follows:

from langchain_community.document_loaders import PyMuPDFLoader
loader = PyMuPDFLoader("example.pdf")
data = loader.load()

See LangChain Using PyMuPDF for full details.

与 LlamaIndex 集成#

Integration with LlamaIndex

中文

使用专门的 PyMuPDFReader 组件，从 LlamaIndex 🦙 进行文档加载管理。

from llama_index.readers.file import PyMuPDFReader
loader = PyMuPDFReader()
documents = loader.load(file_path="example.pdf")

详情请参阅从零开始构建 RAG 。

英文

Use the dedicated PyMuPDFReader from LlamaIndex 🦙 to manage your document loading.

from llama_index.readers.file import PyMuPDFReader
loader = PyMuPDFReader()
documents = loader.load(file_path="example.pdf")

See Building RAG from Scratch for more.

准备分块数据#

Preparing Data for Chunking

中文

数据分块（Chunking）对于向 LLM 提供上下文至关重要。现在 PyMuPDF 已支持 Markdown 输出，这意味着三级分块也得到了支持。

英文

Chunking (or splitting) data is essential to give context to your LLM data and with Markdown output now supported by PyMuPDF this means that Level 3 chunking is supported.

输出为 Markdown#

Outputting as Markdown

中文

为了将文档导出为 Markdown 格式，您需要使用一个单独的辅助工具。软件包 PyMuPDF4LLM 是 PyMuPDF 函数的高级封装，它可以为每一页输出标准文本和表格文本，并以 Markdown 格式整合整个文档的内容：

# 将文档转换为 Markdown
import pymupdf4llm
md_text = pymupdf4llm.to_markdown("input.pdf")

# 以 UTF-8 编码将文本写入文件
import pathlib
pathlib.Path("output.md").write_bytes(md_text.encode())

更多信息请参考： PyMuPDF4LLM。

英文

In order to export your document in Markdown format you will need a separate helper. Package PyMuPDF4LLM is a high-level wrapper of PyMuPDF functions which for each page outputs standard and table text in an integrated Markdown-formatted string across all document pages:

# convert the document to markdown
import pymupdf4llm
md_text = pymupdf4llm.to_markdown("input.pdf")

# Write the text to some file in UTF8-encoding
import pathlib
pathlib.Path("output.md").write_bytes(md_text.encode())

For further information please refer to: PyMuPDF4LLM.

如何使用 Markdown 输出#

How to use Markdown output

中文

一旦您的数据转换为 Markdown 格式，便可以对其进行分块（chunking）并提供给 LLM。例如，如果使用 LangChain，可以执行以下操作：

import pymupdf4llm
from langchain.text_splitter import MarkdownTextSplitter

# 获取 Markdown 格式的文本
md_text = pymupdf4llm.to_markdown("input.pdf")  # 获取所有页面的 Markdown 内容

splitter = MarkdownTextSplitter(chunk_size=40, chunk_overlap=0)

splitter.create_documents([md_text])

更多信息请参考 5 Levels of Text Splitting。

英文

Once you have your data in Markdown format you are ready to chunk/split it and supply it to your LLM, for example, if this is LangChain then do the following:

import pymupdf4llm
from langchain.text_splitter import MarkdownTextSplitter

# Get the MD text
md_text = pymupdf4llm.to_markdown("input.pdf")  # get markdown for all pages

splitter = MarkdownTextSplitter(chunk_size=40, chunk_overlap=0)

splitter.create_documents([md_text])

For more see 5 Levels of Text Splitting

PyMuPDF、LLM 和 RAG#

与 LangChain 集成#

与 LlamaIndex 集成#

准备分块数据#

输出为 Markdown#

如何使用 Markdown 输出#

相关博客#

提取文本的方法#

创建聊天机器人来讨论您的文档#