简化文档转换：用于 LLM 应用程序的现代文档处理堆栈

Rifx.Online
Natural Language Processing , Generative AI , Technology
28 Feb, 2025

将PDF、DOCX和网页内容转换为干净的Markdown及其元数据的综合指南，以支持可扩展、高效的AI管道

带有元数据的干净Markdown

构建以大型语言模型（LLMs）为驱动的应用程序是具有挑战性的——尤其是当用户提供的数据以各种常常杂乱的格式出现时。无论是PDF、Word文档、电子表格还是网页，每种来源都有其独特之处。

现代文档处理堆栈旨在通过将多种文档类型转换为一致的、适合LLM的格式（通常是Markdown）并丰富有用的元数据，以克服这些挑战。

_如果_您无法查看完整帖子并希望阅读，请使用：“阅读完整帖子”

这种统一的方法简化了下游处理（例如嵌入或检索增强生成），并使快速原型设计成为可能，而无需重新发明轮子。

为什么选择现代文档处理技术栈？

问题

数据多样性: 文档以多种格式到达，具有不同的布局、结构和嵌入元素。
LLM 限制: LLM 在处理干净、结构化的输入时表现最佳。原始 PDF 或杂乱的 HTML 可能导致幻觉或嵌入质量差。
定制需求: 定制解析器通常比较脆弱；需求往往在初始部署后才会演变。

解决方案

该堆栈利用一组专门的 开源库，每个库都针对文档处理的特定方面：

格式转换： 将PDF、DOCX和电子表格转换为Markdown。
视觉理解： 使用基于视觉的模型（例如，Zerox）处理图像密集或复杂布局。
网络爬虫： 使用像Jina AI Reader这样的工具从网页中提取干净的文本，即使在动态内容存在的情况下。
元数据提取： 自动为文档标记语言、令牌计数和其他元数据，以改善后续处理。
面向服务的架构： 使用像FastAPI这样的框架通过HTTP暴露功能，实现集中式、可扩展的处理。

关键组件和库

核心要求

一个生产就绪的文档到Markdown转换引擎应该：

通过利用现有库来最小化自定义代码。
支持多种格式（PDF、DOCX、XLSX、HTML等）。
在处理高度视觉化的文档时使用视觉LLM（如通过Zerox的GPT-4o）。
使用专用抓取工具（如Jina AI Reader API）清理和抓取HTML。
提取有用的元数据（例如，语言检测、令牌计数）以供下游应用使用。
运行在HTTP服务器上，以实现集中和可扩展的访问。

使用的库

Docling: 简化多种格式的处理和高级PDF理解。
Zerox: 提供基于视觉的OCR管道，将复杂布局的文档转换为Markdown。
Jina AI Reader: 通过处理动态内容和清理杂乱，将任何URL转换为LLM友好的Markdown。
langdetect: 自动检测文档的语言。
FastAPI: 提供异步的高性能HTTP服务器。
Pandoc/MarkItDown: （可选）将办公文档转换为Markdown的附加工具。

实施细节

该堆栈通过两个端点暴露：

**/process/document**: 接受文件上传（例如，PDF、DOCX），并使用标准管道或基于LLM的方法将其转换为Markdown。
**/process/url**: 接受一个URL，并使用Jina AI Reader API抓取并将网页内容转换为Markdown。

示例代码：FastAPI 端点

from datetime import datetime

from fastapi import Depends, FastAPI, File, Form, Header, HTTPException, UploadFile from pydantic import HttpUrl from document_processing import ( get_markdown_from_url, process_doc_standard, process_doc_with_llm, ) from file_utils import ( count_tokens, detect_language, get_sample_text, validate_uploaded_file, ) from logger import setup_logger from models import ProcessDocumentResponse, Settings, TokenCount settings = Settings()
logger = setup_logger(__name__)

def api_key_auth(x_api_key: str = Header(None)): if x_api_key != settings.api_key.get_secret_value(): raise HTTPException(status_code=401, detail=“无效的 API 密钥”)

app = FastAPI(dependencies=[Depends(api_key_auth)])

@app.get(”/”, include_in_schema=False) async def health_check(): return { “status”: “ok”, “timestamp”: datetime.now().isoformat(), “service”: “现代文档处理栈”, }

@app.post(“/process/document”, response_model=ProcessDocumentResponse) async def process_document( file: UploadFile = File(…), use_llm: bool = Form(default=False), ) -> ProcessDocumentResponse: contents, filename, mime_type = await validate_uploaded_file( file=file, max_file_size=settings.max_file_size, use_llm=use_llm ) if use_llm: markdown = await process_doc_with_llm(file_path=filename) else: markdown = await process_doc_standard(filename, contents) logger.info( f”成功处理文档：{filename}。检测到的格式：{mime_type}。” ) language = detect_language(get_sample_text(markdown)) await file.close() return ProcessDocumentResponse( markdown=markdown, language=language, mimetype=mime_type, token_count=TokenCount( cl100k_base=count_tokens(text=markdown, encoding_name=“cl100k_base”), o200k_base=count_tokens(text=markdown, encoding_name=“o200k_base”), ), )

@app.post(“/process/url”, response_model=ProcessDocumentResponse) async def process_url(url: HttpUrl) -> ProcessDocumentResponse: markdown = get_markdown_from_url(url) if markdown is None: raise HTTPException( status_code=400, detail=f”无法从 URL 获取 markdown：{url}” ) language = detect_language(get_sample_text(markdown)) return ProcessDocumentResponse( markdown=markdown, language=language, mimetype=“text/html”, token_count=TokenCount( cl100k_base=count_tokens(text=markdown, encoding_name=“cl100k_base”), o200k_base=count_tokens(text=markdown, encoding_name=“o200k_base”), ), )

示例代码：文档转换函数

from io import BytesIO
from typing import Optional

import requests
from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
from docling.datamodel.base_models import InputFormat
from docling.document_converter import (
DocumentConverter,
PdfFormatOption,
WordFormatOption,
)
from docling.pipeline.simple_pipeline import SimplePipeline
from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline
from docling_core.types.io import DocumentStream
from pydantic import HttpUrl
from pyzerox import zerox
from logger import setup_logger
from models import AcceptedMimeTypes
logger = setup_logger(__name__)

async def process_doc_with_llm(file_path: str) -> str:
"""
使用基于LLM的方法处理文档，通过pyzerox。

:param file\_path: 文档路径。  
:return: 从所有页面连接而成的Markdown字符串。  
"""  
result = await zerox(file\_path=file\_path)  
return "\\n\\n".join(\[page.content for page in result.pages\])

async def process_doc_standard(filename: str, contents: bytes) -> str:
"""
使用标准转换管道处理文档。

:param filename: 文档名称。  
:param contents: 文档的字节内容。  
:return: 转换后的Markdown字符串。  
"""  
converter = DocumentConverter(  
    allowed\_formats=AcceptedMimeTypes().get\_accepted\_input\_formats(),  
    format\_options={  
        InputFormat.PDF: PdfFormatOption(  
            pipeline\_cls=StandardPdfPipeline, backend=PyPdfiumDocumentBackend  
        ),  
        InputFormat.DOCX: WordFormatOption(  
            pipeline\_cls=SimplePipeline,  
        ),  
    },  
)  
converter\_result = converter.convert(  
    DocumentStream(name=filename, stream=BytesIO(contents))  
)  
return converter\_result.document.export\_to\_markdown()

def get_markdown_from_url(url: HttpUrl) -> Optional[str]:
"""
使用Jina Reader API从URL检索并清理Markdown内容。

:param url: 要处理的URL。  
:return: 如果成功则返回清理后的Markdown字符串，否则返回None。  
"""  
jina\_ai\_prefix = "https://r.jina.ai/"  
markdown\_url = jina\_ai\_prefix + str(url)  
try:  
    response = requests.get(markdown\_url)  
except requests.RequestException as e:  
    logger.error(f"获取URL {markdown\_url} 时出错: {e}")  
    return None  
if response.status\_code != 200:  
    logger.error(f"无法从URL获取markdown: {markdown\_url} (状态码: {response.status\_code})")  
    return None  
split\_markdown = response.text.split("Markdown Content:")  
if len(split\_markdown) < 2:  
    logger.error(f"无法清理URL的markdown: {markdown\_url}")  
    return None  
clean\_markdown = split\_markdown\[1\].strip()  
return clean\_markdown

部署与可扩展性

要部署该堆栈：

Docker化： 服务已容器化，以便在任何云平台上轻松部署。
云平台： 使用Railway、Kubernetes或类似解决方案进行部署，以实现高可用性。
缓存与速率限制： 缓存处理过的文档并实施速率限制，以优化在高负载下的性能。

现实世界的使用案例和好处

用例

LLM摄取： 将杂乱的用户文档转换为干净的Markdown，以便更好地进行嵌入。
检索增强生成（RAG）： 为RAG管道提供结构化输入。
知识库构建： 从多种来源摄取内容，以便创建可搜索的结构化数据库。
内容迁移： 通过标准化文档格式简化从遗留系统的迁移。

好处

效率： 消除了手动解析和自定义转换脚本。
可扩展性： 通过异步处理和容器化部署处理高吞吐量。
灵活性： 易于扩展以支持新格式并与各种AI模型集成。
成本效益： 利用开源库和云原生部署来降低开销。

结论

现代文档处理栈为基于LLM的应用程序提供了一个强大的基础，通过将多种文档格式转换为干净、富含元数据的Markdown。

无论是构建RAG系统、训练嵌入，还是构建知识库，这个栈使您能够专注于构建智能应用程序，同时确保数据摄取管道中的一致性和可扩展性。

简化文档转换：用于 LLM 应用程序的现代文档处理堆栈

为什么选择现代文档处理技术栈？

问题

解决方案

关键组件和库

核心要求

使用的库

实施细节

示例代码：FastAPI 端点

示例代码：文档转换函数

部署与可扩展性

现实世界的使用案例和好处

用例

好处

结论

参考文献：

Tags :

Share :

Related Posts

结合chatgpt-o3-mini与perplexity Deep Research的3步提示：提升论文写作质量的终极指南

让 Excel 过时的 10 种 Ai 工具：实现数据分析自动化，节省手工作业时间

使用 ChatGPT 搜索网络功能的 10 种创意方法

掌握Ai代理：解密Google革命性白皮书的10个关键问题解答

在人工智能和技术领域保持领先地位的 10 项必学技能 📚

揭开真相！深度探悉DeepSeek AI的十大误区，您被误导了吗？