Skip to content

[community] Propose PDFRouterParser and Loader #30847

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

pprados
Copy link
Contributor

@pprados pprados commented Apr 15, 2025

Load PDFs using different parsers based on the metadata of the PDF or the body of the first page.

The routes are defined as a list of tuples, where each tuple contains the name, a dictionary of metadata and regex pattern and the parser to use. The special key "page1" is to search in the first page with a regexp. Use the route in the correct order, as the first matching route is used.a Add a default route ("default", {}, parser) at the end to catch all PDFs. This code is similar to MimeTypeBasedParser, but on the content of the PDF file.

Sample:

    from langchain_community.document_loaders import PyPDFLoader
    from langchain_community.document_loaders.parsers.pdf import PyMuPDFParser
    from langchain_community.document_loaders.parsers.pdf import PyPDFium2Parser
    from langchain_community.document_loaders.parsers import PDFPlumberParser
    routes = [
        # Name, keys with regex, parser
        ("Microsoft", {"producer": "Microsoft", "creator": "Microsoft"},
        PyMuPDFParser()),
        ("LibreOffice", {"producer": "LibreOffice", }, PDFPlumberParser()),
        ("Xdvipdfmx", {"producer": "xdvipdfmx.*", "page1":"Hello"}, PDFPlumberParser()),
        ("defautl", {}, PyPDFium2Parser())
    ]
    loader = PDFRouterLoader(filename, routes)
    loader.load()

@pprados pprados mentioned this pull request Apr 15, 2025
2 tasks
Copy link

vercel bot commented Apr 15, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview Apr 17, 2025 1:55pm

@pprados pprados force-pushed the pprados/pdf-router branch from cbdaac0 to 8356398 Compare April 15, 2025 14:21
@pprados pprados force-pushed the pprados/pdf-router branch 5 times, most recently from 41bd854 to fdf5c9a Compare April 15, 2025 14:57
@pprados pprados force-pushed the pprados/pdf-router branch from fdf5c9a to 007180d Compare April 16, 2025 07:03
@pprados pprados marked this pull request as ready for review April 17, 2025 12:05
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) labels Apr 17, 2025
@pprados pprados force-pushed the pprados/pdf-router branch from 52a8dc0 to dfda5a0 Compare April 17, 2025 13:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant