[community] Propose PDFRouterParser and Loader #30847

pprados · 2025-04-15T14:15:32Z

Load PDFs using different parsers based on the metadata of the PDF or the body of the first page.

The routes are defined as a list of tuples, where each tuple contains the name, a dictionary of metadata and regex pattern and the parser to use. The special key "page1" is to search in the first page with a regexp. Use the route in the correct order, as the first matching route is used.a Add a default route ("default", {}, parser) at the end to catch all PDFs. This code is similar to MimeTypeBasedParser, but on the content of the PDF file.

Sample:

    from langchain_community.document_loaders import PyPDFLoader
    from langchain_community.document_loaders.parsers.pdf import PyMuPDFParser
    from langchain_community.document_loaders.parsers.pdf import PyPDFium2Parser
    from langchain_community.document_loaders.parsers import PDFPlumberParser
    routes = [
        # Name, keys with regex, parser
        ("Microsoft", {"producer": "Microsoft", "creator": "Microsoft"},
        PyMuPDFParser()),
        ("LibreOffice", {"producer": "LibreOffice", }, PDFPlumberParser()),
        ("Xdvipdfmx", {"producer": "xdvipdfmx.*", "page1":"Hello"}, PDFPlumberParser()),
        ("defautl", {}, PyPDFium2Parser())
    ]
    loader = PDFRouterLoader(filename, routes)
    loader.load()

vercel · 2025-04-15T14:17:41Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment

Name	Status	Preview	Comments	Updated (UTC)
langchain	⬜️ Ignored (Inspect)	Visit Preview		Apr 17, 2025 1:55pm

pprados mentioned this pull request Apr 15, 2025

Refactoring PDF loaders: all #28970

Closed

2 tasks

Propose PDFRouterParser and Loader

8356398

pprados force-pushed the pprados/pdf-router branch from cbdaac0 to 8356398 Compare April 15, 2025 14:21

Merge remote-tracking branch 'upstream/master' into pprados/pdf-router

b5221f2

pprados force-pushed the pprados/pdf-router branch 5 times, most recently from 41bd854 to fdf5c9a Compare April 15, 2025 14:57

Propose PDFRouterParser and Loader

007180d

pprados force-pushed the pprados/pdf-router branch from fdf5c9a to 007180d Compare April 16, 2025 07:03

Merge branch 'master' into pprados/pdf-router

5c2f21b

pprados marked this pull request as ready for review April 17, 2025 12:05

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) labels Apr 17, 2025

Merge branch 'master' into pprados/pdf-router

dfda5a0

pprados force-pushed the pprados/pdf-router branch from 52a8dc0 to dfda5a0 Compare April 17, 2025 13:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[community] Propose PDFRouterParser and Loader #30847

[community] Propose PDFRouterParser and Loader #30847

pprados commented Apr 15, 2025 •

edited

Loading

vercel bot commented Apr 15, 2025 •

edited

Loading

[community] Propose PDFRouterParser and Loader #30847

Are you sure you want to change the base?

[community] Propose PDFRouterParser and Loader #30847

Conversation

pprados commented Apr 15, 2025 • edited Loading

Load PDFs using different parsers based on the metadata of the PDF or the body of the first page.

vercel bot commented Apr 15, 2025 • edited Loading

pprados commented Apr 15, 2025 •

edited

Loading

vercel bot commented Apr 15, 2025 •

edited

Loading