pypdf vs PyMuPDF: A Security-Focused Comparison for Python PDF Handling

When choosing a PDF library in Python, developers often weigh pypdf (pure Python, lightweight, BSD-licensed) against PyMuPDF (also known as Fitz, a high-performance binding to the MuPDF C library under AGPLv3 with commercial options). While PyMuPDF excels in speed, rendering, and advanced features like text extraction accuracy and annotation support, security is a critical differentiator—especially for applications processing untrusted PDFs (uploads, email attachments, or AI RAG pipelines).


Here's a balanced, up-to-date comparison centered on security, with notes on architecture implications.

Core Architectural Differences Impacting Security

  • pypdf: Entirely pure Python. No compiled binaries or external C dependencies. This reduces the attack surface from native code vulnerabilities (buffer overflows, memory corruption) but makes it more susceptible to logic errors in Python code, such as infinite loops or excessive resource usage during parsing.
  • PyMuPDF: Python bindings to the mature MuPDF (Fitz) engine. It leverages highly optimized C code for parsing and rendering, which delivers superior performance but inherits risks from the underlying C library. Installation can be more complex (wheels help, but platform issues occur), and the AGPL license may require commercial licensing for closed-source use.

Pure Python (pypdf) generally means easier sandboxing and fewer "supply chain" surprises from native libs, but C-based engines (PyMuPDF) benefit from decades of hardening in MuPDF—though older MuPDF CVEs have included memory corruption issues that could lead to crashes or, in theory, code execution.

Recent and Historical Vulnerabilities

pypdf Security Profile (as of March 2026):

  • CVE-2026-33699 (Infinite loop in DictionaryObject.read_from_stream): A recent DoS via specially crafted PDFs in non-strict mode. Fixed in 6.9.2 (March 2026). This is a classic recovery-logic bug where malformed structures cause endless loops, leading to high CPU usage.
  • Multiple prior DoS/resource-exhaustion issues in 2025–2026:
    • Long runtimes or high memory from large arrays, oversized /Length values, ASCIIHexDecode/RunLengthDecode/LZWDecode filters, missing /Root objects, and outlines/bookmarks.
    • Examples: CVE-2026-33123, CVE-2026-31826, CVE-2026-24688, CVE-2025-66019, etc. Many involve crafted PDFs triggering quadratic behavior or excessive allocation.
  • Pattern: pypdf has seen a cluster of availability-focused vulnerabilities (DoS via loops/memory) in recent releases. The maintainers respond quickly with patches, but the forgiving parser (non-strict mode) has been a repeated vector. Latest version (6.9.2+) resolves the most recent issues.

PyMuPDF Security Profile:

  • CVE-2026-3029 (Path Traversal/Arbitrary File Write in embedded_get()): A more severe issue (March 2026) allowing attackers to write files anywhere on the filesystem via crafted embedded file metadata when extracting without supplying an output path. Fixed in 1.26.7. This impacts confidentiality/integrity more directly than pure DoS.
  • Other issues:
    • NULL pointer dereference (CVE-2025-55780) leading to crashes on malformed EPUBs.
    • Older MuPDF-derived CVEs (pre-2020s) included memory corruption, double-free, and potential RCE in rendering/JBIG2/font handling.
    • Occasional logic issues like infinite loops in specific functions (e.g., fill_textbox or circular bookmarks), but these are rarer and often fixed promptly.
  • Pattern: Fewer recent pure-Python-style DoS issues, but higher-severity flaws when they occur due to native code involvement (path handling, parsing edge cases). The underlying MuPDF has a long history of security fixes for rendering and font issues. Latest PyMuPDF (around 1.27.x) shows no open high-severity direct vulnerabilities in some scanners, though MuPDF itself continues receiving CVEs.

Severity and Exploitability Comparison

  • pypdf: Vulnerabilities are mostly Moderate (CVSS ~4–7), focused on Availability (CPU/memory exhaustion). Easy to trigger remotely with a small crafted PDF in upload scenarios. No known RCE or data exfiltration. Impact is higher in serverless/cloud environments (costly hangs, worker crashes).
  • PyMuPDF: Can have Higher impact issues (path traversal = arbitrary write; potential memory corruption from MuPDF). These are more dangerous in untrusted environments because they could lead to file system compromise or crashes that are harder to contain. However, rendering-heavy features may expose more surface if not used carefully.

Both libraries improve rapidly—pypdf through frequent Python-level fixes, PyMuPDF via upstream MuPDF hardening.

Mitigation and Best Practices for Each

For pypdf:

  • Upgrade immediately to >=6.9.2.
  • Use strict=True where possible (though it may break on imperfect PDFs).
  • Implement timeouts, resource limits (CPU/memory caps on workers), and sandboxing (e.g., via containers or qpdf pre-validation).
  • Avoid non-strict mode on untrusted input.

For PyMuPDF:

  • Upgrade to >=1.26.7 (or latest 1.27.x) to address path traversal.
  • When extracting embedded files, always provide an explicit safe output path—never rely on metadata.
  • Run in isolated processes/containers due to native code.
  • Be cautious with rendering or complex operations on untrusted files.

Shared Defenses (Recommended for Both):

  • Pre-process PDFs with robust sanitizers like qpdf --linearize or commercial PDF cleaners.
  • Use temporary directories with strict permissions.
  • Monitor for anomalous resource usage.
  • Prefer strict parsing/validation modes.
  • For high-risk apps (public uploads), consider combining libraries or using serverless functions with timeouts.

Which Is "More Secure"?

  • If your priority is minimizing severe impact (RCE, file writes): pypdf currently edges out due to its pure-Python nature and DoS-only recent issues. It's simpler to audit and sandbox.
  • If you need performance and advanced features: PyMuPDF is excellent but requires stricter input validation and isolation because of its C foundation and occasional higher-severity flaws.
  • Neither is immune. PDF parsing is inherently complex and risky—no library is perfectly secure against maliciously crafted files.

In production, many teams use pypdf for simple manipulation (safer defaults for basic workflows) and PyMuPDF for heavy lifting (text extraction, rendering), with strong perimeter controls around both.

Recommendation

Audit your usage:

  • For lightweight, dependency-free PDF reading/writing → pypdf 6.9.2+ with strict mode and timeouts.
  • For speed, accuracy, or rich features → PyMuPDF latest with careful embedded-file handling and sandboxing.

Always keep both updated, as the PDF ecosystem evolves quickly with new attack vectors. Test thoroughly with fuzzing tools (e.g., PDF fuzzers) if handling untrusted documents.

Bottom line: pypdf's recent issues are annoying but lower-impact DoS; PyMuPDF's are potentially more damaging but less frequent in the Python wrapper. Choose based on your threat model, performance needs, and willingness to isolate native code.

Have you encountered PDF-related security issues in your projects? Which library do you lean toward for sensitive workflows? Share in the comments.

Previous Post Next Post