14 KiB
Sprint 10 Contract: Pre-Conversion PDF Page Chunking
Status: Implemented Last updated: 2026-05-08
Objective
Add an opt-in pre-conversion workflow for long PDFs:
- Split each source PDF into fixed-size chunk PDFs of 20 pages.
- Convert each chunk PDF independently through the existing MinerU conversion pipeline.
- Do not merge the generated Markdown files.
The feature is intended to reduce long-document memory/runtime pressure and make partial progress usable when one chunk fails. It must preserve strict-local execution, keep MinerU 3.1.0 as the only conversion engine, and keep default tests independent of real MinerU, GPU, CUDA, model files, network access, Obsidian, LaTeX tooling, and samples/.
Research Summary
Sources checked on 2026-05-08:
- pypdf PyPI: current release observed as
6.10.2, uploaded 2026-04-15; metadata listsBSD-3-Clause, Python>=3.9, Python 3.12 support, and describes pypdf as a pure-Python PDF library capable of splitting, merging, cropping, and transforming PDF pages. - pypdf merging docs:
PdfWriter.append()can append a complete or partial source PDF; examples use zero-based page ranges such as(0, 10). The docs recommendappendormergeover low-leveladd_page/insert_page. - pypdf streaming docs:
PdfReaderandPdfWritersupport file-like objects, but the project should write chunk PDFs to local disk because MinerU accepts local file paths. - pypdf PdfWriter docs: writer operations clone/copy PDF objects into the destination. The docs warn that cloning linked objects can copy more than just the visible page object in some cases, so chunk output size must be checked in tests.
- MinerU CLI tools docs: the direct
mineruCLI accepts-p/--path,-o/--output,-s/--start, and-e/--end; without--api-url, it launches a temporary localmineru-api. - PyMuPDF PyPI: PyMuPDF is fast and local, but PyPI lists dual licensing under GNU AGPL v3 or an Artifex commercial license.
- pikepdf page assembly docs: pikepdf can split pages and transfer page-associated data; it is a capable fallback candidate but adds a QPDF-backed dependency and is not needed for a first implementation.
Recommended Package Decision
Use pypdf for Sprint 10.
Rationale:
- It is pure Python and fits the current Python 3.12 +
uvworkflow. - It has permissive
BSD-3-Clausemetadata on PyPI. - It directly supports page-level PDF assembly with
PdfReader/PdfWriter. - It avoids adding PyMuPDF's AGPL/commercial licensing considerations for a simple split-only feature.
- It avoids adding pikepdf/QPDF native dependency complexity before there is evidence that pypdf cannot handle the project samples.
Recommended dependency range for implementation:
dependencies = [
"pypdf>=6.10.2,<7",
]
The implementation adds this dependency to pyproject.toml and uv.lock.
Current Precondition
pdf2md convertalready converts one PDF or a directory of PDFs.- Existing conversion output per input includes:
- Markdown
- optional metadata JSON, enabled by default
<stem>.report.md- assets directory
- optional raw MinerU output
plan_outputs()already enforces overwrite and output-root safety.convert_input()already handles directory batches and continues after per-file failures.- The MinerU adapter accepts one PDF path at a time and runs direct local
mineruCLI. samples/is local and untracked; do not commit sample PDFs or generated outputs.
Touched Surfaces
Allowed during implementation:
pyproject.tomluv.locksrc/pdf2md/pdf_splitter.pysrc/pdf2md/paths.pysrc/pdf2md/conversion.pysrc/pdf2md/cli.pysrc/pdf2md/ir.pyonly if new warning codes or chunk provenance records are requiredsrc/pdf2md/metadata.pyonly for chunk provenance fieldssrc/pdf2md/report.pyonly to expose chunk provenance in reportstests/test_pdf_splitter.pytests/test_conversion.pytests/test_cli.pytests/test_paths.pytests/test_metadata.pytests/integration/for mocked chunk workflow coverageREADME.mddocs/V1IMPLEMENTATIONPLAN.mddocs/Sprints/SPRINT10CONTRACT.mdPLAN.mdPROGRESS.md
Not allowed:
- Runtime engine selection or alternate conversion engines.
- Use of cloud OCR, remote LLM/VLM, hosted renderers, hosted document parsers,
--api-url, router mode, HTTP client backends, remote APIs, or remote OpenAI-compatible backends. - Mandatory default tests requiring real MinerU, GPU, CUDA, model files, network, Obsidian, LaTeX tooling, or
samples/. - Committed files under
samples/. - Committed generated conversion outputs.
- Automatic model or package downloads triggered by import time,
doctor,convert, or tests. - Markdown merge behavior for chunk outputs.
- Claims that chunking improves formula correctness; it is only a processing-control feature.
Product Behavior
Activation:
- Chunking is opt-in and existing conversion behavior is unchanged when
chunk_pagesis unset. - CLI:
pdf2md convert INPUT --out OUTPUT_DIR --chunk-pagesuses the default chunk size of 20 pages. - CLI:
pdf2md convert INPUT --out OUTPUT_DIR --chunk-pages 20uses an explicit positive chunk size. - Python API:
convert_pdf(..., chunk_pages=20)andconvert_input(..., chunk_pages=20). convert_pdf()returnsConversionResultwithout chunking andBatchConversionResultwhen chunk mode is active.chunk_pagesmust beNoneor a positive integer.
Chunking behavior:
- If
chunk_pagesis unset, current behavior remains unchanged. - If
chunk_pages=20and a PDF has 20 or fewer pages, conversion may either:- convert the original PDF directly, or
- create one chunk PDF and convert that chunk.
- Recommended: convert the original directly when
total_pages <= chunk_pagesto avoid unnecessary intermediate files. - If a PDF has more than 20 pages, split it into chunk PDFs with ranges:
- chunk 1: source pages 1-20
- chunk 2: source pages 21-40
- chunk N: remaining pages
- Convert chunk PDFs sequentially, not in parallel. GTX 1070 Ti 8GB memory pressure makes sequential conversion the safer default.
- If one chunk conversion fails, continue with later chunks and report the failed chunk clearly.
- Do not merge Markdown outputs.
Recommended chunk output naming:
<stem>.part-001.pages-001-020.md
<stem>.part-001.pages-001-020.metadata.json
<stem>.part-001.pages-001-020.report.md
<stem>.part-001.pages-001-020.assets/
<stem>.part-002.pages-021-040.md
...
Recommended chunk PDF staging:
- Use a temporary working directory.
- Delete temporary chunk PDFs after conversion completes, including when
--keep-rawis enabled. - Do not add a separate
--keep-chunksflag in Sprint 10.
Provenance Requirements
Each chunk conversion must preserve original-source context.
Required chunk fields in metadata or engine options:
- original source PDF path
- original source SHA-256
- chunk PDF path when retained, or chunk PDF filename when temporary
- chunk index, 1-based
- total chunk count
- source page start, 1-based inclusive
- source page end, 1-based inclusive
- chunk page count
Page provenance must distinguish:
- chunk-local page index, starting at 0 for MinerU output
- original source page number, starting at 1 for user-facing reports
The report should include a short chunk context line, for example:
- Chunk: 2/5, source pages: 21-40
Architecture Plan
WP10.1: PDF Splitter Module
Owner:
feature-generator-agentmineru-integration-agent
Actions:
- Add
src/pdf2md/pdf_splitter.py. - Define project-owned
PdfChunkPlan. - Implement page counting with
pypdf.PdfReader. - Implement chunk planning without writing files.
- Implement chunk writing with
pypdf.PdfWriter.append(source, (start, end))or an equivalent testedPdfReader/PdfWriterpath. - Use zero-based half-open page ranges internally and one-based inclusive ranges for filenames and reports.
- Reject invalid chunk sizes with clear
ValueError. - Fail clearly on encrypted/password-protected PDFs unless a later sprint adds password handling.
Expected output:
- Deterministic chunk plans and local chunk PDFs suitable for the existing MinerU adapter.
WP10.2: Chunk-Aware Path Planning
Owner:
feature-generator-agent
Actions:
- Extend output planning so chunk outputs are deterministic and conflict-checked before conversion starts.
- Avoid collisions between original-output stems and chunk-output stems.
- Preserve output-root escape prevention.
- Respect
--overwrite. - Keep Korean and non-ASCII source stems working.
Expected output:
- A long PDF can produce multiple planned Markdown/metadata/report/assets outputs without overwriting another chunk.
WP10.3: Conversion Orchestration
Owner:
feature-generator-agentmineru-integration-agent
Actions:
- Add chunk mode to
convert_pdf()andconvert_input(). - When chunk mode is active, split before calling the MinerU adapter.
- Reuse existing per-PDF conversion path for each chunk PDF rather than creating a second conversion pipeline.
- Continue conversion after a chunk-level failure and aggregate a batch-like result for the source.
- Ensure temporary chunk directories are cleaned up unless raw retention is requested.
- Keep strict-local validation unchanged.
Expected output:
- Long PDF conversion yields separate Markdown/metadata/report/assets outputs per chunk.
WP10.4: Metadata And Report Chunk Provenance
Owner:
metadata-agentobsidian-markdown-agent
Actions:
- Add chunk provenance fields without exposing raw pypdf objects.
- Keep existing required metadata fields valid.
- Keep original source provenance visible even though MinerU sees a chunk PDF as input.
- Ensure chunk reports are readable without opening JSON metadata.
Expected output:
- Users can map each output file back to original source pages.
WP10.5: CLI And API Surface
Owner:
feature-generator-agentrequirements-guard-agent
Actions:
- Add
--chunk-pages [INTEGER]topdf2md convert. - Keep chunking disabled unless the option is present.
- Use 20 pages when
--chunk-pagesis present without an explicit value. - Validate positive integer input.
- Keep
--out,--metadata,--keep-raw,--recursive,--overwrite,--gpu, and strict-local behavior unchanged. - Update README with the long-PDF workflow.
Expected output:
uv run pdf2md convert samples/long.pdf --out outputs --chunk-pages 20
WP10.6: Tests
Owner:
feature-generator-agentevaluation-agent
Default tests must not require real MinerU or sample PDFs.
Required tests:
- Build small local PDF fixtures using
pypdfblank pages or minimal test PDFs. - Page count detection for 1, 20, 21, 40, and 41 pages.
- Chunk planning produces expected 1-based filenames and page ranges.
- Chunk writing produces PDFs with the expected page counts.
- Non-positive chunk size is rejected.
- Existing conversion without
--chunk-pagesis unchanged. - Chunked conversion calls the fake adapter once per chunk.
- Chunked conversion writes separate Markdown, metadata JSON, report Markdown, and assets per chunk.
--overwriteand conflict detection work for all planned chunk outputs.- A failed chunk does not silently fallback and does not prevent later chunks from being attempted.
- Metadata/report contain original source PDF and source page range.
- CLI validates
--chunk-pagesand prints a useful summary.
Optional local validation:
- Run chunked conversion on a local
samples/PDF only by explicit user request or opt-in gate. - Do not commit generated chunk PDFs or outputs.
Acceptance Criteria
- Sprint 10 implementation can split a PDF into 20-page chunk PDFs before MinerU conversion.
- Chunk PDFs are converted one by one using the existing direct local MinerU CLI adapter.
- Markdown outputs are separate and not merged.
- Metadata/report files show chunk index and original page range.
- Default test suite passes without real MinerU, GPU, CUDA, model files, network, Obsidian, LaTeX tooling, or
samples/. - Strict-local policy remains unchanged.
- Existing non-chunked conversion behavior remains backward-compatible.
Hard Failure Criteria
- Chunking uses a remote PDF service or uploads document content.
- Chunking introduces an alternate Markdown conversion engine.
- Default tests require real MinerU, GPU, CUDA, model files, network, or local samples.
- Chunk outputs overwrite each other or overwrite non-chunk outputs without
--overwrite. - Chunk metadata loses original source page provenance.
- The implementation merges Markdown despite this contract's non-merge requirement.
- The implementation silently skips failed chunks without warnings.
Resolved Decisions
- Activation mode: opt-in with
--chunk-pages; the option defaults to 20 pages when no value is supplied. - Chunk PDF retention: temporary chunk PDFs only; they are deleted after conversion completes.
- API return type:
convert_pdf()returns aBatchConversionResultwhen chunk mode is active.
Verification Commands For Implementation
uv sync
uv run pytest tests/test_pdf_splitter.py tests/test_conversion.py tests/test_cli.py tests/test_paths.py tests/test_metadata.py
uv run pytest
git diff --check
git status --short --untracked-files=all
Optional local command after implementation and explicit user approval:
uv run pdf2md convert samples/MITC공부.pdf --out outputs --chunk-pages 20 --overwrite
Handoff Requirements
After implementation:
- Update
PROGRESS.mdwith files changed, commands run, tests passed, optional sample status, known failures, residual risks, and next action. - Do not mark the sprint implemented until independent evaluation or equivalent focused review verifies the acceptance criteria.
- Commit the completed change without including
samples/or generated outputs.
Implementation Handoff
- Files changed:
pyproject.toml,uv.lock,src/pdf2md/pdf_splitter.py,src/pdf2md/conversion.py,src/pdf2md/cli.py,src/pdf2md/__init__.py,src/pdf2md/report.py, tests, README,docs/V1IMPLEMENTATIONPLAN.md,PLAN.md, andPROGRESS.md. - Verification status: targeted unit tests passed 42 tests; the full local test suite passed 163 tests with 1 optional skip;
git diff --checkpassed with line-ending warnings only. - Optional local sample conversion remains out of scope unless explicitly requested.