Files
PDFToMD/docs/Sprints/SPRINT10CONTRACT.md
2026-05-08 16:42:19 +09:00

356 lines
14 KiB
Markdown

# Sprint 10 Contract: Pre-Conversion PDF Page Chunking
Status: Implemented
Last updated: 2026-05-08
## Objective
Add an opt-in pre-conversion workflow for long PDFs:
1. Split each source PDF into fixed-size chunk PDFs of 20 pages.
2. Convert each chunk PDF independently through the existing MinerU conversion pipeline.
3. Do not merge the generated Markdown files.
The feature is intended to reduce long-document memory/runtime pressure and make partial progress usable when one chunk fails. It must preserve strict-local execution, keep MinerU 3.1.0 as the only conversion engine, and keep default tests independent of real MinerU, GPU, CUDA, model files, network access, Obsidian, LaTeX tooling, and `samples/`.
## Research Summary
Sources checked on 2026-05-08:
- [pypdf PyPI](https://pypi.org/project/pypdf/): current release observed as `6.10.2`, uploaded 2026-04-15; metadata lists `BSD-3-Clause`, Python `>=3.9`, Python 3.12 support, and describes pypdf as a pure-Python PDF library capable of splitting, merging, cropping, and transforming PDF pages.
- [pypdf merging docs](https://pypdf.readthedocs.io/en/stable/user/merging-pdfs.html): `PdfWriter.append()` can append a complete or partial source PDF; examples use zero-based page ranges such as `(0, 10)`. The docs recommend `append` or `merge` over low-level `add_page` / `insert_page`.
- [pypdf streaming docs](https://pypdf.readthedocs.io/en/latest/user/streaming-data.html): `PdfReader` and `PdfWriter` support file-like objects, but the project should write chunk PDFs to local disk because MinerU accepts local file paths.
- [pypdf PdfWriter docs](https://pypdf.readthedocs.io/en/stable/modules/PdfWriter.html): writer operations clone/copy PDF objects into the destination. The docs warn that cloning linked objects can copy more than just the visible page object in some cases, so chunk output size must be checked in tests.
- [MinerU CLI tools docs](https://opendatalab.github.io/MinerU/usage/cli_tools/): the direct `mineru` CLI accepts `-p/--path`, `-o/--output`, `-s/--start`, and `-e/--end`; without `--api-url`, it launches a temporary local `mineru-api`.
- [PyMuPDF PyPI](https://pypi.org/project/PyMuPDF/): PyMuPDF is fast and local, but PyPI lists dual licensing under GNU AGPL v3 or an Artifex commercial license.
- [pikepdf page assembly docs](https://pikepdf.readthedocs.io/en/latest/topics/pages.html): pikepdf can split pages and transfer page-associated data; it is a capable fallback candidate but adds a QPDF-backed dependency and is not needed for a first implementation.
## Recommended Package Decision
Use `pypdf` for Sprint 10.
Rationale:
- It is pure Python and fits the current Python 3.12 + `uv` workflow.
- It has permissive `BSD-3-Clause` metadata on PyPI.
- It directly supports page-level PDF assembly with `PdfReader` / `PdfWriter`.
- It avoids adding PyMuPDF's AGPL/commercial licensing considerations for a simple split-only feature.
- It avoids adding pikepdf/QPDF native dependency complexity before there is evidence that pypdf cannot handle the project samples.
Recommended dependency range for implementation:
```toml
dependencies = [
"pypdf>=6.10.2,<7",
]
```
The implementation adds this dependency to `pyproject.toml` and `uv.lock`.
## Current Precondition
- `pdf2md convert` already converts one PDF or a directory of PDFs.
- Existing conversion output per input includes:
- Markdown
- optional metadata JSON, enabled by default
- `<stem>.report.md`
- assets directory
- optional raw MinerU output
- `plan_outputs()` already enforces overwrite and output-root safety.
- `convert_input()` already handles directory batches and continues after per-file failures.
- The MinerU adapter accepts one PDF path at a time and runs direct local `mineru` CLI.
- `samples/` is local and untracked; do not commit sample PDFs or generated outputs.
## Touched Surfaces
Allowed during implementation:
- `pyproject.toml`
- `uv.lock`
- `src/pdf2md/pdf_splitter.py`
- `src/pdf2md/paths.py`
- `src/pdf2md/conversion.py`
- `src/pdf2md/cli.py`
- `src/pdf2md/ir.py` only if new warning codes or chunk provenance records are required
- `src/pdf2md/metadata.py` only for chunk provenance fields
- `src/pdf2md/report.py` only to expose chunk provenance in reports
- `tests/test_pdf_splitter.py`
- `tests/test_conversion.py`
- `tests/test_cli.py`
- `tests/test_paths.py`
- `tests/test_metadata.py`
- `tests/integration/` for mocked chunk workflow coverage
- `README.md`
- `docs/V1IMPLEMENTATIONPLAN.md`
- `docs/Sprints/SPRINT10CONTRACT.md`
- `PLAN.md`
- `PROGRESS.md`
Not allowed:
- Runtime engine selection or alternate conversion engines.
- Use of cloud OCR, remote LLM/VLM, hosted renderers, hosted document parsers, `--api-url`, router mode, HTTP client backends, remote APIs, or remote OpenAI-compatible backends.
- Mandatory default tests requiring real MinerU, GPU, CUDA, model files, network, Obsidian, LaTeX tooling, or `samples/`.
- Committed files under `samples/`.
- Committed generated conversion outputs.
- Automatic model or package downloads triggered by import time, `doctor`, `convert`, or tests.
- Markdown merge behavior for chunk outputs.
- Claims that chunking improves formula correctness; it is only a processing-control feature.
## Product Behavior
Activation:
- Chunking is opt-in and existing conversion behavior is unchanged when `chunk_pages` is unset.
- CLI: `pdf2md convert INPUT --out OUTPUT_DIR --chunk-pages` uses the default chunk size of 20 pages.
- CLI: `pdf2md convert INPUT --out OUTPUT_DIR --chunk-pages 20` uses an explicit positive chunk size.
- Python API: `convert_pdf(..., chunk_pages=20)` and `convert_input(..., chunk_pages=20)`.
- `convert_pdf()` returns `ConversionResult` without chunking and `BatchConversionResult` when chunk mode is active.
- `chunk_pages` must be `None` or a positive integer.
Chunking behavior:
- If `chunk_pages` is unset, current behavior remains unchanged.
- If `chunk_pages=20` and a PDF has 20 or fewer pages, conversion may either:
- convert the original PDF directly, or
- create one chunk PDF and convert that chunk.
- Recommended: convert the original directly when `total_pages <= chunk_pages` to avoid unnecessary intermediate files.
- If a PDF has more than 20 pages, split it into chunk PDFs with ranges:
- chunk 1: source pages 1-20
- chunk 2: source pages 21-40
- chunk N: remaining pages
- Convert chunk PDFs sequentially, not in parallel. GTX 1070 Ti 8GB memory pressure makes sequential conversion the safer default.
- If one chunk conversion fails, continue with later chunks and report the failed chunk clearly.
- Do not merge Markdown outputs.
Recommended chunk output naming:
```text
<stem>.part-001.pages-001-020.md
<stem>.part-001.pages-001-020.metadata.json
<stem>.part-001.pages-001-020.report.md
<stem>.part-001.pages-001-020.assets/
<stem>.part-002.pages-021-040.md
...
```
Recommended chunk PDF staging:
- Use a temporary working directory.
- Delete temporary chunk PDFs after conversion completes, including when `--keep-raw` is enabled.
- Do not add a separate `--keep-chunks` flag in Sprint 10.
## Provenance Requirements
Each chunk conversion must preserve original-source context.
Required chunk fields in metadata or engine options:
- original source PDF path
- original source SHA-256
- chunk PDF path when retained, or chunk PDF filename when temporary
- chunk index, 1-based
- total chunk count
- source page start, 1-based inclusive
- source page end, 1-based inclusive
- chunk page count
Page provenance must distinguish:
- chunk-local page index, starting at 0 for MinerU output
- original source page number, starting at 1 for user-facing reports
The report should include a short chunk context line, for example:
```text
- Chunk: 2/5, source pages: 21-40
```
## Architecture Plan
### WP10.1: PDF Splitter Module
Owner:
- `feature-generator-agent`
- `mineru-integration-agent`
Actions:
- Add `src/pdf2md/pdf_splitter.py`.
- Define project-owned `PdfChunkPlan`.
- Implement page counting with `pypdf.PdfReader`.
- Implement chunk planning without writing files.
- Implement chunk writing with `pypdf.PdfWriter.append(source, (start, end))` or an equivalent tested `PdfReader`/`PdfWriter` path.
- Use zero-based half-open page ranges internally and one-based inclusive ranges for filenames and reports.
- Reject invalid chunk sizes with clear `ValueError`.
- Fail clearly on encrypted/password-protected PDFs unless a later sprint adds password handling.
Expected output:
- Deterministic chunk plans and local chunk PDFs suitable for the existing MinerU adapter.
### WP10.2: Chunk-Aware Path Planning
Owner:
- `feature-generator-agent`
Actions:
- Extend output planning so chunk outputs are deterministic and conflict-checked before conversion starts.
- Avoid collisions between original-output stems and chunk-output stems.
- Preserve output-root escape prevention.
- Respect `--overwrite`.
- Keep Korean and non-ASCII source stems working.
Expected output:
- A long PDF can produce multiple planned Markdown/metadata/report/assets outputs without overwriting another chunk.
### WP10.3: Conversion Orchestration
Owner:
- `feature-generator-agent`
- `mineru-integration-agent`
Actions:
- Add chunk mode to `convert_pdf()` and `convert_input()`.
- When chunk mode is active, split before calling the MinerU adapter.
- Reuse existing per-PDF conversion path for each chunk PDF rather than creating a second conversion pipeline.
- Continue conversion after a chunk-level failure and aggregate a batch-like result for the source.
- Ensure temporary chunk directories are cleaned up unless raw retention is requested.
- Keep strict-local validation unchanged.
Expected output:
- Long PDF conversion yields separate Markdown/metadata/report/assets outputs per chunk.
### WP10.4: Metadata And Report Chunk Provenance
Owner:
- `metadata-agent`
- `obsidian-markdown-agent`
Actions:
- Add chunk provenance fields without exposing raw pypdf objects.
- Keep existing required metadata fields valid.
- Keep original source provenance visible even though MinerU sees a chunk PDF as input.
- Ensure chunk reports are readable without opening JSON metadata.
Expected output:
- Users can map each output file back to original source pages.
### WP10.5: CLI And API Surface
Owner:
- `feature-generator-agent`
- `requirements-guard-agent`
Actions:
- Add `--chunk-pages [INTEGER]` to `pdf2md convert`.
- Keep chunking disabled unless the option is present.
- Use 20 pages when `--chunk-pages` is present without an explicit value.
- Validate positive integer input.
- Keep `--out`, `--metadata`, `--keep-raw`, `--recursive`, `--overwrite`, `--gpu`, and strict-local behavior unchanged.
- Update README with the long-PDF workflow.
Expected output:
```powershell
uv run pdf2md convert samples/long.pdf --out outputs --chunk-pages 20
```
### WP10.6: Tests
Owner:
- `feature-generator-agent`
- `evaluation-agent`
Default tests must not require real MinerU or sample PDFs.
Required tests:
- Build small local PDF fixtures using `pypdf` blank pages or minimal test PDFs.
- Page count detection for 1, 20, 21, 40, and 41 pages.
- Chunk planning produces expected 1-based filenames and page ranges.
- Chunk writing produces PDFs with the expected page counts.
- Non-positive chunk size is rejected.
- Existing conversion without `--chunk-pages` is unchanged.
- Chunked conversion calls the fake adapter once per chunk.
- Chunked conversion writes separate Markdown, metadata JSON, report Markdown, and assets per chunk.
- `--overwrite` and conflict detection work for all planned chunk outputs.
- A failed chunk does not silently fallback and does not prevent later chunks from being attempted.
- Metadata/report contain original source PDF and source page range.
- CLI validates `--chunk-pages` and prints a useful summary.
Optional local validation:
- Run chunked conversion on a local `samples/` PDF only by explicit user request or opt-in gate.
- Do not commit generated chunk PDFs or outputs.
## Acceptance Criteria
- Sprint 10 implementation can split a PDF into 20-page chunk PDFs before MinerU conversion.
- Chunk PDFs are converted one by one using the existing direct local MinerU CLI adapter.
- Markdown outputs are separate and not merged.
- Metadata/report files show chunk index and original page range.
- Default test suite passes without real MinerU, GPU, CUDA, model files, network, Obsidian, LaTeX tooling, or `samples/`.
- Strict-local policy remains unchanged.
- Existing non-chunked conversion behavior remains backward-compatible.
## Hard Failure Criteria
- Chunking uses a remote PDF service or uploads document content.
- Chunking introduces an alternate Markdown conversion engine.
- Default tests require real MinerU, GPU, CUDA, model files, network, or local samples.
- Chunk outputs overwrite each other or overwrite non-chunk outputs without `--overwrite`.
- Chunk metadata loses original source page provenance.
- The implementation merges Markdown despite this contract's non-merge requirement.
- The implementation silently skips failed chunks without warnings.
## Resolved Decisions
- Activation mode: opt-in with `--chunk-pages`; the option defaults to 20 pages when no value is supplied.
- Chunk PDF retention: temporary chunk PDFs only; they are deleted after conversion completes.
- API return type: `convert_pdf()` returns a `BatchConversionResult` when chunk mode is active.
## Verification Commands For Implementation
```powershell
uv sync
uv run pytest tests/test_pdf_splitter.py tests/test_conversion.py tests/test_cli.py tests/test_paths.py tests/test_metadata.py
uv run pytest
git diff --check
git status --short --untracked-files=all
```
Optional local command after implementation and explicit user approval:
```powershell
uv run pdf2md convert samples/MITC공부.pdf --out outputs --chunk-pages 20 --overwrite
```
## Handoff Requirements
After implementation:
- Update `PROGRESS.md` with files changed, commands run, tests passed, optional sample status, known failures, residual risks, and next action.
- Do not mark the sprint implemented until independent evaluation or equivalent focused review verifies the acceptance criteria.
- Commit the completed change without including `samples/` or generated outputs.
## Implementation Handoff
- Files changed: `pyproject.toml`, `uv.lock`, `src/pdf2md/pdf_splitter.py`, `src/pdf2md/conversion.py`, `src/pdf2md/cli.py`, `src/pdf2md/__init__.py`, `src/pdf2md/report.py`, tests, README, `docs/V1IMPLEMENTATIONPLAN.md`, `PLAN.md`, and `PROGRESS.md`.
- Verification status: targeted unit tests passed 42 tests; the full local test suite passed 163 tests with 1 optional skip; `git diff --check` passed with line-ending warnings only.
- Optional local sample conversion remains out of scope unless explicitly requested.