Files
PDFToMD/docs/Sprints/SPRINT14CONTRACT.md
2026-05-14 10:16:59 +09:00

379 lines
16 KiB
Markdown

# Sprint 14 Contract: Single-Page Conversion With Grouped Outputs
Status: Implemented
Last updated: 2026-05-11
## Objective
Replace the current fixed-size pre-conversion chunking behavior with a safer long-PDF workflow:
1. When chunk mode is active, split the source PDF into one-page temporary PDFs.
2. Convert each one-page PDF sequentially through the existing local MinerU CLI adapter.
3. Merge successful converted page Markdown into grouped output files after every configured output group size.
4. Keep the default output group size at 20 pages when `--chunk-pages` is supplied without a value.
This sprint is motivated by local evidence from `samples/2007쉘구조물의유한요소해석에대하여.pdf`: a 5-page MinerU input chunk stalled on GTX 1070 Ti 8GB, while one-page conversion completed all 13 pages.
## Current Precondition
- MinerU 3.1.0 remains the only conversion engine.
- Conversion runs through direct local `mineru` CLI execution only.
- Strict-local allows only the direct CLI and MinerU CLI-internal temporary local `mineru-api`; remote API/backend paths remain prohibited.
- `pypdf` is already available and used for local PDF chunk planning and temporary chunk PDF writing.
- `pdf2md convert` currently supports `--chunk-pages [PAGES]`.
- Existing chunk mode currently treats `chunk_pages` as the MinerU input PDF page count and writes one final Markdown file per input chunk.
- `convert_pdf(..., chunk_pages=N)` currently returns `BatchConversionResult` in chunk mode.
- Sprint 13 text fidelity diagnostics are most accurate when each MinerU Markdown output maps to exactly one source page.
## Contract Assumptions
- Keep chunk mode opt-in for this sprint. If `chunk_pages` is `None`, the existing non-chunked full-PDF conversion path remains unchanged.
- Keep the public option name `--chunk-pages` for CLI/API compatibility, but redefine its behavior in chunk mode as the output group size, not the MinerU input size.
- If `--chunk-pages` is present without a value, use `DEFAULT_CHUNK_PAGES == 20` as the output group size.
- In chunk mode, even a PDF with fewer than `chunk_pages` pages is converted internally one page at a time and emitted as one grouped output file.
- Final grouped outputs are the public conversion results. Temporary per-page Markdown, metadata, reports, assets, and one-page PDFs are not retained unless a later sprint explicitly adds debug retention.
## Touched Surfaces
Allowed during implementation:
- `src/pdf2md/pdf_splitter.py`
- `src/pdf2md/conversion.py`
- `src/pdf2md/paths.py`
- `src/pdf2md/metadata.py`
- `src/pdf2md/report.py`
- `src/pdf2md/cli.py`
- `src/pdf2md_ui/app.py`
- `src/pdf2md_ui/runner.py`
- `tests/test_pdf_splitter.py`
- `tests/test_conversion.py`
- `tests/test_cli.py`
- `tests/test_paths.py`
- `tests/test_metadata.py`
- `tests/test_report.py`
- `tests/test_ui_runner.py`
- `README.md`
- `ARCHITECTURE.md`
- `docs/V1IMPLEMENTATIONPLAN.md`
- `PLAN.md`
- `PROGRESS.md`
- `docs/WORKARCHIVE.md` after implementation
Allowed if a focused helper boundary keeps `conversion.py` simpler:
- Create `src/pdf2md/page_grouping.py`
- Create `tests/test_page_grouping.py`
Not allowed:
- Adding another conversion engine or runtime engine selector.
- Running page conversions in parallel by default. GTX 1070 Ti 8GB memory pressure makes sequential conversion the safe default.
- Adding cloud OCR, hosted LLM/VLM, remote document parsing, `--api-url`, router mode, HTTP client backends, or remote OpenAI-compatible endpoints.
- Making default tests depend on real MinerU, GPU, model files, network, Obsidian, MathJax, or `samples/`.
- Committing sample PDFs, generated `outputs/`, retained temporary page outputs, or `dist/pdf2md-ui.exe`.
## Product Behavior
### Activation
Existing non-chunked conversion remains unchanged:
```powershell
uv run pdf2md convert paper.pdf --out outputs
```
Grouped page conversion is enabled by `--chunk-pages`:
```powershell
uv run pdf2md convert paper.pdf --out outputs --chunk-pages
uv run pdf2md convert paper.pdf --out outputs --chunk-pages 20
uv run pdf2md convert paper.pdf --out outputs --chunk-pages 1
```
Behavior:
- `--chunk-pages` means output group size.
- `--chunk-pages 20` converts pages 1, 2, 3, ... as independent one-page MinerU jobs, then emits grouped outputs covering pages 1-20, 21-40, and so on.
- `--chunk-pages 1` emits one final output file per source page.
- `convert_pdf(..., chunk_pages=N)` still returns `BatchConversionResult`; each `ConversionResult` represents one final grouped output file, not each internal one-page MinerU run.
### Output Naming
Use the existing part/page-range naming shape for grouped outputs:
```text
<stem>.part-001.pages-001-020.md
<stem>.part-001.pages-001-020.metadata.json
<stem>.part-001.pages-001-020.report.md
<stem>.part-001.pages-001-020.assets/
<stem>.part-002.pages-021-040.md
...
```
If a 13-page PDF is converted with `--chunk-pages 20`, it emits:
```text
<stem>.part-001.pages-001-013.md
<stem>.part-001.pages-001-013.metadata.json
<stem>.part-001.pages-001-013.report.md
<stem>.part-001.pages-001-013.assets/
```
This is an intentional behavior change from Sprint 10: short PDFs in chunk mode no longer bypass chunk mode and no longer write `<stem>.md`.
### Internal Page Conversion
For every source page in chunk mode:
- Write a one-page temporary PDF with pypdf.
- Run the existing local MinerU adapter against that one-page PDF.
- Normalize Markdown, copy page assets into a temporary page assets directory, run MathJax checks/repair, and run Sprint 13 text fidelity diagnostics against the original source page.
- Delete the one-page temporary PDF and temporary per-page final files after grouped output generation.
The implementation should reuse existing conversion primitives where practical, but it must avoid writing final public files for every page before grouping.
### Markdown Grouping
For each output group:
- Concatenate successful page Markdown in source page order.
- Separate pages with blank lines and an HTML comment that is invisible in Obsidian preview:
```markdown
<!-- source-page: 7 -->
```
- Do not add visible page headings or instructional text.
- If a page conversion fails, do not invent Markdown for that page. Add an invisible comment at the page boundary:
```markdown
<!-- source-page: 7 conversion failed; see report -->
```
- Preserve Obsidian-friendly math delimiters and display math spacing after concatenation.
### Asset Grouping
Assets from temporary per-page outputs must be copied into the grouped assets directory with collision-proof names.
Recommended destination layout:
```text
<stem>.part-001.pages-001-020.assets/page-001/<asset-name>
<stem>.part-001.pages-001-020.assets/page-002/<asset-name>
```
Markdown image links must be rewritten to the grouped assets directory. This keeps repeated MinerU asset filenames from different pages from overwriting each other.
### Metadata And Report Grouping
Grouped metadata must be derived from per-page conversion records plus group-level checks.
Required metadata behavior:
- `source_pdf` remains the original source PDF path.
- `source_sha256` remains the original source PDF hash.
- `pages` contains one page record per source page in the group.
- Page indexes in grouped metadata are group-local zero-based indexes.
- Original source page numbers remain visible in chunk/page conversion provenance.
- Warnings from per-page conversions are preserved with adjusted group-local page indexes.
- Warnings for failed page conversions are added with original source page context.
- `text_fidelity` records are carried from one-page checks and keep exact `source_page_number` values.
- Summary counts are aggregated from the grouped metadata and grouped Markdown.
Required `engine_options` shape:
```json
{
"chunk": {
"original_source_pdf": "...",
"chunk_index": 1,
"total_chunks": 3,
"source_page_start": 1,
"source_page_end": 20,
"chunk_page_count": 20
},
"page_conversion": {
"mode": "single_page",
"mineru_input_page_count": 1,
"output_group_page_count": 20,
"failed_source_pages": []
}
}
```
Report Markdown must continue to include the existing chunk context line and should add a concise page-conversion line, for example:
```text
- Page conversion mode: single-page MinerU inputs, grouped output size: 20
```
## Failure Policy
- Convert pages sequentially.
- If a page fails, continue with later pages.
- If at least one page in a group succeeds, write the grouped Markdown/metadata/report and mark final status `partial`.
- If every page in a group fails, return a failed `ConversionResult` for that grouped output and do not write Markdown for that group.
- Failed pages must be visible in metadata/report warnings.
- There is no silent fallback and no retry loop in this sprint.
## Architecture Plan
### WP14.1: Page And Group Planning
Actions:
- Extend `pdf_splitter.py` or add `page_grouping.py` with project-owned records for:
- one-page MinerU input plans,
- final output group plans,
- original source page ranges,
- deterministic output stems.
- Keep pypdf page extraction local and temporary.
- Validate output group size as a positive integer.
- Plan output groups before conversion starts so overwrite/conflict behavior remains deterministic.
Expected output:
- A 41-page PDF with group size 20 plans 41 one-page MinerU inputs and 3 final grouped outputs.
- A 13-page PDF with group size 20 plans 13 one-page MinerU inputs and 1 final grouped output.
### WP14.2: Conversion Orchestration
Actions:
- Rework chunk-mode `convert_pdf()` and `convert_input()` orchestration so `chunk_pages` creates grouped output tasks.
- Run one-page MinerU inputs in source-page order.
- Keep temporary page PDFs and intermediate page outputs under local temporary directories.
- Keep `BatchConversionResult` at the grouped-output level.
- Keep strict-local validation unchanged.
Expected output:
- The public API keeps returning multiple grouped results in chunk mode while the adapter is called once per source page internally.
### WP14.3: Markdown And Asset Group Assembly
Actions:
- Build a focused helper to merge page Markdown and page assets into a grouped output.
- Insert invisible `<!-- source-page: N -->` boundaries.
- Rewrite per-page asset links to `page-NNN/` asset subdirectories.
- Run final group-level local quality checks after asset rewriting.
Expected output:
- Grouped Markdown renders in Obsidian and assets do not collide across pages.
### WP14.4: Metadata, Warnings, And Report Assembly
Actions:
- Aggregate per-page metadata into grouped metadata.
- Adjust page indexes from page-local `0` to group-local indexes.
- Preserve original source page numbers in `engine_options` and text fidelity records.
- Add `page_conversion` engine options.
- Add a report line for single-page conversion mode and grouped output size.
Expected output:
- Metadata/report can explain both facts: MinerU saw one page at a time, while the user received grouped Markdown files.
### WP14.5: CLI, UI, And Documentation
Actions:
- Update CLI help for `--chunk-pages` from "pre-conversion PDF chunking" to "group converted pages into output files of N pages; MinerU runs one page at a time."
- Update README and architecture docs with the new behavior.
- Update the Windows UI label/help text so the field represents output group size.
- Keep runner command construction using `--chunk-pages N`.
Expected output:
- Users do not confuse `--chunk-pages 20` with a 20-page MinerU input.
### WP14.6: Tests
Default fast tests:
- Generated blank local PDFs verify page count and group planning for 1, 13, 20, 21, 40, and 41 pages.
- `--chunk-pages` without a value still passes `20`.
- `convert_pdf(..., chunk_pages=20)` for 41 pages calls the fake adapter 41 times and returns 3 grouped `ConversionResult` objects.
- `convert_pdf(..., chunk_pages=20)` for 13 pages calls the fake adapter 13 times and returns 1 grouped output named `part-001.pages-001-013`.
- `convert_pdf(..., chunk_pages=1)` returns one grouped output per source page.
- Temporary one-page PDFs and temporary per-page outputs are deleted after conversion.
- A failed internal page conversion does not stop later pages and appears in grouped metadata/report.
- A group with only failed pages returns a failed result and writes no Markdown.
- Asset filenames from different pages do not collide in the grouped assets directory.
- Per-page warnings and text fidelity records are adjusted to group-local page indexes while preserving original source page numbers.
- Existing non-chunked conversion tests keep passing unchanged.
- UI runner tests continue to build fixed argument lists with `shell=False`.
Optional local validation:
```powershell
$env:MINERU_MODEL_SOURCE='local'
$pdf = (Get-ChildItem samples -Filter '2007*.pdf' | Select-Object -First 1).FullName
uv run pdf2md convert $pdf --out outputs\sprint14-2007-page-grouped --overwrite --chunk-pages
```
Expected optional validation:
- The 13-page Korean sample emits one grouped Markdown file for pages 1-13.
- Metadata/report show exact page-level text fidelity records.
- Generated outputs stay ignored and uncommitted.
## Acceptance Criteria
- Chunk mode runs MinerU on one-page temporary PDFs only.
- `chunk_pages` controls final grouped output page count.
- Default group size remains 20 when `--chunk-pages` is supplied without a value.
- Grouped Markdown, metadata JSON, report Markdown, and grouped assets directory are written.
- Grouped metadata preserves original source PDF, original source SHA-256, group page range, one-page conversion mode, page warnings, and text fidelity provenance.
- Failed page conversions are explicit, nonfatal to later pages, and visible in report/metadata.
- Default tests remain fast and local.
- Strict-local policy remains unchanged.
- Non-chunked conversion behavior remains backward-compatible.
## Hard Failure Criteria
- Chunk mode sends more than one source page to MinerU in a single temporary PDF.
- `--chunk-pages` continues to mean MinerU input chunk size after this sprint.
- Grouped outputs lose source page provenance or hide failed pages.
- Asset links collide or point outside the grouped assets directory.
- Default tests require real MinerU, GPU, model files, network, Obsidian, MathJax, or `samples/`.
- The implementation adds a remote API/backend path, alternate conversion engine, router mode, or OpenAI-compatible backend.
- Sample PDFs, generated outputs, retained temporary page outputs, or `dist/pdf2md-ui.exe` are committed.
## Verification Commands
```powershell
uv run pytest tests/test_pdf_splitter.py tests/test_conversion.py tests/test_cli.py tests/test_paths.py tests/test_metadata.py tests/test_report.py tests/test_ui_runner.py
uv run pytest
git diff --check
git status --short --untracked-files=all
```
Optional local validation command is listed in WP14.6 and should be run only when a long GPU conversion is acceptable.
## Handoff Requirements
After implementation:
- Update `PROGRESS.md` with files changed, commands run, test outcomes, optional sample validation outcome, known failures, residual risks, and next action.
- Archive completed implementation details in `docs/WORKARCHIVE.md` after verification.
- Keep sample PDFs, generated outputs, retained temporary page outputs, and build artifacts out of the commit.
- Record whether the 2007 Korean sample was validated with grouped page conversion and how many grouped outputs were produced.
Implementation handoff on 2026-05-11:
- Implemented grouped page conversion in `src/pdf2md/conversion.py` with one-page temporary MinerU inputs and grouped public outputs.
- Added report output for `page_conversion` engine options.
- Updated CLI help, UI label text, README, architecture, implementation plan, and coordination/archive docs.
- Verification: targeted Sprint 14 tests passed, the 101-test related suite passed, and full `uv run pytest` passed 202 tests with 1 optional skip.
- Optional real MinerU validation on the 2007 Korean sample was not run during this implementation pass.
## Future Sprint Boundary
A later sprint may make grouped page conversion the default even without `--chunk-pages`, add resumable page caches, or add a debug option to retain intermediate per-page outputs. Those behaviors are intentionally out of Sprint 14 scope.