379 lines
16 KiB
Markdown
379 lines
16 KiB
Markdown
# Sprint 14 Contract: Single-Page Conversion With Grouped Outputs
|
|
|
|
Status: Implemented
|
|
Last updated: 2026-05-11
|
|
|
|
## Objective
|
|
|
|
Replace the current fixed-size pre-conversion chunking behavior with a safer long-PDF workflow:
|
|
|
|
1. When chunk mode is active, split the source PDF into one-page temporary PDFs.
|
|
2. Convert each one-page PDF sequentially through the existing local MinerU CLI adapter.
|
|
3. Merge successful converted page Markdown into grouped output files after every configured output group size.
|
|
4. Keep the default output group size at 20 pages when `--chunk-pages` is supplied without a value.
|
|
|
|
This sprint is motivated by local evidence from `samples/2007쉘구조물의유한요소해석에대하여.pdf`: a 5-page MinerU input chunk stalled on GTX 1070 Ti 8GB, while one-page conversion completed all 13 pages.
|
|
|
|
## Current Precondition
|
|
|
|
- MinerU 3.1.0 remains the only conversion engine.
|
|
- Conversion runs through direct local `mineru` CLI execution only.
|
|
- Strict-local allows only the direct CLI and MinerU CLI-internal temporary local `mineru-api`; remote API/backend paths remain prohibited.
|
|
- `pypdf` is already available and used for local PDF chunk planning and temporary chunk PDF writing.
|
|
- `pdf2md convert` currently supports `--chunk-pages [PAGES]`.
|
|
- Existing chunk mode currently treats `chunk_pages` as the MinerU input PDF page count and writes one final Markdown file per input chunk.
|
|
- `convert_pdf(..., chunk_pages=N)` currently returns `BatchConversionResult` in chunk mode.
|
|
- Sprint 13 text fidelity diagnostics are most accurate when each MinerU Markdown output maps to exactly one source page.
|
|
|
|
## Contract Assumptions
|
|
|
|
- Keep chunk mode opt-in for this sprint. If `chunk_pages` is `None`, the existing non-chunked full-PDF conversion path remains unchanged.
|
|
- Keep the public option name `--chunk-pages` for CLI/API compatibility, but redefine its behavior in chunk mode as the output group size, not the MinerU input size.
|
|
- If `--chunk-pages` is present without a value, use `DEFAULT_CHUNK_PAGES == 20` as the output group size.
|
|
- In chunk mode, even a PDF with fewer than `chunk_pages` pages is converted internally one page at a time and emitted as one grouped output file.
|
|
- Final grouped outputs are the public conversion results. Temporary per-page Markdown, metadata, reports, assets, and one-page PDFs are not retained unless a later sprint explicitly adds debug retention.
|
|
|
|
## Touched Surfaces
|
|
|
|
Allowed during implementation:
|
|
|
|
- `src/pdf2md/pdf_splitter.py`
|
|
- `src/pdf2md/conversion.py`
|
|
- `src/pdf2md/paths.py`
|
|
- `src/pdf2md/metadata.py`
|
|
- `src/pdf2md/report.py`
|
|
- `src/pdf2md/cli.py`
|
|
- `src/pdf2md_ui/app.py`
|
|
- `src/pdf2md_ui/runner.py`
|
|
- `tests/test_pdf_splitter.py`
|
|
- `tests/test_conversion.py`
|
|
- `tests/test_cli.py`
|
|
- `tests/test_paths.py`
|
|
- `tests/test_metadata.py`
|
|
- `tests/test_report.py`
|
|
- `tests/test_ui_runner.py`
|
|
- `README.md`
|
|
- `ARCHITECTURE.md`
|
|
- `docs/V1IMPLEMENTATIONPLAN.md`
|
|
- `PLAN.md`
|
|
- `PROGRESS.md`
|
|
- `docs/WORKARCHIVE.md` after implementation
|
|
|
|
Allowed if a focused helper boundary keeps `conversion.py` simpler:
|
|
|
|
- Create `src/pdf2md/page_grouping.py`
|
|
- Create `tests/test_page_grouping.py`
|
|
|
|
Not allowed:
|
|
|
|
- Adding another conversion engine or runtime engine selector.
|
|
- Running page conversions in parallel by default. GTX 1070 Ti 8GB memory pressure makes sequential conversion the safe default.
|
|
- Adding cloud OCR, hosted LLM/VLM, remote document parsing, `--api-url`, router mode, HTTP client backends, or remote OpenAI-compatible endpoints.
|
|
- Making default tests depend on real MinerU, GPU, model files, network, Obsidian, MathJax, or `samples/`.
|
|
- Committing sample PDFs, generated `outputs/`, retained temporary page outputs, or `dist/pdf2md-ui.exe`.
|
|
|
|
## Product Behavior
|
|
|
|
### Activation
|
|
|
|
Existing non-chunked conversion remains unchanged:
|
|
|
|
```powershell
|
|
uv run pdf2md convert paper.pdf --out outputs
|
|
```
|
|
|
|
Grouped page conversion is enabled by `--chunk-pages`:
|
|
|
|
```powershell
|
|
uv run pdf2md convert paper.pdf --out outputs --chunk-pages
|
|
uv run pdf2md convert paper.pdf --out outputs --chunk-pages 20
|
|
uv run pdf2md convert paper.pdf --out outputs --chunk-pages 1
|
|
```
|
|
|
|
Behavior:
|
|
|
|
- `--chunk-pages` means output group size.
|
|
- `--chunk-pages 20` converts pages 1, 2, 3, ... as independent one-page MinerU jobs, then emits grouped outputs covering pages 1-20, 21-40, and so on.
|
|
- `--chunk-pages 1` emits one final output file per source page.
|
|
- `convert_pdf(..., chunk_pages=N)` still returns `BatchConversionResult`; each `ConversionResult` represents one final grouped output file, not each internal one-page MinerU run.
|
|
|
|
### Output Naming
|
|
|
|
Use the existing part/page-range naming shape for grouped outputs:
|
|
|
|
```text
|
|
<stem>.part-001.pages-001-020.md
|
|
<stem>.part-001.pages-001-020.metadata.json
|
|
<stem>.part-001.pages-001-020.report.md
|
|
<stem>.part-001.pages-001-020.assets/
|
|
|
|
<stem>.part-002.pages-021-040.md
|
|
...
|
|
```
|
|
|
|
If a 13-page PDF is converted with `--chunk-pages 20`, it emits:
|
|
|
|
```text
|
|
<stem>.part-001.pages-001-013.md
|
|
<stem>.part-001.pages-001-013.metadata.json
|
|
<stem>.part-001.pages-001-013.report.md
|
|
<stem>.part-001.pages-001-013.assets/
|
|
```
|
|
|
|
This is an intentional behavior change from Sprint 10: short PDFs in chunk mode no longer bypass chunk mode and no longer write `<stem>.md`.
|
|
|
|
### Internal Page Conversion
|
|
|
|
For every source page in chunk mode:
|
|
|
|
- Write a one-page temporary PDF with pypdf.
|
|
- Run the existing local MinerU adapter against that one-page PDF.
|
|
- Normalize Markdown, copy page assets into a temporary page assets directory, run MathJax checks/repair, and run Sprint 13 text fidelity diagnostics against the original source page.
|
|
- Delete the one-page temporary PDF and temporary per-page final files after grouped output generation.
|
|
|
|
The implementation should reuse existing conversion primitives where practical, but it must avoid writing final public files for every page before grouping.
|
|
|
|
### Markdown Grouping
|
|
|
|
For each output group:
|
|
|
|
- Concatenate successful page Markdown in source page order.
|
|
- Separate pages with blank lines and an HTML comment that is invisible in Obsidian preview:
|
|
|
|
```markdown
|
|
<!-- source-page: 7 -->
|
|
```
|
|
|
|
- Do not add visible page headings or instructional text.
|
|
- If a page conversion fails, do not invent Markdown for that page. Add an invisible comment at the page boundary:
|
|
|
|
```markdown
|
|
<!-- source-page: 7 conversion failed; see report -->
|
|
```
|
|
|
|
- Preserve Obsidian-friendly math delimiters and display math spacing after concatenation.
|
|
|
|
### Asset Grouping
|
|
|
|
Assets from temporary per-page outputs must be copied into the grouped assets directory with collision-proof names.
|
|
|
|
Recommended destination layout:
|
|
|
|
```text
|
|
<stem>.part-001.pages-001-020.assets/page-001/<asset-name>
|
|
<stem>.part-001.pages-001-020.assets/page-002/<asset-name>
|
|
```
|
|
|
|
Markdown image links must be rewritten to the grouped assets directory. This keeps repeated MinerU asset filenames from different pages from overwriting each other.
|
|
|
|
### Metadata And Report Grouping
|
|
|
|
Grouped metadata must be derived from per-page conversion records plus group-level checks.
|
|
|
|
Required metadata behavior:
|
|
|
|
- `source_pdf` remains the original source PDF path.
|
|
- `source_sha256` remains the original source PDF hash.
|
|
- `pages` contains one page record per source page in the group.
|
|
- Page indexes in grouped metadata are group-local zero-based indexes.
|
|
- Original source page numbers remain visible in chunk/page conversion provenance.
|
|
- Warnings from per-page conversions are preserved with adjusted group-local page indexes.
|
|
- Warnings for failed page conversions are added with original source page context.
|
|
- `text_fidelity` records are carried from one-page checks and keep exact `source_page_number` values.
|
|
- Summary counts are aggregated from the grouped metadata and grouped Markdown.
|
|
|
|
Required `engine_options` shape:
|
|
|
|
```json
|
|
{
|
|
"chunk": {
|
|
"original_source_pdf": "...",
|
|
"chunk_index": 1,
|
|
"total_chunks": 3,
|
|
"source_page_start": 1,
|
|
"source_page_end": 20,
|
|
"chunk_page_count": 20
|
|
},
|
|
"page_conversion": {
|
|
"mode": "single_page",
|
|
"mineru_input_page_count": 1,
|
|
"output_group_page_count": 20,
|
|
"failed_source_pages": []
|
|
}
|
|
}
|
|
```
|
|
|
|
Report Markdown must continue to include the existing chunk context line and should add a concise page-conversion line, for example:
|
|
|
|
```text
|
|
- Page conversion mode: single-page MinerU inputs, grouped output size: 20
|
|
```
|
|
|
|
## Failure Policy
|
|
|
|
- Convert pages sequentially.
|
|
- If a page fails, continue with later pages.
|
|
- If at least one page in a group succeeds, write the grouped Markdown/metadata/report and mark final status `partial`.
|
|
- If every page in a group fails, return a failed `ConversionResult` for that grouped output and do not write Markdown for that group.
|
|
- Failed pages must be visible in metadata/report warnings.
|
|
- There is no silent fallback and no retry loop in this sprint.
|
|
|
|
## Architecture Plan
|
|
|
|
### WP14.1: Page And Group Planning
|
|
|
|
Actions:
|
|
|
|
- Extend `pdf_splitter.py` or add `page_grouping.py` with project-owned records for:
|
|
- one-page MinerU input plans,
|
|
- final output group plans,
|
|
- original source page ranges,
|
|
- deterministic output stems.
|
|
- Keep pypdf page extraction local and temporary.
|
|
- Validate output group size as a positive integer.
|
|
- Plan output groups before conversion starts so overwrite/conflict behavior remains deterministic.
|
|
|
|
Expected output:
|
|
|
|
- A 41-page PDF with group size 20 plans 41 one-page MinerU inputs and 3 final grouped outputs.
|
|
- A 13-page PDF with group size 20 plans 13 one-page MinerU inputs and 1 final grouped output.
|
|
|
|
### WP14.2: Conversion Orchestration
|
|
|
|
Actions:
|
|
|
|
- Rework chunk-mode `convert_pdf()` and `convert_input()` orchestration so `chunk_pages` creates grouped output tasks.
|
|
- Run one-page MinerU inputs in source-page order.
|
|
- Keep temporary page PDFs and intermediate page outputs under local temporary directories.
|
|
- Keep `BatchConversionResult` at the grouped-output level.
|
|
- Keep strict-local validation unchanged.
|
|
|
|
Expected output:
|
|
|
|
- The public API keeps returning multiple grouped results in chunk mode while the adapter is called once per source page internally.
|
|
|
|
### WP14.3: Markdown And Asset Group Assembly
|
|
|
|
Actions:
|
|
|
|
- Build a focused helper to merge page Markdown and page assets into a grouped output.
|
|
- Insert invisible `<!-- source-page: N -->` boundaries.
|
|
- Rewrite per-page asset links to `page-NNN/` asset subdirectories.
|
|
- Run final group-level local quality checks after asset rewriting.
|
|
|
|
Expected output:
|
|
|
|
- Grouped Markdown renders in Obsidian and assets do not collide across pages.
|
|
|
|
### WP14.4: Metadata, Warnings, And Report Assembly
|
|
|
|
Actions:
|
|
|
|
- Aggregate per-page metadata into grouped metadata.
|
|
- Adjust page indexes from page-local `0` to group-local indexes.
|
|
- Preserve original source page numbers in `engine_options` and text fidelity records.
|
|
- Add `page_conversion` engine options.
|
|
- Add a report line for single-page conversion mode and grouped output size.
|
|
|
|
Expected output:
|
|
|
|
- Metadata/report can explain both facts: MinerU saw one page at a time, while the user received grouped Markdown files.
|
|
|
|
### WP14.5: CLI, UI, And Documentation
|
|
|
|
Actions:
|
|
|
|
- Update CLI help for `--chunk-pages` from "pre-conversion PDF chunking" to "group converted pages into output files of N pages; MinerU runs one page at a time."
|
|
- Update README and architecture docs with the new behavior.
|
|
- Update the Windows UI label/help text so the field represents output group size.
|
|
- Keep runner command construction using `--chunk-pages N`.
|
|
|
|
Expected output:
|
|
|
|
- Users do not confuse `--chunk-pages 20` with a 20-page MinerU input.
|
|
|
|
### WP14.6: Tests
|
|
|
|
Default fast tests:
|
|
|
|
- Generated blank local PDFs verify page count and group planning for 1, 13, 20, 21, 40, and 41 pages.
|
|
- `--chunk-pages` without a value still passes `20`.
|
|
- `convert_pdf(..., chunk_pages=20)` for 41 pages calls the fake adapter 41 times and returns 3 grouped `ConversionResult` objects.
|
|
- `convert_pdf(..., chunk_pages=20)` for 13 pages calls the fake adapter 13 times and returns 1 grouped output named `part-001.pages-001-013`.
|
|
- `convert_pdf(..., chunk_pages=1)` returns one grouped output per source page.
|
|
- Temporary one-page PDFs and temporary per-page outputs are deleted after conversion.
|
|
- A failed internal page conversion does not stop later pages and appears in grouped metadata/report.
|
|
- A group with only failed pages returns a failed result and writes no Markdown.
|
|
- Asset filenames from different pages do not collide in the grouped assets directory.
|
|
- Per-page warnings and text fidelity records are adjusted to group-local page indexes while preserving original source page numbers.
|
|
- Existing non-chunked conversion tests keep passing unchanged.
|
|
- UI runner tests continue to build fixed argument lists with `shell=False`.
|
|
|
|
Optional local validation:
|
|
|
|
```powershell
|
|
$env:MINERU_MODEL_SOURCE='local'
|
|
$pdf = (Get-ChildItem samples -Filter '2007*.pdf' | Select-Object -First 1).FullName
|
|
uv run pdf2md convert $pdf --out outputs\sprint14-2007-page-grouped --overwrite --chunk-pages
|
|
```
|
|
|
|
Expected optional validation:
|
|
|
|
- The 13-page Korean sample emits one grouped Markdown file for pages 1-13.
|
|
- Metadata/report show exact page-level text fidelity records.
|
|
- Generated outputs stay ignored and uncommitted.
|
|
|
|
## Acceptance Criteria
|
|
|
|
- Chunk mode runs MinerU on one-page temporary PDFs only.
|
|
- `chunk_pages` controls final grouped output page count.
|
|
- Default group size remains 20 when `--chunk-pages` is supplied without a value.
|
|
- Grouped Markdown, metadata JSON, report Markdown, and grouped assets directory are written.
|
|
- Grouped metadata preserves original source PDF, original source SHA-256, group page range, one-page conversion mode, page warnings, and text fidelity provenance.
|
|
- Failed page conversions are explicit, nonfatal to later pages, and visible in report/metadata.
|
|
- Default tests remain fast and local.
|
|
- Strict-local policy remains unchanged.
|
|
- Non-chunked conversion behavior remains backward-compatible.
|
|
|
|
## Hard Failure Criteria
|
|
|
|
- Chunk mode sends more than one source page to MinerU in a single temporary PDF.
|
|
- `--chunk-pages` continues to mean MinerU input chunk size after this sprint.
|
|
- Grouped outputs lose source page provenance or hide failed pages.
|
|
- Asset links collide or point outside the grouped assets directory.
|
|
- Default tests require real MinerU, GPU, model files, network, Obsidian, MathJax, or `samples/`.
|
|
- The implementation adds a remote API/backend path, alternate conversion engine, router mode, or OpenAI-compatible backend.
|
|
- Sample PDFs, generated outputs, retained temporary page outputs, or `dist/pdf2md-ui.exe` are committed.
|
|
|
|
## Verification Commands
|
|
|
|
```powershell
|
|
uv run pytest tests/test_pdf_splitter.py tests/test_conversion.py tests/test_cli.py tests/test_paths.py tests/test_metadata.py tests/test_report.py tests/test_ui_runner.py
|
|
uv run pytest
|
|
git diff --check
|
|
git status --short --untracked-files=all
|
|
```
|
|
|
|
Optional local validation command is listed in WP14.6 and should be run only when a long GPU conversion is acceptable.
|
|
|
|
## Handoff Requirements
|
|
|
|
After implementation:
|
|
|
|
- Update `PROGRESS.md` with files changed, commands run, test outcomes, optional sample validation outcome, known failures, residual risks, and next action.
|
|
- Archive completed implementation details in `docs/WORKARCHIVE.md` after verification.
|
|
- Keep sample PDFs, generated outputs, retained temporary page outputs, and build artifacts out of the commit.
|
|
- Record whether the 2007 Korean sample was validated with grouped page conversion and how many grouped outputs were produced.
|
|
|
|
Implementation handoff on 2026-05-11:
|
|
|
|
- Implemented grouped page conversion in `src/pdf2md/conversion.py` with one-page temporary MinerU inputs and grouped public outputs.
|
|
- Added report output for `page_conversion` engine options.
|
|
- Updated CLI help, UI label text, README, architecture, implementation plan, and coordination/archive docs.
|
|
- Verification: targeted Sprint 14 tests passed, the 101-test related suite passed, and full `uv run pytest` passed 202 tests with 1 optional skip.
|
|
- Optional real MinerU validation on the 2007 Korean sample was not run during this implementation pass.
|
|
|
|
## Future Sprint Boundary
|
|
|
|
A later sprint may make grouped page conversion the default even without `--chunk-pages`, add resumable page caches, or add a debug option to retain intermediate per-page outputs. Those behaviors are intentionally out of Sprint 14 scope.
|