baram2584/PDFToMD

Fork 0

Files

T

김경종 dc11880140 modify pdftomd

2026-05-14 10:16:59 +09:00

16 KiB

Raw Blame History

Sprint 14 Contract: Single-Page Conversion With Grouped Outputs

Status: Implemented Last updated: 2026-05-11

Objective

Replace the current fixed-size pre-conversion chunking behavior with a safer long-PDF workflow:

When chunk mode is active, split the source PDF into one-page temporary PDFs.
Convert each one-page PDF sequentially through the existing local MinerU CLI adapter.
Merge successful converted page Markdown into grouped output files after every configured output group size.
Keep the default output group size at 20 pages when --chunk-pages is supplied without a value.

This sprint is motivated by local evidence from samples/2007쉘구조물의유한요소해석에대하여.pdf: a 5-page MinerU input chunk stalled on GTX 1070 Ti 8GB, while one-page conversion completed all 13 pages.

Current Precondition

MinerU 3.1.0 remains the only conversion engine.
Conversion runs through direct local mineru CLI execution only.
Strict-local allows only the direct CLI and MinerU CLI-internal temporary local mineru-api; remote API/backend paths remain prohibited.
pypdf is already available and used for local PDF chunk planning and temporary chunk PDF writing.
pdf2md convert currently supports --chunk-pages [PAGES].
Existing chunk mode currently treats chunk_pages as the MinerU input PDF page count and writes one final Markdown file per input chunk.
convert_pdf(..., chunk_pages=N) currently returns BatchConversionResult in chunk mode.
Sprint 13 text fidelity diagnostics are most accurate when each MinerU Markdown output maps to exactly one source page.

Contract Assumptions

Keep chunk mode opt-in for this sprint. If chunk_pages is None, the existing non-chunked full-PDF conversion path remains unchanged.
Keep the public option name --chunk-pages for CLI/API compatibility, but redefine its behavior in chunk mode as the output group size, not the MinerU input size.
If --chunk-pages is present without a value, use DEFAULT_CHUNK_PAGES == 20 as the output group size.
In chunk mode, even a PDF with fewer than chunk_pages pages is converted internally one page at a time and emitted as one grouped output file.
Final grouped outputs are the public conversion results. Temporary per-page Markdown, metadata, reports, assets, and one-page PDFs are not retained unless a later sprint explicitly adds debug retention.

Touched Surfaces

Allowed during implementation:

src/pdf2md/pdf_splitter.py
src/pdf2md/conversion.py
src/pdf2md/paths.py
src/pdf2md/metadata.py
src/pdf2md/report.py
src/pdf2md/cli.py
src/pdf2md_ui/app.py
src/pdf2md_ui/runner.py
tests/test_pdf_splitter.py
tests/test_conversion.py
tests/test_cli.py
tests/test_paths.py
tests/test_metadata.py
tests/test_report.py
tests/test_ui_runner.py
README.md
ARCHITECTURE.md
docs/V1IMPLEMENTATIONPLAN.md
PLAN.md
PROGRESS.md
docs/WORKARCHIVE.md after implementation

Allowed if a focused helper boundary keeps conversion.py simpler:

Create src/pdf2md/page_grouping.py
Create tests/test_page_grouping.py

Not allowed:

Adding another conversion engine or runtime engine selector.
Running page conversions in parallel by default. GTX 1070 Ti 8GB memory pressure makes sequential conversion the safe default.
Adding cloud OCR, hosted LLM/VLM, remote document parsing, --api-url, router mode, HTTP client backends, or remote OpenAI-compatible endpoints.
Making default tests depend on real MinerU, GPU, model files, network, Obsidian, MathJax, or samples/.
Committing sample PDFs, generated outputs/, retained temporary page outputs, or dist/pdf2md-ui.exe.

Product Behavior

Activation

Existing non-chunked conversion remains unchanged:

uv run pdf2md convert paper.pdf --out outputs

Grouped page conversion is enabled by --chunk-pages:

uv run pdf2md convert paper.pdf --out outputs --chunk-pages
uv run pdf2md convert paper.pdf --out outputs --chunk-pages 20
uv run pdf2md convert paper.pdf --out outputs --chunk-pages 1

Behavior:

--chunk-pages means output group size.
--chunk-pages 20 converts pages 1, 2, 3, ... as independent one-page MinerU jobs, then emits grouped outputs covering pages 1-20, 21-40, and so on.
--chunk-pages 1 emits one final output file per source page.
convert_pdf(..., chunk_pages=N) still returns BatchConversionResult; each ConversionResult represents one final grouped output file, not each internal one-page MinerU run.

Output Naming

Use the existing part/page-range naming shape for grouped outputs:

<stem>.part-001.pages-001-020.md
<stem>.part-001.pages-001-020.metadata.json
<stem>.part-001.pages-001-020.report.md
<stem>.part-001.pages-001-020.assets/

<stem>.part-002.pages-021-040.md
...

If a 13-page PDF is converted with --chunk-pages 20, it emits:

<stem>.part-001.pages-001-013.md
<stem>.part-001.pages-001-013.metadata.json
<stem>.part-001.pages-001-013.report.md
<stem>.part-001.pages-001-013.assets/

This is an intentional behavior change from Sprint 10: short PDFs in chunk mode no longer bypass chunk mode and no longer write <stem>.md.

Internal Page Conversion

For every source page in chunk mode:

Write a one-page temporary PDF with pypdf.
Run the existing local MinerU adapter against that one-page PDF.
Normalize Markdown, copy page assets into a temporary page assets directory, run MathJax checks/repair, and run Sprint 13 text fidelity diagnostics against the original source page.
Delete the one-page temporary PDF and temporary per-page final files after grouped output generation.

The implementation should reuse existing conversion primitives where practical, but it must avoid writing final public files for every page before grouping.

Markdown Grouping

For each output group:

Concatenate successful page Markdown in source page order.
Separate pages with blank lines and an HTML comment that is invisible in Obsidian preview:

<!-- source-page: 7 -->

Do not add visible page headings or instructional text.
If a page conversion fails, do not invent Markdown for that page. Add an invisible comment at the page boundary:

<!-- source-page: 7 conversion failed; see report -->

Preserve Obsidian-friendly math delimiters and display math spacing after concatenation.

Asset Grouping

Assets from temporary per-page outputs must be copied into the grouped assets directory with collision-proof names.

Recommended destination layout:

<stem>.part-001.pages-001-020.assets/page-001/<asset-name>
<stem>.part-001.pages-001-020.assets/page-002/<asset-name>

Markdown image links must be rewritten to the grouped assets directory. This keeps repeated MinerU asset filenames from different pages from overwriting each other.

Metadata And Report Grouping

Grouped metadata must be derived from per-page conversion records plus group-level checks.

Required metadata behavior:

source_pdf remains the original source PDF path.
source_sha256 remains the original source PDF hash.
pages contains one page record per source page in the group.
Page indexes in grouped metadata are group-local zero-based indexes.
Original source page numbers remain visible in chunk/page conversion provenance.
Warnings from per-page conversions are preserved with adjusted group-local page indexes.
Warnings for failed page conversions are added with original source page context.
text_fidelity records are carried from one-page checks and keep exact source_page_number values.
Summary counts are aggregated from the grouped metadata and grouped Markdown.

Required engine_options shape:

{
  "chunk": {
    "original_source_pdf": "...",
    "chunk_index": 1,
    "total_chunks": 3,
    "source_page_start": 1,
    "source_page_end": 20,
    "chunk_page_count": 20
  },
  "page_conversion": {
    "mode": "single_page",
    "mineru_input_page_count": 1,
    "output_group_page_count": 20,
    "failed_source_pages": []
  }
}

Report Markdown must continue to include the existing chunk context line and should add a concise page-conversion line, for example:

- Page conversion mode: single-page MinerU inputs, grouped output size: 20

Failure Policy

Convert pages sequentially.
If a page fails, continue with later pages.
If at least one page in a group succeeds, write the grouped Markdown/metadata/report and mark final status partial.
If every page in a group fails, return a failed ConversionResult for that grouped output and do not write Markdown for that group.
Failed pages must be visible in metadata/report warnings.
There is no silent fallback and no retry loop in this sprint.

Architecture Plan

WP14.1: Page And Group Planning

Actions:

Extend pdf_splitter.py or add page_grouping.py with project-owned records for:
- one-page MinerU input plans,
- final output group plans,
- original source page ranges,
- deterministic output stems.
Keep pypdf page extraction local and temporary.
Validate output group size as a positive integer.
Plan output groups before conversion starts so overwrite/conflict behavior remains deterministic.

Expected output:

A 41-page PDF with group size 20 plans 41 one-page MinerU inputs and 3 final grouped outputs.
A 13-page PDF with group size 20 plans 13 one-page MinerU inputs and 1 final grouped output.

WP14.2: Conversion Orchestration

Actions:

Rework chunk-mode convert_pdf() and convert_input() orchestration so chunk_pages creates grouped output tasks.
Run one-page MinerU inputs in source-page order.
Keep temporary page PDFs and intermediate page outputs under local temporary directories.
Keep BatchConversionResult at the grouped-output level.
Keep strict-local validation unchanged.

Expected output:

The public API keeps returning multiple grouped results in chunk mode while the adapter is called once per source page internally.

WP14.3: Markdown And Asset Group Assembly

Actions:

Build a focused helper to merge page Markdown and page assets into a grouped output.
Insert invisible  boundaries.
Rewrite per-page asset links to page-NNN/ asset subdirectories.
Run final group-level local quality checks after asset rewriting.

Expected output:

Grouped Markdown renders in Obsidian and assets do not collide across pages.

WP14.4: Metadata, Warnings, And Report Assembly

Actions:

Aggregate per-page metadata into grouped metadata.
Adjust page indexes from page-local 0 to group-local indexes.
Preserve original source page numbers in engine_options and text fidelity records.
Add page_conversion engine options.
Add a report line for single-page conversion mode and grouped output size.

Expected output:

Metadata/report can explain both facts: MinerU saw one page at a time, while the user received grouped Markdown files.

WP14.5: CLI, UI, And Documentation

Actions:

Update CLI help for --chunk-pages from "pre-conversion PDF chunking" to "group converted pages into output files of N pages; MinerU runs one page at a time."
Update README and architecture docs with the new behavior.
Update the Windows UI label/help text so the field represents output group size.
Keep runner command construction using --chunk-pages N.

Expected output:

Users do not confuse --chunk-pages 20 with a 20-page MinerU input.

WP14.6: Tests

Default fast tests:

Generated blank local PDFs verify page count and group planning for 1, 13, 20, 21, 40, and 41 pages.
--chunk-pages without a value still passes 20.
convert_pdf(..., chunk_pages=20) for 41 pages calls the fake adapter 41 times and returns 3 grouped ConversionResult objects.
convert_pdf(..., chunk_pages=20) for 13 pages calls the fake adapter 13 times and returns 1 grouped output named part-001.pages-001-013.
convert_pdf(..., chunk_pages=1) returns one grouped output per source page.
Temporary one-page PDFs and temporary per-page outputs are deleted after conversion.
A failed internal page conversion does not stop later pages and appears in grouped metadata/report.
A group with only failed pages returns a failed result and writes no Markdown.
Asset filenames from different pages do not collide in the grouped assets directory.
Per-page warnings and text fidelity records are adjusted to group-local page indexes while preserving original source page numbers.
Existing non-chunked conversion tests keep passing unchanged.
UI runner tests continue to build fixed argument lists with shell=False.

Optional local validation:

$env:MINERU_MODEL_SOURCE='local'
$pdf = (Get-ChildItem samples -Filter '2007*.pdf' | Select-Object -First 1).FullName
uv run pdf2md convert $pdf --out outputs\sprint14-2007-page-grouped --overwrite --chunk-pages

Expected optional validation:

The 13-page Korean sample emits one grouped Markdown file for pages 1-13.
Metadata/report show exact page-level text fidelity records.
Generated outputs stay ignored and uncommitted.

Acceptance Criteria

Chunk mode runs MinerU on one-page temporary PDFs only.
chunk_pages controls final grouped output page count.
Default group size remains 20 when --chunk-pages is supplied without a value.
Grouped Markdown, metadata JSON, report Markdown, and grouped assets directory are written.
Grouped metadata preserves original source PDF, original source SHA-256, group page range, one-page conversion mode, page warnings, and text fidelity provenance.
Failed page conversions are explicit, nonfatal to later pages, and visible in report/metadata.
Default tests remain fast and local.
Strict-local policy remains unchanged.
Non-chunked conversion behavior remains backward-compatible.

Hard Failure Criteria

Chunk mode sends more than one source page to MinerU in a single temporary PDF.
--chunk-pages continues to mean MinerU input chunk size after this sprint.
Grouped outputs lose source page provenance or hide failed pages.
Asset links collide or point outside the grouped assets directory.
Default tests require real MinerU, GPU, model files, network, Obsidian, MathJax, or samples/.
The implementation adds a remote API/backend path, alternate conversion engine, router mode, or OpenAI-compatible backend.
Sample PDFs, generated outputs, retained temporary page outputs, or dist/pdf2md-ui.exe are committed.

Verification Commands

uv run pytest tests/test_pdf_splitter.py tests/test_conversion.py tests/test_cli.py tests/test_paths.py tests/test_metadata.py tests/test_report.py tests/test_ui_runner.py
uv run pytest
git diff --check
git status --short --untracked-files=all

Optional local validation command is listed in WP14.6 and should be run only when a long GPU conversion is acceptable.

Handoff Requirements

After implementation:

Update PROGRESS.md with files changed, commands run, test outcomes, optional sample validation outcome, known failures, residual risks, and next action.
Archive completed implementation details in docs/WORKARCHIVE.md after verification.
Keep sample PDFs, generated outputs, retained temporary page outputs, and build artifacts out of the commit.
Record whether the 2007 Korean sample was validated with grouped page conversion and how many grouped outputs were produced.

Implementation handoff on 2026-05-11:

Implemented grouped page conversion in src/pdf2md/conversion.py with one-page temporary MinerU inputs and grouped public outputs.
Added report output for page_conversion engine options.
Updated CLI help, UI label text, README, architecture, implementation plan, and coordination/archive docs.
Verification: targeted Sprint 14 tests passed, the 101-test related suite passed, and full uv run pytest passed 202 tests with 1 optional skip.
Optional real MinerU validation on the 2007 Korean sample was not run during this implementation pass.

Future Sprint Boundary

A later sprint may make grouped page conversion the default even without --chunk-pages, add resumable page caches, or add a debug option to retain intermediate per-page outputs. Those behaviors are intentionally out of Sprint 14 scope.

16 KiB Raw Blame History

Sprint 14 Contract: Single-Page Conversion With Grouped Outputs

Objective

Current Precondition

Contract Assumptions

Touched Surfaces

Product Behavior

Activation

Output Naming

Internal Page Conversion

Markdown Grouping

Asset Grouping

Metadata And Report Grouping

Failure Policy

Architecture Plan

WP14.1: Page And Group Planning

WP14.2: Conversion Orchestration

WP14.3: Markdown And Asset Group Assembly

WP14.4: Metadata, Warnings, And Report Assembly

WP14.5: CLI, UI, And Documentation

WP14.6: Tests

Acceptance Criteria

Hard Failure Criteria

Verification Commands

Handoff Requirements

Future Sprint Boundary

16 KiB

Raw Blame History