Files
PDFToMD/docs/Sprints/SPRINT16CONTRACT.md
T
2026-05-14 10:16:59 +09:00

17 KiB

Sprint 16 Contract: Simplified Output Layout

Status: Implemented Last updated: 2026-05-12

Objective

Simplify conversion outputs so each input PDF gets one predictable output folder named after the PDF stem, all images live under one images folder, Markdown parts use _001, _002 numbering, one human-readable report is written per PDF, and no metadata JSON file is persisted.

This sprint changes the public output contract. It supersedes the older v1 output layout that wrote sibling <stem>.md, <stem>.assets, <stem>.metadata.json, and <stem>.report.md files.

Product Output Contract

For an input PDF:

paper.pdf

and output root:

outputs/

write:

outputs/
  paper/
    paper_001.md
    paper_002.md
    paper_report.md
    images/
      ...

Rules:

  • paper is the PDF stem, meaning the original filename without .pdf.
  • A one-part conversion still writes paper_001.md.
  • A multi-part conversion writes paper_001.md, paper_002.md, and so on.
  • Part numbering uses at least three digits and grows only when the part count exceeds 999.
  • All generated image and media assets for the PDF live under paper/images/.
  • Markdown links must point to images/<asset-name>.
  • The report is a single file at paper/paper_report.md.
  • No <stem>.metadata.json, part metadata JSON, or sidecar metadata JSON is written.
  • Internal metadata records may still be built in memory to produce reports, warnings, counts, and ConversionResult fields.

Contract Assumptions

  • The user request "metadata is not needed" means metadata JSON should not be written as a user-facing output file. It does not mean removing internal metadata objects needed for report generation and warning aggregation.
  • Keep --chunk-pages semantics from Sprint 14: when enabled, MinerU receives one source page per run and final Markdown files are grouped by chunk_pages.
  • If --chunk-pages is absent, the whole PDF is still converted in one MinerU run and written as <stem>_001.md.
  • Keep --chunk-pages without a value as the default grouping size of 20.
  • Keep --metadata accepted as a backward-compatible no-op for one sprint, but update help text to say metadata JSON output is disabled in the simplified layout.
  • pdf2md recheck remains supported only for legacy outputs that still have adjacent metadata JSON. New simplified outputs should fail recheck clearly until a later sprint designs metadata-free recheck.
  • Recursive directory conversion should preserve the discovered relative parent before the PDF stem folder: outputs/<relative-parent>/<stem>/<stem>_001.md.
  • If two inputs would map to the same output folder and overwrite is false, fail during preflight. Do not invent automatic suffixes.
  • --keep-raw should place raw MinerU diagnostics under paper/raw/ so raw outputs do not clutter the main folder.

Touched Surfaces

Allowed during implementation:

  • Modify src/pdf2md/paths.py.
  • Modify src/pdf2md/pdf_splitter.py only if part naming needs helper support.
  • Modify src/pdf2md/conversion.py.
  • Modify src/pdf2md/report.py or add a focused aggregate report helper if one report needs multiple part summaries.
  • Modify src/pdf2md/cli.py.
  • Modify src/pdf2md_ui/runner.py and src/pdf2md_ui/app.py only if UI text or expected output descriptions mention metadata/report paths.
  • Modify tests/test_paths.py.
  • Modify tests/test_conversion.py.
  • Modify tests/test_cli.py.
  • Modify tests/test_report.py.
  • Modify tests/test_ui_runner.py only if UI command/output assumptions change.
  • Modify tests/integration/test_v1_fast_release_gate.py.
  • Modify tests/integration/test_optional_mineru_fixtures.py.
  • Modify README.md.
  • Modify PRD.md.
  • Modify ARCHITECTURE.md.
  • Modify docs/V1IMPLEMENTATIONPLAN.md.
  • Modify PLAN.md.
  • Modify PROGRESS.md.
  • Modify docs/WORKARCHIVE.md after implementation.

Not allowed:

  • Do not change MinerU 3.1.0 as the fixed engine.
  • Do not add another conversion engine.
  • Do not add remote/API/backend paths.
  • Do not change --gpu, --mineru-profile, or strict-local behavior except where report text reflects the new layout.
  • Do not make default tests depend on real MinerU, GPU, CUDA, model files, network, Obsidian, MathJax, or samples/.
  • Do not commit generated outputs/, sample PDFs, local model files, or dist/pdf2md-ui.exe.

Architecture Plan

WP16.1: Document-Level Output Layout

Add or reshape path planning so final outputs are planned per source PDF folder instead of as sibling files.

Expected final paths for a single PDF:

<out>/<stem>/<stem>_001.md
<out>/<stem>/images/
<out>/<stem>/<stem>_report.md

Expected final paths for recursive input:

<out>/<relative-parent>/<stem>/<stem>_001.md
<out>/<relative-parent>/<stem>/images/
<out>/<relative-parent>/<stem>/<stem>_report.md

Implementation guidance:

  • Keep DiscoveredPdf.relative_parent behavior.
  • Add a focused part-planning helper rather than encoding final output names through fake temporary PDF filenames.
  • Keep PlannedOutput if the existing conversion code can use it cleanly, but allow multiple Markdown parts to share the same assets_dir and report_path.
  • Duplicate-path detection must reject duplicate Markdown files and raw directories, but it must allow shared images/ and shared report paths for parts belonging to the same source PDF.

WP16.2: Markdown Part Numbering

Replace public part names:

paper.part-001.pages-001-020.md
paper.part-002.pages-021-040.md

with:

paper_001.md
paper_002.md

Rules:

  • Part index is based on final output group order, not source page number.
  • The report must still record source page ranges for each part.
  • Failed groups should not create a Markdown file, but the report must mention the failed part and source page range.

WP16.3: Shared Images Folder

Replace per-output asset directories:

paper.part-001.pages-001-020.assets/
paper.part-002.pages-021-040.assets/

with:

paper/images/

Implementation guidance:

  • Copy all assets for one source PDF into the shared images/ folder.
  • Rewrite Markdown links to images/<asset-name>.
  • Use deterministic collision-safe filenames. Recommended pattern:
    • page-known assets: page-001_<original-name>, with -002 suffixes when needed.
    • page-unknown assets: asset-001<suffix>, preserving the original suffix when available.
  • Keep asset-link validation pointed at the shared images/ directory.

WP16.4: One Report, No Metadata JSON

Stop writing metadata JSON as a user-facing output file.

Implementation guidance:

  • Continue building internal metadata dictionaries or records for each part so report generation and ConversionResult summaries stay traceable.
  • Add an aggregate report path at <stem>/<stem>_report.md.
  • The report must include:
    • source PDF path,
    • output folder path,
    • Markdown part list with page ranges,
    • engine and engine options,
    • final status,
    • warning count,
    • asset count,
    • missing/invalid asset link counts,
    • inline/display formula counts,
    • MathJax render error count,
    • text fidelity summary when available,
    • failed source pages or failed parts when any exist,
    • warnings grouped by page or part.
  • ConversionResult.metadata_path should be None for simplified outputs.
  • ConversionResult.report_path should point to the shared report path.

WP16.5: CLI, UI, And Documentation

Update user-facing docs and tests to remove metadata JSON as an expected output.

Implementation guidance:

  • pdf2md convert summary may keep printing Markdown paths and warning counts.
  • Update CLI help for --metadata to say metadata JSON output is disabled or deprecated in the simplified layout.
  • Update README examples to show the new folder layout.
  • Update PRD and ARCHITECTURE so they no longer claim metadata JSON is required as a public artifact.
  • Keep internal provenance wording clear: warnings and report are still derived from internal metadata-like records.
  • Update optional fixture documentation so generated metadata JSON is not required for sample validation.

Implementation Task Plan

Task 1: Path Planning For Simplified Layout

Files:

  • Modify src/pdf2md/paths.py.
  • Modify tests/test_paths.py.

Steps:

  • Add failing tests showing plan_outputs() maps paper.pdf to out/paper/paper_001.md, out/paper/images, no metadata path, and out/paper/paper_report.md.
  • Add a failing test for Korean filenames, using the PDF stem exactly as the output folder and file prefix.
  • Add a failing test for recursive input preserving relative_parent.
  • Add a failing test that duplicate source stems in the same relative parent conflict before conversion.
  • Implement the minimal path planning changes.
  • Run uv run pytest tests/test_paths.py.
  • Commit path planning changes.

Task 2: Single-Output Conversion Writes Simplified Files

Files:

  • Modify src/pdf2md/conversion.py.
  • Modify tests/test_conversion.py.
  • Modify tests/test_cli.py.

Steps:

  • Add failing conversion tests showing a non-chunked fake-adapter conversion writes out/paper/paper_001.md, out/paper/images, and out/paper/paper_report.md.
  • Add failing assertions that no .metadata.json file is written and result.metadata_path is None.
  • Add failing CLI test showing pdf2md convert paper.pdf --out out creates the simplified folder.
  • Implement the minimal conversion changes for non-chunked output.
  • Run uv run pytest tests/test_conversion.py tests/test_cli.py tests/test_paths.py.
  • Commit single-output conversion changes.

Task 3: Grouped Output Parts And Shared Images

Files:

  • Modify src/pdf2md/conversion.py.
  • Modify src/pdf2md/pdf_splitter.py only if a small helper is needed.
  • Modify tests/test_conversion.py.
  • Modify tests/test_cli.py.

Steps:

  • Add failing tests for chunk_pages=20 showing final Markdown names are paper_001.md, paper_002.md, not paper.part-...md.
  • Add failing tests proving all grouped assets are copied into paper/images/ and Markdown links use images/....
  • Add failing tests proving asset collisions across pages get deterministic unique filenames.
  • Add failing tests proving failed page conversions are represented in the shared report while later pages still convert.
  • Implement grouped output naming and shared image handling.
  • Run uv run pytest tests/test_conversion.py tests/test_cli.py tests/test_pdf_splitter.py.
  • Commit grouped output changes.

Task 4: Aggregate Report Without Metadata JSON

Files:

  • Modify src/pdf2md/report.py or add a focused aggregate report helper.
  • Modify src/pdf2md/conversion.py.
  • Modify tests/test_report.py.
  • Modify tests/test_conversion.py.

Steps:

  • Add failing report tests for a one-file report listing multiple Markdown parts and source page ranges.
  • Add failing conversion tests proving only one report exists for a chunked PDF.
  • Add failing tests proving report summary totals combine all output parts.
  • Add failing tests proving all-failed conversions write a report but no Markdown part.
  • Implement aggregate report rendering from internal metadata records.
  • Run uv run pytest tests/test_report.py tests/test_conversion.py.
  • Commit report changes.

Task 5: Recheck, CLI Compatibility, UI Text, And Docs

Files:

  • Modify src/pdf2md/cli.py.
  • Modify src/pdf2md/conversion.py.
  • Modify src/pdf2md_ui/runner.py and src/pdf2md_ui/app.py only if text/output assumptions change.
  • Modify README.md.
  • Modify PRD.md.
  • Modify ARCHITECTURE.md.
  • Modify docs/V1IMPLEMENTATIONPLAN.md.
  • Modify tests/test_cli.py.
  • Modify tests/test_ui_runner.py only if UI behavior changes.
  • Modify tests/integration/test_v1_fast_release_gate.py.
  • Modify tests/integration/test_optional_mineru_fixtures.py.

Steps:

  • Add failing CLI tests proving --metadata remains accepted but no metadata JSON is written.
  • Add failing recheck test proving simplified outputs without metadata fail with a clear legacy-metadata message.
  • Update integration tests to require Markdown part files, one report, and image links, not metadata JSON.
  • Update README, PRD, ARCHITECTURE, and release-gate wording for the simplified layout.
  • Implement CLI/recheck/doc changes.
  • Run uv run pytest tests/test_cli.py tests/test_ui_runner.py tests/integration/test_v1_fast_release_gate.py.
  • Commit CLI, UI, integration, and documentation changes.

Task 6: Final Verification And Handoff

Files:

  • Modify PLAN.md.
  • Modify PROGRESS.md.
  • Modify docs/WORKARCHIVE.md after implementation.
  • Modify docs/Sprints/SPRINT16CONTRACT.md status and handoff fields.

Steps:

  • Run focused Sprint 16 verification:
uv run pytest tests/test_paths.py tests/test_conversion.py tests/test_cli.py tests/test_report.py tests/integration/test_v1_fast_release_gate.py
  • Run full default verification:
uv run pytest
  • Run diff check:
git diff --check
  • Update PROGRESS.md with files changed, checks run, residual risks, and next actions.
  • Archive completed implementation evidence in docs/WORKARCHIVE.md.
  • Commit final coordination updates.

Verification Commands

uv run pytest tests/test_paths.py tests/test_conversion.py tests/test_cli.py tests/test_report.py tests/integration/test_v1_fast_release_gate.py
uv run pytest
git diff --check
git status --short --untracked-files=all

Optional local fixture validation after implementation:

$env:MINERU_MODEL_SOURCE='local'
uv run pdf2md convert samples\SolidElement.pdf --out outputs\SolidElement_sprint16_layout --overwrite --chunk-pages --gpu auto --mineru-profile auto --strict-local

Expected optional validation:

  • Output folder is outputs\SolidElement\ or the explicitly provided output root plus SolidElement\, depending on the command.
  • Markdown part is SolidElement_001.md for the 6-page sample.
  • Report is SolidElement_report.md.
  • Images are under images\.
  • No metadata JSON exists.

Acceptance Criteria

  • Each input PDF writes into an output folder named after the PDF stem.
  • Markdown outputs are named <stem>_001.md, <stem>_002.md, and so on.
  • All image/media assets for one PDF live under <stem>/images/.
  • Markdown links point to images/....
  • Exactly one report file is written per input PDF at <stem>/<stem>_report.md.
  • No metadata JSON file is written for new conversions.
  • Internal warning, provenance, formula count, asset count, and text fidelity information remains available in the report.
  • Chunk mode still converts one source page per MinerU run and groups Markdown by chunk_pages.
  • Strict-local and MinerU-only constraints remain unchanged.
  • Default tests stay fast and local.

Hard Failure Criteria

  • Any new conversion writes .metadata.json as a public output.
  • Output files keep old part-001.pages-... names.
  • Assets are split into per-part .assets folders.
  • More than one report is written for one input PDF.
  • Markdown links point outside the PDF output folder.
  • Chunk mode stops using one source page per MinerU run.
  • Strict-local enforcement is weakened.
  • Default tests require real MinerU, GPU, model files, network, Obsidian, MathJax, or samples/.
  • Sample PDFs, generated outputs, local model files, or dist/pdf2md-ui.exe are committed.

Open Questions

  • Should metadata-free pdf2md recheck be restored in a later sprint by deriving enough state from the report and Markdown, or is rerunning conversion acceptable for simplified outputs?
  • Should raw MinerU outputs under --keep-raw be flattened into raw/ or kept per part under raw/<stem>_001/? This contract recommends per-part raw folders to avoid collisions.

Handoff Requirements

After implementation:

  • Update this contract status to Implemented.
  • Record final file layout examples in README.md.
  • Record verification commands and outcomes in PROGRESS.md.
  • Archive implementation and optional sample validation results in docs/WORKARCHIVE.md.
  • Keep generated outputs and sample PDFs uncommitted.

Implementation Handoff

  • Files changed: src/pdf2md/paths.py, src/pdf2md/conversion.py, src/pdf2md/report.py, src/pdf2md/cli.py, src/pdf2md_ui/runner.py, focused tests, and current docs.
  • Output layout implemented: <out>/<stem>/<stem>_001.md, additional numbered parts when grouped, <out>/<stem>/images/, and <out>/<stem>/<stem>_report.md.
  • Metadata JSON behavior: new conversions do not write public .metadata.json; ConversionResult.metadata_path is None; internal metadata-like records still feed reports and tests.
  • Recheck behavior: pdf2md recheck remains legacy-only and requires adjacent metadata JSON.
  • Verification recorded in PROGRESS.md: focused Sprint 16 tests passed, full uv run pytest passed 227 tests with 1 optional skip, and git diff --check passed with line-ending warnings only.