17 KiB
Sprint 16 Contract: Simplified Output Layout
Status: Implemented Last updated: 2026-05-12
Objective
Simplify conversion outputs so each input PDF gets one predictable output folder named after the PDF stem, all images live under one images folder, Markdown parts use _001, _002 numbering, one human-readable report is written per PDF, and no metadata JSON file is persisted.
This sprint changes the public output contract. It supersedes the older v1 output layout that wrote sibling <stem>.md, <stem>.assets, <stem>.metadata.json, and <stem>.report.md files.
Product Output Contract
For an input PDF:
paper.pdf
and output root:
outputs/
write:
outputs/
paper/
paper_001.md
paper_002.md
paper_report.md
images/
...
Rules:
paperis the PDF stem, meaning the original filename without.pdf.- A one-part conversion still writes
paper_001.md. - A multi-part conversion writes
paper_001.md,paper_002.md, and so on. - Part numbering uses at least three digits and grows only when the part count exceeds 999.
- All generated image and media assets for the PDF live under
paper/images/. - Markdown links must point to
images/<asset-name>. - The report is a single file at
paper/paper_report.md. - No
<stem>.metadata.json, part metadata JSON, or sidecar metadata JSON is written. - Internal metadata records may still be built in memory to produce reports, warnings, counts, and
ConversionResultfields.
Contract Assumptions
- The user request "metadata is not needed" means metadata JSON should not be written as a user-facing output file. It does not mean removing internal metadata objects needed for report generation and warning aggregation.
- Keep
--chunk-pagessemantics from Sprint 14: when enabled, MinerU receives one source page per run and final Markdown files are grouped bychunk_pages. - If
--chunk-pagesis absent, the whole PDF is still converted in one MinerU run and written as<stem>_001.md. - Keep
--chunk-pageswithout a value as the default grouping size of 20. - Keep
--metadataaccepted as a backward-compatible no-op for one sprint, but update help text to say metadata JSON output is disabled in the simplified layout. pdf2md recheckremains supported only for legacy outputs that still have adjacent metadata JSON. New simplified outputs should fail recheck clearly until a later sprint designs metadata-free recheck.- Recursive directory conversion should preserve the discovered relative parent before the PDF stem folder:
outputs/<relative-parent>/<stem>/<stem>_001.md. - If two inputs would map to the same output folder and overwrite is false, fail during preflight. Do not invent automatic suffixes.
--keep-rawshould place raw MinerU diagnostics underpaper/raw/so raw outputs do not clutter the main folder.
Touched Surfaces
Allowed during implementation:
- Modify
src/pdf2md/paths.py. - Modify
src/pdf2md/pdf_splitter.pyonly if part naming needs helper support. - Modify
src/pdf2md/conversion.py. - Modify
src/pdf2md/report.pyor add a focused aggregate report helper if one report needs multiple part summaries. - Modify
src/pdf2md/cli.py. - Modify
src/pdf2md_ui/runner.pyandsrc/pdf2md_ui/app.pyonly if UI text or expected output descriptions mention metadata/report paths. - Modify
tests/test_paths.py. - Modify
tests/test_conversion.py. - Modify
tests/test_cli.py. - Modify
tests/test_report.py. - Modify
tests/test_ui_runner.pyonly if UI command/output assumptions change. - Modify
tests/integration/test_v1_fast_release_gate.py. - Modify
tests/integration/test_optional_mineru_fixtures.py. - Modify
README.md. - Modify
PRD.md. - Modify
ARCHITECTURE.md. - Modify
docs/V1IMPLEMENTATIONPLAN.md. - Modify
PLAN.md. - Modify
PROGRESS.md. - Modify
docs/WORKARCHIVE.mdafter implementation.
Not allowed:
- Do not change MinerU 3.1.0 as the fixed engine.
- Do not add another conversion engine.
- Do not add remote/API/backend paths.
- Do not change
--gpu,--mineru-profile, or strict-local behavior except where report text reflects the new layout. - Do not make default tests depend on real MinerU, GPU, CUDA, model files, network, Obsidian, MathJax, or
samples/. - Do not commit generated
outputs/, sample PDFs, local model files, ordist/pdf2md-ui.exe.
Architecture Plan
WP16.1: Document-Level Output Layout
Add or reshape path planning so final outputs are planned per source PDF folder instead of as sibling files.
Expected final paths for a single PDF:
<out>/<stem>/<stem>_001.md
<out>/<stem>/images/
<out>/<stem>/<stem>_report.md
Expected final paths for recursive input:
<out>/<relative-parent>/<stem>/<stem>_001.md
<out>/<relative-parent>/<stem>/images/
<out>/<relative-parent>/<stem>/<stem>_report.md
Implementation guidance:
- Keep
DiscoveredPdf.relative_parentbehavior. - Add a focused part-planning helper rather than encoding final output names through fake temporary PDF filenames.
- Keep
PlannedOutputif the existing conversion code can use it cleanly, but allow multiple Markdown parts to share the sameassets_dirandreport_path. - Duplicate-path detection must reject duplicate Markdown files and raw directories, but it must allow shared
images/and shared report paths for parts belonging to the same source PDF.
WP16.2: Markdown Part Numbering
Replace public part names:
paper.part-001.pages-001-020.md
paper.part-002.pages-021-040.md
with:
paper_001.md
paper_002.md
Rules:
- Part index is based on final output group order, not source page number.
- The report must still record source page ranges for each part.
- Failed groups should not create a Markdown file, but the report must mention the failed part and source page range.
WP16.3: Shared Images Folder
Replace per-output asset directories:
paper.part-001.pages-001-020.assets/
paper.part-002.pages-021-040.assets/
with:
paper/images/
Implementation guidance:
- Copy all assets for one source PDF into the shared
images/folder. - Rewrite Markdown links to
images/<asset-name>. - Use deterministic collision-safe filenames. Recommended pattern:
- page-known assets:
page-001_<original-name>, with-002suffixes when needed. - page-unknown assets:
asset-001<suffix>, preserving the original suffix when available.
- page-known assets:
- Keep asset-link validation pointed at the shared
images/directory.
WP16.4: One Report, No Metadata JSON
Stop writing metadata JSON as a user-facing output file.
Implementation guidance:
- Continue building internal metadata dictionaries or records for each part so report generation and
ConversionResultsummaries stay traceable. - Add an aggregate report path at
<stem>/<stem>_report.md. - The report must include:
- source PDF path,
- output folder path,
- Markdown part list with page ranges,
- engine and engine options,
- final status,
- warning count,
- asset count,
- missing/invalid asset link counts,
- inline/display formula counts,
- MathJax render error count,
- text fidelity summary when available,
- failed source pages or failed parts when any exist,
- warnings grouped by page or part.
ConversionResult.metadata_pathshould beNonefor simplified outputs.ConversionResult.report_pathshould point to the shared report path.
WP16.5: CLI, UI, And Documentation
Update user-facing docs and tests to remove metadata JSON as an expected output.
Implementation guidance:
pdf2md convertsummary may keep printing Markdown paths and warning counts.- Update CLI help for
--metadatato say metadata JSON output is disabled or deprecated in the simplified layout. - Update README examples to show the new folder layout.
- Update PRD and ARCHITECTURE so they no longer claim metadata JSON is required as a public artifact.
- Keep internal provenance wording clear: warnings and report are still derived from internal metadata-like records.
- Update optional fixture documentation so generated metadata JSON is not required for sample validation.
Implementation Task Plan
Task 1: Path Planning For Simplified Layout
Files:
- Modify
src/pdf2md/paths.py. - Modify
tests/test_paths.py.
Steps:
- Add failing tests showing
plan_outputs()mapspaper.pdftoout/paper/paper_001.md,out/paper/images, no metadata path, andout/paper/paper_report.md. - Add a failing test for Korean filenames, using the PDF stem exactly as the output folder and file prefix.
- Add a failing test for recursive input preserving
relative_parent. - Add a failing test that duplicate source stems in the same relative parent conflict before conversion.
- Implement the minimal path planning changes.
- Run
uv run pytest tests/test_paths.py. - Commit path planning changes.
Task 2: Single-Output Conversion Writes Simplified Files
Files:
- Modify
src/pdf2md/conversion.py. - Modify
tests/test_conversion.py. - Modify
tests/test_cli.py.
Steps:
- Add failing conversion tests showing a non-chunked fake-adapter conversion writes
out/paper/paper_001.md,out/paper/images, andout/paper/paper_report.md. - Add failing assertions that no
.metadata.jsonfile is written andresult.metadata_path is None. - Add failing CLI test showing
pdf2md convert paper.pdf --out outcreates the simplified folder. - Implement the minimal conversion changes for non-chunked output.
- Run
uv run pytest tests/test_conversion.py tests/test_cli.py tests/test_paths.py. - Commit single-output conversion changes.
Task 3: Grouped Output Parts And Shared Images
Files:
- Modify
src/pdf2md/conversion.py. - Modify
src/pdf2md/pdf_splitter.pyonly if a small helper is needed. - Modify
tests/test_conversion.py. - Modify
tests/test_cli.py.
Steps:
- Add failing tests for
chunk_pages=20showing final Markdown names arepaper_001.md,paper_002.md, notpaper.part-...md. - Add failing tests proving all grouped assets are copied into
paper/images/and Markdown links useimages/.... - Add failing tests proving asset collisions across pages get deterministic unique filenames.
- Add failing tests proving failed page conversions are represented in the shared report while later pages still convert.
- Implement grouped output naming and shared image handling.
- Run
uv run pytest tests/test_conversion.py tests/test_cli.py tests/test_pdf_splitter.py. - Commit grouped output changes.
Task 4: Aggregate Report Without Metadata JSON
Files:
- Modify
src/pdf2md/report.pyor add a focused aggregate report helper. - Modify
src/pdf2md/conversion.py. - Modify
tests/test_report.py. - Modify
tests/test_conversion.py.
Steps:
- Add failing report tests for a one-file report listing multiple Markdown parts and source page ranges.
- Add failing conversion tests proving only one report exists for a chunked PDF.
- Add failing tests proving report summary totals combine all output parts.
- Add failing tests proving all-failed conversions write a report but no Markdown part.
- Implement aggregate report rendering from internal metadata records.
- Run
uv run pytest tests/test_report.py tests/test_conversion.py. - Commit report changes.
Task 5: Recheck, CLI Compatibility, UI Text, And Docs
Files:
- Modify
src/pdf2md/cli.py. - Modify
src/pdf2md/conversion.py. - Modify
src/pdf2md_ui/runner.pyandsrc/pdf2md_ui/app.pyonly if text/output assumptions change. - Modify
README.md. - Modify
PRD.md. - Modify
ARCHITECTURE.md. - Modify
docs/V1IMPLEMENTATIONPLAN.md. - Modify
tests/test_cli.py. - Modify
tests/test_ui_runner.pyonly if UI behavior changes. - Modify
tests/integration/test_v1_fast_release_gate.py. - Modify
tests/integration/test_optional_mineru_fixtures.py.
Steps:
- Add failing CLI tests proving
--metadataremains accepted but no metadata JSON is written. - Add failing recheck test proving simplified outputs without metadata fail with a clear legacy-metadata message.
- Update integration tests to require Markdown part files, one report, and image links, not metadata JSON.
- Update README, PRD, ARCHITECTURE, and release-gate wording for the simplified layout.
- Implement CLI/recheck/doc changes.
- Run
uv run pytest tests/test_cli.py tests/test_ui_runner.py tests/integration/test_v1_fast_release_gate.py. - Commit CLI, UI, integration, and documentation changes.
Task 6: Final Verification And Handoff
Files:
- Modify
PLAN.md. - Modify
PROGRESS.md. - Modify
docs/WORKARCHIVE.mdafter implementation. - Modify
docs/Sprints/SPRINT16CONTRACT.mdstatus and handoff fields.
Steps:
- Run focused Sprint 16 verification:
uv run pytest tests/test_paths.py tests/test_conversion.py tests/test_cli.py tests/test_report.py tests/integration/test_v1_fast_release_gate.py
- Run full default verification:
uv run pytest
- Run diff check:
git diff --check
- Update
PROGRESS.mdwith files changed, checks run, residual risks, and next actions. - Archive completed implementation evidence in
docs/WORKARCHIVE.md. - Commit final coordination updates.
Verification Commands
uv run pytest tests/test_paths.py tests/test_conversion.py tests/test_cli.py tests/test_report.py tests/integration/test_v1_fast_release_gate.py
uv run pytest
git diff --check
git status --short --untracked-files=all
Optional local fixture validation after implementation:
$env:MINERU_MODEL_SOURCE='local'
uv run pdf2md convert samples\SolidElement.pdf --out outputs\SolidElement_sprint16_layout --overwrite --chunk-pages --gpu auto --mineru-profile auto --strict-local
Expected optional validation:
- Output folder is
outputs\SolidElement\or the explicitly provided output root plusSolidElement\, depending on the command. - Markdown part is
SolidElement_001.mdfor the 6-page sample. - Report is
SolidElement_report.md. - Images are under
images\. - No metadata JSON exists.
Acceptance Criteria
- Each input PDF writes into an output folder named after the PDF stem.
- Markdown outputs are named
<stem>_001.md,<stem>_002.md, and so on. - All image/media assets for one PDF live under
<stem>/images/. - Markdown links point to
images/.... - Exactly one report file is written per input PDF at
<stem>/<stem>_report.md. - No metadata JSON file is written for new conversions.
- Internal warning, provenance, formula count, asset count, and text fidelity information remains available in the report.
- Chunk mode still converts one source page per MinerU run and groups Markdown by
chunk_pages. - Strict-local and MinerU-only constraints remain unchanged.
- Default tests stay fast and local.
Hard Failure Criteria
- Any new conversion writes
.metadata.jsonas a public output. - Output files keep old
part-001.pages-...names. - Assets are split into per-part
.assetsfolders. - More than one report is written for one input PDF.
- Markdown links point outside the PDF output folder.
- Chunk mode stops using one source page per MinerU run.
- Strict-local enforcement is weakened.
- Default tests require real MinerU, GPU, model files, network, Obsidian, MathJax, or
samples/. - Sample PDFs, generated outputs, local model files, or
dist/pdf2md-ui.exeare committed.
Open Questions
- Should metadata-free
pdf2md recheckbe restored in a later sprint by deriving enough state from the report and Markdown, or is rerunning conversion acceptable for simplified outputs? - Should raw MinerU outputs under
--keep-rawbe flattened intoraw/or kept per part underraw/<stem>_001/? This contract recommends per-part raw folders to avoid collisions.
Handoff Requirements
After implementation:
- Update this contract status to
Implemented. - Record final file layout examples in
README.md. - Record verification commands and outcomes in
PROGRESS.md. - Archive implementation and optional sample validation results in
docs/WORKARCHIVE.md. - Keep generated outputs and sample PDFs uncommitted.
Implementation Handoff
- Files changed:
src/pdf2md/paths.py,src/pdf2md/conversion.py,src/pdf2md/report.py,src/pdf2md/cli.py,src/pdf2md_ui/runner.py, focused tests, and current docs. - Output layout implemented:
<out>/<stem>/<stem>_001.md, additional numbered parts when grouped,<out>/<stem>/images/, and<out>/<stem>/<stem>_report.md. - Metadata JSON behavior: new conversions do not write public
.metadata.json;ConversionResult.metadata_pathisNone; internal metadata-like records still feed reports and tests. - Recheck behavior:
pdf2md recheckremains legacy-only and requires adjacent metadata JSON. - Verification recorded in
PROGRESS.md: focused Sprint 16 tests passed, fulluv run pytestpassed 227 tests with 1 optional skip, andgit diff --checkpassed with line-ending warnings only.