# Sprint 16 Contract: Simplified Output Layout Status: Implemented Last updated: 2026-05-12 ## Objective Simplify conversion outputs so each input PDF gets one predictable output folder named after the PDF stem, all images live under one `images` folder, Markdown parts use `_001`, `_002` numbering, one human-readable report is written per PDF, and no metadata JSON file is persisted. This sprint changes the public output contract. It supersedes the older v1 output layout that wrote sibling `.md`, `.assets`, `.metadata.json`, and `.report.md` files. ## Product Output Contract For an input PDF: ```text paper.pdf ``` and output root: ```text outputs/ ``` write: ```text outputs/ paper/ paper_001.md paper_002.md paper_report.md images/ ... ``` Rules: - `paper` is the PDF stem, meaning the original filename without `.pdf`. - A one-part conversion still writes `paper_001.md`. - A multi-part conversion writes `paper_001.md`, `paper_002.md`, and so on. - Part numbering uses at least three digits and grows only when the part count exceeds 999. - All generated image and media assets for the PDF live under `paper/images/`. - Markdown links must point to `images/`. - The report is a single file at `paper/paper_report.md`. - No `.metadata.json`, part metadata JSON, or sidecar metadata JSON is written. - Internal metadata records may still be built in memory to produce reports, warnings, counts, and `ConversionResult` fields. ## Contract Assumptions - The user request "metadata is not needed" means metadata JSON should not be written as a user-facing output file. It does not mean removing internal metadata objects needed for report generation and warning aggregation. - Keep `--chunk-pages` semantics from Sprint 14: when enabled, MinerU receives one source page per run and final Markdown files are grouped by `chunk_pages`. - If `--chunk-pages` is absent, the whole PDF is still converted in one MinerU run and written as `_001.md`. - Keep `--chunk-pages` without a value as the default grouping size of 20. - Keep `--metadata` accepted as a backward-compatible no-op for one sprint, but update help text to say metadata JSON output is disabled in the simplified layout. - `pdf2md recheck` remains supported only for legacy outputs that still have adjacent metadata JSON. New simplified outputs should fail recheck clearly until a later sprint designs metadata-free recheck. - Recursive directory conversion should preserve the discovered relative parent before the PDF stem folder: `outputs///_001.md`. - If two inputs would map to the same output folder and overwrite is false, fail during preflight. Do not invent automatic suffixes. - `--keep-raw` should place raw MinerU diagnostics under `paper/raw/` so raw outputs do not clutter the main folder. ## Touched Surfaces Allowed during implementation: - Modify `src/pdf2md/paths.py`. - Modify `src/pdf2md/pdf_splitter.py` only if part naming needs helper support. - Modify `src/pdf2md/conversion.py`. - Modify `src/pdf2md/report.py` or add a focused aggregate report helper if one report needs multiple part summaries. - Modify `src/pdf2md/cli.py`. - Modify `src/pdf2md_ui/runner.py` and `src/pdf2md_ui/app.py` only if UI text or expected output descriptions mention metadata/report paths. - Modify `tests/test_paths.py`. - Modify `tests/test_conversion.py`. - Modify `tests/test_cli.py`. - Modify `tests/test_report.py`. - Modify `tests/test_ui_runner.py` only if UI command/output assumptions change. - Modify `tests/integration/test_v1_fast_release_gate.py`. - Modify `tests/integration/test_optional_mineru_fixtures.py`. - Modify `README.md`. - Modify `PRD.md`. - Modify `ARCHITECTURE.md`. - Modify `docs/V1IMPLEMENTATIONPLAN.md`. - Modify `PLAN.md`. - Modify `PROGRESS.md`. - Modify `docs/WORKARCHIVE.md` after implementation. Not allowed: - Do not change MinerU 3.1.0 as the fixed engine. - Do not add another conversion engine. - Do not add remote/API/backend paths. - Do not change `--gpu`, `--mineru-profile`, or strict-local behavior except where report text reflects the new layout. - Do not make default tests depend on real MinerU, GPU, CUDA, model files, network, Obsidian, MathJax, or `samples/`. - Do not commit generated `outputs/`, sample PDFs, local model files, or `dist/pdf2md-ui.exe`. ## Architecture Plan ### WP16.1: Document-Level Output Layout Add or reshape path planning so final outputs are planned per source PDF folder instead of as sibling files. Expected final paths for a single PDF: ```text //_001.md //images/ //_report.md ``` Expected final paths for recursive input: ```text ///_001.md ///images/ ///_report.md ``` Implementation guidance: - Keep `DiscoveredPdf.relative_parent` behavior. - Add a focused part-planning helper rather than encoding final output names through fake temporary PDF filenames. - Keep `PlannedOutput` if the existing conversion code can use it cleanly, but allow multiple Markdown parts to share the same `assets_dir` and `report_path`. - Duplicate-path detection must reject duplicate Markdown files and raw directories, but it must allow shared `images/` and shared report paths for parts belonging to the same source PDF. ### WP16.2: Markdown Part Numbering Replace public part names: ```text paper.part-001.pages-001-020.md paper.part-002.pages-021-040.md ``` with: ```text paper_001.md paper_002.md ``` Rules: - Part index is based on final output group order, not source page number. - The report must still record source page ranges for each part. - Failed groups should not create a Markdown file, but the report must mention the failed part and source page range. ### WP16.3: Shared Images Folder Replace per-output asset directories: ```text paper.part-001.pages-001-020.assets/ paper.part-002.pages-021-040.assets/ ``` with: ```text paper/images/ ``` Implementation guidance: - Copy all assets for one source PDF into the shared `images/` folder. - Rewrite Markdown links to `images/`. - Use deterministic collision-safe filenames. Recommended pattern: - page-known assets: `page-001_`, with `-002` suffixes when needed. - page-unknown assets: `asset-001`, preserving the original suffix when available. - Keep asset-link validation pointed at the shared `images/` directory. ### WP16.4: One Report, No Metadata JSON Stop writing metadata JSON as a user-facing output file. Implementation guidance: - Continue building internal metadata dictionaries or records for each part so report generation and `ConversionResult` summaries stay traceable. - Add an aggregate report path at `/_report.md`. - The report must include: - source PDF path, - output folder path, - Markdown part list with page ranges, - engine and engine options, - final status, - warning count, - asset count, - missing/invalid asset link counts, - inline/display formula counts, - MathJax render error count, - text fidelity summary when available, - failed source pages or failed parts when any exist, - warnings grouped by page or part. - `ConversionResult.metadata_path` should be `None` for simplified outputs. - `ConversionResult.report_path` should point to the shared report path. ### WP16.5: CLI, UI, And Documentation Update user-facing docs and tests to remove metadata JSON as an expected output. Implementation guidance: - `pdf2md convert` summary may keep printing Markdown paths and warning counts. - Update CLI help for `--metadata` to say metadata JSON output is disabled or deprecated in the simplified layout. - Update README examples to show the new folder layout. - Update PRD and ARCHITECTURE so they no longer claim metadata JSON is required as a public artifact. - Keep internal provenance wording clear: warnings and report are still derived from internal metadata-like records. - Update optional fixture documentation so generated metadata JSON is not required for sample validation. ## Implementation Task Plan ### Task 1: Path Planning For Simplified Layout Files: - Modify `src/pdf2md/paths.py`. - Modify `tests/test_paths.py`. Steps: - [ ] Add failing tests showing `plan_outputs()` maps `paper.pdf` to `out/paper/paper_001.md`, `out/paper/images`, no metadata path, and `out/paper/paper_report.md`. - [ ] Add a failing test for Korean filenames, using the PDF stem exactly as the output folder and file prefix. - [ ] Add a failing test for recursive input preserving `relative_parent`. - [ ] Add a failing test that duplicate source stems in the same relative parent conflict before conversion. - [ ] Implement the minimal path planning changes. - [ ] Run `uv run pytest tests/test_paths.py`. - [ ] Commit path planning changes. ### Task 2: Single-Output Conversion Writes Simplified Files Files: - Modify `src/pdf2md/conversion.py`. - Modify `tests/test_conversion.py`. - Modify `tests/test_cli.py`. Steps: - [ ] Add failing conversion tests showing a non-chunked fake-adapter conversion writes `out/paper/paper_001.md`, `out/paper/images`, and `out/paper/paper_report.md`. - [ ] Add failing assertions that no `.metadata.json` file is written and `result.metadata_path is None`. - [ ] Add failing CLI test showing `pdf2md convert paper.pdf --out out` creates the simplified folder. - [ ] Implement the minimal conversion changes for non-chunked output. - [ ] Run `uv run pytest tests/test_conversion.py tests/test_cli.py tests/test_paths.py`. - [ ] Commit single-output conversion changes. ### Task 3: Grouped Output Parts And Shared Images Files: - Modify `src/pdf2md/conversion.py`. - Modify `src/pdf2md/pdf_splitter.py` only if a small helper is needed. - Modify `tests/test_conversion.py`. - Modify `tests/test_cli.py`. Steps: - [ ] Add failing tests for `chunk_pages=20` showing final Markdown names are `paper_001.md`, `paper_002.md`, not `paper.part-...md`. - [ ] Add failing tests proving all grouped assets are copied into `paper/images/` and Markdown links use `images/...`. - [ ] Add failing tests proving asset collisions across pages get deterministic unique filenames. - [ ] Add failing tests proving failed page conversions are represented in the shared report while later pages still convert. - [ ] Implement grouped output naming and shared image handling. - [ ] Run `uv run pytest tests/test_conversion.py tests/test_cli.py tests/test_pdf_splitter.py`. - [ ] Commit grouped output changes. ### Task 4: Aggregate Report Without Metadata JSON Files: - Modify `src/pdf2md/report.py` or add a focused aggregate report helper. - Modify `src/pdf2md/conversion.py`. - Modify `tests/test_report.py`. - Modify `tests/test_conversion.py`. Steps: - [ ] Add failing report tests for a one-file report listing multiple Markdown parts and source page ranges. - [ ] Add failing conversion tests proving only one report exists for a chunked PDF. - [ ] Add failing tests proving report summary totals combine all output parts. - [ ] Add failing tests proving all-failed conversions write a report but no Markdown part. - [ ] Implement aggregate report rendering from internal metadata records. - [ ] Run `uv run pytest tests/test_report.py tests/test_conversion.py`. - [ ] Commit report changes. ### Task 5: Recheck, CLI Compatibility, UI Text, And Docs Files: - Modify `src/pdf2md/cli.py`. - Modify `src/pdf2md/conversion.py`. - Modify `src/pdf2md_ui/runner.py` and `src/pdf2md_ui/app.py` only if text/output assumptions change. - Modify `README.md`. - Modify `PRD.md`. - Modify `ARCHITECTURE.md`. - Modify `docs/V1IMPLEMENTATIONPLAN.md`. - Modify `tests/test_cli.py`. - Modify `tests/test_ui_runner.py` only if UI behavior changes. - Modify `tests/integration/test_v1_fast_release_gate.py`. - Modify `tests/integration/test_optional_mineru_fixtures.py`. Steps: - [ ] Add failing CLI tests proving `--metadata` remains accepted but no metadata JSON is written. - [ ] Add failing recheck test proving simplified outputs without metadata fail with a clear legacy-metadata message. - [ ] Update integration tests to require Markdown part files, one report, and image links, not metadata JSON. - [ ] Update README, PRD, ARCHITECTURE, and release-gate wording for the simplified layout. - [ ] Implement CLI/recheck/doc changes. - [ ] Run `uv run pytest tests/test_cli.py tests/test_ui_runner.py tests/integration/test_v1_fast_release_gate.py`. - [ ] Commit CLI, UI, integration, and documentation changes. ### Task 6: Final Verification And Handoff Files: - Modify `PLAN.md`. - Modify `PROGRESS.md`. - Modify `docs/WORKARCHIVE.md` after implementation. - Modify `docs/Sprints/SPRINT16CONTRACT.md` status and handoff fields. Steps: - [ ] Run focused Sprint 16 verification: ```powershell uv run pytest tests/test_paths.py tests/test_conversion.py tests/test_cli.py tests/test_report.py tests/integration/test_v1_fast_release_gate.py ``` - [ ] Run full default verification: ```powershell uv run pytest ``` - [ ] Run diff check: ```powershell git diff --check ``` - [ ] Update `PROGRESS.md` with files changed, checks run, residual risks, and next actions. - [ ] Archive completed implementation evidence in `docs/WORKARCHIVE.md`. - [ ] Commit final coordination updates. ## Verification Commands ```powershell uv run pytest tests/test_paths.py tests/test_conversion.py tests/test_cli.py tests/test_report.py tests/integration/test_v1_fast_release_gate.py uv run pytest git diff --check git status --short --untracked-files=all ``` Optional local fixture validation after implementation: ```powershell $env:MINERU_MODEL_SOURCE='local' uv run pdf2md convert samples\SolidElement.pdf --out outputs\SolidElement_sprint16_layout --overwrite --chunk-pages --gpu auto --mineru-profile auto --strict-local ``` Expected optional validation: - Output folder is `outputs\SolidElement\` or the explicitly provided output root plus `SolidElement\`, depending on the command. - Markdown part is `SolidElement_001.md` for the 6-page sample. - Report is `SolidElement_report.md`. - Images are under `images\`. - No metadata JSON exists. ## Acceptance Criteria - Each input PDF writes into an output folder named after the PDF stem. - Markdown outputs are named `_001.md`, `_002.md`, and so on. - All image/media assets for one PDF live under `/images/`. - Markdown links point to `images/...`. - Exactly one report file is written per input PDF at `/_report.md`. - No metadata JSON file is written for new conversions. - Internal warning, provenance, formula count, asset count, and text fidelity information remains available in the report. - Chunk mode still converts one source page per MinerU run and groups Markdown by `chunk_pages`. - Strict-local and MinerU-only constraints remain unchanged. - Default tests stay fast and local. ## Hard Failure Criteria - Any new conversion writes `.metadata.json` as a public output. - Output files keep old `part-001.pages-...` names. - Assets are split into per-part `.assets` folders. - More than one report is written for one input PDF. - Markdown links point outside the PDF output folder. - Chunk mode stops using one source page per MinerU run. - Strict-local enforcement is weakened. - Default tests require real MinerU, GPU, model files, network, Obsidian, MathJax, or `samples/`. - Sample PDFs, generated outputs, local model files, or `dist/pdf2md-ui.exe` are committed. ## Open Questions - Should metadata-free `pdf2md recheck` be restored in a later sprint by deriving enough state from the report and Markdown, or is rerunning conversion acceptable for simplified outputs? - Should raw MinerU outputs under `--keep-raw` be flattened into `raw/` or kept per part under `raw/_001/`? This contract recommends per-part raw folders to avoid collisions. ## Handoff Requirements After implementation: - Update this contract status to `Implemented`. - Record final file layout examples in `README.md`. - Record verification commands and outcomes in `PROGRESS.md`. - Archive implementation and optional sample validation results in `docs/WORKARCHIVE.md`. - Keep generated outputs and sample PDFs uncommitted. ## Implementation Handoff - Files changed: `src/pdf2md/paths.py`, `src/pdf2md/conversion.py`, `src/pdf2md/report.py`, `src/pdf2md/cli.py`, `src/pdf2md_ui/runner.py`, focused tests, and current docs. - Output layout implemented: `//_001.md`, additional numbered parts when grouped, `//images/`, and `//_report.md`. - Metadata JSON behavior: new conversions do not write public `.metadata.json`; `ConversionResult.metadata_path` is `None`; internal metadata-like records still feed reports and tests. - Recheck behavior: `pdf2md recheck` remains legacy-only and requires adjacent metadata JSON. - Verification recorded in `PROGRESS.md`: focused Sprint 16 tests passed, full `uv run pytest` passed 227 tests with 1 optional skip, and `git diff --check` passed with line-ending warnings only.