Files
PDFToMD/docs/Sprints/SPRINT16CONTRACT.md
2026-05-14 10:16:59 +09:00

413 lines
17 KiB
Markdown

# Sprint 16 Contract: Simplified Output Layout
Status: Implemented
Last updated: 2026-05-12
## Objective
Simplify conversion outputs so each input PDF gets one predictable output folder named after the PDF stem, all images live under one `images` folder, Markdown parts use `_001`, `_002` numbering, one human-readable report is written per PDF, and no metadata JSON file is persisted.
This sprint changes the public output contract. It supersedes the older v1 output layout that wrote sibling `<stem>.md`, `<stem>.assets`, `<stem>.metadata.json`, and `<stem>.report.md` files.
## Product Output Contract
For an input PDF:
```text
paper.pdf
```
and output root:
```text
outputs/
```
write:
```text
outputs/
paper/
paper_001.md
paper_002.md
paper_report.md
images/
...
```
Rules:
- `paper` is the PDF stem, meaning the original filename without `.pdf`.
- A one-part conversion still writes `paper_001.md`.
- A multi-part conversion writes `paper_001.md`, `paper_002.md`, and so on.
- Part numbering uses at least three digits and grows only when the part count exceeds 999.
- All generated image and media assets for the PDF live under `paper/images/`.
- Markdown links must point to `images/<asset-name>`.
- The report is a single file at `paper/paper_report.md`.
- No `<stem>.metadata.json`, part metadata JSON, or sidecar metadata JSON is written.
- Internal metadata records may still be built in memory to produce reports, warnings, counts, and `ConversionResult` fields.
## Contract Assumptions
- The user request "metadata is not needed" means metadata JSON should not be written as a user-facing output file. It does not mean removing internal metadata objects needed for report generation and warning aggregation.
- Keep `--chunk-pages` semantics from Sprint 14: when enabled, MinerU receives one source page per run and final Markdown files are grouped by `chunk_pages`.
- If `--chunk-pages` is absent, the whole PDF is still converted in one MinerU run and written as `<stem>_001.md`.
- Keep `--chunk-pages` without a value as the default grouping size of 20.
- Keep `--metadata` accepted as a backward-compatible no-op for one sprint, but update help text to say metadata JSON output is disabled in the simplified layout.
- `pdf2md recheck` remains supported only for legacy outputs that still have adjacent metadata JSON. New simplified outputs should fail recheck clearly until a later sprint designs metadata-free recheck.
- Recursive directory conversion should preserve the discovered relative parent before the PDF stem folder: `outputs/<relative-parent>/<stem>/<stem>_001.md`.
- If two inputs would map to the same output folder and overwrite is false, fail during preflight. Do not invent automatic suffixes.
- `--keep-raw` should place raw MinerU diagnostics under `paper/raw/` so raw outputs do not clutter the main folder.
## Touched Surfaces
Allowed during implementation:
- Modify `src/pdf2md/paths.py`.
- Modify `src/pdf2md/pdf_splitter.py` only if part naming needs helper support.
- Modify `src/pdf2md/conversion.py`.
- Modify `src/pdf2md/report.py` or add a focused aggregate report helper if one report needs multiple part summaries.
- Modify `src/pdf2md/cli.py`.
- Modify `src/pdf2md_ui/runner.py` and `src/pdf2md_ui/app.py` only if UI text or expected output descriptions mention metadata/report paths.
- Modify `tests/test_paths.py`.
- Modify `tests/test_conversion.py`.
- Modify `tests/test_cli.py`.
- Modify `tests/test_report.py`.
- Modify `tests/test_ui_runner.py` only if UI command/output assumptions change.
- Modify `tests/integration/test_v1_fast_release_gate.py`.
- Modify `tests/integration/test_optional_mineru_fixtures.py`.
- Modify `README.md`.
- Modify `PRD.md`.
- Modify `ARCHITECTURE.md`.
- Modify `docs/V1IMPLEMENTATIONPLAN.md`.
- Modify `PLAN.md`.
- Modify `PROGRESS.md`.
- Modify `docs/WORKARCHIVE.md` after implementation.
Not allowed:
- Do not change MinerU 3.1.0 as the fixed engine.
- Do not add another conversion engine.
- Do not add remote/API/backend paths.
- Do not change `--gpu`, `--mineru-profile`, or strict-local behavior except where report text reflects the new layout.
- Do not make default tests depend on real MinerU, GPU, CUDA, model files, network, Obsidian, MathJax, or `samples/`.
- Do not commit generated `outputs/`, sample PDFs, local model files, or `dist/pdf2md-ui.exe`.
## Architecture Plan
### WP16.1: Document-Level Output Layout
Add or reshape path planning so final outputs are planned per source PDF folder instead of as sibling files.
Expected final paths for a single PDF:
```text
<out>/<stem>/<stem>_001.md
<out>/<stem>/images/
<out>/<stem>/<stem>_report.md
```
Expected final paths for recursive input:
```text
<out>/<relative-parent>/<stem>/<stem>_001.md
<out>/<relative-parent>/<stem>/images/
<out>/<relative-parent>/<stem>/<stem>_report.md
```
Implementation guidance:
- Keep `DiscoveredPdf.relative_parent` behavior.
- Add a focused part-planning helper rather than encoding final output names through fake temporary PDF filenames.
- Keep `PlannedOutput` if the existing conversion code can use it cleanly, but allow multiple Markdown parts to share the same `assets_dir` and `report_path`.
- Duplicate-path detection must reject duplicate Markdown files and raw directories, but it must allow shared `images/` and shared report paths for parts belonging to the same source PDF.
### WP16.2: Markdown Part Numbering
Replace public part names:
```text
paper.part-001.pages-001-020.md
paper.part-002.pages-021-040.md
```
with:
```text
paper_001.md
paper_002.md
```
Rules:
- Part index is based on final output group order, not source page number.
- The report must still record source page ranges for each part.
- Failed groups should not create a Markdown file, but the report must mention the failed part and source page range.
### WP16.3: Shared Images Folder
Replace per-output asset directories:
```text
paper.part-001.pages-001-020.assets/
paper.part-002.pages-021-040.assets/
```
with:
```text
paper/images/
```
Implementation guidance:
- Copy all assets for one source PDF into the shared `images/` folder.
- Rewrite Markdown links to `images/<asset-name>`.
- Use deterministic collision-safe filenames. Recommended pattern:
- page-known assets: `page-001_<original-name>`, with `-002` suffixes when needed.
- page-unknown assets: `asset-001<suffix>`, preserving the original suffix when available.
- Keep asset-link validation pointed at the shared `images/` directory.
### WP16.4: One Report, No Metadata JSON
Stop writing metadata JSON as a user-facing output file.
Implementation guidance:
- Continue building internal metadata dictionaries or records for each part so report generation and `ConversionResult` summaries stay traceable.
- Add an aggregate report path at `<stem>/<stem>_report.md`.
- The report must include:
- source PDF path,
- output folder path,
- Markdown part list with page ranges,
- engine and engine options,
- final status,
- warning count,
- asset count,
- missing/invalid asset link counts,
- inline/display formula counts,
- MathJax render error count,
- text fidelity summary when available,
- failed source pages or failed parts when any exist,
- warnings grouped by page or part.
- `ConversionResult.metadata_path` should be `None` for simplified outputs.
- `ConversionResult.report_path` should point to the shared report path.
### WP16.5: CLI, UI, And Documentation
Update user-facing docs and tests to remove metadata JSON as an expected output.
Implementation guidance:
- `pdf2md convert` summary may keep printing Markdown paths and warning counts.
- Update CLI help for `--metadata` to say metadata JSON output is disabled or deprecated in the simplified layout.
- Update README examples to show the new folder layout.
- Update PRD and ARCHITECTURE so they no longer claim metadata JSON is required as a public artifact.
- Keep internal provenance wording clear: warnings and report are still derived from internal metadata-like records.
- Update optional fixture documentation so generated metadata JSON is not required for sample validation.
## Implementation Task Plan
### Task 1: Path Planning For Simplified Layout
Files:
- Modify `src/pdf2md/paths.py`.
- Modify `tests/test_paths.py`.
Steps:
- [ ] Add failing tests showing `plan_outputs()` maps `paper.pdf` to `out/paper/paper_001.md`, `out/paper/images`, no metadata path, and `out/paper/paper_report.md`.
- [ ] Add a failing test for Korean filenames, using the PDF stem exactly as the output folder and file prefix.
- [ ] Add a failing test for recursive input preserving `relative_parent`.
- [ ] Add a failing test that duplicate source stems in the same relative parent conflict before conversion.
- [ ] Implement the minimal path planning changes.
- [ ] Run `uv run pytest tests/test_paths.py`.
- [ ] Commit path planning changes.
### Task 2: Single-Output Conversion Writes Simplified Files
Files:
- Modify `src/pdf2md/conversion.py`.
- Modify `tests/test_conversion.py`.
- Modify `tests/test_cli.py`.
Steps:
- [ ] Add failing conversion tests showing a non-chunked fake-adapter conversion writes `out/paper/paper_001.md`, `out/paper/images`, and `out/paper/paper_report.md`.
- [ ] Add failing assertions that no `.metadata.json` file is written and `result.metadata_path is None`.
- [ ] Add failing CLI test showing `pdf2md convert paper.pdf --out out` creates the simplified folder.
- [ ] Implement the minimal conversion changes for non-chunked output.
- [ ] Run `uv run pytest tests/test_conversion.py tests/test_cli.py tests/test_paths.py`.
- [ ] Commit single-output conversion changes.
### Task 3: Grouped Output Parts And Shared Images
Files:
- Modify `src/pdf2md/conversion.py`.
- Modify `src/pdf2md/pdf_splitter.py` only if a small helper is needed.
- Modify `tests/test_conversion.py`.
- Modify `tests/test_cli.py`.
Steps:
- [ ] Add failing tests for `chunk_pages=20` showing final Markdown names are `paper_001.md`, `paper_002.md`, not `paper.part-...md`.
- [ ] Add failing tests proving all grouped assets are copied into `paper/images/` and Markdown links use `images/...`.
- [ ] Add failing tests proving asset collisions across pages get deterministic unique filenames.
- [ ] Add failing tests proving failed page conversions are represented in the shared report while later pages still convert.
- [ ] Implement grouped output naming and shared image handling.
- [ ] Run `uv run pytest tests/test_conversion.py tests/test_cli.py tests/test_pdf_splitter.py`.
- [ ] Commit grouped output changes.
### Task 4: Aggregate Report Without Metadata JSON
Files:
- Modify `src/pdf2md/report.py` or add a focused aggregate report helper.
- Modify `src/pdf2md/conversion.py`.
- Modify `tests/test_report.py`.
- Modify `tests/test_conversion.py`.
Steps:
- [ ] Add failing report tests for a one-file report listing multiple Markdown parts and source page ranges.
- [ ] Add failing conversion tests proving only one report exists for a chunked PDF.
- [ ] Add failing tests proving report summary totals combine all output parts.
- [ ] Add failing tests proving all-failed conversions write a report but no Markdown part.
- [ ] Implement aggregate report rendering from internal metadata records.
- [ ] Run `uv run pytest tests/test_report.py tests/test_conversion.py`.
- [ ] Commit report changes.
### Task 5: Recheck, CLI Compatibility, UI Text, And Docs
Files:
- Modify `src/pdf2md/cli.py`.
- Modify `src/pdf2md/conversion.py`.
- Modify `src/pdf2md_ui/runner.py` and `src/pdf2md_ui/app.py` only if text/output assumptions change.
- Modify `README.md`.
- Modify `PRD.md`.
- Modify `ARCHITECTURE.md`.
- Modify `docs/V1IMPLEMENTATIONPLAN.md`.
- Modify `tests/test_cli.py`.
- Modify `tests/test_ui_runner.py` only if UI behavior changes.
- Modify `tests/integration/test_v1_fast_release_gate.py`.
- Modify `tests/integration/test_optional_mineru_fixtures.py`.
Steps:
- [ ] Add failing CLI tests proving `--metadata` remains accepted but no metadata JSON is written.
- [ ] Add failing recheck test proving simplified outputs without metadata fail with a clear legacy-metadata message.
- [ ] Update integration tests to require Markdown part files, one report, and image links, not metadata JSON.
- [ ] Update README, PRD, ARCHITECTURE, and release-gate wording for the simplified layout.
- [ ] Implement CLI/recheck/doc changes.
- [ ] Run `uv run pytest tests/test_cli.py tests/test_ui_runner.py tests/integration/test_v1_fast_release_gate.py`.
- [ ] Commit CLI, UI, integration, and documentation changes.
### Task 6: Final Verification And Handoff
Files:
- Modify `PLAN.md`.
- Modify `PROGRESS.md`.
- Modify `docs/WORKARCHIVE.md` after implementation.
- Modify `docs/Sprints/SPRINT16CONTRACT.md` status and handoff fields.
Steps:
- [ ] Run focused Sprint 16 verification:
```powershell
uv run pytest tests/test_paths.py tests/test_conversion.py tests/test_cli.py tests/test_report.py tests/integration/test_v1_fast_release_gate.py
```
- [ ] Run full default verification:
```powershell
uv run pytest
```
- [ ] Run diff check:
```powershell
git diff --check
```
- [ ] Update `PROGRESS.md` with files changed, checks run, residual risks, and next actions.
- [ ] Archive completed implementation evidence in `docs/WORKARCHIVE.md`.
- [ ] Commit final coordination updates.
## Verification Commands
```powershell
uv run pytest tests/test_paths.py tests/test_conversion.py tests/test_cli.py tests/test_report.py tests/integration/test_v1_fast_release_gate.py
uv run pytest
git diff --check
git status --short --untracked-files=all
```
Optional local fixture validation after implementation:
```powershell
$env:MINERU_MODEL_SOURCE='local'
uv run pdf2md convert samples\SolidElement.pdf --out outputs\SolidElement_sprint16_layout --overwrite --chunk-pages --gpu auto --mineru-profile auto --strict-local
```
Expected optional validation:
- Output folder is `outputs\SolidElement\` or the explicitly provided output root plus `SolidElement\`, depending on the command.
- Markdown part is `SolidElement_001.md` for the 6-page sample.
- Report is `SolidElement_report.md`.
- Images are under `images\`.
- No metadata JSON exists.
## Acceptance Criteria
- Each input PDF writes into an output folder named after the PDF stem.
- Markdown outputs are named `<stem>_001.md`, `<stem>_002.md`, and so on.
- All image/media assets for one PDF live under `<stem>/images/`.
- Markdown links point to `images/...`.
- Exactly one report file is written per input PDF at `<stem>/<stem>_report.md`.
- No metadata JSON file is written for new conversions.
- Internal warning, provenance, formula count, asset count, and text fidelity information remains available in the report.
- Chunk mode still converts one source page per MinerU run and groups Markdown by `chunk_pages`.
- Strict-local and MinerU-only constraints remain unchanged.
- Default tests stay fast and local.
## Hard Failure Criteria
- Any new conversion writes `.metadata.json` as a public output.
- Output files keep old `part-001.pages-...` names.
- Assets are split into per-part `.assets` folders.
- More than one report is written for one input PDF.
- Markdown links point outside the PDF output folder.
- Chunk mode stops using one source page per MinerU run.
- Strict-local enforcement is weakened.
- Default tests require real MinerU, GPU, model files, network, Obsidian, MathJax, or `samples/`.
- Sample PDFs, generated outputs, local model files, or `dist/pdf2md-ui.exe` are committed.
## Open Questions
- Should metadata-free `pdf2md recheck` be restored in a later sprint by deriving enough state from the report and Markdown, or is rerunning conversion acceptable for simplified outputs?
- Should raw MinerU outputs under `--keep-raw` be flattened into `raw/` or kept per part under `raw/<stem>_001/`? This contract recommends per-part raw folders to avoid collisions.
## Handoff Requirements
After implementation:
- Update this contract status to `Implemented`.
- Record final file layout examples in `README.md`.
- Record verification commands and outcomes in `PROGRESS.md`.
- Archive implementation and optional sample validation results in `docs/WORKARCHIVE.md`.
- Keep generated outputs and sample PDFs uncommitted.
## Implementation Handoff
- Files changed: `src/pdf2md/paths.py`, `src/pdf2md/conversion.py`, `src/pdf2md/report.py`, `src/pdf2md/cli.py`, `src/pdf2md_ui/runner.py`, focused tests, and current docs.
- Output layout implemented: `<out>/<stem>/<stem>_001.md`, additional numbered parts when grouped, `<out>/<stem>/images/`, and `<out>/<stem>/<stem>_report.md`.
- Metadata JSON behavior: new conversions do not write public `.metadata.json`; `ConversionResult.metadata_path` is `None`; internal metadata-like records still feed reports and tests.
- Recheck behavior: `pdf2md recheck` remains legacy-only and requires adjacent metadata JSON.
- Verification recorded in `PROGRESS.md`: focused Sprint 16 tests passed, full `uv run pytest` passed 227 tests with 1 optional skip, and `git diff --check` passed with line-ending warnings only.