413 lines
17 KiB
Markdown
413 lines
17 KiB
Markdown
# Sprint 16 Contract: Simplified Output Layout
|
|
|
|
Status: Implemented
|
|
Last updated: 2026-05-12
|
|
|
|
## Objective
|
|
|
|
Simplify conversion outputs so each input PDF gets one predictable output folder named after the PDF stem, all images live under one `images` folder, Markdown parts use `_001`, `_002` numbering, one human-readable report is written per PDF, and no metadata JSON file is persisted.
|
|
|
|
This sprint changes the public output contract. It supersedes the older v1 output layout that wrote sibling `<stem>.md`, `<stem>.assets`, `<stem>.metadata.json`, and `<stem>.report.md` files.
|
|
|
|
## Product Output Contract
|
|
|
|
For an input PDF:
|
|
|
|
```text
|
|
paper.pdf
|
|
```
|
|
|
|
and output root:
|
|
|
|
```text
|
|
outputs/
|
|
```
|
|
|
|
write:
|
|
|
|
```text
|
|
outputs/
|
|
paper/
|
|
paper_001.md
|
|
paper_002.md
|
|
paper_report.md
|
|
images/
|
|
...
|
|
```
|
|
|
|
Rules:
|
|
|
|
- `paper` is the PDF stem, meaning the original filename without `.pdf`.
|
|
- A one-part conversion still writes `paper_001.md`.
|
|
- A multi-part conversion writes `paper_001.md`, `paper_002.md`, and so on.
|
|
- Part numbering uses at least three digits and grows only when the part count exceeds 999.
|
|
- All generated image and media assets for the PDF live under `paper/images/`.
|
|
- Markdown links must point to `images/<asset-name>`.
|
|
- The report is a single file at `paper/paper_report.md`.
|
|
- No `<stem>.metadata.json`, part metadata JSON, or sidecar metadata JSON is written.
|
|
- Internal metadata records may still be built in memory to produce reports, warnings, counts, and `ConversionResult` fields.
|
|
|
|
## Contract Assumptions
|
|
|
|
- The user request "metadata is not needed" means metadata JSON should not be written as a user-facing output file. It does not mean removing internal metadata objects needed for report generation and warning aggregation.
|
|
- Keep `--chunk-pages` semantics from Sprint 14: when enabled, MinerU receives one source page per run and final Markdown files are grouped by `chunk_pages`.
|
|
- If `--chunk-pages` is absent, the whole PDF is still converted in one MinerU run and written as `<stem>_001.md`.
|
|
- Keep `--chunk-pages` without a value as the default grouping size of 20.
|
|
- Keep `--metadata` accepted as a backward-compatible no-op for one sprint, but update help text to say metadata JSON output is disabled in the simplified layout.
|
|
- `pdf2md recheck` remains supported only for legacy outputs that still have adjacent metadata JSON. New simplified outputs should fail recheck clearly until a later sprint designs metadata-free recheck.
|
|
- Recursive directory conversion should preserve the discovered relative parent before the PDF stem folder: `outputs/<relative-parent>/<stem>/<stem>_001.md`.
|
|
- If two inputs would map to the same output folder and overwrite is false, fail during preflight. Do not invent automatic suffixes.
|
|
- `--keep-raw` should place raw MinerU diagnostics under `paper/raw/` so raw outputs do not clutter the main folder.
|
|
|
|
## Touched Surfaces
|
|
|
|
Allowed during implementation:
|
|
|
|
- Modify `src/pdf2md/paths.py`.
|
|
- Modify `src/pdf2md/pdf_splitter.py` only if part naming needs helper support.
|
|
- Modify `src/pdf2md/conversion.py`.
|
|
- Modify `src/pdf2md/report.py` or add a focused aggregate report helper if one report needs multiple part summaries.
|
|
- Modify `src/pdf2md/cli.py`.
|
|
- Modify `src/pdf2md_ui/runner.py` and `src/pdf2md_ui/app.py` only if UI text or expected output descriptions mention metadata/report paths.
|
|
- Modify `tests/test_paths.py`.
|
|
- Modify `tests/test_conversion.py`.
|
|
- Modify `tests/test_cli.py`.
|
|
- Modify `tests/test_report.py`.
|
|
- Modify `tests/test_ui_runner.py` only if UI command/output assumptions change.
|
|
- Modify `tests/integration/test_v1_fast_release_gate.py`.
|
|
- Modify `tests/integration/test_optional_mineru_fixtures.py`.
|
|
- Modify `README.md`.
|
|
- Modify `PRD.md`.
|
|
- Modify `ARCHITECTURE.md`.
|
|
- Modify `docs/V1IMPLEMENTATIONPLAN.md`.
|
|
- Modify `PLAN.md`.
|
|
- Modify `PROGRESS.md`.
|
|
- Modify `docs/WORKARCHIVE.md` after implementation.
|
|
|
|
Not allowed:
|
|
|
|
- Do not change MinerU 3.1.0 as the fixed engine.
|
|
- Do not add another conversion engine.
|
|
- Do not add remote/API/backend paths.
|
|
- Do not change `--gpu`, `--mineru-profile`, or strict-local behavior except where report text reflects the new layout.
|
|
- Do not make default tests depend on real MinerU, GPU, CUDA, model files, network, Obsidian, MathJax, or `samples/`.
|
|
- Do not commit generated `outputs/`, sample PDFs, local model files, or `dist/pdf2md-ui.exe`.
|
|
|
|
## Architecture Plan
|
|
|
|
### WP16.1: Document-Level Output Layout
|
|
|
|
Add or reshape path planning so final outputs are planned per source PDF folder instead of as sibling files.
|
|
|
|
Expected final paths for a single PDF:
|
|
|
|
```text
|
|
<out>/<stem>/<stem>_001.md
|
|
<out>/<stem>/images/
|
|
<out>/<stem>/<stem>_report.md
|
|
```
|
|
|
|
Expected final paths for recursive input:
|
|
|
|
```text
|
|
<out>/<relative-parent>/<stem>/<stem>_001.md
|
|
<out>/<relative-parent>/<stem>/images/
|
|
<out>/<relative-parent>/<stem>/<stem>_report.md
|
|
```
|
|
|
|
Implementation guidance:
|
|
|
|
- Keep `DiscoveredPdf.relative_parent` behavior.
|
|
- Add a focused part-planning helper rather than encoding final output names through fake temporary PDF filenames.
|
|
- Keep `PlannedOutput` if the existing conversion code can use it cleanly, but allow multiple Markdown parts to share the same `assets_dir` and `report_path`.
|
|
- Duplicate-path detection must reject duplicate Markdown files and raw directories, but it must allow shared `images/` and shared report paths for parts belonging to the same source PDF.
|
|
|
|
### WP16.2: Markdown Part Numbering
|
|
|
|
Replace public part names:
|
|
|
|
```text
|
|
paper.part-001.pages-001-020.md
|
|
paper.part-002.pages-021-040.md
|
|
```
|
|
|
|
with:
|
|
|
|
```text
|
|
paper_001.md
|
|
paper_002.md
|
|
```
|
|
|
|
Rules:
|
|
|
|
- Part index is based on final output group order, not source page number.
|
|
- The report must still record source page ranges for each part.
|
|
- Failed groups should not create a Markdown file, but the report must mention the failed part and source page range.
|
|
|
|
### WP16.3: Shared Images Folder
|
|
|
|
Replace per-output asset directories:
|
|
|
|
```text
|
|
paper.part-001.pages-001-020.assets/
|
|
paper.part-002.pages-021-040.assets/
|
|
```
|
|
|
|
with:
|
|
|
|
```text
|
|
paper/images/
|
|
```
|
|
|
|
Implementation guidance:
|
|
|
|
- Copy all assets for one source PDF into the shared `images/` folder.
|
|
- Rewrite Markdown links to `images/<asset-name>`.
|
|
- Use deterministic collision-safe filenames. Recommended pattern:
|
|
- page-known assets: `page-001_<original-name>`, with `-002` suffixes when needed.
|
|
- page-unknown assets: `asset-001<suffix>`, preserving the original suffix when available.
|
|
- Keep asset-link validation pointed at the shared `images/` directory.
|
|
|
|
### WP16.4: One Report, No Metadata JSON
|
|
|
|
Stop writing metadata JSON as a user-facing output file.
|
|
|
|
Implementation guidance:
|
|
|
|
- Continue building internal metadata dictionaries or records for each part so report generation and `ConversionResult` summaries stay traceable.
|
|
- Add an aggregate report path at `<stem>/<stem>_report.md`.
|
|
- The report must include:
|
|
- source PDF path,
|
|
- output folder path,
|
|
- Markdown part list with page ranges,
|
|
- engine and engine options,
|
|
- final status,
|
|
- warning count,
|
|
- asset count,
|
|
- missing/invalid asset link counts,
|
|
- inline/display formula counts,
|
|
- MathJax render error count,
|
|
- text fidelity summary when available,
|
|
- failed source pages or failed parts when any exist,
|
|
- warnings grouped by page or part.
|
|
- `ConversionResult.metadata_path` should be `None` for simplified outputs.
|
|
- `ConversionResult.report_path` should point to the shared report path.
|
|
|
|
### WP16.5: CLI, UI, And Documentation
|
|
|
|
Update user-facing docs and tests to remove metadata JSON as an expected output.
|
|
|
|
Implementation guidance:
|
|
|
|
- `pdf2md convert` summary may keep printing Markdown paths and warning counts.
|
|
- Update CLI help for `--metadata` to say metadata JSON output is disabled or deprecated in the simplified layout.
|
|
- Update README examples to show the new folder layout.
|
|
- Update PRD and ARCHITECTURE so they no longer claim metadata JSON is required as a public artifact.
|
|
- Keep internal provenance wording clear: warnings and report are still derived from internal metadata-like records.
|
|
- Update optional fixture documentation so generated metadata JSON is not required for sample validation.
|
|
|
|
## Implementation Task Plan
|
|
|
|
### Task 1: Path Planning For Simplified Layout
|
|
|
|
Files:
|
|
|
|
- Modify `src/pdf2md/paths.py`.
|
|
- Modify `tests/test_paths.py`.
|
|
|
|
Steps:
|
|
|
|
- [ ] Add failing tests showing `plan_outputs()` maps `paper.pdf` to `out/paper/paper_001.md`, `out/paper/images`, no metadata path, and `out/paper/paper_report.md`.
|
|
- [ ] Add a failing test for Korean filenames, using the PDF stem exactly as the output folder and file prefix.
|
|
- [ ] Add a failing test for recursive input preserving `relative_parent`.
|
|
- [ ] Add a failing test that duplicate source stems in the same relative parent conflict before conversion.
|
|
- [ ] Implement the minimal path planning changes.
|
|
- [ ] Run `uv run pytest tests/test_paths.py`.
|
|
- [ ] Commit path planning changes.
|
|
|
|
### Task 2: Single-Output Conversion Writes Simplified Files
|
|
|
|
Files:
|
|
|
|
- Modify `src/pdf2md/conversion.py`.
|
|
- Modify `tests/test_conversion.py`.
|
|
- Modify `tests/test_cli.py`.
|
|
|
|
Steps:
|
|
|
|
- [ ] Add failing conversion tests showing a non-chunked fake-adapter conversion writes `out/paper/paper_001.md`, `out/paper/images`, and `out/paper/paper_report.md`.
|
|
- [ ] Add failing assertions that no `.metadata.json` file is written and `result.metadata_path is None`.
|
|
- [ ] Add failing CLI test showing `pdf2md convert paper.pdf --out out` creates the simplified folder.
|
|
- [ ] Implement the minimal conversion changes for non-chunked output.
|
|
- [ ] Run `uv run pytest tests/test_conversion.py tests/test_cli.py tests/test_paths.py`.
|
|
- [ ] Commit single-output conversion changes.
|
|
|
|
### Task 3: Grouped Output Parts And Shared Images
|
|
|
|
Files:
|
|
|
|
- Modify `src/pdf2md/conversion.py`.
|
|
- Modify `src/pdf2md/pdf_splitter.py` only if a small helper is needed.
|
|
- Modify `tests/test_conversion.py`.
|
|
- Modify `tests/test_cli.py`.
|
|
|
|
Steps:
|
|
|
|
- [ ] Add failing tests for `chunk_pages=20` showing final Markdown names are `paper_001.md`, `paper_002.md`, not `paper.part-...md`.
|
|
- [ ] Add failing tests proving all grouped assets are copied into `paper/images/` and Markdown links use `images/...`.
|
|
- [ ] Add failing tests proving asset collisions across pages get deterministic unique filenames.
|
|
- [ ] Add failing tests proving failed page conversions are represented in the shared report while later pages still convert.
|
|
- [ ] Implement grouped output naming and shared image handling.
|
|
- [ ] Run `uv run pytest tests/test_conversion.py tests/test_cli.py tests/test_pdf_splitter.py`.
|
|
- [ ] Commit grouped output changes.
|
|
|
|
### Task 4: Aggregate Report Without Metadata JSON
|
|
|
|
Files:
|
|
|
|
- Modify `src/pdf2md/report.py` or add a focused aggregate report helper.
|
|
- Modify `src/pdf2md/conversion.py`.
|
|
- Modify `tests/test_report.py`.
|
|
- Modify `tests/test_conversion.py`.
|
|
|
|
Steps:
|
|
|
|
- [ ] Add failing report tests for a one-file report listing multiple Markdown parts and source page ranges.
|
|
- [ ] Add failing conversion tests proving only one report exists for a chunked PDF.
|
|
- [ ] Add failing tests proving report summary totals combine all output parts.
|
|
- [ ] Add failing tests proving all-failed conversions write a report but no Markdown part.
|
|
- [ ] Implement aggregate report rendering from internal metadata records.
|
|
- [ ] Run `uv run pytest tests/test_report.py tests/test_conversion.py`.
|
|
- [ ] Commit report changes.
|
|
|
|
### Task 5: Recheck, CLI Compatibility, UI Text, And Docs
|
|
|
|
Files:
|
|
|
|
- Modify `src/pdf2md/cli.py`.
|
|
- Modify `src/pdf2md/conversion.py`.
|
|
- Modify `src/pdf2md_ui/runner.py` and `src/pdf2md_ui/app.py` only if text/output assumptions change.
|
|
- Modify `README.md`.
|
|
- Modify `PRD.md`.
|
|
- Modify `ARCHITECTURE.md`.
|
|
- Modify `docs/V1IMPLEMENTATIONPLAN.md`.
|
|
- Modify `tests/test_cli.py`.
|
|
- Modify `tests/test_ui_runner.py` only if UI behavior changes.
|
|
- Modify `tests/integration/test_v1_fast_release_gate.py`.
|
|
- Modify `tests/integration/test_optional_mineru_fixtures.py`.
|
|
|
|
Steps:
|
|
|
|
- [ ] Add failing CLI tests proving `--metadata` remains accepted but no metadata JSON is written.
|
|
- [ ] Add failing recheck test proving simplified outputs without metadata fail with a clear legacy-metadata message.
|
|
- [ ] Update integration tests to require Markdown part files, one report, and image links, not metadata JSON.
|
|
- [ ] Update README, PRD, ARCHITECTURE, and release-gate wording for the simplified layout.
|
|
- [ ] Implement CLI/recheck/doc changes.
|
|
- [ ] Run `uv run pytest tests/test_cli.py tests/test_ui_runner.py tests/integration/test_v1_fast_release_gate.py`.
|
|
- [ ] Commit CLI, UI, integration, and documentation changes.
|
|
|
|
### Task 6: Final Verification And Handoff
|
|
|
|
Files:
|
|
|
|
- Modify `PLAN.md`.
|
|
- Modify `PROGRESS.md`.
|
|
- Modify `docs/WORKARCHIVE.md` after implementation.
|
|
- Modify `docs/Sprints/SPRINT16CONTRACT.md` status and handoff fields.
|
|
|
|
Steps:
|
|
|
|
- [ ] Run focused Sprint 16 verification:
|
|
|
|
```powershell
|
|
uv run pytest tests/test_paths.py tests/test_conversion.py tests/test_cli.py tests/test_report.py tests/integration/test_v1_fast_release_gate.py
|
|
```
|
|
|
|
- [ ] Run full default verification:
|
|
|
|
```powershell
|
|
uv run pytest
|
|
```
|
|
|
|
- [ ] Run diff check:
|
|
|
|
```powershell
|
|
git diff --check
|
|
```
|
|
|
|
- [ ] Update `PROGRESS.md` with files changed, checks run, residual risks, and next actions.
|
|
- [ ] Archive completed implementation evidence in `docs/WORKARCHIVE.md`.
|
|
- [ ] Commit final coordination updates.
|
|
|
|
## Verification Commands
|
|
|
|
```powershell
|
|
uv run pytest tests/test_paths.py tests/test_conversion.py tests/test_cli.py tests/test_report.py tests/integration/test_v1_fast_release_gate.py
|
|
uv run pytest
|
|
git diff --check
|
|
git status --short --untracked-files=all
|
|
```
|
|
|
|
Optional local fixture validation after implementation:
|
|
|
|
```powershell
|
|
$env:MINERU_MODEL_SOURCE='local'
|
|
uv run pdf2md convert samples\SolidElement.pdf --out outputs\SolidElement_sprint16_layout --overwrite --chunk-pages --gpu auto --mineru-profile auto --strict-local
|
|
```
|
|
|
|
Expected optional validation:
|
|
|
|
- Output folder is `outputs\SolidElement\` or the explicitly provided output root plus `SolidElement\`, depending on the command.
|
|
- Markdown part is `SolidElement_001.md` for the 6-page sample.
|
|
- Report is `SolidElement_report.md`.
|
|
- Images are under `images\`.
|
|
- No metadata JSON exists.
|
|
|
|
## Acceptance Criteria
|
|
|
|
- Each input PDF writes into an output folder named after the PDF stem.
|
|
- Markdown outputs are named `<stem>_001.md`, `<stem>_002.md`, and so on.
|
|
- All image/media assets for one PDF live under `<stem>/images/`.
|
|
- Markdown links point to `images/...`.
|
|
- Exactly one report file is written per input PDF at `<stem>/<stem>_report.md`.
|
|
- No metadata JSON file is written for new conversions.
|
|
- Internal warning, provenance, formula count, asset count, and text fidelity information remains available in the report.
|
|
- Chunk mode still converts one source page per MinerU run and groups Markdown by `chunk_pages`.
|
|
- Strict-local and MinerU-only constraints remain unchanged.
|
|
- Default tests stay fast and local.
|
|
|
|
## Hard Failure Criteria
|
|
|
|
- Any new conversion writes `.metadata.json` as a public output.
|
|
- Output files keep old `part-001.pages-...` names.
|
|
- Assets are split into per-part `.assets` folders.
|
|
- More than one report is written for one input PDF.
|
|
- Markdown links point outside the PDF output folder.
|
|
- Chunk mode stops using one source page per MinerU run.
|
|
- Strict-local enforcement is weakened.
|
|
- Default tests require real MinerU, GPU, model files, network, Obsidian, MathJax, or `samples/`.
|
|
- Sample PDFs, generated outputs, local model files, or `dist/pdf2md-ui.exe` are committed.
|
|
|
|
## Open Questions
|
|
|
|
- Should metadata-free `pdf2md recheck` be restored in a later sprint by deriving enough state from the report and Markdown, or is rerunning conversion acceptable for simplified outputs?
|
|
- Should raw MinerU outputs under `--keep-raw` be flattened into `raw/` or kept per part under `raw/<stem>_001/`? This contract recommends per-part raw folders to avoid collisions.
|
|
|
|
## Handoff Requirements
|
|
|
|
After implementation:
|
|
|
|
- Update this contract status to `Implemented`.
|
|
- Record final file layout examples in `README.md`.
|
|
- Record verification commands and outcomes in `PROGRESS.md`.
|
|
- Archive implementation and optional sample validation results in `docs/WORKARCHIVE.md`.
|
|
- Keep generated outputs and sample PDFs uncommitted.
|
|
|
|
## Implementation Handoff
|
|
|
|
- Files changed: `src/pdf2md/paths.py`, `src/pdf2md/conversion.py`, `src/pdf2md/report.py`, `src/pdf2md/cli.py`, `src/pdf2md_ui/runner.py`, focused tests, and current docs.
|
|
- Output layout implemented: `<out>/<stem>/<stem>_001.md`, additional numbered parts when grouped, `<out>/<stem>/images/`, and `<out>/<stem>/<stem>_report.md`.
|
|
- Metadata JSON behavior: new conversions do not write public `.metadata.json`; `ConversionResult.metadata_path` is `None`; internal metadata-like records still feed reports and tests.
|
|
- Recheck behavior: `pdf2md recheck` remains legacy-only and requires adjacent metadata JSON.
|
|
- Verification recorded in `PROGRESS.md`: focused Sprint 16 tests passed, full `uv run pytest` passed 227 tests with 1 optional skip, and `git diff --check` passed with line-ending warnings only.
|