Files
2026-05-08 16:42:19 +09:00

361 lines
15 KiB
Markdown

# Sprint 7 Contract: Conversion Orchestrator, CLI, And Python API
Status: Implemented
Last updated: 2026-05-08
## Objective
Connect the existing project-owned boundaries into a working conversion orchestration layer, public Python API, and `pdf2md convert` CLI path.
Sprint 7 must establish:
- A public `convert_pdf` API for one local PDF.
- A batch conversion API or helper for directory inputs.
- A `pdf2md convert INPUT --out OUTPUT_DIR` command.
- Product behavior that writes Markdown, optional metadata JSON, and `<stem>.report.md`.
- Local asset materialization for adapter-provided asset files.
- CLI summaries that surface success, failure, and warning counts.
- Fast tests that use fake adapter outputs and do not require real MinerU, model files, GPU, sample PDFs, network, Obsidian, or LaTeX tooling.
Sprint 7 is an orchestration sprint. It may call the real `MinerUAdapter` in the normal production path, but the default test suite must use injected fake adapters and must not execute MinerU.
## Current Precondition
Sprint 6 is complete:
- `src/pdf2md/paths.py` owns input discovery and output path planning.
- `src/pdf2md/ir.py` owns project records, block types, warning codes, and warning severities.
- `src/pdf2md/metadata.py` builds JSON-serializable metadata and summary counts from project-owned records.
- `src/pdf2md/mineru_adapter.py` owns the mocked direct local MinerU CLI adapter boundary.
- `src/pdf2md/markdown.py` owns Obsidian Markdown normalization, asset link warnings, and table fallback warnings.
- `src/pdf2md/quality.py` owns local quality checks over normalized Markdown and asset context.
- `src/pdf2md/report.py` owns report content rendering and final status calculation.
- `uv run pytest` passed 103 tests.
Sprint 7 may compute source SHA-256, create conversion output directories, copy local adapter-provided asset files, write final Markdown, write metadata JSON when requested, and write `<stem>.report.md`. It must keep public return types project-owned and must not require raw MinerU-specific Python objects from callers.
## Touched Surfaces
Allowed:
- `src/pdf2md/conversion.py`
- `src/pdf2md/cli.py`
- `src/pdf2md/__init__.py`
- `src/pdf2md/paths.py` only for narrowly required path/output helper compatibility
- `src/pdf2md/mineru_adapter.py` only for narrowly required adapter protocol or result compatibility
- `src/pdf2md/metadata.py` only for narrowly required output-location or summary compatibility
- `src/pdf2md/markdown.py` only for narrowly required orchestration compatibility
- `src/pdf2md/quality.py` only for narrowly required orchestration compatibility
- `src/pdf2md/report.py` only for narrowly required orchestration compatibility
- `tests/test_conversion.py`
- `tests/test_cli.py`
- Existing focused unit tests only if a touched module requires compatibility updates
- `README.md` only if a short usage note is needed for `pdf2md convert`
- `PLAN.md`
- `PROGRESS.md`
- `docs/V1IMPLEMENTATIONPLAN.md`
- `docs/Sprints/SPRINT7CONTRACT.md`
Not allowed:
- `src/pdf2md/doctor.py`
- Working `pdf2md doctor` behavior
- `scripts/`
- MinerU/model installation or download scripts
- Real MinerU invocation in default tests
- Real GPU/CUDA checks
- Real PDF content parsing outside adapter output handling
- Runtime engine selection or alternate engine support
- Cloud OCR, remote LLM/VLM, hosted renderer, remote document parser, remote asset fetching, HTTP client backend, router mode, `--api-url`, remote APIs, or remote OpenAI-compatible backend support
- A CLI flag that disables strict-local policy
- Committed files under `samples/`
## Expected Outputs
Sprint 7 should produce:
1. Public conversion records and API
- A project-owned conversion result type containing at least:
- source PDF path
- Markdown output path
- metadata JSON path when written
- report path
- assets directory
- raw output directory when kept
- engine name and version
- final status
- warning count
- warnings
- `convert_pdf(input_path, output_dir, metadata=True, keep_raw=False, overwrite=False, gpu=None, strict_local=True, adapter=None, clock=None)` or an equivalently small API.
- The default API path uses the direct local MinerU adapter.
- Tests can inject a fake adapter and deterministic clock.
- The public return type must not expose raw MinerU-specific Python objects as required fields.
2. Single-PDF orchestration
- Discover and plan the single PDF using existing path helpers.
- Create required output directories only after preflight path checks pass.
- Run the adapter into a planned temporary or raw work directory.
- Stop the individual conversion on adapter hard failure and return explicit warnings/status.
- Normalize adapter Markdown into Obsidian-friendly Markdown.
- Copy local adapter-provided asset files into the planned assets directory when needed.
- Compute source SHA-256 with local file reads.
- Build metadata from project-owned records.
- Run local quality checks over normalized Markdown and asset context.
- Render report Markdown from metadata and quality results.
- Write final Markdown, optional metadata JSON, and report Markdown.
3. Output writing and overwrite behavior
- Never write outside the planned output root.
- Respect existing-output conflicts unless `overwrite=True`.
- Keep writes deterministic and UTF-8 encoded for text outputs.
- Preserve `--metadata` behavior: metadata JSON is written when enabled and omitted when disabled.
- Always write `<stem>.report.md`.
- `--keep-raw` preserves raw MinerU output in the planned raw directory.
- Without `--keep-raw`, temporary raw work must be cleaned up when the conversion completes, while preserving enough failure context in metadata/report/warnings.
4. Asset materialization
- Copy only local files returned by the adapter.
- Do not fetch remote assets.
- Do not follow asset paths that escape the adapter work directory or the planned output root.
- Handle missing or invalid adapter asset paths with project-owned warnings.
- Normalize final Markdown asset links to stable relative paths.
5. CLI convert command
- `pdf2md convert INPUT --out OUTPUT_DIR`.
- Options:
- `--metadata`
- `--keep-raw`
- `--recursive`
- `--overwrite`
- `--gpu GPU_DEVICE`
- `--strict-local`
- Strict-local must remain enabled in v1; the CLI must not add a supported way to disable it.
- Single PDF conversion returns exit code `0` on success or partial success, and non-zero when a hard error prevents conversion.
- Directory conversion handles multiple PDFs deterministically and prints a summary.
- Batch conversion should continue to the next file when one PDF fails after planning, then return a non-zero exit code if any PDF failed.
- CLI output must include converted count, failed count, and warning count.
6. Batch conversion
- Directory inputs use existing non-recursive discovery by default.
- Recursive discovery occurs only with `--recursive`.
- Output paths preserve relative subdirectories from the input root.
- Duplicate planned outputs and overwrite conflicts fail before conversion starts.
- Results are deterministic and ordered by discovery/path planning order.
7. Failure and warning behavior
- MinerU failure must be clear and must not trigger fallback to any other engine.
- Strict-local violations must be hard failures.
- Per-file failures must include project-owned warnings.
- CLI summaries must not suppress warning counts.
- Metadata/report content must reflect warnings emitted during adapter, normalization, asset, and quality steps.
8. Tests
- API test for one successful conversion with a fake adapter.
- API test for adapter failure with no fallback.
- API test for output conflict and overwrite behavior.
- API test for metadata disabled behavior.
- API test for local asset copying and relative Markdown links.
- API test for `keep_raw` behavior.
- CLI test for single PDF conversion with a fake adapter.
- CLI test for directory conversion with deterministic summary output.
- CLI test for recursive behavior.
- CLI test for failure summary and non-zero exit code.
- Tests proving default checks do not require real MinerU, GPU, models, network, `samples/`, Obsidian, or LaTeX tooling.
9. Handoff
- `PROGRESS.md` records changed files, commands run, tests passed or blocked, known failures, residual risks, and next action.
## Non-Goals
- Do not implement `pdf2md doctor`.
- Do not implement environment diagnostics.
- Do not install MinerU 3.1.0.
- Do not download MinerU models.
- Do not probe real MinerU output with local sample PDFs.
- Do not add setup scripts.
- Do not implement runtime engine selection.
- Do not add alternate engines.
- Do not add cloud, remote API, router, HTTP client backend, remote OpenAI-compatible backend, hosted renderer, or remote asset-fetching support.
- Do not add a CLI flag or API option that disables strict-local policy.
- Do not require real MinerU, CUDA, GPU, model files, network, Obsidian, LaTeX tooling, or `samples/` in default tests.
- Do not implement real math rendering; use the Sprint 6 local checker boundary.
- Do not commit generated conversion outputs or sample PDFs.
## Work Packages
### WP7.1: Public API And Result Records
Owner:
- `feature-generator-agent`
- `requirements-guard-agent`
Actions:
- Add `conversion.py`.
- Define project-owned conversion result records.
- Expose `convert_pdf` from the library surface.
- Support fake adapter and deterministic clock injection for tests.
Output:
- Callers can run one conversion without depending on raw MinerU objects.
### WP7.2: Single-PDF Orchestration And Output Writing
Owner:
- `feature-generator-agent`
- `mineru-integration-agent`
- `metadata-agent`
- `obsidian-markdown-agent`
Actions:
- Connect path planning, adapter execution, Markdown normalization, metadata building, quality checks, and report rendering.
- Write final Markdown, optional metadata JSON, and report Markdown.
- Compute source SHA-256.
- Preserve strict-local behavior and no-fallback behavior.
Output:
- One PDF can be converted through mocked adapter outputs in tests and through the real adapter in normal use.
### WP7.3: Asset And Raw Output Handling
Owner:
- `feature-generator-agent`
- `obsidian-markdown-agent`
- `metadata-agent`
Actions:
- Copy local adapter-provided assets into the planned assets directory.
- Normalize Markdown links relative to the final Markdown file.
- Preserve raw output only when requested.
- Clean temporary work when raw output is not requested.
Output:
- Markdown, assets, metadata, and report paths are stable and local-only.
### WP7.4: CLI Convert And Batch Summary
Owner:
- `feature-generator-agent`
- `requirements-guard-agent`
Actions:
- Replace the placeholder CLI with `convert` while keeping `--version`.
- Add only the agreed v1 options.
- Print deterministic summaries with converted, failed, and warning counts.
- Return non-zero exit code when hard failures occur.
Output:
- Users can run `pdf2md convert` for one PDF or a directory.
### WP7.5: Independent Evaluation
Owner:
- `evaluation-agent`
Actions:
- Review completed orchestration behavior against this contract.
- Verify no default test executes real MinerU, uses GPU, downloads models, uses network, or requires `samples/`.
- Verify no runtime remote/API path or alternate engine is introduced.
- Verify `samples/` remains untracked and unstaged.
Output:
- PASS/FAIL notes with any missing acceptance criteria.
## Verification Checks
Required:
- `git status --short --untracked-files=all` before staging confirms `samples/` remains untracked and unstaged.
- `uv --version` is run and result is recorded.
- `uv sync` passes.
- `uv run pytest tests/test_conversion.py tests/test_cli.py` passes.
- `uv run pytest` passes.
- `git diff --check` passes.
- Default tests do not require real MinerU, CUDA, GPU, model files, network, Obsidian, LaTeX tooling, or `samples/`.
- No model downloads occur.
- No network calls are required.
- No candidate engine comparison is reintroduced.
- No alternate engine or runtime engine selection is added.
- No CLI/API option disables strict-local policy.
- No `--api-url`, router mode, HTTP client backend, remote API, or remote OpenAI-compatible backend support is added.
- Adapter failures produce explicit failed results and no fallback conversion.
- Output files are written only after path preflight succeeds.
- Existing outputs are protected unless overwrite is enabled.
- CLI summaries include warning counts.
- Metadata/report paths and counts match the files written.
Recommended:
- Keep conversion orchestration small and dependency-injected.
- Prefer local temporary directories from the standard library for raw work when `keep_raw` is disabled.
- Keep batch conversion a thin loop over single-file conversion.
- Keep CLI formatting simple and stable enough for tests.
- Use fake adapter records in tests rather than monkeypatching subprocess behavior at the CLI layer.
## Hard Failure Criteria
Sprint 7 fails and must stop for a user decision if any of these are true:
- Public API requires or exposes raw MinerU-specific Python objects as required return fields.
- The implementation silently falls back to another engine after MinerU failure.
- A CLI/API option disables strict-local policy.
- The implementation adds or permits `--api-url`, remote APIs, router mode, HTTP client backends, or remote OpenAI-compatible backends.
- Default tests execute real MinerU, require GPU/CUDA, download models, use network, require Obsidian/LaTeX tooling, or require `samples/`.
- Output writing can escape the planned output root.
- Existing files are overwritten without explicit overwrite intent.
- CLI writes final outputs after a preflight hard failure.
- CLI summaries suppress warning counts or failed counts.
- Metadata/report content omits warnings emitted during adapter, normalization, asset, or quality steps.
- `samples/` is staged or committed.
## Acceptance Criteria
Sprint 7 is complete when:
- `src/pdf2md/conversion.py` exists and owns conversion orchestration.
- `convert_pdf` is available from the public Python package.
- `pdf2md convert INPUT --out OUTPUT_DIR` exists.
- Single-PDF conversion is tested with a fake adapter.
- Directory and recursive conversion behavior is tested with fake adapters.
- Output conflict and overwrite behavior is tested.
- Adapter failure produces a clear failed result and no fallback.
- Final Markdown, metadata JSON when enabled, and report Markdown are written by product behavior.
- Local adapter-provided assets are copied or warned about deterministically.
- `--keep-raw` behavior is tested.
- CLI summaries include converted, failed, and warning counts.
- Default tests do not require real MinerU, GPU, model files, network, Obsidian, LaTeX tooling, or `samples/`.
- `uv sync` passes.
- Targeted conversion/CLI tests pass.
- `uv run pytest` passes.
- `PROGRESS.md` records checks performed and residual risks.
- Independent evaluation is complete.
- The completed change is committed.
## Handoff Fields
Use these fields when Sprint 7 completes:
- Files changed:
- Commands run:
- Tests passed:
- Tests blocked:
- Known failures:
- Residual risks:
- User decisions needed:
- Go/no-go recommendation for Sprint 8:
- Next action: