275 lines
10 KiB
Markdown
275 lines
10 KiB
Markdown
# Sprint 2 Contract: Paths, Input Discovery, And Overwrite Planning
|
|
|
|
Status: Completed
|
|
Last updated: 2026-05-07
|
|
|
|
## Objective
|
|
|
|
Implement deterministic input discovery and output path planning before any PDF conversion logic exists.
|
|
|
|
Sprint 2 must establish:
|
|
|
|
- A project-owned path planning module for local PDF inputs.
|
|
- Deterministic discovery for a single PDF, a directory, and optional recursive directory traversal.
|
|
- Deterministic planned output paths for Markdown, assets, metadata JSON, quality report, and optional raw MinerU output.
|
|
- Preflight overwrite conflict detection that prevents accidental replacement unless overwrite is explicitly allowed.
|
|
- Fast unit tests using generated temporary files, including non-ASCII filenames.
|
|
|
|
Sprint 2 is path planning only. It must not run MinerU, parse PDFs, write conversion outputs, normalize Markdown, create metadata content, or add the real `convert` command behavior.
|
|
|
|
## Current Precondition
|
|
|
|
Sprint 1 is complete:
|
|
|
|
- `uv` is installed per-user at `C:\Users\user\.local\bin`.
|
|
- `pyproject.toml`, `uv.lock`, the `pdf2md` package, CLI placeholder, and fast pytest loop exist.
|
|
- `uv sync` and `uv run pytest` passed.
|
|
|
|
If a new shell cannot find `uv`, prepend `C:\Users\user\.local\bin` to PATH for verification commands and record that in `PROGRESS.md`.
|
|
|
|
## Touched Surfaces
|
|
|
|
Allowed:
|
|
|
|
- `src/pdf2md/paths.py`
|
|
- `src/pdf2md/conversion.py` only for a minimal type boundary if path planning cannot be tested cleanly without it
|
|
- `src/pdf2md/cli.py` only if a minimal parser hook is needed for path-planning tests; do not expose working conversion behavior
|
|
- `tests/test_paths.py` or `tests/unit/test_paths.py`
|
|
- `tests/test_cli.py` only for path-planning parser coverage if `cli.py` changes
|
|
- `README.md` only if setup/test instructions need a small update
|
|
- `PLAN.md` only for current-goal coordination updates required by the shared agent workflow
|
|
- `PROGRESS.md`
|
|
- `docs/V1IMPLEMENTATIONPLAN.md` only if sequencing or constraints need adjustment
|
|
- `docs/Sprints/SPRINT2CONTRACT.md`
|
|
|
|
Not allowed:
|
|
|
|
- `src/pdf2md/mineru_adapter.py`
|
|
- `src/pdf2md/ir.py`
|
|
- `src/pdf2md/markdown.py`
|
|
- `src/pdf2md/metadata.py`
|
|
- `src/pdf2md/quality.py`
|
|
- `src/pdf2md/report.py`
|
|
- `src/pdf2md/doctor.py`
|
|
- `scripts/`
|
|
- Any real MinerU invocation
|
|
- Any model download or install script
|
|
- Any file parsing beyond local filesystem path and extension checks
|
|
- Any conversion output writing beyond temporary files created by tests
|
|
- Any committed file under `samples/`
|
|
|
|
## Expected Outputs
|
|
|
|
Sprint 2 should produce:
|
|
|
|
1. Input discovery
|
|
- Accept a local path that is either a PDF file or a directory.
|
|
- Treat `.pdf` extension matching as case-insensitive.
|
|
- Reject a non-existent path with a clear project-owned error.
|
|
- Reject a non-PDF file with a clear project-owned error.
|
|
- Reject a directory with no discovered PDFs with a clear project-owned error.
|
|
- Discover only direct child PDFs for directory input unless recursive traversal is requested.
|
|
- Discover nested PDFs only when recursive traversal is requested.
|
|
- Return discovered PDFs in a deterministic order.
|
|
|
|
2. Output path plan
|
|
- For each discovered PDF, plan:
|
|
- Markdown path: `<output-root>/<relative-parent>/<stem>.md`.
|
|
- Assets directory: `<output-root>/<relative-parent>/<stem>.assets`.
|
|
- Metadata path when metadata is enabled: `<output-root>/<relative-parent>/<stem>.metadata.json`.
|
|
- Quality report path: `<output-root>/<relative-parent>/<stem>.report.md`.
|
|
- Raw MinerU directory when raw output is kept: `<output-root>/<relative-parent>/<stem>.raw`.
|
|
- For a single PDF input, `relative-parent` is empty unless the implementation has a tested reason to preserve more context.
|
|
- For recursive directory input, preserve the source-relative subdirectory under the output root to avoid filename collisions.
|
|
- Keep planned paths local filesystem paths. Do not introduce URI, URL, cloud, or remote storage handling.
|
|
|
|
3. Overwrite preflight
|
|
- Detect existing planned file or directory outputs before conversion writes occur.
|
|
- Report all detected conflicts in one project-owned error instead of failing on the first conflict.
|
|
- Allow conflicts only when overwrite is explicitly enabled.
|
|
- Do not delete or replace files in Sprint 2.
|
|
|
|
4. Tests
|
|
- Unit tests for single PDF discovery.
|
|
- Unit tests for non-recursive directory discovery.
|
|
- Unit tests for recursive directory discovery.
|
|
- Unit tests for deterministic ordering.
|
|
- Unit tests for non-ASCII filenames, including Korean filenames, using temporary files.
|
|
- Unit tests for invalid input errors.
|
|
- Unit tests for planned Markdown, assets, metadata, report, and raw output paths.
|
|
- Unit tests for overwrite conflict detection.
|
|
|
|
5. Handoff
|
|
- `PROGRESS.md` records changed files, commands run, tests passed or blocked, known failures, residual risks, and next action.
|
|
|
|
## Non-Goals
|
|
|
|
- Do not implement PDF conversion.
|
|
- Do not implement conversion orchestration.
|
|
- Do not implement the MinerU adapter.
|
|
- Do not run MinerU.
|
|
- Do not install MinerU 3.1.0.
|
|
- Do not download MinerU models.
|
|
- Do not parse PDF contents.
|
|
- Do not compute source SHA-256.
|
|
- Do not implement Markdown normalization.
|
|
- Do not implement metadata JSON content.
|
|
- Do not implement `.report.md` content.
|
|
- Do not implement `pdf2md convert` as a working command.
|
|
- Do not implement `pdf2md doctor`.
|
|
- Do not add runtime engine selection.
|
|
- Do not add alternate conversion engines.
|
|
- Do not add cloud, remote API, router, HTTP client backend, or remote OpenAI-compatible backend support.
|
|
|
|
## Work Packages
|
|
|
|
### WP2.1: Path Planning Types And Errors
|
|
|
|
Owner:
|
|
|
|
- `feature-generator-agent`
|
|
|
|
Actions:
|
|
|
|
- Add the smallest project-owned types needed to represent discovered inputs, planned outputs, and overwrite conflicts.
|
|
- Add clear project-owned exceptions or error result types for invalid inputs and conflicts.
|
|
- Avoid public API promises beyond what Sprint 2 tests verify.
|
|
|
|
Output:
|
|
|
|
- Path planning can be tested without converter execution.
|
|
|
|
### WP2.2: Input Discovery
|
|
|
|
Owner:
|
|
|
|
- `feature-generator-agent`
|
|
|
|
Actions:
|
|
|
|
- Implement single PDF and directory discovery.
|
|
- Require explicit recursive mode for subdirectory traversal.
|
|
- Sort results deterministically.
|
|
- Preserve local `Path` objects rather than converting to strings early.
|
|
|
|
Output:
|
|
|
|
- Discovery behavior matches PRD directory and recursive requirements.
|
|
|
|
### WP2.3: Output Planning
|
|
|
|
Owner:
|
|
|
|
- `feature-generator-agent`
|
|
|
|
Actions:
|
|
|
|
- Plan Markdown, assets, metadata, report, and optional raw output paths.
|
|
- Preserve relative subdirectories for recursive directory input.
|
|
- Keep all planned outputs under the requested output root.
|
|
|
|
Output:
|
|
|
|
- Later conversion code can write outputs without rediscovering naming rules.
|
|
|
|
### WP2.4: Overwrite Conflict Detection
|
|
|
|
Owner:
|
|
|
|
- `feature-generator-agent`
|
|
|
|
Actions:
|
|
|
|
- Check whether any planned output already exists.
|
|
- Return or raise a structured conflict list when overwrite is not enabled.
|
|
- Permit the plan when overwrite is enabled without deleting anything.
|
|
|
|
Output:
|
|
|
|
- Existing user files are protected before conversion starts.
|
|
|
|
### WP2.5: Independent Evaluation
|
|
|
|
Owner:
|
|
|
|
- `evaluation-agent`
|
|
|
|
Actions:
|
|
|
|
- Review the completed path planning implementation against this contract.
|
|
- Verify no conversion behavior, MinerU execution, remote runtime path, or alternate engine was added.
|
|
- Verify `samples/` remains untracked and unstaged.
|
|
- Verify tests use temporary files, not committed sample PDFs.
|
|
|
|
Output:
|
|
|
|
- PASS/FAIL notes with any missing acceptance criteria.
|
|
|
|
## Verification Checks
|
|
|
|
Required:
|
|
|
|
- `git status --short` before staging confirms `samples/` remains untracked.
|
|
- `uv --version` is run and result is recorded.
|
|
- `uv sync` passes.
|
|
- `uv run pytest` passes.
|
|
- Targeted path planning tests pass.
|
|
- Tests do not require MinerU, CUDA, GPU, model files, `samples/`, or network.
|
|
- No real MinerU dependency is required for default tests.
|
|
- No model downloads occur.
|
|
- No network calls are required.
|
|
- No candidate engine comparison is reintroduced.
|
|
- No conversion behavior is implemented.
|
|
- No output files are written outside temporary test directories.
|
|
- `git diff --check` passes.
|
|
|
|
Recommended:
|
|
|
|
- Use temporary directories for all filesystem tests.
|
|
- Include Windows-relevant path behavior without hard-coding Windows-only separators in assertions.
|
|
- Use `requirements-guard-agent` if path planning reveals a contradiction in PRD or architecture wording.
|
|
|
|
## Hard Failure Criteria
|
|
|
|
Sprint 2 fails and must stop for a user decision if any of these are true:
|
|
|
|
- Directory conversion descends recursively without explicit recursive intent.
|
|
- Existing planned outputs can be overwritten without explicit overwrite intent.
|
|
- Planned output paths can escape the requested output root.
|
|
- Default tests require MinerU, CUDA, GPU, model files, network, or `samples/`.
|
|
- The implementation parses PDF contents or invokes conversion behavior.
|
|
- The implementation introduces alternate engines or runtime engine selection.
|
|
- The implementation introduces `--api-url`, remote APIs, router mode, HTTP client backends, or remote OpenAI-compatible backends.
|
|
- `samples/` is staged or committed.
|
|
|
|
## Acceptance Criteria
|
|
|
|
Sprint 2 is complete when:
|
|
|
|
- `src/pdf2md/paths.py` exists and owns input discovery plus output path planning.
|
|
- Single PDF discovery is tested.
|
|
- Non-recursive and recursive directory discovery are tested.
|
|
- Non-ASCII PDF filenames are tested with generated temporary files.
|
|
- Markdown, assets, metadata JSON, report Markdown, and optional raw output paths are tested.
|
|
- Existing-output conflict detection is tested with and without overwrite enabled.
|
|
- No conversion, MinerU, Markdown normalization, metadata content, report content, or doctor behavior is implemented.
|
|
- `uv sync` passes.
|
|
- `uv run pytest` passes.
|
|
- `PROGRESS.md` records checks performed and residual risks.
|
|
- Independent evaluation is complete.
|
|
- The completed change is committed.
|
|
|
|
## Handoff Fields
|
|
|
|
Use these fields when Sprint 2 completes:
|
|
|
|
- Files changed:
|
|
- Commands run:
|
|
- Tests passed:
|
|
- Tests blocked:
|
|
- Known failures:
|
|
- Residual risks:
|
|
- User decisions needed:
|
|
- Go/no-go recommendation for Sprint 3:
|
|
- Next action:
|