# Sprint 2 Contract: Paths, Input Discovery, And Overwrite Planning Status: Completed Last updated: 2026-05-07 ## Objective Implement deterministic input discovery and output path planning before any PDF conversion logic exists. Sprint 2 must establish: - A project-owned path planning module for local PDF inputs. - Deterministic discovery for a single PDF, a directory, and optional recursive directory traversal. - Deterministic planned output paths for Markdown, assets, metadata JSON, quality report, and optional raw MinerU output. - Preflight overwrite conflict detection that prevents accidental replacement unless overwrite is explicitly allowed. - Fast unit tests using generated temporary files, including non-ASCII filenames. Sprint 2 is path planning only. It must not run MinerU, parse PDFs, write conversion outputs, normalize Markdown, create metadata content, or add the real `convert` command behavior. ## Current Precondition Sprint 1 is complete: - `uv` is installed per-user at `C:\Users\user\.local\bin`. - `pyproject.toml`, `uv.lock`, the `pdf2md` package, CLI placeholder, and fast pytest loop exist. - `uv sync` and `uv run pytest` passed. If a new shell cannot find `uv`, prepend `C:\Users\user\.local\bin` to PATH for verification commands and record that in `PROGRESS.md`. ## Touched Surfaces Allowed: - `src/pdf2md/paths.py` - `src/pdf2md/conversion.py` only for a minimal type boundary if path planning cannot be tested cleanly without it - `src/pdf2md/cli.py` only if a minimal parser hook is needed for path-planning tests; do not expose working conversion behavior - `tests/test_paths.py` or `tests/unit/test_paths.py` - `tests/test_cli.py` only for path-planning parser coverage if `cli.py` changes - `README.md` only if setup/test instructions need a small update - `PLAN.md` only for current-goal coordination updates required by the shared agent workflow - `PROGRESS.md` - `docs/V1IMPLEMENTATIONPLAN.md` only if sequencing or constraints need adjustment - `docs/Sprints/SPRINT2CONTRACT.md` Not allowed: - `src/pdf2md/mineru_adapter.py` - `src/pdf2md/ir.py` - `src/pdf2md/markdown.py` - `src/pdf2md/metadata.py` - `src/pdf2md/quality.py` - `src/pdf2md/report.py` - `src/pdf2md/doctor.py` - `scripts/` - Any real MinerU invocation - Any model download or install script - Any file parsing beyond local filesystem path and extension checks - Any conversion output writing beyond temporary files created by tests - Any committed file under `samples/` ## Expected Outputs Sprint 2 should produce: 1. Input discovery - Accept a local path that is either a PDF file or a directory. - Treat `.pdf` extension matching as case-insensitive. - Reject a non-existent path with a clear project-owned error. - Reject a non-PDF file with a clear project-owned error. - Reject a directory with no discovered PDFs with a clear project-owned error. - Discover only direct child PDFs for directory input unless recursive traversal is requested. - Discover nested PDFs only when recursive traversal is requested. - Return discovered PDFs in a deterministic order. 2. Output path plan - For each discovered PDF, plan: - Markdown path: `//.md`. - Assets directory: `//.assets`. - Metadata path when metadata is enabled: `//.metadata.json`. - Quality report path: `//.report.md`. - Raw MinerU directory when raw output is kept: `//.raw`. - For a single PDF input, `relative-parent` is empty unless the implementation has a tested reason to preserve more context. - For recursive directory input, preserve the source-relative subdirectory under the output root to avoid filename collisions. - Keep planned paths local filesystem paths. Do not introduce URI, URL, cloud, or remote storage handling. 3. Overwrite preflight - Detect existing planned file or directory outputs before conversion writes occur. - Report all detected conflicts in one project-owned error instead of failing on the first conflict. - Allow conflicts only when overwrite is explicitly enabled. - Do not delete or replace files in Sprint 2. 4. Tests - Unit tests for single PDF discovery. - Unit tests for non-recursive directory discovery. - Unit tests for recursive directory discovery. - Unit tests for deterministic ordering. - Unit tests for non-ASCII filenames, including Korean filenames, using temporary files. - Unit tests for invalid input errors. - Unit tests for planned Markdown, assets, metadata, report, and raw output paths. - Unit tests for overwrite conflict detection. 5. Handoff - `PROGRESS.md` records changed files, commands run, tests passed or blocked, known failures, residual risks, and next action. ## Non-Goals - Do not implement PDF conversion. - Do not implement conversion orchestration. - Do not implement the MinerU adapter. - Do not run MinerU. - Do not install MinerU 3.1.0. - Do not download MinerU models. - Do not parse PDF contents. - Do not compute source SHA-256. - Do not implement Markdown normalization. - Do not implement metadata JSON content. - Do not implement `.report.md` content. - Do not implement `pdf2md convert` as a working command. - Do not implement `pdf2md doctor`. - Do not add runtime engine selection. - Do not add alternate conversion engines. - Do not add cloud, remote API, router, HTTP client backend, or remote OpenAI-compatible backend support. ## Work Packages ### WP2.1: Path Planning Types And Errors Owner: - `feature-generator-agent` Actions: - Add the smallest project-owned types needed to represent discovered inputs, planned outputs, and overwrite conflicts. - Add clear project-owned exceptions or error result types for invalid inputs and conflicts. - Avoid public API promises beyond what Sprint 2 tests verify. Output: - Path planning can be tested without converter execution. ### WP2.2: Input Discovery Owner: - `feature-generator-agent` Actions: - Implement single PDF and directory discovery. - Require explicit recursive mode for subdirectory traversal. - Sort results deterministically. - Preserve local `Path` objects rather than converting to strings early. Output: - Discovery behavior matches PRD directory and recursive requirements. ### WP2.3: Output Planning Owner: - `feature-generator-agent` Actions: - Plan Markdown, assets, metadata, report, and optional raw output paths. - Preserve relative subdirectories for recursive directory input. - Keep all planned outputs under the requested output root. Output: - Later conversion code can write outputs without rediscovering naming rules. ### WP2.4: Overwrite Conflict Detection Owner: - `feature-generator-agent` Actions: - Check whether any planned output already exists. - Return or raise a structured conflict list when overwrite is not enabled. - Permit the plan when overwrite is enabled without deleting anything. Output: - Existing user files are protected before conversion starts. ### WP2.5: Independent Evaluation Owner: - `evaluation-agent` Actions: - Review the completed path planning implementation against this contract. - Verify no conversion behavior, MinerU execution, remote runtime path, or alternate engine was added. - Verify `samples/` remains untracked and unstaged. - Verify tests use temporary files, not committed sample PDFs. Output: - PASS/FAIL notes with any missing acceptance criteria. ## Verification Checks Required: - `git status --short` before staging confirms `samples/` remains untracked. - `uv --version` is run and result is recorded. - `uv sync` passes. - `uv run pytest` passes. - Targeted path planning tests pass. - Tests do not require MinerU, CUDA, GPU, model files, `samples/`, or network. - No real MinerU dependency is required for default tests. - No model downloads occur. - No network calls are required. - No candidate engine comparison is reintroduced. - No conversion behavior is implemented. - No output files are written outside temporary test directories. - `git diff --check` passes. Recommended: - Use temporary directories for all filesystem tests. - Include Windows-relevant path behavior without hard-coding Windows-only separators in assertions. - Use `requirements-guard-agent` if path planning reveals a contradiction in PRD or architecture wording. ## Hard Failure Criteria Sprint 2 fails and must stop for a user decision if any of these are true: - Directory conversion descends recursively without explicit recursive intent. - Existing planned outputs can be overwritten without explicit overwrite intent. - Planned output paths can escape the requested output root. - Default tests require MinerU, CUDA, GPU, model files, network, or `samples/`. - The implementation parses PDF contents or invokes conversion behavior. - The implementation introduces alternate engines or runtime engine selection. - The implementation introduces `--api-url`, remote APIs, router mode, HTTP client backends, or remote OpenAI-compatible backends. - `samples/` is staged or committed. ## Acceptance Criteria Sprint 2 is complete when: - `src/pdf2md/paths.py` exists and owns input discovery plus output path planning. - Single PDF discovery is tested. - Non-recursive and recursive directory discovery are tested. - Non-ASCII PDF filenames are tested with generated temporary files. - Markdown, assets, metadata JSON, report Markdown, and optional raw output paths are tested. - Existing-output conflict detection is tested with and without overwrite enabled. - No conversion, MinerU, Markdown normalization, metadata content, report content, or doctor behavior is implemented. - `uv sync` passes. - `uv run pytest` passes. - `PROGRESS.md` records checks performed and residual risks. - Independent evaluation is complete. - The completed change is committed. ## Handoff Fields Use these fields when Sprint 2 completes: - Files changed: - Commands run: - Tests passed: - Tests blocked: - Known failures: - Residual risks: - User decisions needed: - Go/no-go recommendation for Sprint 3: - Next action: