Files
PDFToMD/docs/Sprints/SPRINT2CONTRACT.md
T
2026-05-08 16:42:19 +09:00

10 KiB

Sprint 2 Contract: Paths, Input Discovery, And Overwrite Planning

Status: Completed Last updated: 2026-05-07

Objective

Implement deterministic input discovery and output path planning before any PDF conversion logic exists.

Sprint 2 must establish:

  • A project-owned path planning module for local PDF inputs.
  • Deterministic discovery for a single PDF, a directory, and optional recursive directory traversal.
  • Deterministic planned output paths for Markdown, assets, metadata JSON, quality report, and optional raw MinerU output.
  • Preflight overwrite conflict detection that prevents accidental replacement unless overwrite is explicitly allowed.
  • Fast unit tests using generated temporary files, including non-ASCII filenames.

Sprint 2 is path planning only. It must not run MinerU, parse PDFs, write conversion outputs, normalize Markdown, create metadata content, or add the real convert command behavior.

Current Precondition

Sprint 1 is complete:

  • uv is installed per-user at C:\Users\user\.local\bin.
  • pyproject.toml, uv.lock, the pdf2md package, CLI placeholder, and fast pytest loop exist.
  • uv sync and uv run pytest passed.

If a new shell cannot find uv, prepend C:\Users\user\.local\bin to PATH for verification commands and record that in PROGRESS.md.

Touched Surfaces

Allowed:

  • src/pdf2md/paths.py
  • src/pdf2md/conversion.py only for a minimal type boundary if path planning cannot be tested cleanly without it
  • src/pdf2md/cli.py only if a minimal parser hook is needed for path-planning tests; do not expose working conversion behavior
  • tests/test_paths.py or tests/unit/test_paths.py
  • tests/test_cli.py only for path-planning parser coverage if cli.py changes
  • README.md only if setup/test instructions need a small update
  • PLAN.md only for current-goal coordination updates required by the shared agent workflow
  • PROGRESS.md
  • docs/V1IMPLEMENTATIONPLAN.md only if sequencing or constraints need adjustment
  • docs/Sprints/SPRINT2CONTRACT.md

Not allowed:

  • src/pdf2md/mineru_adapter.py
  • src/pdf2md/ir.py
  • src/pdf2md/markdown.py
  • src/pdf2md/metadata.py
  • src/pdf2md/quality.py
  • src/pdf2md/report.py
  • src/pdf2md/doctor.py
  • scripts/
  • Any real MinerU invocation
  • Any model download or install script
  • Any file parsing beyond local filesystem path and extension checks
  • Any conversion output writing beyond temporary files created by tests
  • Any committed file under samples/

Expected Outputs

Sprint 2 should produce:

  1. Input discovery

    • Accept a local path that is either a PDF file or a directory.
    • Treat .pdf extension matching as case-insensitive.
    • Reject a non-existent path with a clear project-owned error.
    • Reject a non-PDF file with a clear project-owned error.
    • Reject a directory with no discovered PDFs with a clear project-owned error.
    • Discover only direct child PDFs for directory input unless recursive traversal is requested.
    • Discover nested PDFs only when recursive traversal is requested.
    • Return discovered PDFs in a deterministic order.
  2. Output path plan

    • For each discovered PDF, plan:
      • Markdown path: <output-root>/<relative-parent>/<stem>.md.
      • Assets directory: <output-root>/<relative-parent>/<stem>.assets.
      • Metadata path when metadata is enabled: <output-root>/<relative-parent>/<stem>.metadata.json.
      • Quality report path: <output-root>/<relative-parent>/<stem>.report.md.
      • Raw MinerU directory when raw output is kept: <output-root>/<relative-parent>/<stem>.raw.
    • For a single PDF input, relative-parent is empty unless the implementation has a tested reason to preserve more context.
    • For recursive directory input, preserve the source-relative subdirectory under the output root to avoid filename collisions.
    • Keep planned paths local filesystem paths. Do not introduce URI, URL, cloud, or remote storage handling.
  3. Overwrite preflight

    • Detect existing planned file or directory outputs before conversion writes occur.
    • Report all detected conflicts in one project-owned error instead of failing on the first conflict.
    • Allow conflicts only when overwrite is explicitly enabled.
    • Do not delete or replace files in Sprint 2.
  4. Tests

    • Unit tests for single PDF discovery.
    • Unit tests for non-recursive directory discovery.
    • Unit tests for recursive directory discovery.
    • Unit tests for deterministic ordering.
    • Unit tests for non-ASCII filenames, including Korean filenames, using temporary files.
    • Unit tests for invalid input errors.
    • Unit tests for planned Markdown, assets, metadata, report, and raw output paths.
    • Unit tests for overwrite conflict detection.
  5. Handoff

    • PROGRESS.md records changed files, commands run, tests passed or blocked, known failures, residual risks, and next action.

Non-Goals

  • Do not implement PDF conversion.
  • Do not implement conversion orchestration.
  • Do not implement the MinerU adapter.
  • Do not run MinerU.
  • Do not install MinerU 3.1.0.
  • Do not download MinerU models.
  • Do not parse PDF contents.
  • Do not compute source SHA-256.
  • Do not implement Markdown normalization.
  • Do not implement metadata JSON content.
  • Do not implement .report.md content.
  • Do not implement pdf2md convert as a working command.
  • Do not implement pdf2md doctor.
  • Do not add runtime engine selection.
  • Do not add alternate conversion engines.
  • Do not add cloud, remote API, router, HTTP client backend, or remote OpenAI-compatible backend support.

Work Packages

WP2.1: Path Planning Types And Errors

Owner:

  • feature-generator-agent

Actions:

  • Add the smallest project-owned types needed to represent discovered inputs, planned outputs, and overwrite conflicts.
  • Add clear project-owned exceptions or error result types for invalid inputs and conflicts.
  • Avoid public API promises beyond what Sprint 2 tests verify.

Output:

  • Path planning can be tested without converter execution.

WP2.2: Input Discovery

Owner:

  • feature-generator-agent

Actions:

  • Implement single PDF and directory discovery.
  • Require explicit recursive mode for subdirectory traversal.
  • Sort results deterministically.
  • Preserve local Path objects rather than converting to strings early.

Output:

  • Discovery behavior matches PRD directory and recursive requirements.

WP2.3: Output Planning

Owner:

  • feature-generator-agent

Actions:

  • Plan Markdown, assets, metadata, report, and optional raw output paths.
  • Preserve relative subdirectories for recursive directory input.
  • Keep all planned outputs under the requested output root.

Output:

  • Later conversion code can write outputs without rediscovering naming rules.

WP2.4: Overwrite Conflict Detection

Owner:

  • feature-generator-agent

Actions:

  • Check whether any planned output already exists.
  • Return or raise a structured conflict list when overwrite is not enabled.
  • Permit the plan when overwrite is enabled without deleting anything.

Output:

  • Existing user files are protected before conversion starts.

WP2.5: Independent Evaluation

Owner:

  • evaluation-agent

Actions:

  • Review the completed path planning implementation against this contract.
  • Verify no conversion behavior, MinerU execution, remote runtime path, or alternate engine was added.
  • Verify samples/ remains untracked and unstaged.
  • Verify tests use temporary files, not committed sample PDFs.

Output:

  • PASS/FAIL notes with any missing acceptance criteria.

Verification Checks

Required:

  • git status --short before staging confirms samples/ remains untracked.
  • uv --version is run and result is recorded.
  • uv sync passes.
  • uv run pytest passes.
  • Targeted path planning tests pass.
  • Tests do not require MinerU, CUDA, GPU, model files, samples/, or network.
  • No real MinerU dependency is required for default tests.
  • No model downloads occur.
  • No network calls are required.
  • No candidate engine comparison is reintroduced.
  • No conversion behavior is implemented.
  • No output files are written outside temporary test directories.
  • git diff --check passes.

Recommended:

  • Use temporary directories for all filesystem tests.
  • Include Windows-relevant path behavior without hard-coding Windows-only separators in assertions.
  • Use requirements-guard-agent if path planning reveals a contradiction in PRD or architecture wording.

Hard Failure Criteria

Sprint 2 fails and must stop for a user decision if any of these are true:

  • Directory conversion descends recursively without explicit recursive intent.
  • Existing planned outputs can be overwritten without explicit overwrite intent.
  • Planned output paths can escape the requested output root.
  • Default tests require MinerU, CUDA, GPU, model files, network, or samples/.
  • The implementation parses PDF contents or invokes conversion behavior.
  • The implementation introduces alternate engines or runtime engine selection.
  • The implementation introduces --api-url, remote APIs, router mode, HTTP client backends, or remote OpenAI-compatible backends.
  • samples/ is staged or committed.

Acceptance Criteria

Sprint 2 is complete when:

  • src/pdf2md/paths.py exists and owns input discovery plus output path planning.
  • Single PDF discovery is tested.
  • Non-recursive and recursive directory discovery are tested.
  • Non-ASCII PDF filenames are tested with generated temporary files.
  • Markdown, assets, metadata JSON, report Markdown, and optional raw output paths are tested.
  • Existing-output conflict detection is tested with and without overwrite enabled.
  • No conversion, MinerU, Markdown normalization, metadata content, report content, or doctor behavior is implemented.
  • uv sync passes.
  • uv run pytest passes.
  • PROGRESS.md records checks performed and residual risks.
  • Independent evaluation is complete.
  • The completed change is committed.

Handoff Fields

Use these fields when Sprint 2 completes:

  • Files changed:
  • Commands run:
  • Tests passed:
  • Tests blocked:
  • Known failures:
  • Residual risks:
  • User decisions needed:
  • Go/no-go recommendation for Sprint 3:
  • Next action: