10 KiB
Sprint 2 Contract: Paths, Input Discovery, And Overwrite Planning
Status: Completed Last updated: 2026-05-07
Objective
Implement deterministic input discovery and output path planning before any PDF conversion logic exists.
Sprint 2 must establish:
- A project-owned path planning module for local PDF inputs.
- Deterministic discovery for a single PDF, a directory, and optional recursive directory traversal.
- Deterministic planned output paths for Markdown, assets, metadata JSON, quality report, and optional raw MinerU output.
- Preflight overwrite conflict detection that prevents accidental replacement unless overwrite is explicitly allowed.
- Fast unit tests using generated temporary files, including non-ASCII filenames.
Sprint 2 is path planning only. It must not run MinerU, parse PDFs, write conversion outputs, normalize Markdown, create metadata content, or add the real convert command behavior.
Current Precondition
Sprint 1 is complete:
uvis installed per-user atC:\Users\user\.local\bin.pyproject.toml,uv.lock, thepdf2mdpackage, CLI placeholder, and fast pytest loop exist.uv syncanduv run pytestpassed.
If a new shell cannot find uv, prepend C:\Users\user\.local\bin to PATH for verification commands and record that in PROGRESS.md.
Touched Surfaces
Allowed:
src/pdf2md/paths.pysrc/pdf2md/conversion.pyonly for a minimal type boundary if path planning cannot be tested cleanly without itsrc/pdf2md/cli.pyonly if a minimal parser hook is needed for path-planning tests; do not expose working conversion behaviortests/test_paths.pyortests/unit/test_paths.pytests/test_cli.pyonly for path-planning parser coverage ifcli.pychangesREADME.mdonly if setup/test instructions need a small updatePLAN.mdonly for current-goal coordination updates required by the shared agent workflowPROGRESS.mddocs/V1IMPLEMENTATIONPLAN.mdonly if sequencing or constraints need adjustmentdocs/Sprints/SPRINT2CONTRACT.md
Not allowed:
src/pdf2md/mineru_adapter.pysrc/pdf2md/ir.pysrc/pdf2md/markdown.pysrc/pdf2md/metadata.pysrc/pdf2md/quality.pysrc/pdf2md/report.pysrc/pdf2md/doctor.pyscripts/- Any real MinerU invocation
- Any model download or install script
- Any file parsing beyond local filesystem path and extension checks
- Any conversion output writing beyond temporary files created by tests
- Any committed file under
samples/
Expected Outputs
Sprint 2 should produce:
-
Input discovery
- Accept a local path that is either a PDF file or a directory.
- Treat
.pdfextension matching as case-insensitive. - Reject a non-existent path with a clear project-owned error.
- Reject a non-PDF file with a clear project-owned error.
- Reject a directory with no discovered PDFs with a clear project-owned error.
- Discover only direct child PDFs for directory input unless recursive traversal is requested.
- Discover nested PDFs only when recursive traversal is requested.
- Return discovered PDFs in a deterministic order.
-
Output path plan
- For each discovered PDF, plan:
- Markdown path:
<output-root>/<relative-parent>/<stem>.md. - Assets directory:
<output-root>/<relative-parent>/<stem>.assets. - Metadata path when metadata is enabled:
<output-root>/<relative-parent>/<stem>.metadata.json. - Quality report path:
<output-root>/<relative-parent>/<stem>.report.md. - Raw MinerU directory when raw output is kept:
<output-root>/<relative-parent>/<stem>.raw.
- Markdown path:
- For a single PDF input,
relative-parentis empty unless the implementation has a tested reason to preserve more context. - For recursive directory input, preserve the source-relative subdirectory under the output root to avoid filename collisions.
- Keep planned paths local filesystem paths. Do not introduce URI, URL, cloud, or remote storage handling.
- For each discovered PDF, plan:
-
Overwrite preflight
- Detect existing planned file or directory outputs before conversion writes occur.
- Report all detected conflicts in one project-owned error instead of failing on the first conflict.
- Allow conflicts only when overwrite is explicitly enabled.
- Do not delete or replace files in Sprint 2.
-
Tests
- Unit tests for single PDF discovery.
- Unit tests for non-recursive directory discovery.
- Unit tests for recursive directory discovery.
- Unit tests for deterministic ordering.
- Unit tests for non-ASCII filenames, including Korean filenames, using temporary files.
- Unit tests for invalid input errors.
- Unit tests for planned Markdown, assets, metadata, report, and raw output paths.
- Unit tests for overwrite conflict detection.
-
Handoff
PROGRESS.mdrecords changed files, commands run, tests passed or blocked, known failures, residual risks, and next action.
Non-Goals
- Do not implement PDF conversion.
- Do not implement conversion orchestration.
- Do not implement the MinerU adapter.
- Do not run MinerU.
- Do not install MinerU 3.1.0.
- Do not download MinerU models.
- Do not parse PDF contents.
- Do not compute source SHA-256.
- Do not implement Markdown normalization.
- Do not implement metadata JSON content.
- Do not implement
.report.mdcontent. - Do not implement
pdf2md convertas a working command. - Do not implement
pdf2md doctor. - Do not add runtime engine selection.
- Do not add alternate conversion engines.
- Do not add cloud, remote API, router, HTTP client backend, or remote OpenAI-compatible backend support.
Work Packages
WP2.1: Path Planning Types And Errors
Owner:
feature-generator-agent
Actions:
- Add the smallest project-owned types needed to represent discovered inputs, planned outputs, and overwrite conflicts.
- Add clear project-owned exceptions or error result types for invalid inputs and conflicts.
- Avoid public API promises beyond what Sprint 2 tests verify.
Output:
- Path planning can be tested without converter execution.
WP2.2: Input Discovery
Owner:
feature-generator-agent
Actions:
- Implement single PDF and directory discovery.
- Require explicit recursive mode for subdirectory traversal.
- Sort results deterministically.
- Preserve local
Pathobjects rather than converting to strings early.
Output:
- Discovery behavior matches PRD directory and recursive requirements.
WP2.3: Output Planning
Owner:
feature-generator-agent
Actions:
- Plan Markdown, assets, metadata, report, and optional raw output paths.
- Preserve relative subdirectories for recursive directory input.
- Keep all planned outputs under the requested output root.
Output:
- Later conversion code can write outputs without rediscovering naming rules.
WP2.4: Overwrite Conflict Detection
Owner:
feature-generator-agent
Actions:
- Check whether any planned output already exists.
- Return or raise a structured conflict list when overwrite is not enabled.
- Permit the plan when overwrite is enabled without deleting anything.
Output:
- Existing user files are protected before conversion starts.
WP2.5: Independent Evaluation
Owner:
evaluation-agent
Actions:
- Review the completed path planning implementation against this contract.
- Verify no conversion behavior, MinerU execution, remote runtime path, or alternate engine was added.
- Verify
samples/remains untracked and unstaged. - Verify tests use temporary files, not committed sample PDFs.
Output:
- PASS/FAIL notes with any missing acceptance criteria.
Verification Checks
Required:
git status --shortbefore staging confirmssamples/remains untracked.uv --versionis run and result is recorded.uv syncpasses.uv run pytestpasses.- Targeted path planning tests pass.
- Tests do not require MinerU, CUDA, GPU, model files,
samples/, or network. - No real MinerU dependency is required for default tests.
- No model downloads occur.
- No network calls are required.
- No candidate engine comparison is reintroduced.
- No conversion behavior is implemented.
- No output files are written outside temporary test directories.
git diff --checkpasses.
Recommended:
- Use temporary directories for all filesystem tests.
- Include Windows-relevant path behavior without hard-coding Windows-only separators in assertions.
- Use
requirements-guard-agentif path planning reveals a contradiction in PRD or architecture wording.
Hard Failure Criteria
Sprint 2 fails and must stop for a user decision if any of these are true:
- Directory conversion descends recursively without explicit recursive intent.
- Existing planned outputs can be overwritten without explicit overwrite intent.
- Planned output paths can escape the requested output root.
- Default tests require MinerU, CUDA, GPU, model files, network, or
samples/. - The implementation parses PDF contents or invokes conversion behavior.
- The implementation introduces alternate engines or runtime engine selection.
- The implementation introduces
--api-url, remote APIs, router mode, HTTP client backends, or remote OpenAI-compatible backends. samples/is staged or committed.
Acceptance Criteria
Sprint 2 is complete when:
src/pdf2md/paths.pyexists and owns input discovery plus output path planning.- Single PDF discovery is tested.
- Non-recursive and recursive directory discovery are tested.
- Non-ASCII PDF filenames are tested with generated temporary files.
- Markdown, assets, metadata JSON, report Markdown, and optional raw output paths are tested.
- Existing-output conflict detection is tested with and without overwrite enabled.
- No conversion, MinerU, Markdown normalization, metadata content, report content, or doctor behavior is implemented.
uv syncpasses.uv run pytestpasses.PROGRESS.mdrecords checks performed and residual risks.- Independent evaluation is complete.
- The completed change is committed.
Handoff Fields
Use these fields when Sprint 2 completes:
- Files changed:
- Commands run:
- Tests passed:
- Tests blocked:
- Known failures:
- Residual risks:
- User decisions needed:
- Go/no-go recommendation for Sprint 3:
- Next action: