baram2584/PDFToMD

Fork 0

Files

T

김경종 88d6b92283 add pdftomd

2026-05-08 16:42:19 +09:00

10 KiB

Raw Blame History

Sprint 2 Contract: Paths, Input Discovery, And Overwrite Planning

Status: Completed Last updated: 2026-05-07

Objective

Implement deterministic input discovery and output path planning before any PDF conversion logic exists.

Sprint 2 must establish:

A project-owned path planning module for local PDF inputs.
Deterministic discovery for a single PDF, a directory, and optional recursive directory traversal.
Deterministic planned output paths for Markdown, assets, metadata JSON, quality report, and optional raw MinerU output.
Preflight overwrite conflict detection that prevents accidental replacement unless overwrite is explicitly allowed.
Fast unit tests using generated temporary files, including non-ASCII filenames.

Sprint 2 is path planning only. It must not run MinerU, parse PDFs, write conversion outputs, normalize Markdown, create metadata content, or add the real convert command behavior.

Current Precondition

Sprint 1 is complete:

uv is installed per-user at C:\Users\user\.local\bin.
pyproject.toml, uv.lock, the pdf2md package, CLI placeholder, and fast pytest loop exist.
uv sync and uv run pytest passed.

If a new shell cannot find uv, prepend C:\Users\user\.local\bin to PATH for verification commands and record that in PROGRESS.md.

Touched Surfaces

Allowed:

src/pdf2md/paths.py
src/pdf2md/conversion.py only for a minimal type boundary if path planning cannot be tested cleanly without it
src/pdf2md/cli.py only if a minimal parser hook is needed for path-planning tests; do not expose working conversion behavior
tests/test_paths.py or tests/unit/test_paths.py
tests/test_cli.py only for path-planning parser coverage if cli.py changes
README.md only if setup/test instructions need a small update
PLAN.md only for current-goal coordination updates required by the shared agent workflow
PROGRESS.md
docs/V1IMPLEMENTATIONPLAN.md only if sequencing or constraints need adjustment
docs/Sprints/SPRINT2CONTRACT.md

Not allowed:

src/pdf2md/mineru_adapter.py
src/pdf2md/ir.py
src/pdf2md/markdown.py
src/pdf2md/metadata.py
src/pdf2md/quality.py
src/pdf2md/report.py
src/pdf2md/doctor.py
scripts/
Any real MinerU invocation
Any model download or install script
Any file parsing beyond local filesystem path and extension checks
Any conversion output writing beyond temporary files created by tests
Any committed file under samples/

Expected Outputs

Sprint 2 should produce:

Input discovery
- Accept a local path that is either a PDF file or a directory.
- Treat .pdf extension matching as case-insensitive.
- Reject a non-existent path with a clear project-owned error.
- Reject a non-PDF file with a clear project-owned error.
- Reject a directory with no discovered PDFs with a clear project-owned error.
- Discover only direct child PDFs for directory input unless recursive traversal is requested.
- Discover nested PDFs only when recursive traversal is requested.
- Return discovered PDFs in a deterministic order.
Output path plan
- For each discovered PDF, plan:
  - Markdown path: <output-root>/<relative-parent>/<stem>.md.
  - Assets directory: <output-root>/<relative-parent>/<stem>.assets.
  - Metadata path when metadata is enabled: <output-root>/<relative-parent>/<stem>.metadata.json.
  - Quality report path: <output-root>/<relative-parent>/<stem>.report.md.
  - Raw MinerU directory when raw output is kept: <output-root>/<relative-parent>/<stem>.raw.
- For a single PDF input, relative-parent is empty unless the implementation has a tested reason to preserve more context.
- For recursive directory input, preserve the source-relative subdirectory under the output root to avoid filename collisions.
- Keep planned paths local filesystem paths. Do not introduce URI, URL, cloud, or remote storage handling.
Overwrite preflight
- Detect existing planned file or directory outputs before conversion writes occur.
- Report all detected conflicts in one project-owned error instead of failing on the first conflict.
- Allow conflicts only when overwrite is explicitly enabled.
- Do not delete or replace files in Sprint 2.
Tests
- Unit tests for single PDF discovery.
- Unit tests for non-recursive directory discovery.
- Unit tests for recursive directory discovery.
- Unit tests for deterministic ordering.
- Unit tests for non-ASCII filenames, including Korean filenames, using temporary files.
- Unit tests for invalid input errors.
- Unit tests for planned Markdown, assets, metadata, report, and raw output paths.
- Unit tests for overwrite conflict detection.
Handoff
- PROGRESS.md records changed files, commands run, tests passed or blocked, known failures, residual risks, and next action.

Non-Goals

Do not implement PDF conversion.
Do not implement conversion orchestration.
Do not implement the MinerU adapter.
Do not run MinerU.
Do not install MinerU 3.1.0.
Do not download MinerU models.
Do not parse PDF contents.
Do not compute source SHA-256.
Do not implement Markdown normalization.
Do not implement metadata JSON content.
Do not implement .report.md content.
Do not implement pdf2md convert as a working command.
Do not implement pdf2md doctor.
Do not add runtime engine selection.
Do not add alternate conversion engines.
Do not add cloud, remote API, router, HTTP client backend, or remote OpenAI-compatible backend support.

Work Packages

WP2.1: Path Planning Types And Errors

Owner:

feature-generator-agent

Actions:

Add the smallest project-owned types needed to represent discovered inputs, planned outputs, and overwrite conflicts.
Add clear project-owned exceptions or error result types for invalid inputs and conflicts.
Avoid public API promises beyond what Sprint 2 tests verify.

Output:

Path planning can be tested without converter execution.

WP2.2: Input Discovery

Owner:

feature-generator-agent

Actions:

Implement single PDF and directory discovery.
Require explicit recursive mode for subdirectory traversal.
Sort results deterministically.
Preserve local Path objects rather than converting to strings early.

Output:

Discovery behavior matches PRD directory and recursive requirements.

WP2.3: Output Planning

Owner:

feature-generator-agent

Actions:

Plan Markdown, assets, metadata, report, and optional raw output paths.
Preserve relative subdirectories for recursive directory input.
Keep all planned outputs under the requested output root.

Output:

Later conversion code can write outputs without rediscovering naming rules.

WP2.4: Overwrite Conflict Detection

Owner:

feature-generator-agent

Actions:

Check whether any planned output already exists.
Return or raise a structured conflict list when overwrite is not enabled.
Permit the plan when overwrite is enabled without deleting anything.

Output:

Existing user files are protected before conversion starts.

WP2.5: Independent Evaluation

Owner:

evaluation-agent

Actions:

Review the completed path planning implementation against this contract.
Verify no conversion behavior, MinerU execution, remote runtime path, or alternate engine was added.
Verify samples/ remains untracked and unstaged.
Verify tests use temporary files, not committed sample PDFs.

Output:

PASS/FAIL notes with any missing acceptance criteria.

Verification Checks

Required:

git status --short before staging confirms samples/ remains untracked.
uv --version is run and result is recorded.
uv sync passes.
uv run pytest passes.
Targeted path planning tests pass.
Tests do not require MinerU, CUDA, GPU, model files, samples/, or network.
No real MinerU dependency is required for default tests.
No model downloads occur.
No network calls are required.
No candidate engine comparison is reintroduced.
No conversion behavior is implemented.
No output files are written outside temporary test directories.
git diff --check passes.

Recommended:

Use temporary directories for all filesystem tests.
Include Windows-relevant path behavior without hard-coding Windows-only separators in assertions.
Use requirements-guard-agent if path planning reveals a contradiction in PRD or architecture wording.

Hard Failure Criteria

Sprint 2 fails and must stop for a user decision if any of these are true:

Directory conversion descends recursively without explicit recursive intent.
Existing planned outputs can be overwritten without explicit overwrite intent.
Planned output paths can escape the requested output root.
Default tests require MinerU, CUDA, GPU, model files, network, or samples/.
The implementation parses PDF contents or invokes conversion behavior.
The implementation introduces alternate engines or runtime engine selection.
The implementation introduces --api-url, remote APIs, router mode, HTTP client backends, or remote OpenAI-compatible backends.
samples/ is staged or committed.

Acceptance Criteria

Sprint 2 is complete when:

src/pdf2md/paths.py exists and owns input discovery plus output path planning.
Single PDF discovery is tested.
Non-recursive and recursive directory discovery are tested.
Non-ASCII PDF filenames are tested with generated temporary files.
Markdown, assets, metadata JSON, report Markdown, and optional raw output paths are tested.
Existing-output conflict detection is tested with and without overwrite enabled.
No conversion, MinerU, Markdown normalization, metadata content, report content, or doctor behavior is implemented.
uv sync passes.
uv run pytest passes.
PROGRESS.md records checks performed and residual risks.
Independent evaluation is complete.
The completed change is committed.

Handoff Fields

Use these fields when Sprint 2 completes:

Files changed:
Commands run:
Tests passed:
Tests blocked:
Known failures:
Residual risks:
User decisions needed:
Go/no-go recommendation for Sprint 3:
Next action:

10 KiB Raw Blame History

Sprint 2 Contract: Paths, Input Discovery, And Overwrite Planning

Objective

Current Precondition

Touched Surfaces

Expected Outputs

Non-Goals

Work Packages

WP2.1: Path Planning Types And Errors

WP2.2: Input Discovery

WP2.3: Output Planning

WP2.4: Overwrite Conflict Detection

WP2.5: Independent Evaluation

Verification Checks

Hard Failure Criteria

Acceptance Criteria

Handoff Fields

10 KiB

Raw Blame History