add pdftomd
This commit is contained in:
@@ -0,0 +1,258 @@
|
||||
# Sprint 1 Contract: Project Scaffold And Fast Test Loop
|
||||
|
||||
Status: Completed
|
||||
Last updated: 2026-05-07
|
||||
|
||||
## Objective
|
||||
|
||||
Create the minimal Python project scaffold and fast local test loop for the PDF-to-Markdown converter.
|
||||
|
||||
Sprint 1 must establish:
|
||||
|
||||
- A `uv`-managed Python 3.12 project.
|
||||
- A source package importable as `pdf2md`.
|
||||
- A reserved `pdf2md` CLI entry point that does not implement conversion yet.
|
||||
- A fast test command that runs without MinerU, model downloads, GPU access, sample PDFs, or network access.
|
||||
|
||||
Sprint 1 is scaffolding only. It must not implement PDF conversion, MinerU execution, Markdown normalization, metadata generation, or report generation.
|
||||
|
||||
## Current Precondition
|
||||
|
||||
Sprint 0 found that `uv` was not available on PATH in the current local environment.
|
||||
|
||||
Sprint 1 resolved this by installing `uv` per-user at `C:\Users\user\.local\bin`.
|
||||
|
||||
Before Sprint 1 can be accepted, one of these must happen:
|
||||
|
||||
- `uv` is installed and `uv --version` succeeds.
|
||||
- The user explicitly approves including `uv` bootstrap documentation or setup handling as part of Sprint 1, and the contract result records that `uv sync` could not be run locally.
|
||||
|
||||
Do not silently replace `uv` with another package manager.
|
||||
|
||||
## Touched Surfaces
|
||||
|
||||
Allowed:
|
||||
|
||||
- `pyproject.toml`
|
||||
- `uv.lock`
|
||||
- `.gitignore`
|
||||
- `src/pdf2md/__init__.py`
|
||||
- `src/pdf2md/cli.py` only for a minimal placeholder CLI if needed for entry point verification
|
||||
- `tests/`
|
||||
- `README.md` only for minimal setup/test instructions if needed
|
||||
- `PLAN.md` only for current-goal coordination updates required by the shared agent workflow
|
||||
- `PROGRESS.md`
|
||||
- `docs/V1IMPLEMENTATIONPLAN.md` only if sequencing or constraints need adjustment
|
||||
- `docs/Sprints/SPRINT1CONTRACT.md`
|
||||
|
||||
Not allowed:
|
||||
|
||||
- `src/pdf2md/conversion.py`
|
||||
- `src/pdf2md/mineru_adapter.py`
|
||||
- `src/pdf2md/paths.py`
|
||||
- `src/pdf2md/ir.py`
|
||||
- `src/pdf2md/markdown.py`
|
||||
- `src/pdf2md/metadata.py`
|
||||
- `src/pdf2md/quality.py`
|
||||
- `src/pdf2md/report.py`
|
||||
- `src/pdf2md/doctor.py`
|
||||
- `scripts/`
|
||||
- Any real MinerU invocation
|
||||
- Any model download or install script
|
||||
- Any committed file under `samples/`
|
||||
|
||||
## Expected Outputs
|
||||
|
||||
Sprint 1 should produce:
|
||||
|
||||
1. Project package scaffold
|
||||
- `pyproject.toml` with project metadata.
|
||||
- Python requirement constrained to Python 3.12.
|
||||
- Build configuration suitable for a `src/` layout.
|
||||
- `uv.lock` generated by `uv sync`.
|
||||
- `.gitignore` entries for local virtual environments, pytest cache, and Python bytecode.
|
||||
- Minimal test dependency configuration.
|
||||
- CLI entry point name reserved as `pdf2md`.
|
||||
|
||||
2. Minimal source package
|
||||
- `src/pdf2md/__init__.py`.
|
||||
- A stable package import surface.
|
||||
- Optional minimal `src/pdf2md/cli.py` placeholder that exits clearly and does not imply conversion is implemented.
|
||||
|
||||
3. Fast test loop
|
||||
- A minimal test suite that verifies the package imports.
|
||||
- If a CLI placeholder is added, a smoke test that verifies the CLI entry point is wired without invoking conversion.
|
||||
- Tests must not require MinerU, CUDA, GPU, model files, `samples/`, or network.
|
||||
|
||||
4. Developer workflow
|
||||
- `uv sync` should work when `uv` is installed.
|
||||
- `uv run pytest` should work when `uv` is installed.
|
||||
- If `uv` is still missing locally, record the failure explicitly in `PROGRESS.md` and do not mark Sprint 1 complete.
|
||||
|
||||
5. Handoff
|
||||
- `PROGRESS.md` records changed files, commands run, tests passed or blocked, known failures, residual risks, and next action.
|
||||
|
||||
## Non-Goals
|
||||
|
||||
- Do not implement PDF discovery.
|
||||
- Do not implement conversion orchestration.
|
||||
- Do not implement the MinerU adapter.
|
||||
- Do not run MinerU.
|
||||
- Do not install MinerU 3.1.0.
|
||||
- Do not download MinerU models.
|
||||
- Do not implement Markdown normalization.
|
||||
- Do not implement metadata JSON or `.report.md` output.
|
||||
- Do not implement `pdf2md doctor`; a CLI placeholder may mention future commands, but it must not create a doctor module.
|
||||
- Do not add runtime engine selection.
|
||||
- Do not add alternate conversion engines.
|
||||
- Do not add cloud, remote API, router, HTTP client backend, or remote OpenAI-compatible backend support.
|
||||
|
||||
## Work Packages
|
||||
|
||||
### WP1.1: Scaffold Metadata
|
||||
|
||||
Owner:
|
||||
|
||||
- `feature-generator-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Create the minimal `pyproject.toml`.
|
||||
- Use Python 3.12 constraints.
|
||||
- Configure a `src/` package layout.
|
||||
- Configure pytest as the fast local test runner.
|
||||
- Reserve the `pdf2md` console script.
|
||||
|
||||
Output:
|
||||
|
||||
- A minimal, maintainable scaffold without speculative dependencies.
|
||||
|
||||
### WP1.2: Package Import Surface
|
||||
|
||||
Owner:
|
||||
|
||||
- `feature-generator-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Create `src/pdf2md/__init__.py`.
|
||||
- Expose only a minimal version/import surface.
|
||||
- Avoid public API promises beyond what Sprint 1 verifies.
|
||||
|
||||
Output:
|
||||
|
||||
- `import pdf2md` succeeds.
|
||||
|
||||
### WP1.3: CLI Placeholder
|
||||
|
||||
Owner:
|
||||
|
||||
- `feature-generator-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- If needed for console script verification, create `src/pdf2md/cli.py`.
|
||||
- The placeholder may expose a help message or a clear "not implemented yet" command.
|
||||
- It must not create conversion flags beyond the reserved command shape unless tests need them.
|
||||
|
||||
Output:
|
||||
|
||||
- `pdf2md` entry point is wired without implying conversion works.
|
||||
|
||||
### WP1.4: Fast Tests
|
||||
|
||||
Owner:
|
||||
|
||||
- `feature-generator-agent`
|
||||
- `evaluation-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Add minimal tests for package import and optional CLI placeholder behavior.
|
||||
- Ensure tests are local, fast, and independent of MinerU/model/GPU/network state.
|
||||
|
||||
Output:
|
||||
|
||||
- `uv run pytest` passes when `uv` is available.
|
||||
|
||||
### WP1.5: Independent Evaluation
|
||||
|
||||
Owner:
|
||||
|
||||
- `evaluation-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Review the completed scaffold against this contract.
|
||||
- Verify no converter implementation was added.
|
||||
- Verify `samples/` remains untracked and unstaged.
|
||||
- Verify no runtime remote path or alternate engine was introduced.
|
||||
|
||||
Output:
|
||||
|
||||
- PASS/FAIL notes with any missing acceptance criteria.
|
||||
|
||||
## Verification Checks
|
||||
|
||||
Required:
|
||||
|
||||
- `git status --short` before staging confirms `samples/` remains untracked.
|
||||
- `uv --version` is run and result is recorded.
|
||||
- `uv sync` passes if `uv` is available.
|
||||
- `uv run pytest` passes if `uv` is available.
|
||||
- If `uv` is unavailable, Sprint 1 is marked blocked rather than complete.
|
||||
- Import test passes through the configured test command.
|
||||
- No real MinerU dependency is required for default tests.
|
||||
- No model downloads occur.
|
||||
- No network calls are required.
|
||||
- No candidate engine comparison is reintroduced.
|
||||
- No conversion behavior is implemented.
|
||||
- `git diff --check` passes.
|
||||
|
||||
Recommended:
|
||||
|
||||
- Keep `pyproject.toml` dependency list minimal.
|
||||
- Avoid adding README content beyond setup/test instructions needed for the scaffold.
|
||||
- Use `requirements-guard-agent` to check document consistency if the scaffold reveals a sequencing issue.
|
||||
|
||||
## Hard Failure Criteria
|
||||
|
||||
Sprint 1 fails and must stop for a user decision if any of these are true:
|
||||
|
||||
- `uv` remains unavailable and the user has not approved bootstrap handling.
|
||||
- The project cannot be installed as a Python 3.12 package.
|
||||
- The package cannot be imported as `pdf2md`.
|
||||
- Default tests require MinerU, model downloads, GPU access, sample PDFs, or network access.
|
||||
- The scaffold introduces conversion logic outside Sprint 1 scope.
|
||||
- The scaffold introduces alternate engines or runtime engine selection.
|
||||
- The scaffold introduces `--api-url`, remote APIs, router mode, HTTP client backends, or remote OpenAI-compatible backends.
|
||||
- `samples/` is staged or committed.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
Sprint 1 is complete when:
|
||||
|
||||
- `pyproject.toml` exists and defines a minimal Python 3.12 `uv` project.
|
||||
- `src/pdf2md/__init__.py` exists and `import pdf2md` works through the project environment.
|
||||
- `uv sync` passes.
|
||||
- `uv run pytest` passes.
|
||||
- The `pdf2md` CLI entry point is reserved and does not imply conversion is implemented.
|
||||
- No converter implementation code beyond the allowed placeholder exists.
|
||||
- No default test depends on MinerU, GPU, model files, network, or `samples/`.
|
||||
- `PROGRESS.md` records checks performed and residual risks.
|
||||
- Independent evaluation is complete.
|
||||
- The completed change is committed.
|
||||
|
||||
## Handoff Fields
|
||||
|
||||
Use these fields when Sprint 1 completes:
|
||||
|
||||
- Files changed:
|
||||
- Commands run:
|
||||
- Tests passed:
|
||||
- Tests blocked:
|
||||
- Known failures:
|
||||
- Residual risks:
|
||||
- User decisions needed:
|
||||
- Go/no-go recommendation for Sprint 2:
|
||||
- Next action:
|
||||
Reference in New Issue
Block a user