add pdftomd
This commit is contained in:
@@ -0,0 +1,316 @@
|
||||
# Sprint 4 Contract: MinerU Adapter With Mocked Contract
|
||||
|
||||
Status: Implemented
|
||||
Last updated: 2026-05-07
|
||||
|
||||
## Objective
|
||||
|
||||
Build the direct local MinerU 3.1.0 adapter boundary using mocked subprocess results and fake output directories first.
|
||||
|
||||
Sprint 4 must establish:
|
||||
|
||||
- A project-owned adapter module that is the only boundary for MinerU CLI interaction.
|
||||
- Deterministic command construction for direct local MinerU CLI execution.
|
||||
- Strict-local validation that rejects prohibited remote/API/router/backend options.
|
||||
- Subprocess execution wrapping that captures stdout, stderr, exit code, command, and generated paths.
|
||||
- Optional-file parsing for mocked MinerU output directories.
|
||||
- Clear adapter result and warning records for missing MinerU, failed CLI execution, missing output, and strict-local violations.
|
||||
- Fast unit tests that do not require real MinerU, model files, GPU, sample PDFs, or network.
|
||||
|
||||
Sprint 4 is an adapter contract sprint. It must not connect the adapter to real conversion orchestration, Markdown normalization, metadata writing, report generation, or a working `pdf2md convert` command.
|
||||
|
||||
## Current Precondition
|
||||
|
||||
Sprint 3 is complete:
|
||||
|
||||
- `src/pdf2md/paths.py` owns input discovery and output path planning.
|
||||
- `src/pdf2md/ir.py` owns project records, block types, warning codes, and warning severities.
|
||||
- `src/pdf2md/metadata.py` builds JSON-serializable metadata and summary counts from project-owned records.
|
||||
- `uv run pytest` passed 46 tests.
|
||||
|
||||
Sprint 4 may import project-owned warning records from `ir.py`, but it must not require raw MinerU objects as public or required return fields.
|
||||
|
||||
## Touched Surfaces
|
||||
|
||||
Allowed:
|
||||
|
||||
- `src/pdf2md/mineru_adapter.py`
|
||||
- `src/pdf2md/doctor.py` only for minimal adapter availability/version check types if needed; do not implement full `pdf2md doctor`
|
||||
- `src/pdf2md/ir.py` only for narrowly required warning/result fields discovered while implementing the adapter contract
|
||||
- `tests/test_mineru_adapter.py` or `tests/unit/test_mineru_adapter.py`
|
||||
- `tests/test_doctor.py` only if `doctor.py` is touched for adapter availability/version checks
|
||||
- `README.md` only if a small note is needed to clarify mocked adapter tests versus real MinerU setup
|
||||
- `PLAN.md` only for current-goal coordination updates required by the shared agent workflow
|
||||
- `PROGRESS.md`
|
||||
- `docs/V1IMPLEMENTATIONPLAN.md` only if sequencing or constraints need adjustment
|
||||
- `docs/Sprints/SPRINT4CONTRACT.md`
|
||||
|
||||
Not allowed:
|
||||
|
||||
- `src/pdf2md/conversion.py`
|
||||
- `src/pdf2md/markdown.py`
|
||||
- `src/pdf2md/quality.py`
|
||||
- `src/pdf2md/report.py`
|
||||
- Working `pdf2md convert` behavior
|
||||
- Full `pdf2md doctor` behavior
|
||||
- `scripts/`
|
||||
- Any real MinerU invocation in default tests
|
||||
- Any MinerU/model installation or download script
|
||||
- Any PDF content parsing
|
||||
- Any Markdown normalization behavior
|
||||
- Any metadata JSON file writing
|
||||
- Any `.report.md` content generation
|
||||
- Any runtime engine selection or alternate engine support
|
||||
- Any committed file under `samples/`
|
||||
|
||||
## Expected Outputs
|
||||
|
||||
Sprint 4 should produce:
|
||||
|
||||
1. Adapter records and options
|
||||
- A small adapter options record for v1-local MinerU execution.
|
||||
- A result record containing at least:
|
||||
- command arguments
|
||||
- input PDF path
|
||||
- work/output directory path
|
||||
- raw Markdown when found
|
||||
- raw structured data when found
|
||||
- asset paths when found
|
||||
- warnings
|
||||
- engine name fixed to MinerU
|
||||
- engine version when known
|
||||
- engine options
|
||||
- exit code
|
||||
- stdout
|
||||
- stderr
|
||||
- No public or required field should expose raw MinerU-specific Python objects.
|
||||
|
||||
2. Availability and version checks
|
||||
- Check whether `mineru` is available using a mockable local mechanism such as `shutil.which`.
|
||||
- Check MinerU version using a mockable subprocess call.
|
||||
- Missing MinerU should produce a clear `ENGINE_MISSING` warning or project-owned adapter error.
|
||||
- Version command failure should be explicit and testable.
|
||||
|
||||
3. Direct local command construction
|
||||
- Baseline conversion command shape:
|
||||
|
||||
```text
|
||||
mineru -p <input-pdf> -o <work-dir>
|
||||
```
|
||||
|
||||
- The command must not include `--api-url`.
|
||||
- The command must not include router mode, HTTP client backend flags, remote API URLs, remote OpenAI-compatible backend settings, or runtime engine selection.
|
||||
- GPU/device options may be represented only if they are strict-local and do not introduce remote/backend choices. If the exact MinerU 3.1.0 flag is uncertain, store the requested option without passing a speculative CLI flag until a later source-verified sprint.
|
||||
|
||||
4. Strict-local validation
|
||||
- Reject prohibited CLI args and options before subprocess execution.
|
||||
- Reject any value that looks like a remote URL in user-controlled adapter options.
|
||||
- Allow only direct local `mineru` CLI execution.
|
||||
- Allow the CLI-internal temporary local `mineru-api` process that MinerU 3.1.0 may start when the CLI runs without `--api-url`.
|
||||
- Do not implement or call an HTTP client backend.
|
||||
|
||||
5. Subprocess wrapper
|
||||
- Use dependency injection or a small runner boundary so tests can fake subprocess behavior.
|
||||
- Capture stdout, stderr, and exit code.
|
||||
- Convert non-zero exit into an adapter result or project-owned error with a `MINERU_CLI_FAILED` warning.
|
||||
- Do not silently fallback to another engine.
|
||||
|
||||
6. Mocked output parsing
|
||||
- Parse fake output directories using optional-file behavior.
|
||||
- Raw Markdown is optional and should be read only from mocked local files created by tests.
|
||||
- Raw structured output is optional and may be represented as a JSON-serializable object or raw text, depending on the fake file extension used in tests.
|
||||
- Assets are optional and should be collected as local relative or absolute paths according to the adapter result design.
|
||||
- Missing expected output should produce a clear warning or failed result instead of being silently ignored.
|
||||
- The adapter must not assume real MinerU output layout is fully known until a later local probe.
|
||||
|
||||
7. Tests
|
||||
- Unit tests for availability check success and missing MinerU.
|
||||
- Unit tests for version check success and version command failure.
|
||||
- Unit tests for command construction.
|
||||
- Unit tests that prohibited `--api-url`, remote URLs, router mode, HTTP backend, and OpenAI-compatible backend options are rejected.
|
||||
- Unit tests for mocked successful MinerU output.
|
||||
- Unit tests for mocked non-zero exit.
|
||||
- Unit tests for mocked missing output.
|
||||
- Unit tests proving no real MinerU binary, model files, GPU, `samples/`, or network are required by default.
|
||||
|
||||
8. Handoff
|
||||
- `PROGRESS.md` records changed files, commands run, tests passed or blocked, known failures, residual risks, and next action.
|
||||
|
||||
## Non-Goals
|
||||
|
||||
- Do not implement conversion orchestration.
|
||||
- Do not implement `convert_pdf`.
|
||||
- Do not implement `pdf2md convert`.
|
||||
- Do not implement full `pdf2md doctor`.
|
||||
- Do not install MinerU 3.1.0.
|
||||
- Do not download MinerU models.
|
||||
- Do not run real MinerU in default tests.
|
||||
- Do not parse real PDFs.
|
||||
- Do not normalize Markdown.
|
||||
- Do not write final Markdown, metadata JSON, assets, or report files as product behavior.
|
||||
- Do not compute source SHA-256.
|
||||
- Do not implement asset link checking.
|
||||
- Do not implement math renderability checking.
|
||||
- Do not implement alternate engines or runtime engine selection.
|
||||
- Do not add cloud, remote API, router, HTTP client backend, or remote OpenAI-compatible backend support.
|
||||
|
||||
## Work Packages
|
||||
|
||||
### WP4.1: Adapter Types And Strict-Local Options
|
||||
|
||||
Owner:
|
||||
|
||||
- `mineru-integration-agent`
|
||||
- `feature-generator-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Define minimal adapter options and result records.
|
||||
- Encode strict-local defaults.
|
||||
- Reject prohibited remote/API/backend options before command execution.
|
||||
|
||||
Output:
|
||||
|
||||
- The adapter boundary can be tested without invoking MinerU.
|
||||
|
||||
### WP4.2: Availability, Version, And Command Builder
|
||||
|
||||
Owner:
|
||||
|
||||
- `mineru-integration-agent`
|
||||
- `feature-generator-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Implement mockable `is_available` and `version` checks.
|
||||
- Build the direct local command shape `mineru -p <input-pdf> -o <work-dir>`.
|
||||
- Keep command construction deterministic and easy to inspect in tests.
|
||||
|
||||
Output:
|
||||
|
||||
- Later orchestration can call the adapter without knowing MinerU CLI details.
|
||||
|
||||
### WP4.3: Subprocess Runner Boundary
|
||||
|
||||
Owner:
|
||||
|
||||
- `feature-generator-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Add a small runner interface or callable boundary for subprocess execution.
|
||||
- Capture command, stdout, stderr, exit code, and return values.
|
||||
- Map non-zero exits to structured adapter warnings/errors.
|
||||
|
||||
Output:
|
||||
|
||||
- Default tests can fake all MinerU behavior.
|
||||
|
||||
### WP4.4: Mocked Output Parser
|
||||
|
||||
Owner:
|
||||
|
||||
- `mineru-integration-agent`
|
||||
- `feature-generator-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Parse fake Markdown, JSON/structured, asset, and diagnostic outputs from test-created directories.
|
||||
- Treat all raw MinerU output files as optional until real local output is probed.
|
||||
- Emit clear warnings for missing usable output.
|
||||
|
||||
Output:
|
||||
|
||||
- Adapter result objects can carry raw output into later IR/normalization sprints without binding to a guessed full MinerU layout.
|
||||
|
||||
### WP4.5: Independent Evaluation
|
||||
|
||||
Owner:
|
||||
|
||||
- `evaluation-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Review the completed adapter against this contract.
|
||||
- Verify no conversion orchestration, real MinerU dependency in default tests, remote runtime path, alternate engine, Markdown normalization, metadata writing, report generation, or working CLI command was added.
|
||||
- Verify `samples/` remains untracked and unstaged.
|
||||
|
||||
Output:
|
||||
|
||||
- PASS/FAIL notes with any missing acceptance criteria.
|
||||
|
||||
## Verification Checks
|
||||
|
||||
Required:
|
||||
|
||||
- `git status --short` before staging confirms `samples/` remains untracked.
|
||||
- `uv --version` is run and result is recorded.
|
||||
- `uv sync` passes.
|
||||
- `uv run pytest` passes.
|
||||
- Targeted MinerU adapter tests pass.
|
||||
- Tests do not require real MinerU, CUDA, GPU, model files, `samples/`, or network.
|
||||
- No model downloads occur.
|
||||
- No network calls are required.
|
||||
- No candidate engine comparison is reintroduced.
|
||||
- No conversion orchestration is implemented.
|
||||
- No Markdown normalization behavior is implemented.
|
||||
- No metadata JSON writing or full report generation is implemented.
|
||||
- No working `pdf2md convert` or full `pdf2md doctor` behavior is implemented.
|
||||
- Strict-local rejection tests cover `--api-url`, remote URL values, router mode, HTTP backend, and remote OpenAI-compatible backend options.
|
||||
- `git diff --check` passes.
|
||||
|
||||
Recommended:
|
||||
|
||||
- Use fake runner classes or functions rather than monkeypatching global subprocess calls everywhere.
|
||||
- Keep adapter result data JSON-friendly where practical, but do not force metadata schema generation in Sprint 4.
|
||||
- Include a test that no prohibited remote/API flag appears in the successful command args.
|
||||
- Use `requirements-guard-agent` if command flags or strict-local wording conflict across documents.
|
||||
|
||||
## Hard Failure Criteria
|
||||
|
||||
Sprint 4 fails and must stop for a user decision if any of these are true:
|
||||
|
||||
- The adapter passes `--api-url`.
|
||||
- The adapter uses router mode.
|
||||
- The adapter uses an HTTP client backend.
|
||||
- The adapter accepts a remote API URL or remote OpenAI-compatible backend for runtime conversion.
|
||||
- The adapter falls back to another engine after MinerU failure.
|
||||
- Default tests require real MinerU, CUDA, GPU, model files, network, or `samples/`.
|
||||
- The implementation installs MinerU or downloads models.
|
||||
- The implementation connects the adapter to a working conversion CLI/API.
|
||||
- The implementation adds Markdown normalization, metadata file writing, full report generation, or quality checks.
|
||||
- The implementation assumes real MinerU output layout is fully known without a later local probe.
|
||||
- `samples/` is staged or committed.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
Sprint 4 is complete when:
|
||||
|
||||
- `src/pdf2md/mineru_adapter.py` exists and owns direct local MinerU CLI adapter behavior.
|
||||
- Availability and version checks are mock-tested.
|
||||
- Conversion command construction is mock-tested and uses `mineru -p <input-pdf> -o <work-dir>`.
|
||||
- Strict-local validation rejects prohibited remote/API/router/backend options.
|
||||
- Mocked successful MinerU output produces an adapter result with raw Markdown, raw structured data when available, assets when available, engine info, command args, stdout, stderr, and exit code.
|
||||
- Mocked non-zero exit produces a clear failure result or project-owned error with a `MINERU_CLI_FAILED` warning.
|
||||
- Mocked missing MinerU produces a clear `ENGINE_MISSING` warning or project-owned adapter error.
|
||||
- Default tests do not require MinerU, GPU, model files, network, or `samples/`.
|
||||
- No conversion orchestration, Markdown normalization, metadata file writing, full report generation, or working CLI behavior is implemented.
|
||||
- `uv sync` passes.
|
||||
- `uv run pytest` passes.
|
||||
- `PROGRESS.md` records checks performed and residual risks.
|
||||
- Independent evaluation is complete.
|
||||
- The completed change is committed.
|
||||
|
||||
## Handoff Fields
|
||||
|
||||
Use these fields when Sprint 4 completes:
|
||||
|
||||
- Files changed:
|
||||
- Commands run:
|
||||
- Tests passed:
|
||||
- Tests blocked:
|
||||
- Known failures:
|
||||
- Residual risks:
|
||||
- User decisions needed:
|
||||
- Go/no-go recommendation for Sprint 5:
|
||||
- Next action:
|
||||
Reference in New Issue
Block a user