add pdftomd

This commit is contained in:
김경종
2026-05-08 16:42:19 +09:00
parent 551ab50735
commit 88d6b92283
99 changed files with 47332 additions and 0 deletions
+316
View File
@@ -0,0 +1,316 @@
# Sprint 4 Contract: MinerU Adapter With Mocked Contract
Status: Implemented
Last updated: 2026-05-07
## Objective
Build the direct local MinerU 3.1.0 adapter boundary using mocked subprocess results and fake output directories first.
Sprint 4 must establish:
- A project-owned adapter module that is the only boundary for MinerU CLI interaction.
- Deterministic command construction for direct local MinerU CLI execution.
- Strict-local validation that rejects prohibited remote/API/router/backend options.
- Subprocess execution wrapping that captures stdout, stderr, exit code, command, and generated paths.
- Optional-file parsing for mocked MinerU output directories.
- Clear adapter result and warning records for missing MinerU, failed CLI execution, missing output, and strict-local violations.
- Fast unit tests that do not require real MinerU, model files, GPU, sample PDFs, or network.
Sprint 4 is an adapter contract sprint. It must not connect the adapter to real conversion orchestration, Markdown normalization, metadata writing, report generation, or a working `pdf2md convert` command.
## Current Precondition
Sprint 3 is complete:
- `src/pdf2md/paths.py` owns input discovery and output path planning.
- `src/pdf2md/ir.py` owns project records, block types, warning codes, and warning severities.
- `src/pdf2md/metadata.py` builds JSON-serializable metadata and summary counts from project-owned records.
- `uv run pytest` passed 46 tests.
Sprint 4 may import project-owned warning records from `ir.py`, but it must not require raw MinerU objects as public or required return fields.
## Touched Surfaces
Allowed:
- `src/pdf2md/mineru_adapter.py`
- `src/pdf2md/doctor.py` only for minimal adapter availability/version check types if needed; do not implement full `pdf2md doctor`
- `src/pdf2md/ir.py` only for narrowly required warning/result fields discovered while implementing the adapter contract
- `tests/test_mineru_adapter.py` or `tests/unit/test_mineru_adapter.py`
- `tests/test_doctor.py` only if `doctor.py` is touched for adapter availability/version checks
- `README.md` only if a small note is needed to clarify mocked adapter tests versus real MinerU setup
- `PLAN.md` only for current-goal coordination updates required by the shared agent workflow
- `PROGRESS.md`
- `docs/V1IMPLEMENTATIONPLAN.md` only if sequencing or constraints need adjustment
- `docs/Sprints/SPRINT4CONTRACT.md`
Not allowed:
- `src/pdf2md/conversion.py`
- `src/pdf2md/markdown.py`
- `src/pdf2md/quality.py`
- `src/pdf2md/report.py`
- Working `pdf2md convert` behavior
- Full `pdf2md doctor` behavior
- `scripts/`
- Any real MinerU invocation in default tests
- Any MinerU/model installation or download script
- Any PDF content parsing
- Any Markdown normalization behavior
- Any metadata JSON file writing
- Any `.report.md` content generation
- Any runtime engine selection or alternate engine support
- Any committed file under `samples/`
## Expected Outputs
Sprint 4 should produce:
1. Adapter records and options
- A small adapter options record for v1-local MinerU execution.
- A result record containing at least:
- command arguments
- input PDF path
- work/output directory path
- raw Markdown when found
- raw structured data when found
- asset paths when found
- warnings
- engine name fixed to MinerU
- engine version when known
- engine options
- exit code
- stdout
- stderr
- No public or required field should expose raw MinerU-specific Python objects.
2. Availability and version checks
- Check whether `mineru` is available using a mockable local mechanism such as `shutil.which`.
- Check MinerU version using a mockable subprocess call.
- Missing MinerU should produce a clear `ENGINE_MISSING` warning or project-owned adapter error.
- Version command failure should be explicit and testable.
3. Direct local command construction
- Baseline conversion command shape:
```text
mineru -p <input-pdf> -o <work-dir>
```
- The command must not include `--api-url`.
- The command must not include router mode, HTTP client backend flags, remote API URLs, remote OpenAI-compatible backend settings, or runtime engine selection.
- GPU/device options may be represented only if they are strict-local and do not introduce remote/backend choices. If the exact MinerU 3.1.0 flag is uncertain, store the requested option without passing a speculative CLI flag until a later source-verified sprint.
4. Strict-local validation
- Reject prohibited CLI args and options before subprocess execution.
- Reject any value that looks like a remote URL in user-controlled adapter options.
- Allow only direct local `mineru` CLI execution.
- Allow the CLI-internal temporary local `mineru-api` process that MinerU 3.1.0 may start when the CLI runs without `--api-url`.
- Do not implement or call an HTTP client backend.
5. Subprocess wrapper
- Use dependency injection or a small runner boundary so tests can fake subprocess behavior.
- Capture stdout, stderr, and exit code.
- Convert non-zero exit into an adapter result or project-owned error with a `MINERU_CLI_FAILED` warning.
- Do not silently fallback to another engine.
6. Mocked output parsing
- Parse fake output directories using optional-file behavior.
- Raw Markdown is optional and should be read only from mocked local files created by tests.
- Raw structured output is optional and may be represented as a JSON-serializable object or raw text, depending on the fake file extension used in tests.
- Assets are optional and should be collected as local relative or absolute paths according to the adapter result design.
- Missing expected output should produce a clear warning or failed result instead of being silently ignored.
- The adapter must not assume real MinerU output layout is fully known until a later local probe.
7. Tests
- Unit tests for availability check success and missing MinerU.
- Unit tests for version check success and version command failure.
- Unit tests for command construction.
- Unit tests that prohibited `--api-url`, remote URLs, router mode, HTTP backend, and OpenAI-compatible backend options are rejected.
- Unit tests for mocked successful MinerU output.
- Unit tests for mocked non-zero exit.
- Unit tests for mocked missing output.
- Unit tests proving no real MinerU binary, model files, GPU, `samples/`, or network are required by default.
8. Handoff
- `PROGRESS.md` records changed files, commands run, tests passed or blocked, known failures, residual risks, and next action.
## Non-Goals
- Do not implement conversion orchestration.
- Do not implement `convert_pdf`.
- Do not implement `pdf2md convert`.
- Do not implement full `pdf2md doctor`.
- Do not install MinerU 3.1.0.
- Do not download MinerU models.
- Do not run real MinerU in default tests.
- Do not parse real PDFs.
- Do not normalize Markdown.
- Do not write final Markdown, metadata JSON, assets, or report files as product behavior.
- Do not compute source SHA-256.
- Do not implement asset link checking.
- Do not implement math renderability checking.
- Do not implement alternate engines or runtime engine selection.
- Do not add cloud, remote API, router, HTTP client backend, or remote OpenAI-compatible backend support.
## Work Packages
### WP4.1: Adapter Types And Strict-Local Options
Owner:
- `mineru-integration-agent`
- `feature-generator-agent`
Actions:
- Define minimal adapter options and result records.
- Encode strict-local defaults.
- Reject prohibited remote/API/backend options before command execution.
Output:
- The adapter boundary can be tested without invoking MinerU.
### WP4.2: Availability, Version, And Command Builder
Owner:
- `mineru-integration-agent`
- `feature-generator-agent`
Actions:
- Implement mockable `is_available` and `version` checks.
- Build the direct local command shape `mineru -p <input-pdf> -o <work-dir>`.
- Keep command construction deterministic and easy to inspect in tests.
Output:
- Later orchestration can call the adapter without knowing MinerU CLI details.
### WP4.3: Subprocess Runner Boundary
Owner:
- `feature-generator-agent`
Actions:
- Add a small runner interface or callable boundary for subprocess execution.
- Capture command, stdout, stderr, exit code, and return values.
- Map non-zero exits to structured adapter warnings/errors.
Output:
- Default tests can fake all MinerU behavior.
### WP4.4: Mocked Output Parser
Owner:
- `mineru-integration-agent`
- `feature-generator-agent`
Actions:
- Parse fake Markdown, JSON/structured, asset, and diagnostic outputs from test-created directories.
- Treat all raw MinerU output files as optional until real local output is probed.
- Emit clear warnings for missing usable output.
Output:
- Adapter result objects can carry raw output into later IR/normalization sprints without binding to a guessed full MinerU layout.
### WP4.5: Independent Evaluation
Owner:
- `evaluation-agent`
Actions:
- Review the completed adapter against this contract.
- Verify no conversion orchestration, real MinerU dependency in default tests, remote runtime path, alternate engine, Markdown normalization, metadata writing, report generation, or working CLI command was added.
- Verify `samples/` remains untracked and unstaged.
Output:
- PASS/FAIL notes with any missing acceptance criteria.
## Verification Checks
Required:
- `git status --short` before staging confirms `samples/` remains untracked.
- `uv --version` is run and result is recorded.
- `uv sync` passes.
- `uv run pytest` passes.
- Targeted MinerU adapter tests pass.
- Tests do not require real MinerU, CUDA, GPU, model files, `samples/`, or network.
- No model downloads occur.
- No network calls are required.
- No candidate engine comparison is reintroduced.
- No conversion orchestration is implemented.
- No Markdown normalization behavior is implemented.
- No metadata JSON writing or full report generation is implemented.
- No working `pdf2md convert` or full `pdf2md doctor` behavior is implemented.
- Strict-local rejection tests cover `--api-url`, remote URL values, router mode, HTTP backend, and remote OpenAI-compatible backend options.
- `git diff --check` passes.
Recommended:
- Use fake runner classes or functions rather than monkeypatching global subprocess calls everywhere.
- Keep adapter result data JSON-friendly where practical, but do not force metadata schema generation in Sprint 4.
- Include a test that no prohibited remote/API flag appears in the successful command args.
- Use `requirements-guard-agent` if command flags or strict-local wording conflict across documents.
## Hard Failure Criteria
Sprint 4 fails and must stop for a user decision if any of these are true:
- The adapter passes `--api-url`.
- The adapter uses router mode.
- The adapter uses an HTTP client backend.
- The adapter accepts a remote API URL or remote OpenAI-compatible backend for runtime conversion.
- The adapter falls back to another engine after MinerU failure.
- Default tests require real MinerU, CUDA, GPU, model files, network, or `samples/`.
- The implementation installs MinerU or downloads models.
- The implementation connects the adapter to a working conversion CLI/API.
- The implementation adds Markdown normalization, metadata file writing, full report generation, or quality checks.
- The implementation assumes real MinerU output layout is fully known without a later local probe.
- `samples/` is staged or committed.
## Acceptance Criteria
Sprint 4 is complete when:
- `src/pdf2md/mineru_adapter.py` exists and owns direct local MinerU CLI adapter behavior.
- Availability and version checks are mock-tested.
- Conversion command construction is mock-tested and uses `mineru -p <input-pdf> -o <work-dir>`.
- Strict-local validation rejects prohibited remote/API/router/backend options.
- Mocked successful MinerU output produces an adapter result with raw Markdown, raw structured data when available, assets when available, engine info, command args, stdout, stderr, and exit code.
- Mocked non-zero exit produces a clear failure result or project-owned error with a `MINERU_CLI_FAILED` warning.
- Mocked missing MinerU produces a clear `ENGINE_MISSING` warning or project-owned adapter error.
- Default tests do not require MinerU, GPU, model files, network, or `samples/`.
- No conversion orchestration, Markdown normalization, metadata file writing, full report generation, or working CLI behavior is implemented.
- `uv sync` passes.
- `uv run pytest` passes.
- `PROGRESS.md` records checks performed and residual risks.
- Independent evaluation is complete.
- The completed change is committed.
## Handoff Fields
Use these fields when Sprint 4 completes:
- Files changed:
- Commands run:
- Tests passed:
- Tests blocked:
- Known failures:
- Residual risks:
- User decisions needed:
- Go/no-go recommendation for Sprint 5:
- Next action: