baram2584/PDFToMD

Fork 0

Files

T

김경종 88d6b92283 add pdftomd

2026-05-08 16:42:19 +09:00

13 KiB

Raw Blame History

Sprint 4 Contract: MinerU Adapter With Mocked Contract

Status: Implemented Last updated: 2026-05-07

Objective

Build the direct local MinerU 3.1.0 adapter boundary using mocked subprocess results and fake output directories first.

Sprint 4 must establish:

A project-owned adapter module that is the only boundary for MinerU CLI interaction.
Deterministic command construction for direct local MinerU CLI execution.
Strict-local validation that rejects prohibited remote/API/router/backend options.
Subprocess execution wrapping that captures stdout, stderr, exit code, command, and generated paths.
Optional-file parsing for mocked MinerU output directories.
Clear adapter result and warning records for missing MinerU, failed CLI execution, missing output, and strict-local violations.
Fast unit tests that do not require real MinerU, model files, GPU, sample PDFs, or network.

Sprint 4 is an adapter contract sprint. It must not connect the adapter to real conversion orchestration, Markdown normalization, metadata writing, report generation, or a working pdf2md convert command.

Current Precondition

Sprint 3 is complete:

src/pdf2md/paths.py owns input discovery and output path planning.
src/pdf2md/ir.py owns project records, block types, warning codes, and warning severities.
src/pdf2md/metadata.py builds JSON-serializable metadata and summary counts from project-owned records.
uv run pytest passed 46 tests.

Sprint 4 may import project-owned warning records from ir.py, but it must not require raw MinerU objects as public or required return fields.

Touched Surfaces

Allowed:

src/pdf2md/mineru_adapter.py
src/pdf2md/doctor.py only for minimal adapter availability/version check types if needed; do not implement full pdf2md doctor
src/pdf2md/ir.py only for narrowly required warning/result fields discovered while implementing the adapter contract
tests/test_mineru_adapter.py or tests/unit/test_mineru_adapter.py
tests/test_doctor.py only if doctor.py is touched for adapter availability/version checks
README.md only if a small note is needed to clarify mocked adapter tests versus real MinerU setup
PLAN.md only for current-goal coordination updates required by the shared agent workflow
PROGRESS.md
docs/V1IMPLEMENTATIONPLAN.md only if sequencing or constraints need adjustment
docs/Sprints/SPRINT4CONTRACT.md

Not allowed:

src/pdf2md/conversion.py
src/pdf2md/markdown.py
src/pdf2md/quality.py
src/pdf2md/report.py
Working pdf2md convert behavior
Full pdf2md doctor behavior
scripts/
Any real MinerU invocation in default tests
Any MinerU/model installation or download script
Any PDF content parsing
Any Markdown normalization behavior
Any metadata JSON file writing
Any .report.md content generation
Any runtime engine selection or alternate engine support
Any committed file under samples/

Expected Outputs

Sprint 4 should produce:

Adapter records and options
- A small adapter options record for v1-local MinerU execution.
- A result record containing at least:
  - command arguments
  - input PDF path
  - work/output directory path
  - raw Markdown when found
  - raw structured data when found
  - asset paths when found
  - warnings
  - engine name fixed to MinerU
  - engine version when known
  - engine options
  - exit code
  - stdout
  - stderr
- No public or required field should expose raw MinerU-specific Python objects.
Availability and version checks
- Check whether mineru is available using a mockable local mechanism such as shutil.which.
- Check MinerU version using a mockable subprocess call.
- Missing MinerU should produce a clear ENGINE_MISSING warning or project-owned adapter error.
- Version command failure should be explicit and testable.
Direct local command construction
- Baseline conversion command shape:
```
mineru -p <input-pdf> -o <work-dir>
```
- The command must not include --api-url.
- The command must not include router mode, HTTP client backend flags, remote API URLs, remote OpenAI-compatible backend settings, or runtime engine selection.
- GPU/device options may be represented only if they are strict-local and do not introduce remote/backend choices. If the exact MinerU 3.1.0 flag is uncertain, store the requested option without passing a speculative CLI flag until a later source-verified sprint.
Strict-local validation
- Reject prohibited CLI args and options before subprocess execution.
- Reject any value that looks like a remote URL in user-controlled adapter options.
- Allow only direct local mineru CLI execution.
- Allow the CLI-internal temporary local mineru-api process that MinerU 3.1.0 may start when the CLI runs without --api-url.
- Do not implement or call an HTTP client backend.
Subprocess wrapper
- Use dependency injection or a small runner boundary so tests can fake subprocess behavior.
- Capture stdout, stderr, and exit code.
- Convert non-zero exit into an adapter result or project-owned error with a MINERU_CLI_FAILED warning.
- Do not silently fallback to another engine.
Mocked output parsing
- Parse fake output directories using optional-file behavior.
- Raw Markdown is optional and should be read only from mocked local files created by tests.
- Raw structured output is optional and may be represented as a JSON-serializable object or raw text, depending on the fake file extension used in tests.
- Assets are optional and should be collected as local relative or absolute paths according to the adapter result design.
- Missing expected output should produce a clear warning or failed result instead of being silently ignored.
- The adapter must not assume real MinerU output layout is fully known until a later local probe.
Tests
- Unit tests for availability check success and missing MinerU.
- Unit tests for version check success and version command failure.
- Unit tests for command construction.
- Unit tests that prohibited --api-url, remote URLs, router mode, HTTP backend, and OpenAI-compatible backend options are rejected.
- Unit tests for mocked successful MinerU output.
- Unit tests for mocked non-zero exit.
- Unit tests for mocked missing output.
- Unit tests proving no real MinerU binary, model files, GPU, samples/, or network are required by default.
Handoff
- PROGRESS.md records changed files, commands run, tests passed or blocked, known failures, residual risks, and next action.

Non-Goals

Do not implement conversion orchestration.
Do not implement convert_pdf.
Do not implement pdf2md convert.
Do not implement full pdf2md doctor.
Do not install MinerU 3.1.0.
Do not download MinerU models.
Do not run real MinerU in default tests.
Do not parse real PDFs.
Do not normalize Markdown.
Do not write final Markdown, metadata JSON, assets, or report files as product behavior.
Do not compute source SHA-256.
Do not implement asset link checking.
Do not implement math renderability checking.
Do not implement alternate engines or runtime engine selection.
Do not add cloud, remote API, router, HTTP client backend, or remote OpenAI-compatible backend support.

Work Packages

WP4.1: Adapter Types And Strict-Local Options

Owner:

mineru-integration-agent
feature-generator-agent

Actions:

Define minimal adapter options and result records.
Encode strict-local defaults.
Reject prohibited remote/API/backend options before command execution.

Output:

The adapter boundary can be tested without invoking MinerU.

WP4.2: Availability, Version, And Command Builder

Owner:

mineru-integration-agent
feature-generator-agent

Actions:

Implement mockable is_available and version checks.
Build the direct local command shape mineru -p <input-pdf> -o <work-dir>.
Keep command construction deterministic and easy to inspect in tests.

Output:

Later orchestration can call the adapter without knowing MinerU CLI details.

WP4.3: Subprocess Runner Boundary

Owner:

feature-generator-agent

Actions:

Add a small runner interface or callable boundary for subprocess execution.
Capture command, stdout, stderr, exit code, and return values.
Map non-zero exits to structured adapter warnings/errors.

Output:

Default tests can fake all MinerU behavior.

WP4.4: Mocked Output Parser

Owner:

mineru-integration-agent
feature-generator-agent

Actions:

Parse fake Markdown, JSON/structured, asset, and diagnostic outputs from test-created directories.
Treat all raw MinerU output files as optional until real local output is probed.
Emit clear warnings for missing usable output.

Output:

Adapter result objects can carry raw output into later IR/normalization sprints without binding to a guessed full MinerU layout.

WP4.5: Independent Evaluation

Owner:

evaluation-agent

Actions:

Review the completed adapter against this contract.
Verify no conversion orchestration, real MinerU dependency in default tests, remote runtime path, alternate engine, Markdown normalization, metadata writing, report generation, or working CLI command was added.
Verify samples/ remains untracked and unstaged.

Output:

PASS/FAIL notes with any missing acceptance criteria.

Verification Checks

Required:

git status --short before staging confirms samples/ remains untracked.
uv --version is run and result is recorded.
uv sync passes.
uv run pytest passes.
Targeted MinerU adapter tests pass.
Tests do not require real MinerU, CUDA, GPU, model files, samples/, or network.
No model downloads occur.
No network calls are required.
No candidate engine comparison is reintroduced.
No conversion orchestration is implemented.
No Markdown normalization behavior is implemented.
No metadata JSON writing or full report generation is implemented.
No working pdf2md convert or full pdf2md doctor behavior is implemented.
Strict-local rejection tests cover --api-url, remote URL values, router mode, HTTP backend, and remote OpenAI-compatible backend options.
git diff --check passes.

Recommended:

Use fake runner classes or functions rather than monkeypatching global subprocess calls everywhere.
Keep adapter result data JSON-friendly where practical, but do not force metadata schema generation in Sprint 4.
Include a test that no prohibited remote/API flag appears in the successful command args.
Use requirements-guard-agent if command flags or strict-local wording conflict across documents.

Hard Failure Criteria

Sprint 4 fails and must stop for a user decision if any of these are true:

The adapter passes --api-url.
The adapter uses router mode.
The adapter uses an HTTP client backend.
The adapter accepts a remote API URL or remote OpenAI-compatible backend for runtime conversion.
The adapter falls back to another engine after MinerU failure.
Default tests require real MinerU, CUDA, GPU, model files, network, or samples/.
The implementation installs MinerU or downloads models.
The implementation connects the adapter to a working conversion CLI/API.
The implementation adds Markdown normalization, metadata file writing, full report generation, or quality checks.
The implementation assumes real MinerU output layout is fully known without a later local probe.
samples/ is staged or committed.

Acceptance Criteria

Sprint 4 is complete when:

src/pdf2md/mineru_adapter.py exists and owns direct local MinerU CLI adapter behavior.
Availability and version checks are mock-tested.
Conversion command construction is mock-tested and uses mineru -p <input-pdf> -o <work-dir>.
Strict-local validation rejects prohibited remote/API/router/backend options.
Mocked successful MinerU output produces an adapter result with raw Markdown, raw structured data when available, assets when available, engine info, command args, stdout, stderr, and exit code.
Mocked non-zero exit produces a clear failure result or project-owned error with a MINERU_CLI_FAILED warning.
Mocked missing MinerU produces a clear ENGINE_MISSING warning or project-owned adapter error.
Default tests do not require MinerU, GPU, model files, network, or samples/.
No conversion orchestration, Markdown normalization, metadata file writing, full report generation, or working CLI behavior is implemented.
uv sync passes.
uv run pytest passes.
PROGRESS.md records checks performed and residual risks.
Independent evaluation is complete.
The completed change is committed.

Handoff Fields

Use these fields when Sprint 4 completes:

Files changed:
Commands run:
Tests passed:
Tests blocked:
Known failures:
Residual risks:
User decisions needed:
Go/no-go recommendation for Sprint 5:
Next action:

13 KiB Raw Blame History

Sprint 4 Contract: MinerU Adapter With Mocked Contract

Objective

Current Precondition

Touched Surfaces

Expected Outputs

Non-Goals

Work Packages

WP4.1: Adapter Types And Strict-Local Options

WP4.2: Availability, Version, And Command Builder

WP4.3: Subprocess Runner Boundary

WP4.4: Mocked Output Parser

WP4.5: Independent Evaluation

Verification Checks

Hard Failure Criteria

Acceptance Criteria

Handoff Fields

13 KiB

Raw Blame History