Files
2026-05-08 16:42:19 +09:00

13 KiB

Sprint 4 Contract: MinerU Adapter With Mocked Contract

Status: Implemented Last updated: 2026-05-07

Objective

Build the direct local MinerU 3.1.0 adapter boundary using mocked subprocess results and fake output directories first.

Sprint 4 must establish:

  • A project-owned adapter module that is the only boundary for MinerU CLI interaction.
  • Deterministic command construction for direct local MinerU CLI execution.
  • Strict-local validation that rejects prohibited remote/API/router/backend options.
  • Subprocess execution wrapping that captures stdout, stderr, exit code, command, and generated paths.
  • Optional-file parsing for mocked MinerU output directories.
  • Clear adapter result and warning records for missing MinerU, failed CLI execution, missing output, and strict-local violations.
  • Fast unit tests that do not require real MinerU, model files, GPU, sample PDFs, or network.

Sprint 4 is an adapter contract sprint. It must not connect the adapter to real conversion orchestration, Markdown normalization, metadata writing, report generation, or a working pdf2md convert command.

Current Precondition

Sprint 3 is complete:

  • src/pdf2md/paths.py owns input discovery and output path planning.
  • src/pdf2md/ir.py owns project records, block types, warning codes, and warning severities.
  • src/pdf2md/metadata.py builds JSON-serializable metadata and summary counts from project-owned records.
  • uv run pytest passed 46 tests.

Sprint 4 may import project-owned warning records from ir.py, but it must not require raw MinerU objects as public or required return fields.

Touched Surfaces

Allowed:

  • src/pdf2md/mineru_adapter.py
  • src/pdf2md/doctor.py only for minimal adapter availability/version check types if needed; do not implement full pdf2md doctor
  • src/pdf2md/ir.py only for narrowly required warning/result fields discovered while implementing the adapter contract
  • tests/test_mineru_adapter.py or tests/unit/test_mineru_adapter.py
  • tests/test_doctor.py only if doctor.py is touched for adapter availability/version checks
  • README.md only if a small note is needed to clarify mocked adapter tests versus real MinerU setup
  • PLAN.md only for current-goal coordination updates required by the shared agent workflow
  • PROGRESS.md
  • docs/V1IMPLEMENTATIONPLAN.md only if sequencing or constraints need adjustment
  • docs/Sprints/SPRINT4CONTRACT.md

Not allowed:

  • src/pdf2md/conversion.py
  • src/pdf2md/markdown.py
  • src/pdf2md/quality.py
  • src/pdf2md/report.py
  • Working pdf2md convert behavior
  • Full pdf2md doctor behavior
  • scripts/
  • Any real MinerU invocation in default tests
  • Any MinerU/model installation or download script
  • Any PDF content parsing
  • Any Markdown normalization behavior
  • Any metadata JSON file writing
  • Any .report.md content generation
  • Any runtime engine selection or alternate engine support
  • Any committed file under samples/

Expected Outputs

Sprint 4 should produce:

  1. Adapter records and options

    • A small adapter options record for v1-local MinerU execution.
    • A result record containing at least:
      • command arguments
      • input PDF path
      • work/output directory path
      • raw Markdown when found
      • raw structured data when found
      • asset paths when found
      • warnings
      • engine name fixed to MinerU
      • engine version when known
      • engine options
      • exit code
      • stdout
      • stderr
    • No public or required field should expose raw MinerU-specific Python objects.
  2. Availability and version checks

    • Check whether mineru is available using a mockable local mechanism such as shutil.which.
    • Check MinerU version using a mockable subprocess call.
    • Missing MinerU should produce a clear ENGINE_MISSING warning or project-owned adapter error.
    • Version command failure should be explicit and testable.
  3. Direct local command construction

    • Baseline conversion command shape:

      mineru -p <input-pdf> -o <work-dir>
      
    • The command must not include --api-url.

    • The command must not include router mode, HTTP client backend flags, remote API URLs, remote OpenAI-compatible backend settings, or runtime engine selection.

    • GPU/device options may be represented only if they are strict-local and do not introduce remote/backend choices. If the exact MinerU 3.1.0 flag is uncertain, store the requested option without passing a speculative CLI flag until a later source-verified sprint.

  4. Strict-local validation

    • Reject prohibited CLI args and options before subprocess execution.
    • Reject any value that looks like a remote URL in user-controlled adapter options.
    • Allow only direct local mineru CLI execution.
    • Allow the CLI-internal temporary local mineru-api process that MinerU 3.1.0 may start when the CLI runs without --api-url.
    • Do not implement or call an HTTP client backend.
  5. Subprocess wrapper

    • Use dependency injection or a small runner boundary so tests can fake subprocess behavior.
    • Capture stdout, stderr, and exit code.
    • Convert non-zero exit into an adapter result or project-owned error with a MINERU_CLI_FAILED warning.
    • Do not silently fallback to another engine.
  6. Mocked output parsing

    • Parse fake output directories using optional-file behavior.
    • Raw Markdown is optional and should be read only from mocked local files created by tests.
    • Raw structured output is optional and may be represented as a JSON-serializable object or raw text, depending on the fake file extension used in tests.
    • Assets are optional and should be collected as local relative or absolute paths according to the adapter result design.
    • Missing expected output should produce a clear warning or failed result instead of being silently ignored.
    • The adapter must not assume real MinerU output layout is fully known until a later local probe.
  7. Tests

    • Unit tests for availability check success and missing MinerU.
    • Unit tests for version check success and version command failure.
    • Unit tests for command construction.
    • Unit tests that prohibited --api-url, remote URLs, router mode, HTTP backend, and OpenAI-compatible backend options are rejected.
    • Unit tests for mocked successful MinerU output.
    • Unit tests for mocked non-zero exit.
    • Unit tests for mocked missing output.
    • Unit tests proving no real MinerU binary, model files, GPU, samples/, or network are required by default.
  8. Handoff

    • PROGRESS.md records changed files, commands run, tests passed or blocked, known failures, residual risks, and next action.

Non-Goals

  • Do not implement conversion orchestration.
  • Do not implement convert_pdf.
  • Do not implement pdf2md convert.
  • Do not implement full pdf2md doctor.
  • Do not install MinerU 3.1.0.
  • Do not download MinerU models.
  • Do not run real MinerU in default tests.
  • Do not parse real PDFs.
  • Do not normalize Markdown.
  • Do not write final Markdown, metadata JSON, assets, or report files as product behavior.
  • Do not compute source SHA-256.
  • Do not implement asset link checking.
  • Do not implement math renderability checking.
  • Do not implement alternate engines or runtime engine selection.
  • Do not add cloud, remote API, router, HTTP client backend, or remote OpenAI-compatible backend support.

Work Packages

WP4.1: Adapter Types And Strict-Local Options

Owner:

  • mineru-integration-agent
  • feature-generator-agent

Actions:

  • Define minimal adapter options and result records.
  • Encode strict-local defaults.
  • Reject prohibited remote/API/backend options before command execution.

Output:

  • The adapter boundary can be tested without invoking MinerU.

WP4.2: Availability, Version, And Command Builder

Owner:

  • mineru-integration-agent
  • feature-generator-agent

Actions:

  • Implement mockable is_available and version checks.
  • Build the direct local command shape mineru -p <input-pdf> -o <work-dir>.
  • Keep command construction deterministic and easy to inspect in tests.

Output:

  • Later orchestration can call the adapter without knowing MinerU CLI details.

WP4.3: Subprocess Runner Boundary

Owner:

  • feature-generator-agent

Actions:

  • Add a small runner interface or callable boundary for subprocess execution.
  • Capture command, stdout, stderr, exit code, and return values.
  • Map non-zero exits to structured adapter warnings/errors.

Output:

  • Default tests can fake all MinerU behavior.

WP4.4: Mocked Output Parser

Owner:

  • mineru-integration-agent
  • feature-generator-agent

Actions:

  • Parse fake Markdown, JSON/structured, asset, and diagnostic outputs from test-created directories.
  • Treat all raw MinerU output files as optional until real local output is probed.
  • Emit clear warnings for missing usable output.

Output:

  • Adapter result objects can carry raw output into later IR/normalization sprints without binding to a guessed full MinerU layout.

WP4.5: Independent Evaluation

Owner:

  • evaluation-agent

Actions:

  • Review the completed adapter against this contract.
  • Verify no conversion orchestration, real MinerU dependency in default tests, remote runtime path, alternate engine, Markdown normalization, metadata writing, report generation, or working CLI command was added.
  • Verify samples/ remains untracked and unstaged.

Output:

  • PASS/FAIL notes with any missing acceptance criteria.

Verification Checks

Required:

  • git status --short before staging confirms samples/ remains untracked.
  • uv --version is run and result is recorded.
  • uv sync passes.
  • uv run pytest passes.
  • Targeted MinerU adapter tests pass.
  • Tests do not require real MinerU, CUDA, GPU, model files, samples/, or network.
  • No model downloads occur.
  • No network calls are required.
  • No candidate engine comparison is reintroduced.
  • No conversion orchestration is implemented.
  • No Markdown normalization behavior is implemented.
  • No metadata JSON writing or full report generation is implemented.
  • No working pdf2md convert or full pdf2md doctor behavior is implemented.
  • Strict-local rejection tests cover --api-url, remote URL values, router mode, HTTP backend, and remote OpenAI-compatible backend options.
  • git diff --check passes.

Recommended:

  • Use fake runner classes or functions rather than monkeypatching global subprocess calls everywhere.
  • Keep adapter result data JSON-friendly where practical, but do not force metadata schema generation in Sprint 4.
  • Include a test that no prohibited remote/API flag appears in the successful command args.
  • Use requirements-guard-agent if command flags or strict-local wording conflict across documents.

Hard Failure Criteria

Sprint 4 fails and must stop for a user decision if any of these are true:

  • The adapter passes --api-url.
  • The adapter uses router mode.
  • The adapter uses an HTTP client backend.
  • The adapter accepts a remote API URL or remote OpenAI-compatible backend for runtime conversion.
  • The adapter falls back to another engine after MinerU failure.
  • Default tests require real MinerU, CUDA, GPU, model files, network, or samples/.
  • The implementation installs MinerU or downloads models.
  • The implementation connects the adapter to a working conversion CLI/API.
  • The implementation adds Markdown normalization, metadata file writing, full report generation, or quality checks.
  • The implementation assumes real MinerU output layout is fully known without a later local probe.
  • samples/ is staged or committed.

Acceptance Criteria

Sprint 4 is complete when:

  • src/pdf2md/mineru_adapter.py exists and owns direct local MinerU CLI adapter behavior.
  • Availability and version checks are mock-tested.
  • Conversion command construction is mock-tested and uses mineru -p <input-pdf> -o <work-dir>.
  • Strict-local validation rejects prohibited remote/API/router/backend options.
  • Mocked successful MinerU output produces an adapter result with raw Markdown, raw structured data when available, assets when available, engine info, command args, stdout, stderr, and exit code.
  • Mocked non-zero exit produces a clear failure result or project-owned error with a MINERU_CLI_FAILED warning.
  • Mocked missing MinerU produces a clear ENGINE_MISSING warning or project-owned adapter error.
  • Default tests do not require MinerU, GPU, model files, network, or samples/.
  • No conversion orchestration, Markdown normalization, metadata file writing, full report generation, or working CLI behavior is implemented.
  • uv sync passes.
  • uv run pytest passes.
  • PROGRESS.md records checks performed and residual risks.
  • Independent evaluation is complete.
  • The completed change is committed.

Handoff Fields

Use these fields when Sprint 4 completes:

  • Files changed:
  • Commands run:
  • Tests passed:
  • Tests blocked:
  • Known failures:
  • Residual risks:
  • User decisions needed:
  • Go/no-go recommendation for Sprint 5:
  • Next action: