# Sprint 4 Contract: MinerU Adapter With Mocked Contract Status: Implemented Last updated: 2026-05-07 ## Objective Build the direct local MinerU 3.1.0 adapter boundary using mocked subprocess results and fake output directories first. Sprint 4 must establish: - A project-owned adapter module that is the only boundary for MinerU CLI interaction. - Deterministic command construction for direct local MinerU CLI execution. - Strict-local validation that rejects prohibited remote/API/router/backend options. - Subprocess execution wrapping that captures stdout, stderr, exit code, command, and generated paths. - Optional-file parsing for mocked MinerU output directories. - Clear adapter result and warning records for missing MinerU, failed CLI execution, missing output, and strict-local violations. - Fast unit tests that do not require real MinerU, model files, GPU, sample PDFs, or network. Sprint 4 is an adapter contract sprint. It must not connect the adapter to real conversion orchestration, Markdown normalization, metadata writing, report generation, or a working `pdf2md convert` command. ## Current Precondition Sprint 3 is complete: - `src/pdf2md/paths.py` owns input discovery and output path planning. - `src/pdf2md/ir.py` owns project records, block types, warning codes, and warning severities. - `src/pdf2md/metadata.py` builds JSON-serializable metadata and summary counts from project-owned records. - `uv run pytest` passed 46 tests. Sprint 4 may import project-owned warning records from `ir.py`, but it must not require raw MinerU objects as public or required return fields. ## Touched Surfaces Allowed: - `src/pdf2md/mineru_adapter.py` - `src/pdf2md/doctor.py` only for minimal adapter availability/version check types if needed; do not implement full `pdf2md doctor` - `src/pdf2md/ir.py` only for narrowly required warning/result fields discovered while implementing the adapter contract - `tests/test_mineru_adapter.py` or `tests/unit/test_mineru_adapter.py` - `tests/test_doctor.py` only if `doctor.py` is touched for adapter availability/version checks - `README.md` only if a small note is needed to clarify mocked adapter tests versus real MinerU setup - `PLAN.md` only for current-goal coordination updates required by the shared agent workflow - `PROGRESS.md` - `docs/V1IMPLEMENTATIONPLAN.md` only if sequencing or constraints need adjustment - `docs/Sprints/SPRINT4CONTRACT.md` Not allowed: - `src/pdf2md/conversion.py` - `src/pdf2md/markdown.py` - `src/pdf2md/quality.py` - `src/pdf2md/report.py` - Working `pdf2md convert` behavior - Full `pdf2md doctor` behavior - `scripts/` - Any real MinerU invocation in default tests - Any MinerU/model installation or download script - Any PDF content parsing - Any Markdown normalization behavior - Any metadata JSON file writing - Any `.report.md` content generation - Any runtime engine selection or alternate engine support - Any committed file under `samples/` ## Expected Outputs Sprint 4 should produce: 1. Adapter records and options - A small adapter options record for v1-local MinerU execution. - A result record containing at least: - command arguments - input PDF path - work/output directory path - raw Markdown when found - raw structured data when found - asset paths when found - warnings - engine name fixed to MinerU - engine version when known - engine options - exit code - stdout - stderr - No public or required field should expose raw MinerU-specific Python objects. 2. Availability and version checks - Check whether `mineru` is available using a mockable local mechanism such as `shutil.which`. - Check MinerU version using a mockable subprocess call. - Missing MinerU should produce a clear `ENGINE_MISSING` warning or project-owned adapter error. - Version command failure should be explicit and testable. 3. Direct local command construction - Baseline conversion command shape: ```text mineru -p -o ``` - The command must not include `--api-url`. - The command must not include router mode, HTTP client backend flags, remote API URLs, remote OpenAI-compatible backend settings, or runtime engine selection. - GPU/device options may be represented only if they are strict-local and do not introduce remote/backend choices. If the exact MinerU 3.1.0 flag is uncertain, store the requested option without passing a speculative CLI flag until a later source-verified sprint. 4. Strict-local validation - Reject prohibited CLI args and options before subprocess execution. - Reject any value that looks like a remote URL in user-controlled adapter options. - Allow only direct local `mineru` CLI execution. - Allow the CLI-internal temporary local `mineru-api` process that MinerU 3.1.0 may start when the CLI runs without `--api-url`. - Do not implement or call an HTTP client backend. 5. Subprocess wrapper - Use dependency injection or a small runner boundary so tests can fake subprocess behavior. - Capture stdout, stderr, and exit code. - Convert non-zero exit into an adapter result or project-owned error with a `MINERU_CLI_FAILED` warning. - Do not silently fallback to another engine. 6. Mocked output parsing - Parse fake output directories using optional-file behavior. - Raw Markdown is optional and should be read only from mocked local files created by tests. - Raw structured output is optional and may be represented as a JSON-serializable object or raw text, depending on the fake file extension used in tests. - Assets are optional and should be collected as local relative or absolute paths according to the adapter result design. - Missing expected output should produce a clear warning or failed result instead of being silently ignored. - The adapter must not assume real MinerU output layout is fully known until a later local probe. 7. Tests - Unit tests for availability check success and missing MinerU. - Unit tests for version check success and version command failure. - Unit tests for command construction. - Unit tests that prohibited `--api-url`, remote URLs, router mode, HTTP backend, and OpenAI-compatible backend options are rejected. - Unit tests for mocked successful MinerU output. - Unit tests for mocked non-zero exit. - Unit tests for mocked missing output. - Unit tests proving no real MinerU binary, model files, GPU, `samples/`, or network are required by default. 8. Handoff - `PROGRESS.md` records changed files, commands run, tests passed or blocked, known failures, residual risks, and next action. ## Non-Goals - Do not implement conversion orchestration. - Do not implement `convert_pdf`. - Do not implement `pdf2md convert`. - Do not implement full `pdf2md doctor`. - Do not install MinerU 3.1.0. - Do not download MinerU models. - Do not run real MinerU in default tests. - Do not parse real PDFs. - Do not normalize Markdown. - Do not write final Markdown, metadata JSON, assets, or report files as product behavior. - Do not compute source SHA-256. - Do not implement asset link checking. - Do not implement math renderability checking. - Do not implement alternate engines or runtime engine selection. - Do not add cloud, remote API, router, HTTP client backend, or remote OpenAI-compatible backend support. ## Work Packages ### WP4.1: Adapter Types And Strict-Local Options Owner: - `mineru-integration-agent` - `feature-generator-agent` Actions: - Define minimal adapter options and result records. - Encode strict-local defaults. - Reject prohibited remote/API/backend options before command execution. Output: - The adapter boundary can be tested without invoking MinerU. ### WP4.2: Availability, Version, And Command Builder Owner: - `mineru-integration-agent` - `feature-generator-agent` Actions: - Implement mockable `is_available` and `version` checks. - Build the direct local command shape `mineru -p -o `. - Keep command construction deterministic and easy to inspect in tests. Output: - Later orchestration can call the adapter without knowing MinerU CLI details. ### WP4.3: Subprocess Runner Boundary Owner: - `feature-generator-agent` Actions: - Add a small runner interface or callable boundary for subprocess execution. - Capture command, stdout, stderr, exit code, and return values. - Map non-zero exits to structured adapter warnings/errors. Output: - Default tests can fake all MinerU behavior. ### WP4.4: Mocked Output Parser Owner: - `mineru-integration-agent` - `feature-generator-agent` Actions: - Parse fake Markdown, JSON/structured, asset, and diagnostic outputs from test-created directories. - Treat all raw MinerU output files as optional until real local output is probed. - Emit clear warnings for missing usable output. Output: - Adapter result objects can carry raw output into later IR/normalization sprints without binding to a guessed full MinerU layout. ### WP4.5: Independent Evaluation Owner: - `evaluation-agent` Actions: - Review the completed adapter against this contract. - Verify no conversion orchestration, real MinerU dependency in default tests, remote runtime path, alternate engine, Markdown normalization, metadata writing, report generation, or working CLI command was added. - Verify `samples/` remains untracked and unstaged. Output: - PASS/FAIL notes with any missing acceptance criteria. ## Verification Checks Required: - `git status --short` before staging confirms `samples/` remains untracked. - `uv --version` is run and result is recorded. - `uv sync` passes. - `uv run pytest` passes. - Targeted MinerU adapter tests pass. - Tests do not require real MinerU, CUDA, GPU, model files, `samples/`, or network. - No model downloads occur. - No network calls are required. - No candidate engine comparison is reintroduced. - No conversion orchestration is implemented. - No Markdown normalization behavior is implemented. - No metadata JSON writing or full report generation is implemented. - No working `pdf2md convert` or full `pdf2md doctor` behavior is implemented. - Strict-local rejection tests cover `--api-url`, remote URL values, router mode, HTTP backend, and remote OpenAI-compatible backend options. - `git diff --check` passes. Recommended: - Use fake runner classes or functions rather than monkeypatching global subprocess calls everywhere. - Keep adapter result data JSON-friendly where practical, but do not force metadata schema generation in Sprint 4. - Include a test that no prohibited remote/API flag appears in the successful command args. - Use `requirements-guard-agent` if command flags or strict-local wording conflict across documents. ## Hard Failure Criteria Sprint 4 fails and must stop for a user decision if any of these are true: - The adapter passes `--api-url`. - The adapter uses router mode. - The adapter uses an HTTP client backend. - The adapter accepts a remote API URL or remote OpenAI-compatible backend for runtime conversion. - The adapter falls back to another engine after MinerU failure. - Default tests require real MinerU, CUDA, GPU, model files, network, or `samples/`. - The implementation installs MinerU or downloads models. - The implementation connects the adapter to a working conversion CLI/API. - The implementation adds Markdown normalization, metadata file writing, full report generation, or quality checks. - The implementation assumes real MinerU output layout is fully known without a later local probe. - `samples/` is staged or committed. ## Acceptance Criteria Sprint 4 is complete when: - `src/pdf2md/mineru_adapter.py` exists and owns direct local MinerU CLI adapter behavior. - Availability and version checks are mock-tested. - Conversion command construction is mock-tested and uses `mineru -p -o `. - Strict-local validation rejects prohibited remote/API/router/backend options. - Mocked successful MinerU output produces an adapter result with raw Markdown, raw structured data when available, assets when available, engine info, command args, stdout, stderr, and exit code. - Mocked non-zero exit produces a clear failure result or project-owned error with a `MINERU_CLI_FAILED` warning. - Mocked missing MinerU produces a clear `ENGINE_MISSING` warning or project-owned adapter error. - Default tests do not require MinerU, GPU, model files, network, or `samples/`. - No conversion orchestration, Markdown normalization, metadata file writing, full report generation, or working CLI behavior is implemented. - `uv sync` passes. - `uv run pytest` passes. - `PROGRESS.md` records checks performed and residual risks. - Independent evaluation is complete. - The completed change is committed. ## Handoff Fields Use these fields when Sprint 4 completes: - Files changed: - Commands run: - Tests passed: - Tests blocked: - Known failures: - Residual risks: - User decisions needed: - Go/no-go recommendation for Sprint 5: - Next action: