Files
2026-05-08 16:42:19 +09:00

11 KiB

Sprint 3 Contract: Domain Records, Metadata, And Warning Model

Status: Completed Last updated: 2026-05-07

Objective

Define project-owned domain records, warning records, and metadata JSON construction before binding the system to MinerU output.

Sprint 3 must establish:

  • Internal records for documents, pages, blocks, assets, warnings, and conversion outputs.
  • Stable warning code and severity definitions aligned with ARCHITECTURE.md.
  • A metadata builder that produces the required v1 top-level and summary fields.
  • Warning aggregation behavior that later report generation can consume.
  • Fast unit tests that do not require MinerU, model files, GPU, sample PDFs, or network.

Sprint 3 is schema and metadata modeling only. It must not run MinerU, parse PDFs, normalize Markdown, generate final report Markdown content, expose a working convert command, or add remote/runtime engine behavior.

Current Precondition

Sprint 2 is complete:

  • src/pdf2md/paths.py owns input discovery and output path planning.
  • tests/test_paths.py verifies directory recursion, non-ASCII filenames, overwrite conflict detection, duplicate planned outputs, and output-root escape prevention.
  • uv run pytest passed 21 tests.

Sprint 3 may use path planning records as context, but it should not depend on actual conversion output.

Touched Surfaces

Allowed:

  • src/pdf2md/ir.py
  • src/pdf2md/metadata.py
  • src/pdf2md/report.py only for a minimal type boundary if metadata/report handoff cannot be expressed cleanly without it
  • src/pdf2md/__init__.py only if exporting a minimal stable type is necessary and tested
  • tests/test_ir.py or tests/unit/test_ir.py
  • tests/test_metadata.py or tests/unit/test_metadata.py
  • PLAN.md only for current-goal coordination updates required by the shared agent workflow
  • PROGRESS.md
  • docs/V1IMPLEMENTATIONPLAN.md only if sequencing or constraints need adjustment
  • docs/Sprints/SPRINT3CONTRACT.md

Not allowed:

  • src/pdf2md/mineru_adapter.py
  • src/pdf2md/markdown.py
  • src/pdf2md/quality.py
  • src/pdf2md/doctor.py
  • scripts/
  • Any real MinerU invocation
  • Any model download or install script
  • Any PDF content parsing
  • Any Markdown normalization behavior
  • Any .report.md content generation beyond a minimal handoff type if absolutely needed
  • Any working pdf2md convert or pdf2md doctor behavior
  • Any committed file under samples/

Expected Outputs

Sprint 3 should produce:

  1. Domain records

    • DocumentRecord or equivalent project-owned record.
    • PageRecord or equivalent with page index and optional page dimensions.
    • BlockRecord or equivalent with block type, optional page index, optional bbox, optional confidence, and optional Markdown character span.
    • AssetRecord or equivalent with stable relative path and optional source page/provenance.
    • WarningRecord or equivalent with code, severity, message, optional page index, and optional bbox.
    • ConversionOutputRecord or equivalent only if useful for connecting metadata to later orchestration; it must not invoke conversion.
  2. Stable enums or constants

    • Block types aligned with ARCHITECTURE.md: heading, paragraph, inline_formula, display_formula, table, figure, caption, footnote, reference, and unknown.
    • Warning codes aligned with ARCHITECTURE.md, including at least:
      • ENGINE_MISSING
      • GPU_UNAVAILABLE
      • LOW_CONFIDENCE_FORMULA
      • MATH_RENDER_FAILED
      • ASSET_LINK_MISSING
      • READING_ORDER_UNCERTAIN
      • STRICT_LOCAL_VIOLATION
      • MINERU_CLI_FAILED
    • Warning severity values sufficient for v1 metadata and report summaries, such as info, warning, and error.
  3. Metadata builder

    • Build a JSON-serializable metadata object with required top-level fields:
      • source_pdf
      • source_sha256
      • created_at
      • engine
      • engine_version
      • engine_options
      • pages
      • assets
      • warnings
      • summary
    • Build required summary fields:
      • pages_processed
      • warning_count
      • asset_count
      • display_formula_count
      • inline_formula_count
      • math_render_error_count
    • Preserve optional fields such as bbox and confidence only when present.
    • Require source_sha256 as an input value. Sprint 3 should not compute hashes by reading PDFs unless the contract is explicitly amended.
    • Produce only plain Python data structures that json.dumps can serialize without custom encoders.
  4. Warning aggregation

    • Count warnings.
    • Count math render failures from MATH_RENDER_FAILED.
    • Preserve warning order unless there is a tested reason to sort.
    • Preserve page-level warning data when available.
  5. Tests

    • Unit tests for domain record serialization.
    • Unit tests for metadata schema creation with all required top-level fields.
    • Unit tests for summary counts.
    • Unit tests for warning aggregation.
    • Unit tests that optional bbox and confidence fields are preserved only when present.
    • Unit tests that metadata is JSON serializable.
    • Unit tests that metadata requires source PDF, source SHA-256, engine, engine version, and page records.
  6. Handoff

    • PROGRESS.md records changed files, commands run, tests passed or blocked, known failures, residual risks, and next action.

Non-Goals

  • Do not implement PDF conversion.
  • Do not implement conversion orchestration.
  • Do not implement the MinerU adapter.
  • Do not run MinerU.
  • Do not install MinerU 3.1.0.
  • Do not download MinerU models.
  • Do not parse PDF contents.
  • Do not compute source SHA-256 by reading files unless this contract is explicitly amended.
  • Do not implement Markdown normalization.
  • Do not implement asset link checking.
  • Do not implement math renderability checking.
  • Do not implement full .report.md content generation.
  • Do not implement pdf2md convert as a working command.
  • Do not implement pdf2md doctor.
  • Do not add runtime engine selection.
  • Do not add alternate conversion engines.
  • Do not add cloud, remote API, router, HTTP client backend, or remote OpenAI-compatible backend support.

Work Packages

WP3.1: Domain Record Types

Owner:

  • metadata-agent
  • feature-generator-agent

Actions:

  • Define small project-owned records for document/page/block/asset/warning concepts.
  • Use simple, typed Python structures that are easy to serialize and test.
  • Keep MinerU-specific raw objects out of public and required fields.

Output:

  • ir.py contains the minimal domain model needed by metadata construction.

WP3.2: Warning Codes And Severities

Owner:

  • metadata-agent
  • feature-generator-agent

Actions:

  • Define stable warning codes from ARCHITECTURE.md.
  • Define severity values and validate warning records against them.
  • Avoid inventing speculative warning categories beyond the known v1 set unless needed by tests.

Output:

  • Warnings are structured, countable, and stable across later sprints.

WP3.3: Metadata Builder

Owner:

  • metadata-agent
  • feature-generator-agent

Actions:

  • Build required metadata JSON data from project-owned records.
  • Preserve optional provenance fields only when present.
  • Require source PDF path, source SHA-256, engine, engine version, pages, assets, warnings, and engine options as explicit inputs.

Output:

  • metadata.py produces the required v1 metadata object without MinerU execution.

WP3.4: Metadata And Warning Tests

Owner:

  • feature-generator-agent
  • evaluation-agent

Actions:

  • Add focused unit tests for schema, counts, optional fields, JSON serialization, and validation failures.
  • Use in-memory records and temporary paths only.

Output:

  • uv run pytest verifies metadata behavior without external dependencies.

WP3.5: Independent Evaluation

Owner:

  • evaluation-agent

Actions:

  • Review the completed records and metadata builder against this contract.
  • Verify no conversion behavior, MinerU execution, remote runtime path, alternate engine, Markdown normalization, quality checks, or report content generation was added.
  • Verify samples/ remains untracked and unstaged.

Output:

  • PASS/FAIL notes with any missing acceptance criteria.

Verification Checks

Required:

  • git status --short before staging confirms samples/ remains untracked.
  • uv --version is run and result is recorded.
  • uv sync passes.
  • uv run pytest passes.
  • Targeted IR/metadata tests pass.
  • Metadata output is JSON serializable through json.dumps.
  • Tests do not require MinerU, CUDA, GPU, model files, samples/, or network.
  • No real MinerU dependency is required for default tests.
  • No model downloads occur.
  • No network calls are required.
  • No candidate engine comparison is reintroduced.
  • No conversion behavior is implemented.
  • No Markdown normalization behavior is implemented.
  • No full .report.md content generation is implemented.
  • git diff --check passes.

Recommended:

  • Keep dataclass or enum APIs small and explicit.
  • Prefer one serialization function per record over ad hoc dict mutation in tests.
  • Include tests that fail if a required metadata top-level field is omitted.
  • Use requirements-guard-agent if metadata requirements conflict between PRD.md and ARCHITECTURE.md.

Hard Failure Criteria

Sprint 3 fails and must stop for a user decision if any of these are true:

  • Metadata omits source PDF, source SHA-256, engine, engine version, pages, warnings, assets, or summary.
  • Summary omits pages processed, warning count, asset count, display formula count, inline formula count, or math render error count.
  • Public or required metadata fields require raw MinerU objects.
  • Optional bbox, confidence, or page provenance is dropped when provided.
  • Optional bbox, confidence, or page provenance is invented when absent.
  • Default tests require MinerU, CUDA, GPU, model files, network, or samples/.
  • The implementation parses PDF contents, invokes conversion behavior, normalizes Markdown, or generates full report Markdown content.
  • The implementation introduces alternate engines or runtime engine selection.
  • The implementation introduces --api-url, remote APIs, router mode, HTTP client backends, or remote OpenAI-compatible backends.
  • samples/ is staged or committed.

Acceptance Criteria

Sprint 3 is complete when:

  • src/pdf2md/ir.py exists and owns project domain records.
  • src/pdf2md/metadata.py exists and builds required metadata JSON data from project-owned records.
  • Stable block types and warning codes are defined and tested.
  • Metadata top-level fields and summary fields are tested.
  • Warning aggregation is tested.
  • Optional bbox and confidence preservation is tested.
  • Metadata JSON serializability is tested.
  • No conversion, MinerU, Markdown normalization, quality check, full report generation, or doctor behavior is implemented.
  • uv sync passes.
  • uv run pytest passes.
  • PROGRESS.md records checks performed and residual risks.
  • Independent evaluation is complete.
  • The completed change is committed.

Handoff Fields

Use these fields when Sprint 3 completes:

  • Files changed:
  • Commands run:
  • Tests passed:
  • Tests blocked:
  • Known failures:
  • Residual risks:
  • User decisions needed:
  • Go/no-go recommendation for Sprint 4:
  • Next action: