# Sprint 3 Contract: Domain Records, Metadata, And Warning Model Status: Completed Last updated: 2026-05-07 ## Objective Define project-owned domain records, warning records, and metadata JSON construction before binding the system to MinerU output. Sprint 3 must establish: - Internal records for documents, pages, blocks, assets, warnings, and conversion outputs. - Stable warning code and severity definitions aligned with `ARCHITECTURE.md`. - A metadata builder that produces the required v1 top-level and summary fields. - Warning aggregation behavior that later report generation can consume. - Fast unit tests that do not require MinerU, model files, GPU, sample PDFs, or network. Sprint 3 is schema and metadata modeling only. It must not run MinerU, parse PDFs, normalize Markdown, generate final report Markdown content, expose a working `convert` command, or add remote/runtime engine behavior. ## Current Precondition Sprint 2 is complete: - `src/pdf2md/paths.py` owns input discovery and output path planning. - `tests/test_paths.py` verifies directory recursion, non-ASCII filenames, overwrite conflict detection, duplicate planned outputs, and output-root escape prevention. - `uv run pytest` passed 21 tests. Sprint 3 may use path planning records as context, but it should not depend on actual conversion output. ## Touched Surfaces Allowed: - `src/pdf2md/ir.py` - `src/pdf2md/metadata.py` - `src/pdf2md/report.py` only for a minimal type boundary if metadata/report handoff cannot be expressed cleanly without it - `src/pdf2md/__init__.py` only if exporting a minimal stable type is necessary and tested - `tests/test_ir.py` or `tests/unit/test_ir.py` - `tests/test_metadata.py` or `tests/unit/test_metadata.py` - `PLAN.md` only for current-goal coordination updates required by the shared agent workflow - `PROGRESS.md` - `docs/V1IMPLEMENTATIONPLAN.md` only if sequencing or constraints need adjustment - `docs/Sprints/SPRINT3CONTRACT.md` Not allowed: - `src/pdf2md/mineru_adapter.py` - `src/pdf2md/markdown.py` - `src/pdf2md/quality.py` - `src/pdf2md/doctor.py` - `scripts/` - Any real MinerU invocation - Any model download or install script - Any PDF content parsing - Any Markdown normalization behavior - Any `.report.md` content generation beyond a minimal handoff type if absolutely needed - Any working `pdf2md convert` or `pdf2md doctor` behavior - Any committed file under `samples/` ## Expected Outputs Sprint 3 should produce: 1. Domain records - `DocumentRecord` or equivalent project-owned record. - `PageRecord` or equivalent with page index and optional page dimensions. - `BlockRecord` or equivalent with block type, optional page index, optional bbox, optional confidence, and optional Markdown character span. - `AssetRecord` or equivalent with stable relative path and optional source page/provenance. - `WarningRecord` or equivalent with code, severity, message, optional page index, and optional bbox. - `ConversionOutputRecord` or equivalent only if useful for connecting metadata to later orchestration; it must not invoke conversion. 2. Stable enums or constants - Block types aligned with `ARCHITECTURE.md`: `heading`, `paragraph`, `inline_formula`, `display_formula`, `table`, `figure`, `caption`, `footnote`, `reference`, and `unknown`. - Warning codes aligned with `ARCHITECTURE.md`, including at least: - `ENGINE_MISSING` - `GPU_UNAVAILABLE` - `LOW_CONFIDENCE_FORMULA` - `MATH_RENDER_FAILED` - `ASSET_LINK_MISSING` - `READING_ORDER_UNCERTAIN` - `STRICT_LOCAL_VIOLATION` - `MINERU_CLI_FAILED` - Warning severity values sufficient for v1 metadata and report summaries, such as `info`, `warning`, and `error`. 3. Metadata builder - Build a JSON-serializable metadata object with required top-level fields: - `source_pdf` - `source_sha256` - `created_at` - `engine` - `engine_version` - `engine_options` - `pages` - `assets` - `warnings` - `summary` - Build required summary fields: - `pages_processed` - `warning_count` - `asset_count` - `display_formula_count` - `inline_formula_count` - `math_render_error_count` - Preserve optional fields such as bbox and confidence only when present. - Require `source_sha256` as an input value. Sprint 3 should not compute hashes by reading PDFs unless the contract is explicitly amended. - Produce only plain Python data structures that `json.dumps` can serialize without custom encoders. 4. Warning aggregation - Count warnings. - Count math render failures from `MATH_RENDER_FAILED`. - Preserve warning order unless there is a tested reason to sort. - Preserve page-level warning data when available. 5. Tests - Unit tests for domain record serialization. - Unit tests for metadata schema creation with all required top-level fields. - Unit tests for summary counts. - Unit tests for warning aggregation. - Unit tests that optional bbox and confidence fields are preserved only when present. - Unit tests that metadata is JSON serializable. - Unit tests that metadata requires source PDF, source SHA-256, engine, engine version, and page records. 6. Handoff - `PROGRESS.md` records changed files, commands run, tests passed or blocked, known failures, residual risks, and next action. ## Non-Goals - Do not implement PDF conversion. - Do not implement conversion orchestration. - Do not implement the MinerU adapter. - Do not run MinerU. - Do not install MinerU 3.1.0. - Do not download MinerU models. - Do not parse PDF contents. - Do not compute source SHA-256 by reading files unless this contract is explicitly amended. - Do not implement Markdown normalization. - Do not implement asset link checking. - Do not implement math renderability checking. - Do not implement full `.report.md` content generation. - Do not implement `pdf2md convert` as a working command. - Do not implement `pdf2md doctor`. - Do not add runtime engine selection. - Do not add alternate conversion engines. - Do not add cloud, remote API, router, HTTP client backend, or remote OpenAI-compatible backend support. ## Work Packages ### WP3.1: Domain Record Types Owner: - `metadata-agent` - `feature-generator-agent` Actions: - Define small project-owned records for document/page/block/asset/warning concepts. - Use simple, typed Python structures that are easy to serialize and test. - Keep MinerU-specific raw objects out of public and required fields. Output: - `ir.py` contains the minimal domain model needed by metadata construction. ### WP3.2: Warning Codes And Severities Owner: - `metadata-agent` - `feature-generator-agent` Actions: - Define stable warning codes from `ARCHITECTURE.md`. - Define severity values and validate warning records against them. - Avoid inventing speculative warning categories beyond the known v1 set unless needed by tests. Output: - Warnings are structured, countable, and stable across later sprints. ### WP3.3: Metadata Builder Owner: - `metadata-agent` - `feature-generator-agent` Actions: - Build required metadata JSON data from project-owned records. - Preserve optional provenance fields only when present. - Require source PDF path, source SHA-256, engine, engine version, pages, assets, warnings, and engine options as explicit inputs. Output: - `metadata.py` produces the required v1 metadata object without MinerU execution. ### WP3.4: Metadata And Warning Tests Owner: - `feature-generator-agent` - `evaluation-agent` Actions: - Add focused unit tests for schema, counts, optional fields, JSON serialization, and validation failures. - Use in-memory records and temporary paths only. Output: - `uv run pytest` verifies metadata behavior without external dependencies. ### WP3.5: Independent Evaluation Owner: - `evaluation-agent` Actions: - Review the completed records and metadata builder against this contract. - Verify no conversion behavior, MinerU execution, remote runtime path, alternate engine, Markdown normalization, quality checks, or report content generation was added. - Verify `samples/` remains untracked and unstaged. Output: - PASS/FAIL notes with any missing acceptance criteria. ## Verification Checks Required: - `git status --short` before staging confirms `samples/` remains untracked. - `uv --version` is run and result is recorded. - `uv sync` passes. - `uv run pytest` passes. - Targeted IR/metadata tests pass. - Metadata output is JSON serializable through `json.dumps`. - Tests do not require MinerU, CUDA, GPU, model files, `samples/`, or network. - No real MinerU dependency is required for default tests. - No model downloads occur. - No network calls are required. - No candidate engine comparison is reintroduced. - No conversion behavior is implemented. - No Markdown normalization behavior is implemented. - No full `.report.md` content generation is implemented. - `git diff --check` passes. Recommended: - Keep dataclass or enum APIs small and explicit. - Prefer one serialization function per record over ad hoc dict mutation in tests. - Include tests that fail if a required metadata top-level field is omitted. - Use `requirements-guard-agent` if metadata requirements conflict between `PRD.md` and `ARCHITECTURE.md`. ## Hard Failure Criteria Sprint 3 fails and must stop for a user decision if any of these are true: - Metadata omits source PDF, source SHA-256, engine, engine version, pages, warnings, assets, or summary. - Summary omits pages processed, warning count, asset count, display formula count, inline formula count, or math render error count. - Public or required metadata fields require raw MinerU objects. - Optional bbox, confidence, or page provenance is dropped when provided. - Optional bbox, confidence, or page provenance is invented when absent. - Default tests require MinerU, CUDA, GPU, model files, network, or `samples/`. - The implementation parses PDF contents, invokes conversion behavior, normalizes Markdown, or generates full report Markdown content. - The implementation introduces alternate engines or runtime engine selection. - The implementation introduces `--api-url`, remote APIs, router mode, HTTP client backends, or remote OpenAI-compatible backends. - `samples/` is staged or committed. ## Acceptance Criteria Sprint 3 is complete when: - `src/pdf2md/ir.py` exists and owns project domain records. - `src/pdf2md/metadata.py` exists and builds required metadata JSON data from project-owned records. - Stable block types and warning codes are defined and tested. - Metadata top-level fields and summary fields are tested. - Warning aggregation is tested. - Optional bbox and confidence preservation is tested. - Metadata JSON serializability is tested. - No conversion, MinerU, Markdown normalization, quality check, full report generation, or doctor behavior is implemented. - `uv sync` passes. - `uv run pytest` passes. - `PROGRESS.md` records checks performed and residual risks. - Independent evaluation is complete. - The completed change is committed. ## Handoff Fields Use these fields when Sprint 3 completes: - Files changed: - Commands run: - Tests passed: - Tests blocked: - Known failures: - Residual risks: - User decisions needed: - Go/no-go recommendation for Sprint 4: - Next action: