304 lines
11 KiB
Markdown
304 lines
11 KiB
Markdown
# Sprint 3 Contract: Domain Records, Metadata, And Warning Model
|
|
|
|
Status: Completed
|
|
Last updated: 2026-05-07
|
|
|
|
## Objective
|
|
|
|
Define project-owned domain records, warning records, and metadata JSON construction before binding the system to MinerU output.
|
|
|
|
Sprint 3 must establish:
|
|
|
|
- Internal records for documents, pages, blocks, assets, warnings, and conversion outputs.
|
|
- Stable warning code and severity definitions aligned with `ARCHITECTURE.md`.
|
|
- A metadata builder that produces the required v1 top-level and summary fields.
|
|
- Warning aggregation behavior that later report generation can consume.
|
|
- Fast unit tests that do not require MinerU, model files, GPU, sample PDFs, or network.
|
|
|
|
Sprint 3 is schema and metadata modeling only. It must not run MinerU, parse PDFs, normalize Markdown, generate final report Markdown content, expose a working `convert` command, or add remote/runtime engine behavior.
|
|
|
|
## Current Precondition
|
|
|
|
Sprint 2 is complete:
|
|
|
|
- `src/pdf2md/paths.py` owns input discovery and output path planning.
|
|
- `tests/test_paths.py` verifies directory recursion, non-ASCII filenames, overwrite conflict detection, duplicate planned outputs, and output-root escape prevention.
|
|
- `uv run pytest` passed 21 tests.
|
|
|
|
Sprint 3 may use path planning records as context, but it should not depend on actual conversion output.
|
|
|
|
## Touched Surfaces
|
|
|
|
Allowed:
|
|
|
|
- `src/pdf2md/ir.py`
|
|
- `src/pdf2md/metadata.py`
|
|
- `src/pdf2md/report.py` only for a minimal type boundary if metadata/report handoff cannot be expressed cleanly without it
|
|
- `src/pdf2md/__init__.py` only if exporting a minimal stable type is necessary and tested
|
|
- `tests/test_ir.py` or `tests/unit/test_ir.py`
|
|
- `tests/test_metadata.py` or `tests/unit/test_metadata.py`
|
|
- `PLAN.md` only for current-goal coordination updates required by the shared agent workflow
|
|
- `PROGRESS.md`
|
|
- `docs/V1IMPLEMENTATIONPLAN.md` only if sequencing or constraints need adjustment
|
|
- `docs/Sprints/SPRINT3CONTRACT.md`
|
|
|
|
Not allowed:
|
|
|
|
- `src/pdf2md/mineru_adapter.py`
|
|
- `src/pdf2md/markdown.py`
|
|
- `src/pdf2md/quality.py`
|
|
- `src/pdf2md/doctor.py`
|
|
- `scripts/`
|
|
- Any real MinerU invocation
|
|
- Any model download or install script
|
|
- Any PDF content parsing
|
|
- Any Markdown normalization behavior
|
|
- Any `.report.md` content generation beyond a minimal handoff type if absolutely needed
|
|
- Any working `pdf2md convert` or `pdf2md doctor` behavior
|
|
- Any committed file under `samples/`
|
|
|
|
## Expected Outputs
|
|
|
|
Sprint 3 should produce:
|
|
|
|
1. Domain records
|
|
- `DocumentRecord` or equivalent project-owned record.
|
|
- `PageRecord` or equivalent with page index and optional page dimensions.
|
|
- `BlockRecord` or equivalent with block type, optional page index, optional bbox, optional confidence, and optional Markdown character span.
|
|
- `AssetRecord` or equivalent with stable relative path and optional source page/provenance.
|
|
- `WarningRecord` or equivalent with code, severity, message, optional page index, and optional bbox.
|
|
- `ConversionOutputRecord` or equivalent only if useful for connecting metadata to later orchestration; it must not invoke conversion.
|
|
|
|
2. Stable enums or constants
|
|
- Block types aligned with `ARCHITECTURE.md`: `heading`, `paragraph`, `inline_formula`, `display_formula`, `table`, `figure`, `caption`, `footnote`, `reference`, and `unknown`.
|
|
- Warning codes aligned with `ARCHITECTURE.md`, including at least:
|
|
- `ENGINE_MISSING`
|
|
- `GPU_UNAVAILABLE`
|
|
- `LOW_CONFIDENCE_FORMULA`
|
|
- `MATH_RENDER_FAILED`
|
|
- `ASSET_LINK_MISSING`
|
|
- `READING_ORDER_UNCERTAIN`
|
|
- `STRICT_LOCAL_VIOLATION`
|
|
- `MINERU_CLI_FAILED`
|
|
- Warning severity values sufficient for v1 metadata and report summaries, such as `info`, `warning`, and `error`.
|
|
|
|
3. Metadata builder
|
|
- Build a JSON-serializable metadata object with required top-level fields:
|
|
- `source_pdf`
|
|
- `source_sha256`
|
|
- `created_at`
|
|
- `engine`
|
|
- `engine_version`
|
|
- `engine_options`
|
|
- `pages`
|
|
- `assets`
|
|
- `warnings`
|
|
- `summary`
|
|
- Build required summary fields:
|
|
- `pages_processed`
|
|
- `warning_count`
|
|
- `asset_count`
|
|
- `display_formula_count`
|
|
- `inline_formula_count`
|
|
- `math_render_error_count`
|
|
- Preserve optional fields such as bbox and confidence only when present.
|
|
- Require `source_sha256` as an input value. Sprint 3 should not compute hashes by reading PDFs unless the contract is explicitly amended.
|
|
- Produce only plain Python data structures that `json.dumps` can serialize without custom encoders.
|
|
|
|
4. Warning aggregation
|
|
- Count warnings.
|
|
- Count math render failures from `MATH_RENDER_FAILED`.
|
|
- Preserve warning order unless there is a tested reason to sort.
|
|
- Preserve page-level warning data when available.
|
|
|
|
5. Tests
|
|
- Unit tests for domain record serialization.
|
|
- Unit tests for metadata schema creation with all required top-level fields.
|
|
- Unit tests for summary counts.
|
|
- Unit tests for warning aggregation.
|
|
- Unit tests that optional bbox and confidence fields are preserved only when present.
|
|
- Unit tests that metadata is JSON serializable.
|
|
- Unit tests that metadata requires source PDF, source SHA-256, engine, engine version, and page records.
|
|
|
|
6. Handoff
|
|
- `PROGRESS.md` records changed files, commands run, tests passed or blocked, known failures, residual risks, and next action.
|
|
|
|
## Non-Goals
|
|
|
|
- Do not implement PDF conversion.
|
|
- Do not implement conversion orchestration.
|
|
- Do not implement the MinerU adapter.
|
|
- Do not run MinerU.
|
|
- Do not install MinerU 3.1.0.
|
|
- Do not download MinerU models.
|
|
- Do not parse PDF contents.
|
|
- Do not compute source SHA-256 by reading files unless this contract is explicitly amended.
|
|
- Do not implement Markdown normalization.
|
|
- Do not implement asset link checking.
|
|
- Do not implement math renderability checking.
|
|
- Do not implement full `.report.md` content generation.
|
|
- Do not implement `pdf2md convert` as a working command.
|
|
- Do not implement `pdf2md doctor`.
|
|
- Do not add runtime engine selection.
|
|
- Do not add alternate conversion engines.
|
|
- Do not add cloud, remote API, router, HTTP client backend, or remote OpenAI-compatible backend support.
|
|
|
|
## Work Packages
|
|
|
|
### WP3.1: Domain Record Types
|
|
|
|
Owner:
|
|
|
|
- `metadata-agent`
|
|
- `feature-generator-agent`
|
|
|
|
Actions:
|
|
|
|
- Define small project-owned records for document/page/block/asset/warning concepts.
|
|
- Use simple, typed Python structures that are easy to serialize and test.
|
|
- Keep MinerU-specific raw objects out of public and required fields.
|
|
|
|
Output:
|
|
|
|
- `ir.py` contains the minimal domain model needed by metadata construction.
|
|
|
|
### WP3.2: Warning Codes And Severities
|
|
|
|
Owner:
|
|
|
|
- `metadata-agent`
|
|
- `feature-generator-agent`
|
|
|
|
Actions:
|
|
|
|
- Define stable warning codes from `ARCHITECTURE.md`.
|
|
- Define severity values and validate warning records against them.
|
|
- Avoid inventing speculative warning categories beyond the known v1 set unless needed by tests.
|
|
|
|
Output:
|
|
|
|
- Warnings are structured, countable, and stable across later sprints.
|
|
|
|
### WP3.3: Metadata Builder
|
|
|
|
Owner:
|
|
|
|
- `metadata-agent`
|
|
- `feature-generator-agent`
|
|
|
|
Actions:
|
|
|
|
- Build required metadata JSON data from project-owned records.
|
|
- Preserve optional provenance fields only when present.
|
|
- Require source PDF path, source SHA-256, engine, engine version, pages, assets, warnings, and engine options as explicit inputs.
|
|
|
|
Output:
|
|
|
|
- `metadata.py` produces the required v1 metadata object without MinerU execution.
|
|
|
|
### WP3.4: Metadata And Warning Tests
|
|
|
|
Owner:
|
|
|
|
- `feature-generator-agent`
|
|
- `evaluation-agent`
|
|
|
|
Actions:
|
|
|
|
- Add focused unit tests for schema, counts, optional fields, JSON serialization, and validation failures.
|
|
- Use in-memory records and temporary paths only.
|
|
|
|
Output:
|
|
|
|
- `uv run pytest` verifies metadata behavior without external dependencies.
|
|
|
|
### WP3.5: Independent Evaluation
|
|
|
|
Owner:
|
|
|
|
- `evaluation-agent`
|
|
|
|
Actions:
|
|
|
|
- Review the completed records and metadata builder against this contract.
|
|
- Verify no conversion behavior, MinerU execution, remote runtime path, alternate engine, Markdown normalization, quality checks, or report content generation was added.
|
|
- Verify `samples/` remains untracked and unstaged.
|
|
|
|
Output:
|
|
|
|
- PASS/FAIL notes with any missing acceptance criteria.
|
|
|
|
## Verification Checks
|
|
|
|
Required:
|
|
|
|
- `git status --short` before staging confirms `samples/` remains untracked.
|
|
- `uv --version` is run and result is recorded.
|
|
- `uv sync` passes.
|
|
- `uv run pytest` passes.
|
|
- Targeted IR/metadata tests pass.
|
|
- Metadata output is JSON serializable through `json.dumps`.
|
|
- Tests do not require MinerU, CUDA, GPU, model files, `samples/`, or network.
|
|
- No real MinerU dependency is required for default tests.
|
|
- No model downloads occur.
|
|
- No network calls are required.
|
|
- No candidate engine comparison is reintroduced.
|
|
- No conversion behavior is implemented.
|
|
- No Markdown normalization behavior is implemented.
|
|
- No full `.report.md` content generation is implemented.
|
|
- `git diff --check` passes.
|
|
|
|
Recommended:
|
|
|
|
- Keep dataclass or enum APIs small and explicit.
|
|
- Prefer one serialization function per record over ad hoc dict mutation in tests.
|
|
- Include tests that fail if a required metadata top-level field is omitted.
|
|
- Use `requirements-guard-agent` if metadata requirements conflict between `PRD.md` and `ARCHITECTURE.md`.
|
|
|
|
## Hard Failure Criteria
|
|
|
|
Sprint 3 fails and must stop for a user decision if any of these are true:
|
|
|
|
- Metadata omits source PDF, source SHA-256, engine, engine version, pages, warnings, assets, or summary.
|
|
- Summary omits pages processed, warning count, asset count, display formula count, inline formula count, or math render error count.
|
|
- Public or required metadata fields require raw MinerU objects.
|
|
- Optional bbox, confidence, or page provenance is dropped when provided.
|
|
- Optional bbox, confidence, or page provenance is invented when absent.
|
|
- Default tests require MinerU, CUDA, GPU, model files, network, or `samples/`.
|
|
- The implementation parses PDF contents, invokes conversion behavior, normalizes Markdown, or generates full report Markdown content.
|
|
- The implementation introduces alternate engines or runtime engine selection.
|
|
- The implementation introduces `--api-url`, remote APIs, router mode, HTTP client backends, or remote OpenAI-compatible backends.
|
|
- `samples/` is staged or committed.
|
|
|
|
## Acceptance Criteria
|
|
|
|
Sprint 3 is complete when:
|
|
|
|
- `src/pdf2md/ir.py` exists and owns project domain records.
|
|
- `src/pdf2md/metadata.py` exists and builds required metadata JSON data from project-owned records.
|
|
- Stable block types and warning codes are defined and tested.
|
|
- Metadata top-level fields and summary fields are tested.
|
|
- Warning aggregation is tested.
|
|
- Optional bbox and confidence preservation is tested.
|
|
- Metadata JSON serializability is tested.
|
|
- No conversion, MinerU, Markdown normalization, quality check, full report generation, or doctor behavior is implemented.
|
|
- `uv sync` passes.
|
|
- `uv run pytest` passes.
|
|
- `PROGRESS.md` records checks performed and residual risks.
|
|
- Independent evaluation is complete.
|
|
- The completed change is committed.
|
|
|
|
## Handoff Fields
|
|
|
|
Use these fields when Sprint 3 completes:
|
|
|
|
- Files changed:
|
|
- Commands run:
|
|
- Tests passed:
|
|
- Tests blocked:
|
|
- Known failures:
|
|
- Residual risks:
|
|
- User decisions needed:
|
|
- Go/no-go recommendation for Sprint 4:
|
|
- Next action:
|