11 KiB
Sprint 3 Contract: Domain Records, Metadata, And Warning Model
Status: Completed Last updated: 2026-05-07
Objective
Define project-owned domain records, warning records, and metadata JSON construction before binding the system to MinerU output.
Sprint 3 must establish:
- Internal records for documents, pages, blocks, assets, warnings, and conversion outputs.
- Stable warning code and severity definitions aligned with
ARCHITECTURE.md. - A metadata builder that produces the required v1 top-level and summary fields.
- Warning aggregation behavior that later report generation can consume.
- Fast unit tests that do not require MinerU, model files, GPU, sample PDFs, or network.
Sprint 3 is schema and metadata modeling only. It must not run MinerU, parse PDFs, normalize Markdown, generate final report Markdown content, expose a working convert command, or add remote/runtime engine behavior.
Current Precondition
Sprint 2 is complete:
src/pdf2md/paths.pyowns input discovery and output path planning.tests/test_paths.pyverifies directory recursion, non-ASCII filenames, overwrite conflict detection, duplicate planned outputs, and output-root escape prevention.uv run pytestpassed 21 tests.
Sprint 3 may use path planning records as context, but it should not depend on actual conversion output.
Touched Surfaces
Allowed:
src/pdf2md/ir.pysrc/pdf2md/metadata.pysrc/pdf2md/report.pyonly for a minimal type boundary if metadata/report handoff cannot be expressed cleanly without itsrc/pdf2md/__init__.pyonly if exporting a minimal stable type is necessary and testedtests/test_ir.pyortests/unit/test_ir.pytests/test_metadata.pyortests/unit/test_metadata.pyPLAN.mdonly for current-goal coordination updates required by the shared agent workflowPROGRESS.mddocs/V1IMPLEMENTATIONPLAN.mdonly if sequencing or constraints need adjustmentdocs/Sprints/SPRINT3CONTRACT.md
Not allowed:
src/pdf2md/mineru_adapter.pysrc/pdf2md/markdown.pysrc/pdf2md/quality.pysrc/pdf2md/doctor.pyscripts/- Any real MinerU invocation
- Any model download or install script
- Any PDF content parsing
- Any Markdown normalization behavior
- Any
.report.mdcontent generation beyond a minimal handoff type if absolutely needed - Any working
pdf2md convertorpdf2md doctorbehavior - Any committed file under
samples/
Expected Outputs
Sprint 3 should produce:
-
Domain records
DocumentRecordor equivalent project-owned record.PageRecordor equivalent with page index and optional page dimensions.BlockRecordor equivalent with block type, optional page index, optional bbox, optional confidence, and optional Markdown character span.AssetRecordor equivalent with stable relative path and optional source page/provenance.WarningRecordor equivalent with code, severity, message, optional page index, and optional bbox.ConversionOutputRecordor equivalent only if useful for connecting metadata to later orchestration; it must not invoke conversion.
-
Stable enums or constants
- Block types aligned with
ARCHITECTURE.md:heading,paragraph,inline_formula,display_formula,table,figure,caption,footnote,reference, andunknown. - Warning codes aligned with
ARCHITECTURE.md, including at least:ENGINE_MISSINGGPU_UNAVAILABLELOW_CONFIDENCE_FORMULAMATH_RENDER_FAILEDASSET_LINK_MISSINGREADING_ORDER_UNCERTAINSTRICT_LOCAL_VIOLATIONMINERU_CLI_FAILED
- Warning severity values sufficient for v1 metadata and report summaries, such as
info,warning, anderror.
- Block types aligned with
-
Metadata builder
- Build a JSON-serializable metadata object with required top-level fields:
source_pdfsource_sha256created_atengineengine_versionengine_optionspagesassetswarningssummary
- Build required summary fields:
pages_processedwarning_countasset_countdisplay_formula_countinline_formula_countmath_render_error_count
- Preserve optional fields such as bbox and confidence only when present.
- Require
source_sha256as an input value. Sprint 3 should not compute hashes by reading PDFs unless the contract is explicitly amended. - Produce only plain Python data structures that
json.dumpscan serialize without custom encoders.
- Build a JSON-serializable metadata object with required top-level fields:
-
Warning aggregation
- Count warnings.
- Count math render failures from
MATH_RENDER_FAILED. - Preserve warning order unless there is a tested reason to sort.
- Preserve page-level warning data when available.
-
Tests
- Unit tests for domain record serialization.
- Unit tests for metadata schema creation with all required top-level fields.
- Unit tests for summary counts.
- Unit tests for warning aggregation.
- Unit tests that optional bbox and confidence fields are preserved only when present.
- Unit tests that metadata is JSON serializable.
- Unit tests that metadata requires source PDF, source SHA-256, engine, engine version, and page records.
-
Handoff
PROGRESS.mdrecords changed files, commands run, tests passed or blocked, known failures, residual risks, and next action.
Non-Goals
- Do not implement PDF conversion.
- Do not implement conversion orchestration.
- Do not implement the MinerU adapter.
- Do not run MinerU.
- Do not install MinerU 3.1.0.
- Do not download MinerU models.
- Do not parse PDF contents.
- Do not compute source SHA-256 by reading files unless this contract is explicitly amended.
- Do not implement Markdown normalization.
- Do not implement asset link checking.
- Do not implement math renderability checking.
- Do not implement full
.report.mdcontent generation. - Do not implement
pdf2md convertas a working command. - Do not implement
pdf2md doctor. - Do not add runtime engine selection.
- Do not add alternate conversion engines.
- Do not add cloud, remote API, router, HTTP client backend, or remote OpenAI-compatible backend support.
Work Packages
WP3.1: Domain Record Types
Owner:
metadata-agentfeature-generator-agent
Actions:
- Define small project-owned records for document/page/block/asset/warning concepts.
- Use simple, typed Python structures that are easy to serialize and test.
- Keep MinerU-specific raw objects out of public and required fields.
Output:
ir.pycontains the minimal domain model needed by metadata construction.
WP3.2: Warning Codes And Severities
Owner:
metadata-agentfeature-generator-agent
Actions:
- Define stable warning codes from
ARCHITECTURE.md. - Define severity values and validate warning records against them.
- Avoid inventing speculative warning categories beyond the known v1 set unless needed by tests.
Output:
- Warnings are structured, countable, and stable across later sprints.
WP3.3: Metadata Builder
Owner:
metadata-agentfeature-generator-agent
Actions:
- Build required metadata JSON data from project-owned records.
- Preserve optional provenance fields only when present.
- Require source PDF path, source SHA-256, engine, engine version, pages, assets, warnings, and engine options as explicit inputs.
Output:
metadata.pyproduces the required v1 metadata object without MinerU execution.
WP3.4: Metadata And Warning Tests
Owner:
feature-generator-agentevaluation-agent
Actions:
- Add focused unit tests for schema, counts, optional fields, JSON serialization, and validation failures.
- Use in-memory records and temporary paths only.
Output:
uv run pytestverifies metadata behavior without external dependencies.
WP3.5: Independent Evaluation
Owner:
evaluation-agent
Actions:
- Review the completed records and metadata builder against this contract.
- Verify no conversion behavior, MinerU execution, remote runtime path, alternate engine, Markdown normalization, quality checks, or report content generation was added.
- Verify
samples/remains untracked and unstaged.
Output:
- PASS/FAIL notes with any missing acceptance criteria.
Verification Checks
Required:
git status --shortbefore staging confirmssamples/remains untracked.uv --versionis run and result is recorded.uv syncpasses.uv run pytestpasses.- Targeted IR/metadata tests pass.
- Metadata output is JSON serializable through
json.dumps. - Tests do not require MinerU, CUDA, GPU, model files,
samples/, or network. - No real MinerU dependency is required for default tests.
- No model downloads occur.
- No network calls are required.
- No candidate engine comparison is reintroduced.
- No conversion behavior is implemented.
- No Markdown normalization behavior is implemented.
- No full
.report.mdcontent generation is implemented. git diff --checkpasses.
Recommended:
- Keep dataclass or enum APIs small and explicit.
- Prefer one serialization function per record over ad hoc dict mutation in tests.
- Include tests that fail if a required metadata top-level field is omitted.
- Use
requirements-guard-agentif metadata requirements conflict betweenPRD.mdandARCHITECTURE.md.
Hard Failure Criteria
Sprint 3 fails and must stop for a user decision if any of these are true:
- Metadata omits source PDF, source SHA-256, engine, engine version, pages, warnings, assets, or summary.
- Summary omits pages processed, warning count, asset count, display formula count, inline formula count, or math render error count.
- Public or required metadata fields require raw MinerU objects.
- Optional bbox, confidence, or page provenance is dropped when provided.
- Optional bbox, confidence, or page provenance is invented when absent.
- Default tests require MinerU, CUDA, GPU, model files, network, or
samples/. - The implementation parses PDF contents, invokes conversion behavior, normalizes Markdown, or generates full report Markdown content.
- The implementation introduces alternate engines or runtime engine selection.
- The implementation introduces
--api-url, remote APIs, router mode, HTTP client backends, or remote OpenAI-compatible backends. samples/is staged or committed.
Acceptance Criteria
Sprint 3 is complete when:
src/pdf2md/ir.pyexists and owns project domain records.src/pdf2md/metadata.pyexists and builds required metadata JSON data from project-owned records.- Stable block types and warning codes are defined and tested.
- Metadata top-level fields and summary fields are tested.
- Warning aggregation is tested.
- Optional bbox and confidence preservation is tested.
- Metadata JSON serializability is tested.
- No conversion, MinerU, Markdown normalization, quality check, full report generation, or doctor behavior is implemented.
uv syncpasses.uv run pytestpasses.PROGRESS.mdrecords checks performed and residual risks.- Independent evaluation is complete.
- The completed change is committed.
Handoff Fields
Use these fields when Sprint 3 completes:
- Files changed:
- Commands run:
- Tests passed:
- Tests blocked:
- Known failures:
- Residual risks:
- User decisions needed:
- Go/no-go recommendation for Sprint 4:
- Next action: