baram2584/PDFToMD

Fork 0

Files

T

김경종 88d6b92283 add pdftomd

2026-05-08 16:42:19 +09:00

11 KiB

Raw Blame History

Sprint 3 Contract: Domain Records, Metadata, And Warning Model

Status: Completed Last updated: 2026-05-07

Objective

Define project-owned domain records, warning records, and metadata JSON construction before binding the system to MinerU output.

Sprint 3 must establish:

Internal records for documents, pages, blocks, assets, warnings, and conversion outputs.
Stable warning code and severity definitions aligned with ARCHITECTURE.md.
A metadata builder that produces the required v1 top-level and summary fields.
Warning aggregation behavior that later report generation can consume.
Fast unit tests that do not require MinerU, model files, GPU, sample PDFs, or network.

Sprint 3 is schema and metadata modeling only. It must not run MinerU, parse PDFs, normalize Markdown, generate final report Markdown content, expose a working convert command, or add remote/runtime engine behavior.

Current Precondition

Sprint 2 is complete:

src/pdf2md/paths.py owns input discovery and output path planning.
tests/test_paths.py verifies directory recursion, non-ASCII filenames, overwrite conflict detection, duplicate planned outputs, and output-root escape prevention.
uv run pytest passed 21 tests.

Sprint 3 may use path planning records as context, but it should not depend on actual conversion output.

Touched Surfaces

Allowed:

src/pdf2md/ir.py
src/pdf2md/metadata.py
src/pdf2md/report.py only for a minimal type boundary if metadata/report handoff cannot be expressed cleanly without it
src/pdf2md/__init__.py only if exporting a minimal stable type is necessary and tested
tests/test_ir.py or tests/unit/test_ir.py
tests/test_metadata.py or tests/unit/test_metadata.py
PLAN.md only for current-goal coordination updates required by the shared agent workflow
PROGRESS.md
docs/V1IMPLEMENTATIONPLAN.md only if sequencing or constraints need adjustment
docs/Sprints/SPRINT3CONTRACT.md

Not allowed:

src/pdf2md/mineru_adapter.py
src/pdf2md/markdown.py
src/pdf2md/quality.py
src/pdf2md/doctor.py
scripts/
Any real MinerU invocation
Any model download or install script
Any PDF content parsing
Any Markdown normalization behavior
Any .report.md content generation beyond a minimal handoff type if absolutely needed
Any working pdf2md convert or pdf2md doctor behavior
Any committed file under samples/

Expected Outputs

Sprint 3 should produce:

Domain records
- DocumentRecord or equivalent project-owned record.
- PageRecord or equivalent with page index and optional page dimensions.
- BlockRecord or equivalent with block type, optional page index, optional bbox, optional confidence, and optional Markdown character span.
- AssetRecord or equivalent with stable relative path and optional source page/provenance.
- WarningRecord or equivalent with code, severity, message, optional page index, and optional bbox.
- ConversionOutputRecord or equivalent only if useful for connecting metadata to later orchestration; it must not invoke conversion.
Stable enums or constants
- Block types aligned with ARCHITECTURE.md: heading, paragraph, inline_formula, display_formula, table, figure, caption, footnote, reference, and unknown.
- Warning codes aligned with ARCHITECTURE.md, including at least:
  - ENGINE_MISSING
  - GPU_UNAVAILABLE
  - LOW_CONFIDENCE_FORMULA
  - MATH_RENDER_FAILED
  - ASSET_LINK_MISSING
  - READING_ORDER_UNCERTAIN
  - STRICT_LOCAL_VIOLATION
  - MINERU_CLI_FAILED
- Warning severity values sufficient for v1 metadata and report summaries, such as info, warning, and error.
Metadata builder
- Build a JSON-serializable metadata object with required top-level fields:
  - source_pdf
  - source_sha256
  - created_at
  - engine
  - engine_version
  - engine_options
  - pages
  - assets
  - warnings
  - summary
- Build required summary fields:
  - pages_processed
  - warning_count
  - asset_count
  - display_formula_count
  - inline_formula_count
  - math_render_error_count
- Preserve optional fields such as bbox and confidence only when present.
- Require source_sha256 as an input value. Sprint 3 should not compute hashes by reading PDFs unless the contract is explicitly amended.
- Produce only plain Python data structures that json.dumps can serialize without custom encoders.
Warning aggregation
- Count warnings.
- Count math render failures from MATH_RENDER_FAILED.
- Preserve warning order unless there is a tested reason to sort.
- Preserve page-level warning data when available.
Tests
- Unit tests for domain record serialization.
- Unit tests for metadata schema creation with all required top-level fields.
- Unit tests for summary counts.
- Unit tests for warning aggregation.
- Unit tests that optional bbox and confidence fields are preserved only when present.
- Unit tests that metadata is JSON serializable.
- Unit tests that metadata requires source PDF, source SHA-256, engine, engine version, and page records.
Handoff
- PROGRESS.md records changed files, commands run, tests passed or blocked, known failures, residual risks, and next action.

Non-Goals

Do not implement PDF conversion.
Do not implement conversion orchestration.
Do not implement the MinerU adapter.
Do not run MinerU.
Do not install MinerU 3.1.0.
Do not download MinerU models.
Do not parse PDF contents.
Do not compute source SHA-256 by reading files unless this contract is explicitly amended.
Do not implement Markdown normalization.
Do not implement asset link checking.
Do not implement math renderability checking.
Do not implement full .report.md content generation.
Do not implement pdf2md convert as a working command.
Do not implement pdf2md doctor.
Do not add runtime engine selection.
Do not add alternate conversion engines.
Do not add cloud, remote API, router, HTTP client backend, or remote OpenAI-compatible backend support.

Work Packages

WP3.1: Domain Record Types

Owner:

metadata-agent
feature-generator-agent

Actions:

Define small project-owned records for document/page/block/asset/warning concepts.
Use simple, typed Python structures that are easy to serialize and test.
Keep MinerU-specific raw objects out of public and required fields.

Output:

ir.py contains the minimal domain model needed by metadata construction.

WP3.2: Warning Codes And Severities

Owner:

metadata-agent
feature-generator-agent

Actions:

Define stable warning codes from ARCHITECTURE.md.
Define severity values and validate warning records against them.
Avoid inventing speculative warning categories beyond the known v1 set unless needed by tests.

Output:

Warnings are structured, countable, and stable across later sprints.

WP3.3: Metadata Builder

Owner:

metadata-agent
feature-generator-agent

Actions:

Build required metadata JSON data from project-owned records.
Preserve optional provenance fields only when present.
Require source PDF path, source SHA-256, engine, engine version, pages, assets, warnings, and engine options as explicit inputs.

Output:

metadata.py produces the required v1 metadata object without MinerU execution.

WP3.4: Metadata And Warning Tests

Owner:

feature-generator-agent
evaluation-agent

Actions:

Add focused unit tests for schema, counts, optional fields, JSON serialization, and validation failures.
Use in-memory records and temporary paths only.

Output:

uv run pytest verifies metadata behavior without external dependencies.

WP3.5: Independent Evaluation

Owner:

evaluation-agent

Actions:

Review the completed records and metadata builder against this contract.
Verify no conversion behavior, MinerU execution, remote runtime path, alternate engine, Markdown normalization, quality checks, or report content generation was added.
Verify samples/ remains untracked and unstaged.

Output:

PASS/FAIL notes with any missing acceptance criteria.

Verification Checks

Required:

git status --short before staging confirms samples/ remains untracked.
uv --version is run and result is recorded.
uv sync passes.
uv run pytest passes.
Targeted IR/metadata tests pass.
Metadata output is JSON serializable through json.dumps.
Tests do not require MinerU, CUDA, GPU, model files, samples/, or network.
No real MinerU dependency is required for default tests.
No model downloads occur.
No network calls are required.
No candidate engine comparison is reintroduced.
No conversion behavior is implemented.
No Markdown normalization behavior is implemented.
No full .report.md content generation is implemented.
git diff --check passes.

Recommended:

Keep dataclass or enum APIs small and explicit.
Prefer one serialization function per record over ad hoc dict mutation in tests.
Include tests that fail if a required metadata top-level field is omitted.
Use requirements-guard-agent if metadata requirements conflict between PRD.md and ARCHITECTURE.md.

Hard Failure Criteria

Sprint 3 fails and must stop for a user decision if any of these are true:

Metadata omits source PDF, source SHA-256, engine, engine version, pages, warnings, assets, or summary.
Summary omits pages processed, warning count, asset count, display formula count, inline formula count, or math render error count.
Public or required metadata fields require raw MinerU objects.
Optional bbox, confidence, or page provenance is dropped when provided.
Optional bbox, confidence, or page provenance is invented when absent.
Default tests require MinerU, CUDA, GPU, model files, network, or samples/.
The implementation parses PDF contents, invokes conversion behavior, normalizes Markdown, or generates full report Markdown content.
The implementation introduces alternate engines or runtime engine selection.
The implementation introduces --api-url, remote APIs, router mode, HTTP client backends, or remote OpenAI-compatible backends.
samples/ is staged or committed.

Acceptance Criteria

Sprint 3 is complete when:

src/pdf2md/ir.py exists and owns project domain records.
src/pdf2md/metadata.py exists and builds required metadata JSON data from project-owned records.
Stable block types and warning codes are defined and tested.
Metadata top-level fields and summary fields are tested.
Warning aggregation is tested.
Optional bbox and confidence preservation is tested.
Metadata JSON serializability is tested.
No conversion, MinerU, Markdown normalization, quality check, full report generation, or doctor behavior is implemented.
uv sync passes.
uv run pytest passes.
PROGRESS.md records checks performed and residual risks.
Independent evaluation is complete.
The completed change is committed.

Handoff Fields

Use these fields when Sprint 3 completes:

Files changed:
Commands run:
Tests passed:
Tests blocked:
Known failures:
Residual risks:
User decisions needed:
Go/no-go recommendation for Sprint 4:
Next action:

11 KiB Raw Blame History

Sprint 3 Contract: Domain Records, Metadata, And Warning Model

Objective

Current Precondition

Touched Surfaces

Expected Outputs

Non-Goals

Work Packages

WP3.1: Domain Record Types

WP3.2: Warning Codes And Severities

WP3.3: Metadata Builder

WP3.4: Metadata And Warning Tests

WP3.5: Independent Evaluation

Verification Checks

Hard Failure Criteria

Acceptance Criteria

Handoff Fields

11 KiB

Raw Blame History