baram2584/PDFToMD

Fork 0

Files

T

김경종 88d6b92283 add pdftomd

2026-05-08 16:42:19 +09:00

13 KiB

Raw Permalink Blame History

Sprint 6 Contract: Quality Checks And Report Generation

Status: Implemented Last updated: 2026-05-08

Objective

Build local quality-check and human-readable report generation boundaries from project-owned metadata and normalized Markdown, before they are connected to conversion orchestration.

Sprint 6 must establish:

A project-owned quality module for local asset-link and math-renderability signals.
A report module that renders <stem>.report.md content from metadata and quality results.
Deterministic final status calculation: success, partial, or failed.
Summary fields needed by reports, including missing asset links and math render failures.
Fast unit tests that do not require real MinerU, model files, GPU, sample PDFs, Obsidian, LaTeX tooling, network, or a working conversion CLI.

Sprint 6 is a quality/report contract sprint. It may generate report Markdown content as a string, but it must not connect to the CLI, conversion orchestration, real MinerU execution, file output writing, setup scripts, or end-to-end conversion.

Current Precondition

Sprint 5 is complete:

src/pdf2md/paths.py owns input discovery and output path planning.
src/pdf2md/ir.py owns project records, block types, warning codes, and warning severities.
src/pdf2md/metadata.py builds JSON-serializable metadata and summary counts from project-owned records.
src/pdf2md/mineru_adapter.py owns the mocked direct local MinerU CLI adapter boundary.
src/pdf2md/markdown.py owns Obsidian Markdown normalization, asset link warnings, and table fallback warnings.
uv run pytest passed 89 tests.

Sprint 6 may use metadata dictionaries produced by build_metadata, project-owned WarningRecord values, and normalized Markdown text. It must not require raw MinerU-specific Python objects as public or required inputs.

Touched Surfaces

Allowed:

src/pdf2md/quality.py
src/pdf2md/report.py
src/pdf2md/metadata.py only for narrowly required summary fields or helper functions that keep metadata/report consistency
src/pdf2md/ir.py only for narrowly required warning codes discovered while implementing quality checks
tests/test_quality.py
tests/test_report.py
tests/test_metadata.py only if metadata.py changes
README.md only if a small note is needed to clarify mocked/local quality and report behavior
PLAN.md only for current-goal coordination updates required by the shared agent workflow
PROGRESS.md
docs/V1IMPLEMENTATIONPLAN.md only if sequencing or constraints need adjustment
docs/Sprints/SPRINT6CONTRACT.md

Not allowed:

src/pdf2md/conversion.py
src/pdf2md/cli.py
src/pdf2md/mineru_adapter.py
Working pdf2md convert behavior
Full pdf2md doctor behavior
scripts/
Any real MinerU invocation in default tests
Any MinerU/model installation or download script
Any PDF content parsing
Any final Markdown file writing
Any metadata JSON file writing
Any .report.md file writing as product behavior
Any asset copying or moving
Any runtime engine selection or alternate engine support
Any remote asset fetch, HTTP client, cloud/API integration, hosted renderer, or remote math-render service
Any committed file under samples/

Expected Outputs

Sprint 6 should produce:

Quality result records and API
- A small project-owned quality result type containing at least:
  - missing asset link count
  - invalid asset link count when available
  - math render error count
  - warnings produced by quality checks
- A local asset-link check function that accepts normalized Markdown and local asset context without writing files.
- A math renderability check interface that accepts a local checker callable or reports tool-unavailable behavior gracefully.
- No public or required field should expose raw MinerU-specific Python objects.
Asset-link quality checks
- Count missing local asset links in Markdown.
- Count invalid links that are absolute, parent-escaping, remote, or otherwise non-local according to project policy.
- Produce project-owned warnings for missing or invalid asset links.
- Keep all checks local and deterministic.
- Do not fetch remote URLs, copy assets, move assets, or write files.
Math renderability checks
- Provide a boundary for local math renderability checking.
- Default tests must use fake/local checker callables.
- Tool-unavailable behavior must be explicit and non-fatal.
- Render failures must produce MATH_RENDER_FAILED warnings and count toward the report.
- The checker must not call network services or require a LaTeX/Obsidian install in default tests.
Metadata summary consistency
- Preserve existing required metadata summary fields.
- Add or derive report-needed counts without breaking existing metadata tests:
  - missing asset link count
  - invalid asset link count
  - math render error count
- Warning order and warning counts must remain deterministic.
- Reports must be derived from metadata and quality results, not independently duplicated state.
Report Markdown generation
- Render a human-readable <stem>.report.md content string from metadata and quality results.
- Include at least:
  - source PDF path
  - output Markdown path when provided
  - metadata path when provided
  - report path when provided
  - MinerU engine/version and execution mode/options
  - pages processed
  - warning count
  - asset count
  - missing asset link count
  - inline formula count
  - display formula count
  - math render error count
  - pages with warnings
  - final status: success, partial, or failed
- The report must not invent facts that are absent from metadata; absent optional paths should be omitted or clearly shown as unavailable.
- The report generator must not write files in Sprint 6.
Final status policy
- failed: metadata or quality warnings contain at least one error severity warning.
- partial: no error severity warnings, but warnings or quality failures exist.
- success: no warnings and no quality failures.
- The status function must be unit-tested and reusable by later orchestration.
Tests
- Unit tests for missing asset link counting.
- Unit tests for invalid/remote/escaping asset link warnings.
- Unit tests for math render failure aggregation with a fake checker.
- Unit tests for math checker unavailable behavior.
- Unit tests for report content and required sections.
- Unit tests proving report content is derived from metadata and quality results.
- Unit tests for pages-with-warnings summary.
- Unit tests for final status calculation.
- Unit tests proving no real MinerU binary, model files, GPU, samples/, Obsidian, LaTeX install, or network are required by default.
Handoff
- PROGRESS.md records changed files, commands run, tests passed or blocked, known failures, residual risks, and next action.

Non-Goals

Do not implement conversion orchestration.
Do not implement convert_pdf.
Do not implement pdf2md convert.
Do not implement full pdf2md doctor.
Do not invoke MinerU.
Do not install MinerU 3.1.0.
Do not download MinerU models.
Do not parse real PDFs.
Do not write final Markdown files.
Do not copy or move assets.
Do not write metadata JSON files.
Do not write .report.md files as product behavior.
Do not compute source SHA-256.
Do not implement real LaTeX, KaTeX, MathJax, or Obsidian rendering in default tests.
Do not add setup scripts.
Do not implement full local environment diagnostics.
Do not implement alternate engines or runtime engine selection.
Do not add cloud, remote API, router, HTTP client backend, remote OpenAI-compatible backend, hosted renderer, or remote asset-fetching support.

Work Packages

WP6.1: Quality Result Types And Asset Checks

Owner:

metadata-agent
feature-generator-agent

Actions:

Define a small project-owned quality result type.
Add deterministic local asset link checks over normalized Markdown.
Count missing, invalid, escaping, absolute, and remote asset references.
Return project-owned warnings without writing files.

Output:

Later orchestration can add local quality results to metadata/report flow without duplicating asset-link logic.

WP6.2: Math Renderability Boundary

Owner:

obsidian-markdown-agent
metadata-agent
feature-generator-agent

Actions:

Define a local math render checker interface.
Support fake checkers in tests.
Treat checker-unavailable as explicit non-fatal warning/info according to the implementation design.
Treat render failures as MATH_RENDER_FAILED warnings and count them.

Output:

Math renderability is represented as a local, testable boundary without external dependencies.

WP6.3: Metadata Summary Extensions

Owner:

metadata-agent
feature-generator-agent

Actions:

Preserve existing required metadata summary fields.
Add or derive counts needed by reports in a backward-compatible way.
Keep metadata JSON serializable and deterministic.

Output:

Metadata remains the source of truth for report counts and warning summaries.

WP6.4: Report Markdown Rendering

Owner:

metadata-agent
feature-generator-agent

Actions:

Implement report content rendering from metadata plus quality results.
Include required report sections and final status.
Generate content only; do not write files.

Output:

Later orchestration can write <stem>.report.md by using the tested report renderer.

WP6.5: Independent Evaluation

Owner:

evaluation-agent

Actions:

Review completed quality/report behavior against this contract.
Verify no conversion orchestration, real MinerU dependency in default tests, remote runtime path, alternate engine, final output writing, CLI behavior, or sample dependency was added.
Verify samples/ remains untracked and unstaged.

Output:

PASS/FAIL notes with any missing acceptance criteria.

Verification Checks

Required:

git status --short before staging confirms samples/ remains untracked.
uv --version is run and result is recorded.
uv sync passes.
uv run pytest passes.
Targeted quality/report tests pass.
Tests do not require real MinerU, CUDA, GPU, model files, Obsidian, LaTeX tooling, samples/, or network.
No model downloads occur.
No network calls are required.
No candidate engine comparison is reintroduced.
No conversion orchestration is implemented.
No working pdf2md convert or full pdf2md doctor behavior is implemented.
No final Markdown, metadata JSON, or .report.md files are written as product behavior.
No remote asset fetching is implemented.
No real math renderer dependency is required by default tests.
Report counts match metadata and quality results.
Report generation does not re-run MinerU.
git diff --check passes.

Recommended:

Keep quality helpers pure and deterministic.
Use fake checkers for math renderability tests.
Keep report rendering stable enough for snapshot-like unit assertions.
Use requirements-guard-agent if warning codes, summary fields, or report wording conflict across documents.

Hard Failure Criteria

Sprint 6 fails and must stop for a user decision if any of these are true:

Report content diverges from metadata or quality result counts.
Math render failures are silently ignored.
Quality checks require network access.
The implementation fetches remote assets or adds any HTTP/network client path.
The implementation requires a real LaTeX/Obsidian/MathJax/KaTeX install in default tests.
The implementation connects quality/report behavior to a working conversion CLI/API.
The implementation writes final Markdown, metadata JSON, .report.md, or copied assets as product behavior.
The implementation invokes MinerU, downloads models, adds setup scripts, or parses real PDFs.
Default tests require real MinerU, CUDA, GPU, model files, network, Obsidian, LaTeX tooling, or samples/.
samples/ is staged or committed.

Acceptance Criteria

Sprint 6 is complete when:

src/pdf2md/quality.py exists and owns local quality-check behavior.
src/pdf2md/report.py exists and owns human-readable report content rendering.
Missing asset link counting is unit-tested.
Invalid, escaping, absolute, or remote asset link warning behavior is unit-tested.
Math render failure aggregation is unit-tested with fake checkers.
Math checker unavailable behavior is unit-tested and non-fatal.
Report content includes the required sections and counts.
Pages-with-warnings summary is unit-tested.
Final status calculation is unit-tested.
Report generation is proven not to write files or re-run MinerU.
Default tests do not require MinerU, GPU, model files, network, Obsidian, LaTeX tooling, or samples/.
No conversion orchestration, final output file writing, working CLI behavior, real MinerU execution, or setup script is implemented.
uv sync passes.
uv run pytest passes.
PROGRESS.md records checks performed and residual risks.
Independent evaluation is complete.
The completed change is committed.

Handoff Fields

Use these fields when Sprint 6 completes:

Files changed:
Commands run:
Tests passed:
Tests blocked:
Known failures:
Residual risks:
User decisions needed:
Go/no-go recommendation for Sprint 7:
Next action:

13 KiB Raw Permalink Blame History

Sprint 6 Contract: Quality Checks And Report Generation

Objective

Current Precondition

Touched Surfaces

Expected Outputs

Non-Goals

Work Packages

WP6.1: Quality Result Types And Asset Checks

WP6.2: Math Renderability Boundary

WP6.3: Metadata Summary Extensions

WP6.4: Report Markdown Rendering

WP6.5: Independent Evaluation

Verification Checks

Hard Failure Criteria

Acceptance Criteria

Handoff Fields

13 KiB

Raw Permalink Blame History