Files
PDFToMD/docs/Sprints/SPRINT5CONTRACT.md
T
2026-05-08 16:42:19 +09:00

13 KiB

Sprint 5 Contract: Obsidian Markdown Normalization And Asset Links

Status: Implemented Last updated: 2026-05-07

Objective

Build the project-owned Markdown normalization boundary for Obsidian output, using deterministic unit tests before it is connected to conversion orchestration.

Sprint 5 must establish:

  • A small Markdown normalization module that accepts local raw Markdown-like text and returns normalized Markdown plus project-owned warnings.
  • Obsidian math delimiter normalization for inline and display math.
  • Stable relative asset link normalization without copying files or writing final outputs.
  • Limited local asset link validation where useful for warnings.
  • Table preservation and clear warning behavior when a table cannot be safely simplified.
  • Fast unit tests that do not require real MinerU, model files, GPU, sample PDFs, network, or Obsidian itself.

Sprint 5 is a normalization contract sprint. It must not connect normalization to the CLI, conversion orchestration, metadata writing, report generation, real MinerU execution, or end-to-end output writing.

Current Precondition

Sprint 4 is complete:

  • src/pdf2md/paths.py owns input discovery and output path planning.
  • src/pdf2md/ir.py owns project records, block types, warning codes, and warning severities.
  • src/pdf2md/metadata.py builds JSON-serializable metadata and summary counts from project-owned records.
  • src/pdf2md/mineru_adapter.py owns the mocked direct local MinerU CLI adapter boundary.
  • uv run pytest passed 72 tests.

Sprint 5 may use WarningRecord, WarningCode, and WarningSeverity from ir.py, but it must not require raw MinerU-specific Python objects as public or required inputs.

Touched Surfaces

Allowed:

  • src/pdf2md/markdown.py
  • src/pdf2md/quality.py only for minimal local asset link check helpers if that boundary is cleaner than placing them in markdown.py
  • src/pdf2md/ir.py only for narrowly required warning codes discovered while implementing table or asset fallback warnings
  • tests/test_markdown.py or tests/unit/test_markdown.py
  • tests/test_quality.py only if quality.py is touched for asset link checks
  • README.md only if a small note is needed to clarify that normalization tests are mocked/local and not full conversion behavior
  • PLAN.md only for current-goal coordination updates required by the shared agent workflow
  • PROGRESS.md
  • docs/V1IMPLEMENTATIONPLAN.md only if sequencing or constraints need adjustment
  • docs/Sprints/SPRINT5CONTRACT.md

Not allowed:

  • src/pdf2md/conversion.py
  • src/pdf2md/cli.py
  • src/pdf2md/mineru_adapter.py
  • src/pdf2md/report.py
  • Working pdf2md convert behavior
  • Full pdf2md doctor behavior
  • scripts/
  • Any real MinerU invocation in default tests
  • Any MinerU/model installation or download script
  • Any PDF content parsing
  • Any metadata JSON file writing
  • Any .report.md content generation
  • Any runtime engine selection or alternate engine support
  • Any remote asset fetch, HTTP client, or cloud/API integration
  • Any committed file under samples/

Expected Outputs

Sprint 5 should produce:

  1. Normalization records and API

    • A small result record or equivalent project-owned return type containing at least:
      • normalized Markdown
      • warnings
      • asset links discovered or normalized when available
    • A normalization function with a narrow input surface, such as raw Markdown text plus optional output/assets context.
    • No public or required field should expose raw MinerU-specific Python objects.
    • The API should be usable by later orchestration without knowing how MinerU represented the original Markdown.
  2. Inline math delimiter normalization

    • Normalize safe inline math forms to $...$.
    • Preserve already valid $...$ inline math.
    • Preserve the exact LaTeX body inside inline math except delimiter changes.
    • Do not escape or rewrite underscores, carets, braces, or backslashes inside math.
    • Do not normalize math delimiters inside fenced code blocks or inline code spans.
    • Avoid converting ambiguous dollar signs that look like currency or prose punctuation.
  3. Display math delimiter normalization

    • Normalize safe display math forms to $$...$$.
    • Ensure display math delimiters sit on their own lines.
    • Keep a blank line around display math blocks.
    • Preserve the exact LaTeX body inside display math except delimiter and surrounding whitespace normalization.
    • Preserve LaTeX environments such as equation, align, or gather rather than rewriting their semantics.
    • Make normalization idempotent: running the normalizer twice should produce the same Markdown.
  4. Asset link normalization

    • Normalize local image/asset links to stable relative POSIX-style Markdown paths.
    • Keep relative links relative; do not turn them into absolute paths.
    • Reject or warn on absolute asset links that cannot be represented relative to the planned output/assets context.
    • Reject or warn on links that escape the output/assets directory with ...
    • Do not fetch remote URLs, copy assets, or write files.
    • Preserve alt text when rewriting Markdown image links.
  5. Table preservation and fallback warnings

    • Preserve simple Markdown pipe tables without destructive formatting changes.
    • Preserve HTML tables when Markdown would lose row spans, column spans, nested content, or other complex structure.
    • Emit a project-owned warning when complex table fallback behavior is detected or when table simplification is intentionally skipped.
    • Do not attempt broad table reflow or OCR-style table reconstruction in Sprint 5.
  6. Tests

    • Unit tests for inline math delimiter normalization.
    • Unit tests for display math delimiter normalization and blank-line spacing.
    • Unit tests proving underscores and carets inside math are preserved.
    • Unit tests proving fenced code blocks and inline code are not normalized.
    • Unit tests for idempotency.
    • Unit tests for relative asset link normalization.
    • Unit tests for missing or escaping asset link warnings when asset checking is implemented.
    • Unit tests for simple table preservation.
    • Unit tests for complex table fallback warning behavior.
    • Unit tests proving no real MinerU binary, model files, GPU, samples/, Obsidian installation, or network are required by default.
  7. Handoff

    • PROGRESS.md records changed files, commands run, tests passed or blocked, known failures, residual risks, and next action.

Non-Goals

  • Do not implement conversion orchestration.
  • Do not implement convert_pdf.
  • Do not implement pdf2md convert.
  • Do not implement full pdf2md doctor.
  • Do not invoke MinerU.
  • Do not install MinerU 3.1.0.
  • Do not download MinerU models.
  • Do not parse real PDFs.
  • Do not write final Markdown files as product behavior.
  • Do not copy or move assets as product behavior.
  • Do not write metadata JSON.
  • Do not generate .report.md.
  • Do not compute source SHA-256.
  • Do not implement math renderability checks beyond a future-facing warning interface if needed.
  • Do not implement full quality report checks.
  • Do not implement alternate engines or runtime engine selection.
  • Do not add cloud, remote API, router, HTTP client backend, remote OpenAI-compatible backend, or remote asset-fetching support.

Work Packages

WP5.1: Normalization Types And Safe Boundaries

Owner:

  • obsidian-markdown-agent
  • feature-generator-agent

Actions:

  • Define a small Markdown normalization result type.
  • Define a focused normalization function.
  • Keep warnings project-owned through WarningRecord.
  • Keep the API independent of raw MinerU objects.

Output:

  • Later orchestration can normalize adapter Markdown without knowing MinerU internals.

WP5.2: Math Delimiter Normalization

Owner:

  • obsidian-markdown-agent
  • feature-generator-agent

Actions:

  • Normalize safe inline math delimiters to $...$.
  • Normalize safe display math delimiters to $$...$$ with stable surrounding blank lines.
  • Preserve LaTeX bodies exactly.
  • Protect code fences and inline code spans.
  • Add idempotency tests.

Output:

  • Obsidian-friendly math delimiter behavior is deterministic and covered by unit tests.

Owner:

  • obsidian-markdown-agent
  • feature-generator-agent

Actions:

  • Normalize local image/asset links to stable relative POSIX-style paths.
  • Preserve alt text.
  • Warn on missing, absolute, escaping, or non-local asset links when the helper has enough local context to judge them.
  • Do not fetch or copy assets.

Output:

  • Later conversion can produce Markdown links that are stable relative to planned output paths.

WP5.4: Table Preservation And Fallback Warning

Owner:

  • obsidian-markdown-agent
  • feature-generator-agent

Actions:

  • Preserve simple Markdown pipe tables.
  • Preserve complex HTML tables without simplifying them destructively.
  • Emit a project-owned warning for complex table fallback behavior.

Output:

  • Table handling is conservative and traceable instead of silently lossy.

WP5.5: Independent Evaluation

Owner:

  • evaluation-agent

Actions:

  • Review the completed normalizer against this contract.
  • Verify no conversion orchestration, real MinerU dependency in default tests, remote runtime path, alternate engine, metadata writing, report generation, file-copying behavior, or working CLI command was added.
  • Verify samples/ remains untracked and unstaged.

Output:

  • PASS/FAIL notes with any missing acceptance criteria.

Verification Checks

Required:

  • git status --short before staging confirms samples/ remains untracked.
  • uv --version is run and result is recorded.
  • uv sync passes.
  • uv run pytest passes.
  • Targeted Markdown normalization tests pass.
  • Tests do not require real MinerU, CUDA, GPU, model files, Obsidian, samples/, or network.
  • No model downloads occur.
  • No network calls are required.
  • No candidate engine comparison is reintroduced.
  • No conversion orchestration is implemented.
  • No metadata JSON writing or full report generation is implemented.
  • No working pdf2md convert or full pdf2md doctor behavior is implemented.
  • No final output files are written as product behavior.
  • No remote asset fetching is implemented.
  • Math delimiter normalization is idempotent.
  • Asset paths in normalized Markdown are relative when they are rewritten.
  • git diff --check passes.

Recommended:

  • Prefer a small tokenizer or state-machine approach over broad regular-expression rewrites for math/code boundary handling.
  • Keep normalization helpers pure and deterministic.
  • Treat complex tables conservatively: preserve content and warn rather than flattening structure.
  • Use requirements-guard-agent if warning codes or output behavior conflict across documents.

Hard Failure Criteria

Sprint 5 fails and must stop for a user decision if any of these are true:

  • The normalizer rewrites LaTeX math bodies beyond delimiter and whitespace normalization without deterministic tests.
  • The normalizer changes underscores, carets, braces, or backslashes inside math content.
  • The normalizer rewrites code fences or inline code spans as math.
  • The normalizer produces absolute asset links where relative links are required.
  • The normalizer accepts asset links that escape the output/assets context without warning.
  • The implementation fetches remote assets or adds any HTTP/network client path.
  • The implementation connects normalization to a working conversion CLI/API.
  • The implementation adds metadata file writing, full report generation, real MinerU execution, model downloads, or setup scripts.
  • Default tests require real MinerU, CUDA, GPU, model files, network, Obsidian, or samples/.
  • samples/ is staged or committed.

Acceptance Criteria

Sprint 5 is complete when:

  • src/pdf2md/markdown.py exists and owns Obsidian Markdown normalization behavior.
  • Inline math delimiter normalization is unit-tested.
  • Display math delimiter normalization and blank-line spacing are unit-tested.
  • Tests prove underscores and carets inside math are preserved.
  • Tests prove fenced code blocks and inline code are not normalized.
  • Normalization idempotency is unit-tested.
  • Relative asset link normalization is unit-tested.
  • Asset warning behavior is unit-tested when missing, absolute, escaping, or non-local links are in scope.
  • Simple table preservation and complex table fallback warning behavior are unit-tested.
  • Default tests do not require MinerU, GPU, model files, network, Obsidian, or samples/.
  • No conversion orchestration, metadata file writing, full report generation, file-copying behavior, or working CLI behavior is implemented.
  • uv sync passes.
  • uv run pytest passes.
  • PROGRESS.md records checks performed and residual risks.
  • Independent evaluation is complete.
  • The completed change is committed.

Handoff Fields

Use these fields when Sprint 5 completes:

  • Files changed:
  • Commands run:
  • Tests passed:
  • Tests blocked:
  • Known failures:
  • Residual risks:
  • User decisions needed:
  • Go/no-go recommendation for Sprint 6:
  • Next action: