# Sprint 5 Contract: Obsidian Markdown Normalization And Asset Links Status: Implemented Last updated: 2026-05-07 ## Objective Build the project-owned Markdown normalization boundary for Obsidian output, using deterministic unit tests before it is connected to conversion orchestration. Sprint 5 must establish: - A small Markdown normalization module that accepts local raw Markdown-like text and returns normalized Markdown plus project-owned warnings. - Obsidian math delimiter normalization for inline and display math. - Stable relative asset link normalization without copying files or writing final outputs. - Limited local asset link validation where useful for warnings. - Table preservation and clear warning behavior when a table cannot be safely simplified. - Fast unit tests that do not require real MinerU, model files, GPU, sample PDFs, network, or Obsidian itself. Sprint 5 is a normalization contract sprint. It must not connect normalization to the CLI, conversion orchestration, metadata writing, report generation, real MinerU execution, or end-to-end output writing. ## Current Precondition Sprint 4 is complete: - `src/pdf2md/paths.py` owns input discovery and output path planning. - `src/pdf2md/ir.py` owns project records, block types, warning codes, and warning severities. - `src/pdf2md/metadata.py` builds JSON-serializable metadata and summary counts from project-owned records. - `src/pdf2md/mineru_adapter.py` owns the mocked direct local MinerU CLI adapter boundary. - `uv run pytest` passed 72 tests. Sprint 5 may use `WarningRecord`, `WarningCode`, and `WarningSeverity` from `ir.py`, but it must not require raw MinerU-specific Python objects as public or required inputs. ## Touched Surfaces Allowed: - `src/pdf2md/markdown.py` - `src/pdf2md/quality.py` only for minimal local asset link check helpers if that boundary is cleaner than placing them in `markdown.py` - `src/pdf2md/ir.py` only for narrowly required warning codes discovered while implementing table or asset fallback warnings - `tests/test_markdown.py` or `tests/unit/test_markdown.py` - `tests/test_quality.py` only if `quality.py` is touched for asset link checks - `README.md` only if a small note is needed to clarify that normalization tests are mocked/local and not full conversion behavior - `PLAN.md` only for current-goal coordination updates required by the shared agent workflow - `PROGRESS.md` - `docs/V1IMPLEMENTATIONPLAN.md` only if sequencing or constraints need adjustment - `docs/Sprints/SPRINT5CONTRACT.md` Not allowed: - `src/pdf2md/conversion.py` - `src/pdf2md/cli.py` - `src/pdf2md/mineru_adapter.py` - `src/pdf2md/report.py` - Working `pdf2md convert` behavior - Full `pdf2md doctor` behavior - `scripts/` - Any real MinerU invocation in default tests - Any MinerU/model installation or download script - Any PDF content parsing - Any metadata JSON file writing - Any `.report.md` content generation - Any runtime engine selection or alternate engine support - Any remote asset fetch, HTTP client, or cloud/API integration - Any committed file under `samples/` ## Expected Outputs Sprint 5 should produce: 1. Normalization records and API - A small result record or equivalent project-owned return type containing at least: - normalized Markdown - warnings - asset links discovered or normalized when available - A normalization function with a narrow input surface, such as raw Markdown text plus optional output/assets context. - No public or required field should expose raw MinerU-specific Python objects. - The API should be usable by later orchestration without knowing how MinerU represented the original Markdown. 2. Inline math delimiter normalization - Normalize safe inline math forms to `$...$`. - Preserve already valid `$...$` inline math. - Preserve the exact LaTeX body inside inline math except delimiter changes. - Do not escape or rewrite underscores, carets, braces, or backslashes inside math. - Do not normalize math delimiters inside fenced code blocks or inline code spans. - Avoid converting ambiguous dollar signs that look like currency or prose punctuation. 3. Display math delimiter normalization - Normalize safe display math forms to `$$...$$`. - Ensure display math delimiters sit on their own lines. - Keep a blank line around display math blocks. - Preserve the exact LaTeX body inside display math except delimiter and surrounding whitespace normalization. - Preserve LaTeX environments such as `equation`, `align`, or `gather` rather than rewriting their semantics. - Make normalization idempotent: running the normalizer twice should produce the same Markdown. 4. Asset link normalization - Normalize local image/asset links to stable relative POSIX-style Markdown paths. - Keep relative links relative; do not turn them into absolute paths. - Reject or warn on absolute asset links that cannot be represented relative to the planned output/assets context. - Reject or warn on links that escape the output/assets directory with `..`. - Do not fetch remote URLs, copy assets, or write files. - Preserve alt text when rewriting Markdown image links. 5. Table preservation and fallback warnings - Preserve simple Markdown pipe tables without destructive formatting changes. - Preserve HTML tables when Markdown would lose row spans, column spans, nested content, or other complex structure. - Emit a project-owned warning when complex table fallback behavior is detected or when table simplification is intentionally skipped. - Do not attempt broad table reflow or OCR-style table reconstruction in Sprint 5. 6. Tests - Unit tests for inline math delimiter normalization. - Unit tests for display math delimiter normalization and blank-line spacing. - Unit tests proving underscores and carets inside math are preserved. - Unit tests proving fenced code blocks and inline code are not normalized. - Unit tests for idempotency. - Unit tests for relative asset link normalization. - Unit tests for missing or escaping asset link warnings when asset checking is implemented. - Unit tests for simple table preservation. - Unit tests for complex table fallback warning behavior. - Unit tests proving no real MinerU binary, model files, GPU, `samples/`, Obsidian installation, or network are required by default. 7. Handoff - `PROGRESS.md` records changed files, commands run, tests passed or blocked, known failures, residual risks, and next action. ## Non-Goals - Do not implement conversion orchestration. - Do not implement `convert_pdf`. - Do not implement `pdf2md convert`. - Do not implement full `pdf2md doctor`. - Do not invoke MinerU. - Do not install MinerU 3.1.0. - Do not download MinerU models. - Do not parse real PDFs. - Do not write final Markdown files as product behavior. - Do not copy or move assets as product behavior. - Do not write metadata JSON. - Do not generate `.report.md`. - Do not compute source SHA-256. - Do not implement math renderability checks beyond a future-facing warning interface if needed. - Do not implement full quality report checks. - Do not implement alternate engines or runtime engine selection. - Do not add cloud, remote API, router, HTTP client backend, remote OpenAI-compatible backend, or remote asset-fetching support. ## Work Packages ### WP5.1: Normalization Types And Safe Boundaries Owner: - `obsidian-markdown-agent` - `feature-generator-agent` Actions: - Define a small Markdown normalization result type. - Define a focused normalization function. - Keep warnings project-owned through `WarningRecord`. - Keep the API independent of raw MinerU objects. Output: - Later orchestration can normalize adapter Markdown without knowing MinerU internals. ### WP5.2: Math Delimiter Normalization Owner: - `obsidian-markdown-agent` - `feature-generator-agent` Actions: - Normalize safe inline math delimiters to `$...$`. - Normalize safe display math delimiters to `$$...$$` with stable surrounding blank lines. - Preserve LaTeX bodies exactly. - Protect code fences and inline code spans. - Add idempotency tests. Output: - Obsidian-friendly math delimiter behavior is deterministic and covered by unit tests. ### WP5.3: Asset Link Normalization Owner: - `obsidian-markdown-agent` - `feature-generator-agent` Actions: - Normalize local image/asset links to stable relative POSIX-style paths. - Preserve alt text. - Warn on missing, absolute, escaping, or non-local asset links when the helper has enough local context to judge them. - Do not fetch or copy assets. Output: - Later conversion can produce Markdown links that are stable relative to planned output paths. ### WP5.4: Table Preservation And Fallback Warning Owner: - `obsidian-markdown-agent` - `feature-generator-agent` Actions: - Preserve simple Markdown pipe tables. - Preserve complex HTML tables without simplifying them destructively. - Emit a project-owned warning for complex table fallback behavior. Output: - Table handling is conservative and traceable instead of silently lossy. ### WP5.5: Independent Evaluation Owner: - `evaluation-agent` Actions: - Review the completed normalizer against this contract. - Verify no conversion orchestration, real MinerU dependency in default tests, remote runtime path, alternate engine, metadata writing, report generation, file-copying behavior, or working CLI command was added. - Verify `samples/` remains untracked and unstaged. Output: - PASS/FAIL notes with any missing acceptance criteria. ## Verification Checks Required: - `git status --short` before staging confirms `samples/` remains untracked. - `uv --version` is run and result is recorded. - `uv sync` passes. - `uv run pytest` passes. - Targeted Markdown normalization tests pass. - Tests do not require real MinerU, CUDA, GPU, model files, Obsidian, `samples/`, or network. - No model downloads occur. - No network calls are required. - No candidate engine comparison is reintroduced. - No conversion orchestration is implemented. - No metadata JSON writing or full report generation is implemented. - No working `pdf2md convert` or full `pdf2md doctor` behavior is implemented. - No final output files are written as product behavior. - No remote asset fetching is implemented. - Math delimiter normalization is idempotent. - Asset paths in normalized Markdown are relative when they are rewritten. - `git diff --check` passes. Recommended: - Prefer a small tokenizer or state-machine approach over broad regular-expression rewrites for math/code boundary handling. - Keep normalization helpers pure and deterministic. - Treat complex tables conservatively: preserve content and warn rather than flattening structure. - Use `requirements-guard-agent` if warning codes or output behavior conflict across documents. ## Hard Failure Criteria Sprint 5 fails and must stop for a user decision if any of these are true: - The normalizer rewrites LaTeX math bodies beyond delimiter and whitespace normalization without deterministic tests. - The normalizer changes underscores, carets, braces, or backslashes inside math content. - The normalizer rewrites code fences or inline code spans as math. - The normalizer produces absolute asset links where relative links are required. - The normalizer accepts asset links that escape the output/assets context without warning. - The implementation fetches remote assets or adds any HTTP/network client path. - The implementation connects normalization to a working conversion CLI/API. - The implementation adds metadata file writing, full report generation, real MinerU execution, model downloads, or setup scripts. - Default tests require real MinerU, CUDA, GPU, model files, network, Obsidian, or `samples/`. - `samples/` is staged or committed. ## Acceptance Criteria Sprint 5 is complete when: - `src/pdf2md/markdown.py` exists and owns Obsidian Markdown normalization behavior. - Inline math delimiter normalization is unit-tested. - Display math delimiter normalization and blank-line spacing are unit-tested. - Tests prove underscores and carets inside math are preserved. - Tests prove fenced code blocks and inline code are not normalized. - Normalization idempotency is unit-tested. - Relative asset link normalization is unit-tested. - Asset warning behavior is unit-tested when missing, absolute, escaping, or non-local links are in scope. - Simple table preservation and complex table fallback warning behavior are unit-tested. - Default tests do not require MinerU, GPU, model files, network, Obsidian, or `samples/`. - No conversion orchestration, metadata file writing, full report generation, file-copying behavior, or working CLI behavior is implemented. - `uv sync` passes. - `uv run pytest` passes. - `PROGRESS.md` records checks performed and residual risks. - Independent evaluation is complete. - The completed change is committed. ## Handoff Fields Use these fields when Sprint 5 completes: - Files changed: - Commands run: - Tests passed: - Tests blocked: - Known failures: - Residual risks: - User decisions needed: - Go/no-go recommendation for Sprint 6: - Next action: