13 KiB
Sprint 5 Contract: Obsidian Markdown Normalization And Asset Links
Status: Implemented Last updated: 2026-05-07
Objective
Build the project-owned Markdown normalization boundary for Obsidian output, using deterministic unit tests before it is connected to conversion orchestration.
Sprint 5 must establish:
- A small Markdown normalization module that accepts local raw Markdown-like text and returns normalized Markdown plus project-owned warnings.
- Obsidian math delimiter normalization for inline and display math.
- Stable relative asset link normalization without copying files or writing final outputs.
- Limited local asset link validation where useful for warnings.
- Table preservation and clear warning behavior when a table cannot be safely simplified.
- Fast unit tests that do not require real MinerU, model files, GPU, sample PDFs, network, or Obsidian itself.
Sprint 5 is a normalization contract sprint. It must not connect normalization to the CLI, conversion orchestration, metadata writing, report generation, real MinerU execution, or end-to-end output writing.
Current Precondition
Sprint 4 is complete:
src/pdf2md/paths.pyowns input discovery and output path planning.src/pdf2md/ir.pyowns project records, block types, warning codes, and warning severities.src/pdf2md/metadata.pybuilds JSON-serializable metadata and summary counts from project-owned records.src/pdf2md/mineru_adapter.pyowns the mocked direct local MinerU CLI adapter boundary.uv run pytestpassed 72 tests.
Sprint 5 may use WarningRecord, WarningCode, and WarningSeverity from ir.py, but it must not require raw MinerU-specific Python objects as public or required inputs.
Touched Surfaces
Allowed:
src/pdf2md/markdown.pysrc/pdf2md/quality.pyonly for minimal local asset link check helpers if that boundary is cleaner than placing them inmarkdown.pysrc/pdf2md/ir.pyonly for narrowly required warning codes discovered while implementing table or asset fallback warningstests/test_markdown.pyortests/unit/test_markdown.pytests/test_quality.pyonly ifquality.pyis touched for asset link checksREADME.mdonly if a small note is needed to clarify that normalization tests are mocked/local and not full conversion behaviorPLAN.mdonly for current-goal coordination updates required by the shared agent workflowPROGRESS.mddocs/V1IMPLEMENTATIONPLAN.mdonly if sequencing or constraints need adjustmentdocs/Sprints/SPRINT5CONTRACT.md
Not allowed:
src/pdf2md/conversion.pysrc/pdf2md/cli.pysrc/pdf2md/mineru_adapter.pysrc/pdf2md/report.py- Working
pdf2md convertbehavior - Full
pdf2md doctorbehavior scripts/- Any real MinerU invocation in default tests
- Any MinerU/model installation or download script
- Any PDF content parsing
- Any metadata JSON file writing
- Any
.report.mdcontent generation - Any runtime engine selection or alternate engine support
- Any remote asset fetch, HTTP client, or cloud/API integration
- Any committed file under
samples/
Expected Outputs
Sprint 5 should produce:
-
Normalization records and API
- A small result record or equivalent project-owned return type containing at least:
- normalized Markdown
- warnings
- asset links discovered or normalized when available
- A normalization function with a narrow input surface, such as raw Markdown text plus optional output/assets context.
- No public or required field should expose raw MinerU-specific Python objects.
- The API should be usable by later orchestration without knowing how MinerU represented the original Markdown.
- A small result record or equivalent project-owned return type containing at least:
-
Inline math delimiter normalization
- Normalize safe inline math forms to
$...$. - Preserve already valid
$...$inline math. - Preserve the exact LaTeX body inside inline math except delimiter changes.
- Do not escape or rewrite underscores, carets, braces, or backslashes inside math.
- Do not normalize math delimiters inside fenced code blocks or inline code spans.
- Avoid converting ambiguous dollar signs that look like currency or prose punctuation.
- Normalize safe inline math forms to
-
Display math delimiter normalization
- Normalize safe display math forms to
$$...$$. - Ensure display math delimiters sit on their own lines.
- Keep a blank line around display math blocks.
- Preserve the exact LaTeX body inside display math except delimiter and surrounding whitespace normalization.
- Preserve LaTeX environments such as
equation,align, orgatherrather than rewriting their semantics. - Make normalization idempotent: running the normalizer twice should produce the same Markdown.
- Normalize safe display math forms to
-
Asset link normalization
- Normalize local image/asset links to stable relative POSIX-style Markdown paths.
- Keep relative links relative; do not turn them into absolute paths.
- Reject or warn on absolute asset links that cannot be represented relative to the planned output/assets context.
- Reject or warn on links that escape the output/assets directory with
... - Do not fetch remote URLs, copy assets, or write files.
- Preserve alt text when rewriting Markdown image links.
-
Table preservation and fallback warnings
- Preserve simple Markdown pipe tables without destructive formatting changes.
- Preserve HTML tables when Markdown would lose row spans, column spans, nested content, or other complex structure.
- Emit a project-owned warning when complex table fallback behavior is detected or when table simplification is intentionally skipped.
- Do not attempt broad table reflow or OCR-style table reconstruction in Sprint 5.
-
Tests
- Unit tests for inline math delimiter normalization.
- Unit tests for display math delimiter normalization and blank-line spacing.
- Unit tests proving underscores and carets inside math are preserved.
- Unit tests proving fenced code blocks and inline code are not normalized.
- Unit tests for idempotency.
- Unit tests for relative asset link normalization.
- Unit tests for missing or escaping asset link warnings when asset checking is implemented.
- Unit tests for simple table preservation.
- Unit tests for complex table fallback warning behavior.
- Unit tests proving no real MinerU binary, model files, GPU,
samples/, Obsidian installation, or network are required by default.
-
Handoff
PROGRESS.mdrecords changed files, commands run, tests passed or blocked, known failures, residual risks, and next action.
Non-Goals
- Do not implement conversion orchestration.
- Do not implement
convert_pdf. - Do not implement
pdf2md convert. - Do not implement full
pdf2md doctor. - Do not invoke MinerU.
- Do not install MinerU 3.1.0.
- Do not download MinerU models.
- Do not parse real PDFs.
- Do not write final Markdown files as product behavior.
- Do not copy or move assets as product behavior.
- Do not write metadata JSON.
- Do not generate
.report.md. - Do not compute source SHA-256.
- Do not implement math renderability checks beyond a future-facing warning interface if needed.
- Do not implement full quality report checks.
- Do not implement alternate engines or runtime engine selection.
- Do not add cloud, remote API, router, HTTP client backend, remote OpenAI-compatible backend, or remote asset-fetching support.
Work Packages
WP5.1: Normalization Types And Safe Boundaries
Owner:
obsidian-markdown-agentfeature-generator-agent
Actions:
- Define a small Markdown normalization result type.
- Define a focused normalization function.
- Keep warnings project-owned through
WarningRecord. - Keep the API independent of raw MinerU objects.
Output:
- Later orchestration can normalize adapter Markdown without knowing MinerU internals.
WP5.2: Math Delimiter Normalization
Owner:
obsidian-markdown-agentfeature-generator-agent
Actions:
- Normalize safe inline math delimiters to
$...$. - Normalize safe display math delimiters to
$$...$$with stable surrounding blank lines. - Preserve LaTeX bodies exactly.
- Protect code fences and inline code spans.
- Add idempotency tests.
Output:
- Obsidian-friendly math delimiter behavior is deterministic and covered by unit tests.
WP5.3: Asset Link Normalization
Owner:
obsidian-markdown-agentfeature-generator-agent
Actions:
- Normalize local image/asset links to stable relative POSIX-style paths.
- Preserve alt text.
- Warn on missing, absolute, escaping, or non-local asset links when the helper has enough local context to judge them.
- Do not fetch or copy assets.
Output:
- Later conversion can produce Markdown links that are stable relative to planned output paths.
WP5.4: Table Preservation And Fallback Warning
Owner:
obsidian-markdown-agentfeature-generator-agent
Actions:
- Preserve simple Markdown pipe tables.
- Preserve complex HTML tables without simplifying them destructively.
- Emit a project-owned warning for complex table fallback behavior.
Output:
- Table handling is conservative and traceable instead of silently lossy.
WP5.5: Independent Evaluation
Owner:
evaluation-agent
Actions:
- Review the completed normalizer against this contract.
- Verify no conversion orchestration, real MinerU dependency in default tests, remote runtime path, alternate engine, metadata writing, report generation, file-copying behavior, or working CLI command was added.
- Verify
samples/remains untracked and unstaged.
Output:
- PASS/FAIL notes with any missing acceptance criteria.
Verification Checks
Required:
git status --shortbefore staging confirmssamples/remains untracked.uv --versionis run and result is recorded.uv syncpasses.uv run pytestpasses.- Targeted Markdown normalization tests pass.
- Tests do not require real MinerU, CUDA, GPU, model files, Obsidian,
samples/, or network. - No model downloads occur.
- No network calls are required.
- No candidate engine comparison is reintroduced.
- No conversion orchestration is implemented.
- No metadata JSON writing or full report generation is implemented.
- No working
pdf2md convertor fullpdf2md doctorbehavior is implemented. - No final output files are written as product behavior.
- No remote asset fetching is implemented.
- Math delimiter normalization is idempotent.
- Asset paths in normalized Markdown are relative when they are rewritten.
git diff --checkpasses.
Recommended:
- Prefer a small tokenizer or state-machine approach over broad regular-expression rewrites for math/code boundary handling.
- Keep normalization helpers pure and deterministic.
- Treat complex tables conservatively: preserve content and warn rather than flattening structure.
- Use
requirements-guard-agentif warning codes or output behavior conflict across documents.
Hard Failure Criteria
Sprint 5 fails and must stop for a user decision if any of these are true:
- The normalizer rewrites LaTeX math bodies beyond delimiter and whitespace normalization without deterministic tests.
- The normalizer changes underscores, carets, braces, or backslashes inside math content.
- The normalizer rewrites code fences or inline code spans as math.
- The normalizer produces absolute asset links where relative links are required.
- The normalizer accepts asset links that escape the output/assets context without warning.
- The implementation fetches remote assets or adds any HTTP/network client path.
- The implementation connects normalization to a working conversion CLI/API.
- The implementation adds metadata file writing, full report generation, real MinerU execution, model downloads, or setup scripts.
- Default tests require real MinerU, CUDA, GPU, model files, network, Obsidian, or
samples/. samples/is staged or committed.
Acceptance Criteria
Sprint 5 is complete when:
src/pdf2md/markdown.pyexists and owns Obsidian Markdown normalization behavior.- Inline math delimiter normalization is unit-tested.
- Display math delimiter normalization and blank-line spacing are unit-tested.
- Tests prove underscores and carets inside math are preserved.
- Tests prove fenced code blocks and inline code are not normalized.
- Normalization idempotency is unit-tested.
- Relative asset link normalization is unit-tested.
- Asset warning behavior is unit-tested when missing, absolute, escaping, or non-local links are in scope.
- Simple table preservation and complex table fallback warning behavior are unit-tested.
- Default tests do not require MinerU, GPU, model files, network, Obsidian, or
samples/. - No conversion orchestration, metadata file writing, full report generation, file-copying behavior, or working CLI behavior is implemented.
uv syncpasses.uv run pytestpasses.PROGRESS.mdrecords checks performed and residual risks.- Independent evaluation is complete.
- The completed change is committed.
Handoff Fields
Use these fields when Sprint 5 completes:
- Files changed:
- Commands run:
- Tests passed:
- Tests blocked:
- Known failures:
- Residual risks:
- User decisions needed:
- Go/no-go recommendation for Sprint 6:
- Next action: