Files
PDFToMD/docs/Sprints/SPRINT5CONTRACT.md
T
2026-05-08 16:42:19 +09:00

312 lines
13 KiB
Markdown

# Sprint 5 Contract: Obsidian Markdown Normalization And Asset Links
Status: Implemented
Last updated: 2026-05-07
## Objective
Build the project-owned Markdown normalization boundary for Obsidian output, using deterministic unit tests before it is connected to conversion orchestration.
Sprint 5 must establish:
- A small Markdown normalization module that accepts local raw Markdown-like text and returns normalized Markdown plus project-owned warnings.
- Obsidian math delimiter normalization for inline and display math.
- Stable relative asset link normalization without copying files or writing final outputs.
- Limited local asset link validation where useful for warnings.
- Table preservation and clear warning behavior when a table cannot be safely simplified.
- Fast unit tests that do not require real MinerU, model files, GPU, sample PDFs, network, or Obsidian itself.
Sprint 5 is a normalization contract sprint. It must not connect normalization to the CLI, conversion orchestration, metadata writing, report generation, real MinerU execution, or end-to-end output writing.
## Current Precondition
Sprint 4 is complete:
- `src/pdf2md/paths.py` owns input discovery and output path planning.
- `src/pdf2md/ir.py` owns project records, block types, warning codes, and warning severities.
- `src/pdf2md/metadata.py` builds JSON-serializable metadata and summary counts from project-owned records.
- `src/pdf2md/mineru_adapter.py` owns the mocked direct local MinerU CLI adapter boundary.
- `uv run pytest` passed 72 tests.
Sprint 5 may use `WarningRecord`, `WarningCode`, and `WarningSeverity` from `ir.py`, but it must not require raw MinerU-specific Python objects as public or required inputs.
## Touched Surfaces
Allowed:
- `src/pdf2md/markdown.py`
- `src/pdf2md/quality.py` only for minimal local asset link check helpers if that boundary is cleaner than placing them in `markdown.py`
- `src/pdf2md/ir.py` only for narrowly required warning codes discovered while implementing table or asset fallback warnings
- `tests/test_markdown.py` or `tests/unit/test_markdown.py`
- `tests/test_quality.py` only if `quality.py` is touched for asset link checks
- `README.md` only if a small note is needed to clarify that normalization tests are mocked/local and not full conversion behavior
- `PLAN.md` only for current-goal coordination updates required by the shared agent workflow
- `PROGRESS.md`
- `docs/V1IMPLEMENTATIONPLAN.md` only if sequencing or constraints need adjustment
- `docs/Sprints/SPRINT5CONTRACT.md`
Not allowed:
- `src/pdf2md/conversion.py`
- `src/pdf2md/cli.py`
- `src/pdf2md/mineru_adapter.py`
- `src/pdf2md/report.py`
- Working `pdf2md convert` behavior
- Full `pdf2md doctor` behavior
- `scripts/`
- Any real MinerU invocation in default tests
- Any MinerU/model installation or download script
- Any PDF content parsing
- Any metadata JSON file writing
- Any `.report.md` content generation
- Any runtime engine selection or alternate engine support
- Any remote asset fetch, HTTP client, or cloud/API integration
- Any committed file under `samples/`
## Expected Outputs
Sprint 5 should produce:
1. Normalization records and API
- A small result record or equivalent project-owned return type containing at least:
- normalized Markdown
- warnings
- asset links discovered or normalized when available
- A normalization function with a narrow input surface, such as raw Markdown text plus optional output/assets context.
- No public or required field should expose raw MinerU-specific Python objects.
- The API should be usable by later orchestration without knowing how MinerU represented the original Markdown.
2. Inline math delimiter normalization
- Normalize safe inline math forms to `$...$`.
- Preserve already valid `$...$` inline math.
- Preserve the exact LaTeX body inside inline math except delimiter changes.
- Do not escape or rewrite underscores, carets, braces, or backslashes inside math.
- Do not normalize math delimiters inside fenced code blocks or inline code spans.
- Avoid converting ambiguous dollar signs that look like currency or prose punctuation.
3. Display math delimiter normalization
- Normalize safe display math forms to `$$...$$`.
- Ensure display math delimiters sit on their own lines.
- Keep a blank line around display math blocks.
- Preserve the exact LaTeX body inside display math except delimiter and surrounding whitespace normalization.
- Preserve LaTeX environments such as `equation`, `align`, or `gather` rather than rewriting their semantics.
- Make normalization idempotent: running the normalizer twice should produce the same Markdown.
4. Asset link normalization
- Normalize local image/asset links to stable relative POSIX-style Markdown paths.
- Keep relative links relative; do not turn them into absolute paths.
- Reject or warn on absolute asset links that cannot be represented relative to the planned output/assets context.
- Reject or warn on links that escape the output/assets directory with `..`.
- Do not fetch remote URLs, copy assets, or write files.
- Preserve alt text when rewriting Markdown image links.
5. Table preservation and fallback warnings
- Preserve simple Markdown pipe tables without destructive formatting changes.
- Preserve HTML tables when Markdown would lose row spans, column spans, nested content, or other complex structure.
- Emit a project-owned warning when complex table fallback behavior is detected or when table simplification is intentionally skipped.
- Do not attempt broad table reflow or OCR-style table reconstruction in Sprint 5.
6. Tests
- Unit tests for inline math delimiter normalization.
- Unit tests for display math delimiter normalization and blank-line spacing.
- Unit tests proving underscores and carets inside math are preserved.
- Unit tests proving fenced code blocks and inline code are not normalized.
- Unit tests for idempotency.
- Unit tests for relative asset link normalization.
- Unit tests for missing or escaping asset link warnings when asset checking is implemented.
- Unit tests for simple table preservation.
- Unit tests for complex table fallback warning behavior.
- Unit tests proving no real MinerU binary, model files, GPU, `samples/`, Obsidian installation, or network are required by default.
7. Handoff
- `PROGRESS.md` records changed files, commands run, tests passed or blocked, known failures, residual risks, and next action.
## Non-Goals
- Do not implement conversion orchestration.
- Do not implement `convert_pdf`.
- Do not implement `pdf2md convert`.
- Do not implement full `pdf2md doctor`.
- Do not invoke MinerU.
- Do not install MinerU 3.1.0.
- Do not download MinerU models.
- Do not parse real PDFs.
- Do not write final Markdown files as product behavior.
- Do not copy or move assets as product behavior.
- Do not write metadata JSON.
- Do not generate `.report.md`.
- Do not compute source SHA-256.
- Do not implement math renderability checks beyond a future-facing warning interface if needed.
- Do not implement full quality report checks.
- Do not implement alternate engines or runtime engine selection.
- Do not add cloud, remote API, router, HTTP client backend, remote OpenAI-compatible backend, or remote asset-fetching support.
## Work Packages
### WP5.1: Normalization Types And Safe Boundaries
Owner:
- `obsidian-markdown-agent`
- `feature-generator-agent`
Actions:
- Define a small Markdown normalization result type.
- Define a focused normalization function.
- Keep warnings project-owned through `WarningRecord`.
- Keep the API independent of raw MinerU objects.
Output:
- Later orchestration can normalize adapter Markdown without knowing MinerU internals.
### WP5.2: Math Delimiter Normalization
Owner:
- `obsidian-markdown-agent`
- `feature-generator-agent`
Actions:
- Normalize safe inline math delimiters to `$...$`.
- Normalize safe display math delimiters to `$$...$$` with stable surrounding blank lines.
- Preserve LaTeX bodies exactly.
- Protect code fences and inline code spans.
- Add idempotency tests.
Output:
- Obsidian-friendly math delimiter behavior is deterministic and covered by unit tests.
### WP5.3: Asset Link Normalization
Owner:
- `obsidian-markdown-agent`
- `feature-generator-agent`
Actions:
- Normalize local image/asset links to stable relative POSIX-style paths.
- Preserve alt text.
- Warn on missing, absolute, escaping, or non-local asset links when the helper has enough local context to judge them.
- Do not fetch or copy assets.
Output:
- Later conversion can produce Markdown links that are stable relative to planned output paths.
### WP5.4: Table Preservation And Fallback Warning
Owner:
- `obsidian-markdown-agent`
- `feature-generator-agent`
Actions:
- Preserve simple Markdown pipe tables.
- Preserve complex HTML tables without simplifying them destructively.
- Emit a project-owned warning for complex table fallback behavior.
Output:
- Table handling is conservative and traceable instead of silently lossy.
### WP5.5: Independent Evaluation
Owner:
- `evaluation-agent`
Actions:
- Review the completed normalizer against this contract.
- Verify no conversion orchestration, real MinerU dependency in default tests, remote runtime path, alternate engine, metadata writing, report generation, file-copying behavior, or working CLI command was added.
- Verify `samples/` remains untracked and unstaged.
Output:
- PASS/FAIL notes with any missing acceptance criteria.
## Verification Checks
Required:
- `git status --short` before staging confirms `samples/` remains untracked.
- `uv --version` is run and result is recorded.
- `uv sync` passes.
- `uv run pytest` passes.
- Targeted Markdown normalization tests pass.
- Tests do not require real MinerU, CUDA, GPU, model files, Obsidian, `samples/`, or network.
- No model downloads occur.
- No network calls are required.
- No candidate engine comparison is reintroduced.
- No conversion orchestration is implemented.
- No metadata JSON writing or full report generation is implemented.
- No working `pdf2md convert` or full `pdf2md doctor` behavior is implemented.
- No final output files are written as product behavior.
- No remote asset fetching is implemented.
- Math delimiter normalization is idempotent.
- Asset paths in normalized Markdown are relative when they are rewritten.
- `git diff --check` passes.
Recommended:
- Prefer a small tokenizer or state-machine approach over broad regular-expression rewrites for math/code boundary handling.
- Keep normalization helpers pure and deterministic.
- Treat complex tables conservatively: preserve content and warn rather than flattening structure.
- Use `requirements-guard-agent` if warning codes or output behavior conflict across documents.
## Hard Failure Criteria
Sprint 5 fails and must stop for a user decision if any of these are true:
- The normalizer rewrites LaTeX math bodies beyond delimiter and whitespace normalization without deterministic tests.
- The normalizer changes underscores, carets, braces, or backslashes inside math content.
- The normalizer rewrites code fences or inline code spans as math.
- The normalizer produces absolute asset links where relative links are required.
- The normalizer accepts asset links that escape the output/assets context without warning.
- The implementation fetches remote assets or adds any HTTP/network client path.
- The implementation connects normalization to a working conversion CLI/API.
- The implementation adds metadata file writing, full report generation, real MinerU execution, model downloads, or setup scripts.
- Default tests require real MinerU, CUDA, GPU, model files, network, Obsidian, or `samples/`.
- `samples/` is staged or committed.
## Acceptance Criteria
Sprint 5 is complete when:
- `src/pdf2md/markdown.py` exists and owns Obsidian Markdown normalization behavior.
- Inline math delimiter normalization is unit-tested.
- Display math delimiter normalization and blank-line spacing are unit-tested.
- Tests prove underscores and carets inside math are preserved.
- Tests prove fenced code blocks and inline code are not normalized.
- Normalization idempotency is unit-tested.
- Relative asset link normalization is unit-tested.
- Asset warning behavior is unit-tested when missing, absolute, escaping, or non-local links are in scope.
- Simple table preservation and complex table fallback warning behavior are unit-tested.
- Default tests do not require MinerU, GPU, model files, network, Obsidian, or `samples/`.
- No conversion orchestration, metadata file writing, full report generation, file-copying behavior, or working CLI behavior is implemented.
- `uv sync` passes.
- `uv run pytest` passes.
- `PROGRESS.md` records checks performed and residual risks.
- Independent evaluation is complete.
- The completed change is committed.
## Handoff Fields
Use these fields when Sprint 5 completes:
- Files changed:
- Commands run:
- Tests passed:
- Tests blocked:
- Known failures:
- Residual risks:
- User decisions needed:
- Go/no-go recommendation for Sprint 6:
- Next action: