add pdftomd
This commit is contained in:
@@ -0,0 +1,311 @@
|
||||
# Sprint 5 Contract: Obsidian Markdown Normalization And Asset Links
|
||||
|
||||
Status: Implemented
|
||||
Last updated: 2026-05-07
|
||||
|
||||
## Objective
|
||||
|
||||
Build the project-owned Markdown normalization boundary for Obsidian output, using deterministic unit tests before it is connected to conversion orchestration.
|
||||
|
||||
Sprint 5 must establish:
|
||||
|
||||
- A small Markdown normalization module that accepts local raw Markdown-like text and returns normalized Markdown plus project-owned warnings.
|
||||
- Obsidian math delimiter normalization for inline and display math.
|
||||
- Stable relative asset link normalization without copying files or writing final outputs.
|
||||
- Limited local asset link validation where useful for warnings.
|
||||
- Table preservation and clear warning behavior when a table cannot be safely simplified.
|
||||
- Fast unit tests that do not require real MinerU, model files, GPU, sample PDFs, network, or Obsidian itself.
|
||||
|
||||
Sprint 5 is a normalization contract sprint. It must not connect normalization to the CLI, conversion orchestration, metadata writing, report generation, real MinerU execution, or end-to-end output writing.
|
||||
|
||||
## Current Precondition
|
||||
|
||||
Sprint 4 is complete:
|
||||
|
||||
- `src/pdf2md/paths.py` owns input discovery and output path planning.
|
||||
- `src/pdf2md/ir.py` owns project records, block types, warning codes, and warning severities.
|
||||
- `src/pdf2md/metadata.py` builds JSON-serializable metadata and summary counts from project-owned records.
|
||||
- `src/pdf2md/mineru_adapter.py` owns the mocked direct local MinerU CLI adapter boundary.
|
||||
- `uv run pytest` passed 72 tests.
|
||||
|
||||
Sprint 5 may use `WarningRecord`, `WarningCode`, and `WarningSeverity` from `ir.py`, but it must not require raw MinerU-specific Python objects as public or required inputs.
|
||||
|
||||
## Touched Surfaces
|
||||
|
||||
Allowed:
|
||||
|
||||
- `src/pdf2md/markdown.py`
|
||||
- `src/pdf2md/quality.py` only for minimal local asset link check helpers if that boundary is cleaner than placing them in `markdown.py`
|
||||
- `src/pdf2md/ir.py` only for narrowly required warning codes discovered while implementing table or asset fallback warnings
|
||||
- `tests/test_markdown.py` or `tests/unit/test_markdown.py`
|
||||
- `tests/test_quality.py` only if `quality.py` is touched for asset link checks
|
||||
- `README.md` only if a small note is needed to clarify that normalization tests are mocked/local and not full conversion behavior
|
||||
- `PLAN.md` only for current-goal coordination updates required by the shared agent workflow
|
||||
- `PROGRESS.md`
|
||||
- `docs/V1IMPLEMENTATIONPLAN.md` only if sequencing or constraints need adjustment
|
||||
- `docs/Sprints/SPRINT5CONTRACT.md`
|
||||
|
||||
Not allowed:
|
||||
|
||||
- `src/pdf2md/conversion.py`
|
||||
- `src/pdf2md/cli.py`
|
||||
- `src/pdf2md/mineru_adapter.py`
|
||||
- `src/pdf2md/report.py`
|
||||
- Working `pdf2md convert` behavior
|
||||
- Full `pdf2md doctor` behavior
|
||||
- `scripts/`
|
||||
- Any real MinerU invocation in default tests
|
||||
- Any MinerU/model installation or download script
|
||||
- Any PDF content parsing
|
||||
- Any metadata JSON file writing
|
||||
- Any `.report.md` content generation
|
||||
- Any runtime engine selection or alternate engine support
|
||||
- Any remote asset fetch, HTTP client, or cloud/API integration
|
||||
- Any committed file under `samples/`
|
||||
|
||||
## Expected Outputs
|
||||
|
||||
Sprint 5 should produce:
|
||||
|
||||
1. Normalization records and API
|
||||
- A small result record or equivalent project-owned return type containing at least:
|
||||
- normalized Markdown
|
||||
- warnings
|
||||
- asset links discovered or normalized when available
|
||||
- A normalization function with a narrow input surface, such as raw Markdown text plus optional output/assets context.
|
||||
- No public or required field should expose raw MinerU-specific Python objects.
|
||||
- The API should be usable by later orchestration without knowing how MinerU represented the original Markdown.
|
||||
|
||||
2. Inline math delimiter normalization
|
||||
- Normalize safe inline math forms to `$...$`.
|
||||
- Preserve already valid `$...$` inline math.
|
||||
- Preserve the exact LaTeX body inside inline math except delimiter changes.
|
||||
- Do not escape or rewrite underscores, carets, braces, or backslashes inside math.
|
||||
- Do not normalize math delimiters inside fenced code blocks or inline code spans.
|
||||
- Avoid converting ambiguous dollar signs that look like currency or prose punctuation.
|
||||
|
||||
3. Display math delimiter normalization
|
||||
- Normalize safe display math forms to `$$...$$`.
|
||||
- Ensure display math delimiters sit on their own lines.
|
||||
- Keep a blank line around display math blocks.
|
||||
- Preserve the exact LaTeX body inside display math except delimiter and surrounding whitespace normalization.
|
||||
- Preserve LaTeX environments such as `equation`, `align`, or `gather` rather than rewriting their semantics.
|
||||
- Make normalization idempotent: running the normalizer twice should produce the same Markdown.
|
||||
|
||||
4. Asset link normalization
|
||||
- Normalize local image/asset links to stable relative POSIX-style Markdown paths.
|
||||
- Keep relative links relative; do not turn them into absolute paths.
|
||||
- Reject or warn on absolute asset links that cannot be represented relative to the planned output/assets context.
|
||||
- Reject or warn on links that escape the output/assets directory with `..`.
|
||||
- Do not fetch remote URLs, copy assets, or write files.
|
||||
- Preserve alt text when rewriting Markdown image links.
|
||||
|
||||
5. Table preservation and fallback warnings
|
||||
- Preserve simple Markdown pipe tables without destructive formatting changes.
|
||||
- Preserve HTML tables when Markdown would lose row spans, column spans, nested content, or other complex structure.
|
||||
- Emit a project-owned warning when complex table fallback behavior is detected or when table simplification is intentionally skipped.
|
||||
- Do not attempt broad table reflow or OCR-style table reconstruction in Sprint 5.
|
||||
|
||||
6. Tests
|
||||
- Unit tests for inline math delimiter normalization.
|
||||
- Unit tests for display math delimiter normalization and blank-line spacing.
|
||||
- Unit tests proving underscores and carets inside math are preserved.
|
||||
- Unit tests proving fenced code blocks and inline code are not normalized.
|
||||
- Unit tests for idempotency.
|
||||
- Unit tests for relative asset link normalization.
|
||||
- Unit tests for missing or escaping asset link warnings when asset checking is implemented.
|
||||
- Unit tests for simple table preservation.
|
||||
- Unit tests for complex table fallback warning behavior.
|
||||
- Unit tests proving no real MinerU binary, model files, GPU, `samples/`, Obsidian installation, or network are required by default.
|
||||
|
||||
7. Handoff
|
||||
- `PROGRESS.md` records changed files, commands run, tests passed or blocked, known failures, residual risks, and next action.
|
||||
|
||||
## Non-Goals
|
||||
|
||||
- Do not implement conversion orchestration.
|
||||
- Do not implement `convert_pdf`.
|
||||
- Do not implement `pdf2md convert`.
|
||||
- Do not implement full `pdf2md doctor`.
|
||||
- Do not invoke MinerU.
|
||||
- Do not install MinerU 3.1.0.
|
||||
- Do not download MinerU models.
|
||||
- Do not parse real PDFs.
|
||||
- Do not write final Markdown files as product behavior.
|
||||
- Do not copy or move assets as product behavior.
|
||||
- Do not write metadata JSON.
|
||||
- Do not generate `.report.md`.
|
||||
- Do not compute source SHA-256.
|
||||
- Do not implement math renderability checks beyond a future-facing warning interface if needed.
|
||||
- Do not implement full quality report checks.
|
||||
- Do not implement alternate engines or runtime engine selection.
|
||||
- Do not add cloud, remote API, router, HTTP client backend, remote OpenAI-compatible backend, or remote asset-fetching support.
|
||||
|
||||
## Work Packages
|
||||
|
||||
### WP5.1: Normalization Types And Safe Boundaries
|
||||
|
||||
Owner:
|
||||
|
||||
- `obsidian-markdown-agent`
|
||||
- `feature-generator-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Define a small Markdown normalization result type.
|
||||
- Define a focused normalization function.
|
||||
- Keep warnings project-owned through `WarningRecord`.
|
||||
- Keep the API independent of raw MinerU objects.
|
||||
|
||||
Output:
|
||||
|
||||
- Later orchestration can normalize adapter Markdown without knowing MinerU internals.
|
||||
|
||||
### WP5.2: Math Delimiter Normalization
|
||||
|
||||
Owner:
|
||||
|
||||
- `obsidian-markdown-agent`
|
||||
- `feature-generator-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Normalize safe inline math delimiters to `$...$`.
|
||||
- Normalize safe display math delimiters to `$$...$$` with stable surrounding blank lines.
|
||||
- Preserve LaTeX bodies exactly.
|
||||
- Protect code fences and inline code spans.
|
||||
- Add idempotency tests.
|
||||
|
||||
Output:
|
||||
|
||||
- Obsidian-friendly math delimiter behavior is deterministic and covered by unit tests.
|
||||
|
||||
### WP5.3: Asset Link Normalization
|
||||
|
||||
Owner:
|
||||
|
||||
- `obsidian-markdown-agent`
|
||||
- `feature-generator-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Normalize local image/asset links to stable relative POSIX-style paths.
|
||||
- Preserve alt text.
|
||||
- Warn on missing, absolute, escaping, or non-local asset links when the helper has enough local context to judge them.
|
||||
- Do not fetch or copy assets.
|
||||
|
||||
Output:
|
||||
|
||||
- Later conversion can produce Markdown links that are stable relative to planned output paths.
|
||||
|
||||
### WP5.4: Table Preservation And Fallback Warning
|
||||
|
||||
Owner:
|
||||
|
||||
- `obsidian-markdown-agent`
|
||||
- `feature-generator-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Preserve simple Markdown pipe tables.
|
||||
- Preserve complex HTML tables without simplifying them destructively.
|
||||
- Emit a project-owned warning for complex table fallback behavior.
|
||||
|
||||
Output:
|
||||
|
||||
- Table handling is conservative and traceable instead of silently lossy.
|
||||
|
||||
### WP5.5: Independent Evaluation
|
||||
|
||||
Owner:
|
||||
|
||||
- `evaluation-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Review the completed normalizer against this contract.
|
||||
- Verify no conversion orchestration, real MinerU dependency in default tests, remote runtime path, alternate engine, metadata writing, report generation, file-copying behavior, or working CLI command was added.
|
||||
- Verify `samples/` remains untracked and unstaged.
|
||||
|
||||
Output:
|
||||
|
||||
- PASS/FAIL notes with any missing acceptance criteria.
|
||||
|
||||
## Verification Checks
|
||||
|
||||
Required:
|
||||
|
||||
- `git status --short` before staging confirms `samples/` remains untracked.
|
||||
- `uv --version` is run and result is recorded.
|
||||
- `uv sync` passes.
|
||||
- `uv run pytest` passes.
|
||||
- Targeted Markdown normalization tests pass.
|
||||
- Tests do not require real MinerU, CUDA, GPU, model files, Obsidian, `samples/`, or network.
|
||||
- No model downloads occur.
|
||||
- No network calls are required.
|
||||
- No candidate engine comparison is reintroduced.
|
||||
- No conversion orchestration is implemented.
|
||||
- No metadata JSON writing or full report generation is implemented.
|
||||
- No working `pdf2md convert` or full `pdf2md doctor` behavior is implemented.
|
||||
- No final output files are written as product behavior.
|
||||
- No remote asset fetching is implemented.
|
||||
- Math delimiter normalization is idempotent.
|
||||
- Asset paths in normalized Markdown are relative when they are rewritten.
|
||||
- `git diff --check` passes.
|
||||
|
||||
Recommended:
|
||||
|
||||
- Prefer a small tokenizer or state-machine approach over broad regular-expression rewrites for math/code boundary handling.
|
||||
- Keep normalization helpers pure and deterministic.
|
||||
- Treat complex tables conservatively: preserve content and warn rather than flattening structure.
|
||||
- Use `requirements-guard-agent` if warning codes or output behavior conflict across documents.
|
||||
|
||||
## Hard Failure Criteria
|
||||
|
||||
Sprint 5 fails and must stop for a user decision if any of these are true:
|
||||
|
||||
- The normalizer rewrites LaTeX math bodies beyond delimiter and whitespace normalization without deterministic tests.
|
||||
- The normalizer changes underscores, carets, braces, or backslashes inside math content.
|
||||
- The normalizer rewrites code fences or inline code spans as math.
|
||||
- The normalizer produces absolute asset links where relative links are required.
|
||||
- The normalizer accepts asset links that escape the output/assets context without warning.
|
||||
- The implementation fetches remote assets or adds any HTTP/network client path.
|
||||
- The implementation connects normalization to a working conversion CLI/API.
|
||||
- The implementation adds metadata file writing, full report generation, real MinerU execution, model downloads, or setup scripts.
|
||||
- Default tests require real MinerU, CUDA, GPU, model files, network, Obsidian, or `samples/`.
|
||||
- `samples/` is staged or committed.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
Sprint 5 is complete when:
|
||||
|
||||
- `src/pdf2md/markdown.py` exists and owns Obsidian Markdown normalization behavior.
|
||||
- Inline math delimiter normalization is unit-tested.
|
||||
- Display math delimiter normalization and blank-line spacing are unit-tested.
|
||||
- Tests prove underscores and carets inside math are preserved.
|
||||
- Tests prove fenced code blocks and inline code are not normalized.
|
||||
- Normalization idempotency is unit-tested.
|
||||
- Relative asset link normalization is unit-tested.
|
||||
- Asset warning behavior is unit-tested when missing, absolute, escaping, or non-local links are in scope.
|
||||
- Simple table preservation and complex table fallback warning behavior are unit-tested.
|
||||
- Default tests do not require MinerU, GPU, model files, network, Obsidian, or `samples/`.
|
||||
- No conversion orchestration, metadata file writing, full report generation, file-copying behavior, or working CLI behavior is implemented.
|
||||
- `uv sync` passes.
|
||||
- `uv run pytest` passes.
|
||||
- `PROGRESS.md` records checks performed and residual risks.
|
||||
- Independent evaluation is complete.
|
||||
- The completed change is committed.
|
||||
|
||||
## Handoff Fields
|
||||
|
||||
Use these fields when Sprint 5 completes:
|
||||
|
||||
- Files changed:
|
||||
- Commands run:
|
||||
- Tests passed:
|
||||
- Tests blocked:
|
||||
- Known failures:
|
||||
- Residual risks:
|
||||
- User decisions needed:
|
||||
- Go/no-go recommendation for Sprint 6:
|
||||
- Next action:
|
||||
Reference in New Issue
Block a user