add pdftomd

This commit is contained in:
김경종
2026-05-08 16:42:19 +09:00
parent 551ab50735
commit 88d6b92283
99 changed files with 47332 additions and 0 deletions
+282
View File
@@ -0,0 +1,282 @@
# Knowledge Base: Local PDF-to-Markdown Converter for Math-Heavy Documents
Last updated: 2026-05-07
## 1. Product Direction
This project will build a local-first PDF-to-Markdown converter for math-heavy academic PDFs and books. The v1 target is intentionally narrow:
- Processing policy: local-only. Do not send user PDFs to cloud OCR or external AI APIs.
- Primary interface: CLI plus Python library.
- Primary output: Obsidian-friendly Markdown.
- Main conversion engine: MinerU 3.1.0.
- Math output: inline math as `$...$`, display math as `$$...$$`.
- Hardware target: NVIDIA GPU.
- PDF scope: digital PDFs with an existing text layer first. Scanned books and poor-quality scans are out of scope for v1 optimization.
- Quality workflow: fully automatic conversion. Low-confidence regions should be logged and represented in metadata, but conversion should not stop.
- Install target: Python with `uv`; scripts should document local model download/setup.
- License posture: personal use. License terms are not a v1 blocker for this local project, but document MinerU and transitive model/package licenses before any redistribution.
The rest of this document records the research basis and implementation implications for these decisions.
This file is background research, not a second requirements source. Use `PRD.md` for product requirements and `ARCHITECTURE.md` for implementation structure.
## 2. Why Math PDF Conversion Is Hard
PDF is a visual/page-description format, not a semantic source format. In scientific PDFs, the displayed equation often survives visually while the source-level LaTeX structure is lost. The Nougat paper states the core issue clearly: scientific knowledge is frequently stored in PDFs, and the PDF format loses semantic information, especially for mathematical expressions. Source: [Nougat: Neural Optical Understanding for Academic Documents](https://arxiv.org/abs/2308.13418).
For this project, equation conversion must be treated as a document understanding and formula recognition problem, not just text extraction. A robust converter needs to recover:
- Reading order across multi-column layouts.
- Inline and display equation boundaries.
- LaTeX syntax that renders in common Markdown math renderers.
- Tables, captions, figures, references, footnotes, and image assets.
- Page-level provenance for debugging and downstream correction.
Digital PDFs make the problem easier because embedded text can be extracted directly, but formulas may still appear as glyph streams, vector paths, images, or fragmented positioned characters. A v1 product should therefore use the PDF text layer where reliable and use document parsing/OCR models for layout and math reconstruction.
## 3. MinerU 3.1.0 Engine Strategy
MinerU 3.1.0 is the fixed local parser for v1.
Relevant source facts:
- MinerU 3.1.0 is published on PyPI, requires Python `>=3.10,<3.14`, and describes support for PDF, images, DOCX, PPTX, and XLSX to Markdown and JSON. Source: [PyPI mineru 3.1.0](https://pypi.org/project/mineru/3.1.0/).
- MinerU 3.1.0 release notes describe a license move to the MinerU Open Source License, a VLM main model upgrade to `MinerU2.5-Pro-2604-1.2B`, improved image/chart parsing, truncated paragraph merging, cross-page table merging, table-internal image recognition, and native PPTX/XLSX parsing. Source: [MinerU releases](https://github.com/opendatalab/MinerU/releases).
- MinerU's current quick usage documents the CLI shape as `mineru -p <input_path> -o <output_path>` and states that without `--api-url`, the CLI launches a temporary local `mineru-api`. Source: [MinerU quick usage docs](https://opendatalab.github.io/MinerU/usage/quick_usage/).
- Starting with MinerU 3.0, the `mineru` command is an orchestration client on top of `mineru-api`. Source: [MinerU usage guide](https://opendatalab.github.io/MinerU/usage/).
- MinerU model source configuration uses `MINERU_MODEL_SOURCE`, and local models are enabled with `MINERU_MODEL_SOURCE=local`. Source: [MinerU model source docs](https://opendatalab.github.io/MinerU/usage/model_source/).
Implementation implications:
- Wrap MinerU behind a project-owned adapter rather than binding the codebase to its CLI output.
- Preserve both Markdown and structured JSON when available.
- Treat MinerU output as the first-pass parse, then normalize for Obsidian.
- Keep engine name/version, command/options, page ids, and warnings in metadata.
- Expect model downloads and GPU runtime setup to be heavy; document this clearly.
- Treat MinerU 3.1.0's CLI-internal temporary local `mineru-api` as allowed local orchestration.
- Reject `--api-url`, remote APIs, router mode, HTTP client backends, and remote OpenAI-compatible backends in strict-local mode.
- Do not implement runtime engine selection in v1.
Risks:
- MinerU version behavior may change quickly.
- GTX 1070 Ti is below the current documented GPU acceleration recommendation for some MinerU 3.x paths; `pipeline` CPU or limited GPU behavior must be validated locally.
- Some licenses are custom or changed over time; check the exact dependency license before redistribution if the project scope changes.
- Output Markdown may require post-processing to match Obsidian math and asset path expectations.
## 4. Output Standard: Obsidian-Friendly Markdown
The final Markdown should be optimized for Obsidian rendering and long-term note usage.
Rules:
- Inline math: `$...$`.
- Display math: `$$...$$` on separate lines.
- Store extracted images in a sibling assets directory, for example `paper.assets/page-003-figure-01.png`.
- Use relative links from the Markdown file to assets.
- Preserve page boundaries in metadata, not by noisy visible page markers in the main Markdown.
- Prefer normal Markdown tables for simple tables.
- Use HTML tables only when Markdown tables would destroy merged cells or mathematical structure.
- Do not silently drop formulas, figures, captions, or references. If exact conversion fails, keep a best-effort representation and log a warning.
Post-processing should normalize:
- Math delimiters: convert `\(...\)` to `$...$` and `\[...\]` to `$$...$$` where safe.
- Display math spacing: ensure blank lines around display equations.
- Escaping: avoid over-escaping underscores inside math.
- Asset links: convert absolute/generated paths to stable relative paths.
- Heading levels: avoid treating running headers, footers, and page numbers as section headings.
- Repeated hyphenation and line breaks from PDF extraction.
## 5. Metadata and Provenance
Metadata is necessary because fully automatic conversion can produce imperfect formulas and reading order. The converter must keep enough provenance to identify the source page, source region, engine version, warnings, and emitted assets for each result.
The metadata schema is defined in `ARCHITECTURE.md`.
## 6. Evaluation Strategy
Do not rely only on raw text edit distance. The project should evaluate multiple dimensions.
### Benchmarks to Learn From
OmniDocBench:
- Evaluates document parsing across text, formulas, tables, and reading order.
- End-to-end scoring combines text edit distance, table TEDS, and formula CDM.
- Provides formula recognition evaluation for display and inline formulas. Source: [OmniDocBench](https://github.com/opendatalab/OmniDocBench).
Unit-test-style document parsing checks:
- Uses simple, unambiguous, machine-checkable page facts instead of only soft edit-distance comparisons.
- Includes arXiv math, old scans math, tables, headers/footers, multi-column layouts, long/tiny text, and base cases.
- Explicitly notes that small equation symbol swaps can be critical even when edit distance is small.
ParseBench:
- Emphasizes semantic correctness for agentic document parsing: table structure, chart data, formatting, visual grounding, and content faithfulness. Source: [ParseBench](https://arxiv.org/abs/2604.08538).
### Project Acceptance Metrics
For v1, use a small project fixture suite rather than trying to run every public benchmark.
Required fixture categories:
- A short digital PDF with simple inline and display math.
- A math-heavy academic paper with multi-column layout.
- A PDF page containing a table with formulas.
- A PDF with figures, captions, references, and page numbers.
Required checks:
- Markdown file is generated.
- Metadata JSON is generated when requested.
- All pages have provenance records.
- Inline math and display math render in a KaTeX/MathJax-compatible check.
- Asset links resolve.
- No cloud network calls occur during conversion.
- MinerU failures produce warnings and a best-effort output, not a hard crash, unless the input file cannot be opened or no output can be produced.
Recommended quantitative checks:
- Count display math blocks detected.
- Count math render failures.
- Count missing asset links.
- Count pages with warnings.
- Compare selected known formulas against expected LaTeX or normalized render output.
## 7. Implementation Source of Truth
Do not implement directly from this research note.
- Use `PRD.md` for CLI, API, scope, tests, and release criteria.
- Use `ARCHITECTURE.md` for the conversion pipeline, MinerU boundary, intermediate representation, metadata schema, and strict-local enforcement.
## 8. Implementation Risks
- Formula correctness cannot be guaranteed purely by text extraction.
- MinerU behavior may change across versions; adapter tests should pin expected behavior.
- GPU dependencies can be difficult on Windows; `doctor` should detect CUDA, PyTorch, model paths, and engine availability.
- Obsidian math rendering may differ from GitHub or Pandoc.
- Licensing must be reviewed before packaging or redistributing models/tools.
- Fully automatic mode means users may receive imperfect formulas; warnings and metadata are essential.
## 9. Sprint 0 Verification And Engine Update (2026-05-07)
Sprint 0 originally verified MinerU 2.5.4 assumptions before implementation. After Sprint 0, the project owner changed the v1 engine target to MinerU 3.1.0 and redefined strict-local execution. The current implementation target is therefore MinerU 3.1.0, not the earlier 2.5.4 pin.
### 9.1 Recommendation
Recommendation: `go-with-risks` for personal-use v1.
The project can proceed to Sprint 1 after the `uv` workflow is available locally or Sprint 1 explicitly handles the bootstrap gap. Use `mineru[core]==3.1.0` or another explicitly reviewed 3.1.0 installation path until a later sprint contract changes it.
Current strict-local policy:
- Allowed: direct `mineru` CLI execution.
- Allowed: the temporary local `mineru-api` process that MinerU 3.1.0 starts internally when the CLI runs without `--api-url`.
- Prohibited: `--api-url`, remote APIs, router mode, HTTP client backends, and remote OpenAI-compatible backends.
- Setup may download models only when the user explicitly runs setup commands.
- Runtime conversion should use local model paths, for example with `MINERU_MODEL_SOURCE=local`.
### 9.2 Evidence Policy Applied
Sprint 0 claims use official or primary sources where available. Each claim below is marked as either:
- Direct fact: stated by a source or observed from an allowed local command.
- Project inference: a project decision derived from source facts.
Web research was allowed for documentation verification. Runtime converter design remains local-only.
### 9.3 MinerU 3.1.0 Facts
| Area | Confirmed fact | Evidence type | Source | Implementation implication |
| --- | --- | --- | --- | --- |
| Package pin | PyPI lists MinerU `3.1.0`, released 2026-04-17, with Python `>=3.10,<3.14` and extras including `core`, `pipeline`, `vlm`, `vllm`, `lmdeploy`, `mlx`, `gradio`, and `all`. | Direct fact | [PyPI mineru 3.1.0](https://pypi.org/project/mineru/3.1.0/) | Pin v1 setup to MinerU 3.1.0 until changed by a later contract. |
| Install path | Current MinerU docs show `uv pip install -U "mineru[all]"`; PyPI also supports extras including `core`. | Direct fact plus project inference | [PyPI mineru 3.1.0](https://pypi.org/project/mineru/3.1.0/) | Start with `mineru[core]==3.1.0` for the wrapper unless Sprint 1 or Sprint 8 proves that `all` is required. |
| CLI shape | Current docs show `mineru -p <input_path> -o <output_path>`. | Direct fact | [MinerU quick usage docs](https://opendatalab.github.io/MinerU/usage/quick_usage/) | The adapter still calls the `mineru` CLI directly. |
| Local temporary API | MinerU docs state that without `--api-url`, the CLI launches a temporary local `mineru-api`; with `--api-url`, the CLI connects to an existing local or remote FastAPI service. | Direct fact | [MinerU quick usage docs](https://opendatalab.github.io/MinerU/usage/quick_usage/), [MinerU CLI tools docs](https://opendatalab.github.io/MinerU/usage/cli_tools/) | Allow only the CLI-internal temporary local `mineru-api`; reject `--api-url` and user-managed API endpoints. |
| Backend risk | Current docs describe `pipeline`, `vlm`, and `hybrid` paths plus HTTP-client variants for OpenAI-compatible servers. | Direct fact | [PyPI mineru 3.1.0](https://pypi.org/project/mineru/3.1.0/), [MinerU CLI tools docs](https://opendatalab.github.io/MinerU/usage/cli_tools/) | Strict-local validation must reject HTTP client backends and remote OpenAI-compatible backends. |
| Output layout | MinerU output docs list Markdown plus visual debugging files and structured files such as `model.json`, `middle.json`, and `content_list.json`; the exact set depends on backend and input type. | Direct fact | [MinerU output files docs](https://opendatalab.github.io/MinerU/reference/output_files/) | Adapter parsing must tolerate optional files and backend-specific structured output. Keep raw output optional through `--keep-raw`. |
| Model/cache behavior | MinerU uses Hugging Face and ModelScope by default and switches source through `MINERU_MODEL_SOURCE`; local parsing uses `MINERU_MODEL_SOURCE=local` after models are downloaded. | Direct fact | [MinerU model source docs](https://opendatalab.github.io/MinerU/usage/model_source/) | Setup scripts may download models; runtime conversion should require local model paths under strict-local mode. |
| 3.1.0 capability update | Release notes say 3.1.0 upgrades the main VLM model to `MinerU2.5-Pro-2604-1.2B` and improves image/chart parsing, truncated paragraph merging, cross-page table merging, and image recognition inside tables. | Direct fact | [MinerU releases](https://github.com/opendatalab/MinerU/releases) | 3.1.0 is a better target for math-heavy and complex-layout documents, pending local hardware validation. |
Adapter-facing fields that must remain optional until a local MinerU output probe is run:
- Backend/parse method directory names.
- Presence of `content_list`, `middle`, `model`, `layout`, `span`, and `origin` files.
- Page/block confidence and bbox fields in structured output.
- Asset naming and relative paths.
- Exact stdout/stderr warning text.
### 9.4 Local Environment Facts
Allowed local commands run during Sprint 0:
```powershell
python --version
uv --version
nvidia-smi
```
Results:
| Area | Result | Evidence type | Source or command | Impact |
| --- | --- | --- | --- | --- |
| Python | Local Python is `Python 3.12.7`. Python 3.12 target is viable for MinerU 3.1.0's declared `>=3.10,<3.14` range. | Direct fact plus project inference | `python --version`; [PyPI mineru 3.1.0](https://pypi.org/project/mineru/3.1.0/) | Continue targeting Python 3.12. Prefer current 3.12 patch in setup docs, but do not require a patch bump for v1. |
| `uv` | `uv` is not on PATH. PowerShell reported that `uv` is not recognized. | Direct fact | `uv --version`; [uv installation docs](https://docs.astral.sh/uv/getting-started/installation/) | Sprint 1 cannot honestly claim `uv sync` works until `uv` is installed or the Sprint 1 contract includes bootstrap instructions. |
| GPU | `nvidia-smi` detects NVIDIA GeForce GTX 1070 Ti, driver `577.00`, WDDM, 8192 MiB VRAM, about 5034 MiB free at check time. | Direct fact | `nvidia-smi` | GPU is visible, but available VRAM is tight for model workloads and must be reported by `doctor`. |
| Compute capability | GTX 1070 Ti is Pascal and listed as CUDA compute capability `6.1`. | Direct fact | [NVIDIA legacy CUDA GPU table](https://developer.nvidia.com/cuda/gpus/legacy), [NVIDIA GTX 10-series specs](https://www.nvidia.com/en-us/geforce/10-series/10-series-specs/) | Warn on pre-Turing GPUs. Do not assume modern CUDA 12.8/12.9 PyTorch wheels work. |
| PyTorch/CUDA risk | PyTorch project discussion states CUDA 12.8/12.9 builds remove Maxwell/Pascal support. `nvidia-smi` CUDA version is a driver capability ceiling, not proof that PyTorch CUDA works. | Direct fact plus project inference | [PyTorch CUDA architecture support discussion](https://dev-discuss.pytorch.org/t/cuda-toolkit-version-and-architecture-support-update-maxwell-and-pascal-architecture-support-removed-in-cuda-12-8-and-12-9-builds/3128), [PyTorch install docs](https://pytorch.org/get-started/locally/) | `pdf2md doctor` must test actual PyTorch import, CUDA runtime, device name, compute capability, and free memory. |
Future `pdf2md doctor` checks:
- `python --version`, requiring `>=3.12,<3.13`.
- `uv --version`, failing clearly when unavailable.
- PowerShell version and edition on Windows.
- `nvidia-smi` GPU name, driver, WDDM/TCC mode, total/free VRAM.
- GPU compute capability warning for `<7.5`, especially Pascal `6.1`.
- `torch` import, `torch.__version__`, `torch.version.cuda`, `torch.cuda.is_available()`, device name, compute capability, and free memory.
- MinerU CLI availability, installed package version, and model/cache configuration.
- Strict-local validation that runtime uses direct CLI, allows only CLI-internal temporary local `mineru-api`, uses local model paths, and rejects `--api-url`, router mode, HTTP client backends, remote APIs, and remote OpenAI-compatible backends.
### 9.5 License And Privacy Facts
| Area | Confirmed fact | Evidence type | Source | Implementation implication |
| --- | --- | --- | --- | --- |
| MinerU 3.1.0 license | PyPI identifies MinerU 3.1.0 as `LicenseRef-MinerU-Open-Source-License`; release notes state MinerU moved from AGPLv3 to a custom license based on Apache 2.0. | Direct fact | [PyPI mineru 3.1.0](https://pypi.org/project/mineru/3.1.0/), [MinerU releases](https://github.com/opendatalab/MinerU/releases) | License is not a blocker for personal v1. Redistribution still needs review. |
| Runtime privacy | MinerU docs expose remote-capable paths, including `--api-url`, router usage, HTTP client backends, and OpenAI-compatible backend URLs. | Direct fact | [MinerU usage guide](https://opendatalab.github.io/MinerU/usage/), [MinerU CLI tools docs](https://opendatalab.github.io/MinerU/usage/cli_tools/) | The wrapper must never upload PDFs, page images, extracted text, or intermediates. Setup downloads are separate from runtime conversion. |
| Model downloads | `mineru-models-download` must use a remote model source for real downloads, while runtime local parsing uses `MINERU_MODEL_SOURCE=local`. | Direct fact | [MinerU model source docs](https://opendatalab.github.io/MinerU/usage/model_source/) | Download scripts are setup-only. Conversion runtime must not download or call remote inference. |
### 9.6 Go-With-Risks Register
| Risk | Later sprint that must absorb it | Required handling |
| --- | --- | --- |
| `uv` is missing locally. | Sprint 1 and Sprint 8 | Install `uv` before claiming scaffold success, or make bootstrap documentation part of Sprint 1. `doctor` must report missing `uv`. |
| GTX 1070 Ti is Pascal compute capability 6.1 and is below current documented GPU acceleration recommendations for some MinerU 3.x paths. | Sprint 8 | `doctor` must verify actual PyTorch/CUDA/MinerU backend availability and warn on pre-Turing GPUs. Setup docs should avoid assuming CUDA 12.8/12.9 PyTorch works on this GPU. |
| MinerU 3.1.0 output layout is source-verified but not locally probed. | Sprint 4 and Sprint 9 | Adapter tests should mock optional outputs first. A later local MinerU probe should confirm real output paths before release. |
| CLI-internal local `mineru-api` is allowed, but user-specified API paths are prohibited. | Sprint 4 and Sprint 8 | Adapter command validation must allow `mineru -p ... -o ...` without `--api-url`, while rejecting `--api-url`, router mode, HTTP client backends, remote APIs, and remote OpenAI-compatible backends. |
| Runtime must be local-only while setup may download packages/models. | Sprint 4 and Sprint 8 | Separate setup/download commands from conversion runtime. Enforce `MINERU_MODEL_SOURCE=local` or equivalent local model configuration in strict-local mode. |
No hard failure criteria are currently met. Direct local MinerU 3.1.0 CLI execution appears suitable for v1 under the redefined strict-local policy, Python 3.12 is compatible with the package metadata, local-only design remains viable, and license posture does not block personal use.
## 10. Source Index
- [Nougat paper](https://arxiv.org/abs/2308.13418)
- [MinerU docs](https://opendatalab.github.io/MinerU/zh/)
- [MinerU GitHub](https://github.com/opendatalab/mineru)
- [PyPI mineru 3.1.0](https://pypi.org/project/mineru/3.1.0/)
- [MinerU releases](https://github.com/opendatalab/MinerU/releases)
- [MinerU usage guide](https://opendatalab.github.io/MinerU/usage/)
- [MinerU quick usage docs](https://opendatalab.github.io/MinerU/usage/quick_usage/)
- [MinerU CLI tools docs](https://opendatalab.github.io/MinerU/usage/cli_tools/)
- [MinerU output files docs](https://opendatalab.github.io/MinerU/reference/output_files/)
- [MinerU model source docs](https://opendatalab.github.io/MinerU/usage/model_source/)
- [uv installation docs](https://docs.astral.sh/uv/getting-started/installation/)
- [PyTorch install docs](https://pytorch.org/get-started/locally/)
- [PyTorch CUDA architecture support discussion](https://dev-discuss.pytorch.org/t/cuda-toolkit-version-and-architecture-support-update-maxwell-and-pascal-architecture-support-removed-in-cuda-12-8-and-12-9-builds/3128)
- [NVIDIA legacy CUDA GPU table](https://developer.nvidia.com/cuda/gpus/legacy)
- [NVIDIA GTX 10-series specs](https://www.nvidia.com/en-us/geforce/10-series/10-series-specs/)
- [OmniDocBench](https://github.com/opendatalab/OmniDocBench)
- [ParseBench](https://arxiv.org/abs/2604.08538)
+298
View File
@@ -0,0 +1,298 @@
# MathJax Local Render Checker Implementation Plan
## Purpose
Add a local MathJax-based render checker so the converter can validate whether extracted LaTeX formulas are likely to render in Obsidian. The checker must remain a quality signal only: failed formulas produce structured warnings, metadata counts, and report entries, but they do not stop conversion when Markdown output can still be produced.
This plan is implementation planning only. It does not add a second PDF conversion engine, cloud service, remote API, or manual review workflow.
Implementation status: implemented on 2026-05-08 with a local Node.js helper, Python MathJax wrapper, conversion integration, doctor diagnostics, setup documentation, and mocked default tests. Real checker execution uses the official local `mathjax` package and requires `npm install` to populate local `node_modules/`.
## Product Context
The project already normalizes inline math to `$...$` and display math to `$$...$$`. `src/pdf2md/quality.py` already has a math renderability boundary through `check_math_renderability()` and `MathCheckerUnavailable`, but current conversions record an info warning when no checker is injected.
The next implementation should replace that unavailable-checker path with a real local MathJax check when local Node.js and MathJax are available.
Relevant existing behavior:
- Conversion remains local-only.
- MinerU 3.1.0 remains the only PDF conversion engine.
- Quality warnings are non-fatal unless no usable output can be produced.
- Metadata and `.report.md` already include `math_render_error_count`.
- Default tests must not require real MinerU, GPU, Node.js, MathJax, network, Obsidian, or sample PDFs.
## References
- Obsidian documents math expressions as MathJax/LaTeX notation: https://help.obsidian.md/advanced-syntax
- MathJax supports Node/server-side use through components: https://docs.mathjax.org/en/v4.1/server/components.html
- MathJax can convert TeX strings to SVG, including `tex2svgPromise()`: https://docs.mathjax.org/en/latest/web/convert.html
## Design Decisions
1. Use MathJax, not KaTeX, as the primary checker.
- Obsidian compatibility is the output standard.
- Obsidian uses MathJax for math rendering.
- KaTeX can remain a future fast preflight option, but it should not define v1 pass/fail behavior.
2. Run MathJax locally through Node.js.
- Do not use a CDN.
- Do not fetch packages at conversion time.
- Do not call remote render APIs.
3. Batch formulas in one Node process per conversion.
- Spawning one process per formula would be too slow for math-heavy papers.
- `samples/MITC공부.pdf` produced 126 math expressions, so batch checking is the practical baseline.
4. Treat unavailable tooling differently from invalid math.
- Missing Node.js, missing MathJax, bad helper path, timeout, or invalid helper JSON should produce an info-level unavailable-checker warning.
- A MathJax parse/render failure for a specific expression should produce a warning-level `MATH_RENDER_FAILED` record and increment `math_render_error_count`.
5. Preserve conversion continuity.
- Math render failures never remove formulas from the Markdown.
- Math render failures do not trigger fallback engines.
- The generated report remains derived from metadata and local checks.
## Proposed Touched Surfaces
- `src/pdf2md/quality.py`
- Replace body-only math iteration with expression records carrying body, display mode, index, and Markdown span.
- Keep code fence and inline code protection.
- Keep unavailable-checker behavior non-fatal.
- `src/pdf2md/math_render.py`
- Add the Python wrapper for local MathJax checking.
- Probe Node.js availability.
- Execute the Node helper with JSON stdin.
- Parse JSON stdout into project-owned check results.
- Convert helper failures into `MathCheckerUnavailable`.
- `tools/mathjax-checker/check.mjs`
- Add the Node helper.
- Load local MathJax components.
- Accept JSON input with expressions.
- Return JSON results only.
- `package.json` and lockfile, or equivalent local setup documentation
- Add a local MathJax Node dependency if the project chooses a committed Node package manifest.
- The implementation should not install npm dependencies during conversion.
- `src/pdf2md/conversion.py`
- Wire the default local checker when available, while preserving dependency injection for tests.
- Keep the public Python API compatible unless a later sprint explicitly changes it.
- `src/pdf2md/doctor.py`
- Add diagnostic checks for Node.js and local MathJax checker availability.
- Report missing MathJax as a warning, not a hard failure, unless the project later decides the checker is mandatory.
- `README.md` or setup documentation
- Document local MathJax checker setup.
- Explain that missing MathJax does not block conversion but leaves renderability unvalidated.
- `PROGRESS.md`
- Record the implementation and verification outcome after completion.
## Data Model Plan
Add a small expression record for quality checking:
```python
@dataclass(frozen=True)
class MathExpression:
index: int
body: str
display: bool
markdown_span: tuple[int, int]
```
The checker output should be project-owned and independent of MathJax internals:
```python
@dataclass(frozen=True)
class MathCheckResult:
ok: bool
message: str = ""
```
If per-expression metadata is later needed, extend warning messages first. Do not expose raw MathJax objects in metadata or public API return values.
## Node Helper Contract
Input over stdin:
```json
{
"expressions": [
{"index": 0, "body": "x^2", "display": false},
{"index": 1, "body": "\\frac{1}{2}", "display": true}
]
}
```
Output over stdout:
```json
{
"results": [
{"index": 0, "ok": true},
{"index": 1, "ok": false, "message": "MathJax error message"}
]
}
```
stderr is reserved for diagnostics only. The Python wrapper should not depend on stderr format.
## Python Wrapper Behavior
The wrapper should:
1. Locate `node`.
2. Locate the helper script.
3. Send all expressions through JSON stdin.
4. Set a deterministic timeout.
5. Require valid JSON stdout.
6. Map each result by expression index.
7. Return `MathCheckResult` values for expression failures.
8. Raise `MathCheckerUnavailable` for tool-level failures.
Recommended default timeout policy:
- Start with one conversion-level MathJax timeout.
- Use a conservative default such as 60 seconds for a document batch.
- Make the timeout test-injectable.
- Do not add CLI flags for timeout unless the user explicitly asks for configuration.
## Integration Plan
1. Refactor math extraction in `quality.py`.
- Add expression records.
- Preserve existing code-block exclusions.
- Preserve inline/display count behavior.
- Add tests for display mode and spans.
2. Add mocked MathJax wrapper tests.
- Fake successful Node JSON response.
- Fake per-expression failure.
- Fake missing `node`.
- Fake timeout.
- Fake invalid JSON.
- Fake mismatched expression indexes.
3. Add the Node helper.
- Keep stdout as JSON only.
- Ensure local package resolution.
- Avoid remote imports or CDN URLs.
4. Wire the checker into conversion.
- If a test injects `math_checker`, use the injected checker.
- Otherwise, build a default local MathJax checker when available.
- If unavailable, keep the current info warning behavior.
5. Extend doctor.
- Report Node.js availability.
- Report local MathJax package/helper availability.
- Keep missing MathJax as WARN.
6. Update setup docs.
- Explain how to install local MathJax dependencies.
- Explain expected report behavior with and without MathJax.
7. Run optional real fixture validation.
- Re-run `samples/MITC공부.pdf` only under an explicit local fixture gate or direct user request.
- Confirm `MATH_RENDER_FAILED` unavailable-checker warning disappears when MathJax is installed.
## Test Plan
Default fast tests, no real Node or MathJax required:
- `uv run pytest tests/test_quality.py`
- `uv run pytest tests/test_conversion.py`
- `uv run pytest tests/test_doctor.py tests/test_cli.py`
- `uv run pytest`
New required tests:
- Extract inline and display expressions with correct `display` values.
- Ignore math-like text inside fenced code and inline code.
- Count failures from injected checker results.
- Preserve conversion success when some formulas fail render checks.
- Preserve info warning when the checker is unavailable.
- Validate Python wrapper command construction and JSON stdin/stdout handling with a fake runner.
- Validate timeout and invalid JSON handling.
- Validate doctor warning output when Node.js or MathJax is missing.
Optional local tests:
- Run the Node helper against a small expression list.
- Run the converter on `samples/MITC공부.pdf`.
- Confirm report fields:
- `Math render error count` is actual failure count.
- Missing checker info warning is absent when MathJax is available.
- Asset link counts remain 0 missing and 0 invalid for the sample.
## Acceptance Criteria
- Default test suite passes without Node.js or MathJax.
- Local-only policy is preserved: no CDN, remote API, or document upload path.
- `pdf2md doctor` reports MathJax checker availability clearly.
- Conversion still succeeds when MathJax is unavailable, with an info warning.
- Conversion still succeeds when individual formulas fail, with warning records.
- `.metadata.json` and `.report.md` show actual math render failure counts when MathJax is available.
- The generated Markdown is not changed by the checker.
## Hard Failure Criteria
- The checker blocks conversion when Markdown output exists.
- The checker uses a remote service or CDN at runtime.
- Default tests require Node.js, MathJax, MinerU, GPU, network, Obsidian, or sample PDFs.
- Raw MathJax output objects become public API return types.
- The report records renderability as successful when the checker did not actually run.
## Open Decisions Before Implementation
1. Dependency packaging:
- Use committed `package.json` and lockfile for a local MathJax package, or document a manual local npm setup.
- Recommended: commit `package.json` and lockfile so setup is reproducible.
2. Default checker activation:
- Recommended: auto-attempt the local checker when available; otherwise emit the existing unavailable-checker info warning.
3. Timeout value:
- Recommended initial default: 60 seconds per document batch, test-injectable and documented.
4. Doctor severity:
- Recommended: missing MathJax checker is WARN, not FAIL, because conversion can still produce useful output.
5. Real fixture gate:
- Recommended: keep real sample conversion explicit and opt-in for tests, but allow direct user-requested runs.
## Suggested Implementation Contract
Objective:
- Implement a local MathJax render checker that validates normalized Markdown math expressions and records failures in metadata/report output without changing conversion continuity.
Expected outputs:
- Python MathJax checker wrapper.
- Node MathJax helper.
- Updated quality extraction for display/inline expression records.
- Doctor warning coverage for missing checker dependencies.
- Setup documentation.
- Fast mocked tests and optional real local checker validation.
Non-goals:
- No cloud rendering.
- No Obsidian app automation.
- No full LaTeX engine.
- No manual review queue.
- No runtime engine selection.
- No correction or rewriting of failed formulas.
Verification:
- `uv run pytest`
- `git diff --check`
- Optional local Node helper smoke test.
- Optional `samples/MITC공부.pdf` conversion after local MathJax setup.
+239
View File
@@ -0,0 +1,239 @@
# Sprint 0 Contract: Source And Environment Verification
Status: Completed
Last updated: 2026-05-07
Result: PASS, go-with-risks
## Objective
Verify the external facts and local environment assumptions needed before any converter implementation starts.
Current amendment: the project target changed after the original Sprint 0 pass from MinerU 2.5 to MinerU 3.1.0. Read MinerU version constraints in this contract as applying to MinerU 3.1.0 for future work.
Sprint 0 must answer whether the planned v1 implementation can proceed with:
- MinerU 3.1.0 through direct local CLI execution only.
- Python 3.12 and `uv`.
- Windows PowerShell on the current workspace.
- NVIDIA GTX 1070 Ti 8GB as the target GPU.
- Local-only processing with no cloud OCR, remote LLM/VLM, hosted parser, `--api-url`, remote APIs, router mode, HTTP client backends, or remote OpenAI-compatible backends.
- The temporary local `mineru-api` process started internally by MinerU 3.1.0 CLI is allowed when CLI runs without `--api-url`.
## Touched Surfaces
Allowed:
- `docs/KNOWLEDGEBASE.md`
- `docs/V1IMPLEMENTATIONPLAN.md` only if the sprint sequence or constraints need adjustment
- `docs/Sprints/SPRINT0CONTRACT.md`
- `PROGRESS.md`
Not allowed:
- `pyproject.toml`
- `src/`
- `tests/`
- `scripts/`
- Any converter implementation code
- Any committed file under `samples/`
## Expected Outputs
Sprint 0 should produce a concise source-backed handoff in `docs/KNOWLEDGEBASE.md` under the heading `## 9. Sprint 0 Verification (2026-05-07)`, plus a short status and handoff entry in `PROGRESS.md`.
Evidence requirements:
- Prefer official or primary sources.
- Use non-official sources only when official documentation is incomplete and the source is directly relevant.
- For volatile implementation claims, record source URL, access date, and whether the claim is a direct fact or a project inference.
- Record failures from allowed local commands as environment facts. Do not fix them during Sprint 0.
- Web research is allowed for documentation verification. Runtime converter design must remain local-only.
The handoff must cover:
1. MinerU 3.1.0 local CLI facts
- Install command or supported install path.
- Version command or reliable version detection path.
- Direct local CLI invocation shape for PDF conversion.
- Supported output locations for Markdown, JSON/structured data, assets, logs, and raw diagnostics.
- Whether local execution can be kept free of router/API/HTTP endpoint modes.
2. Python and package workflow facts
- Python 3.12 compatibility status for the planned dependency stack.
- `uv` setup implications.
- Any packaging constraints that affect Sprint 1 scaffolding.
3. GPU and runtime facts
- CUDA/PyTorch expectations relevant to GTX 1070 Ti 8GB.
- Known GPU memory risks or CPU fallback/error-message implications.
- Commands that future `pdf2md doctor` should check.
4. License and privacy facts
- MinerU license status.
- Model-weight license status when identifiable.
- Transitive package/license risks that must be reviewed before redistribution.
- Confirmation that v1 runtime design remains local-only.
5. Implementation go/no-go recommendation
- `go`: proceed to Sprint 1 with listed assumptions.
- `go-with-risks`: proceed but carry specific risks into later sprint contracts. Each risk must name the later sprint that must absorb it.
- `blocked`: stop and ask the user for a requirement change.
## Non-Goals
- Do not scaffold the Python project.
- Do not install MinerU, CUDA, PyTorch, or model weights.
- Do not run full PDF conversion.
- Do not edit `samples/` or commit sample files.
- Do not introduce candidate engine comparisons.
- Do not decide a new conversion engine.
- Do not add implementation abstractions, config systems, or CLI flags.
## Work Packages
### WP0.1: MinerU Source Review
Owner:
- `research-agent`
- `mineru-research` skill
Actions:
- Review official MinerU documentation, MinerU GitHub, release notes/tags, and relevant license files.
- Verify current install, CLI, version, output, model/cache, and local execution facts.
- Record source URLs and access date for durable claims.
Output:
- Source-backed MinerU fact table in `docs/KNOWLEDGEBASE.md` under `## 9. Sprint 0 Verification (2026-05-07)`.
- Any adapter-impacting uncertainty listed in `PROGRESS.md`.
### WP0.2: Local Environment Review
Owner:
- `local-setup-agent`
Actions:
- Check non-invasive local facts, such as Python version, `uv` availability, and GPU visibility when available.
- Review official Python, `uv`, PyTorch/CUDA, and NVIDIA-relevant documentation for compatibility constraints.
- Identify future `pdf2md doctor` checks.
Output:
- Environment compatibility notes and doctor-check requirements.
- Explicit GTX 1070 Ti 8GB risks.
Allowed local commands:
```powershell
python --version
uv --version
nvidia-smi
```
Only run these commands if they are useful for the active research pass. Do not install or modify system packages.
### WP0.3: Output Layout Probe Plan
Owner:
- `mineru-integration-agent`
Actions:
- Define what must be observed from MinerU before implementing the adapter.
- If MinerU is already installed locally, a later user-approved probe may inspect command help or run against a disposable file. Sprint 0 does not require conversion execution.
Output:
- Adapter-facing output layout assumptions.
- List of fields that must remain optional until observed from real MinerU output.
### WP0.4: License And Privacy Review
Owner:
- `license-privacy-agent`
Actions:
- Review MinerU and model/package license sources.
- Distinguish personal/research use from redistribution.
- Check that no planned runtime path uploads PDFs, page images, extracted text, or model intermediates.
Output:
- License/privacy summary with unresolved obligations.
- Blockers if redistribution assumptions are unsafe.
### WP0.5: Contract Evaluation
Owner:
- `evaluation-agent`
Actions:
- Review this contract before Sprint 0 research starts.
- Require concrete evidence expectations and failure thresholds.
- After Sprint 0 research, independently verify that each expected output is present and source-backed.
Output:
- Pass/fail evaluation notes.
- Specific follow-up findings if the contract or results are incomplete.
## Verification Checks
Required:
- Every volatile implementation fact has an official or primary source URL.
- Source-backed claims distinguish direct fact from project inference.
- `docs/KNOWLEDGEBASE.md` remains consistent with `PRD.md` and `ARCHITECTURE.md`.
- No candidate engine comparison is reintroduced.
- No converter implementation code is created.
- `samples/` remains untracked.
- `git diff --check` passes for documentation changes.
Recommended:
- Check all newly added links manually or with a lightweight local link check if available.
- Have `evaluation-agent` review the completed Sprint 0 outputs before proceeding to Sprint 1.
## Hard Failure Criteria
Sprint 0 fails and must stop for a user decision if any of these are true:
- MinerU 3.1.0 cannot be invoked through a direct local CLI path suitable for v1.
- MinerU's required v1 path requires `--api-url`, router mode, HTTP client backends, remote APIs, or remote OpenAI-compatible backend behavior.
- Python 3.12 is incompatible with the required implementation stack.
- Local-only processing cannot be maintained.
- License terms clearly block the intended personal/research use.
- Required evidence cannot be verified from official or primary sources.
## Acceptance Criteria
Sprint 0 is complete when:
- `docs/KNOWLEDGEBASE.md` contains updated Sprint 0 facts with sources.
- `PROGRESS.md` records checks performed, unresolved risks, and go/no-go recommendation.
- Future Sprint 1 can proceed without guessing install, CLI, environment, or licensing assumptions.
- The evaluator review is complete.
- The completed documentation change is committed.
## Handoff Fields
Use these fields when Sprint 0 completes:
- Files changed:
- Sources checked:
- Local commands run:
- Facts confirmed:
- Inferences made:
- Known failures:
- Residual risks:
- Go/no-go recommendation:
- Next action:
+355
View File
@@ -0,0 +1,355 @@
# Sprint 10 Contract: Pre-Conversion PDF Page Chunking
Status: Implemented
Last updated: 2026-05-08
## Objective
Add an opt-in pre-conversion workflow for long PDFs:
1. Split each source PDF into fixed-size chunk PDFs of 20 pages.
2. Convert each chunk PDF independently through the existing MinerU conversion pipeline.
3. Do not merge the generated Markdown files.
The feature is intended to reduce long-document memory/runtime pressure and make partial progress usable when one chunk fails. It must preserve strict-local execution, keep MinerU 3.1.0 as the only conversion engine, and keep default tests independent of real MinerU, GPU, CUDA, model files, network access, Obsidian, LaTeX tooling, and `samples/`.
## Research Summary
Sources checked on 2026-05-08:
- [pypdf PyPI](https://pypi.org/project/pypdf/): current release observed as `6.10.2`, uploaded 2026-04-15; metadata lists `BSD-3-Clause`, Python `>=3.9`, Python 3.12 support, and describes pypdf as a pure-Python PDF library capable of splitting, merging, cropping, and transforming PDF pages.
- [pypdf merging docs](https://pypdf.readthedocs.io/en/stable/user/merging-pdfs.html): `PdfWriter.append()` can append a complete or partial source PDF; examples use zero-based page ranges such as `(0, 10)`. The docs recommend `append` or `merge` over low-level `add_page` / `insert_page`.
- [pypdf streaming docs](https://pypdf.readthedocs.io/en/latest/user/streaming-data.html): `PdfReader` and `PdfWriter` support file-like objects, but the project should write chunk PDFs to local disk because MinerU accepts local file paths.
- [pypdf PdfWriter docs](https://pypdf.readthedocs.io/en/stable/modules/PdfWriter.html): writer operations clone/copy PDF objects into the destination. The docs warn that cloning linked objects can copy more than just the visible page object in some cases, so chunk output size must be checked in tests.
- [MinerU CLI tools docs](https://opendatalab.github.io/MinerU/usage/cli_tools/): the direct `mineru` CLI accepts `-p/--path`, `-o/--output`, `-s/--start`, and `-e/--end`; without `--api-url`, it launches a temporary local `mineru-api`.
- [PyMuPDF PyPI](https://pypi.org/project/PyMuPDF/): PyMuPDF is fast and local, but PyPI lists dual licensing under GNU AGPL v3 or an Artifex commercial license.
- [pikepdf page assembly docs](https://pikepdf.readthedocs.io/en/latest/topics/pages.html): pikepdf can split pages and transfer page-associated data; it is a capable fallback candidate but adds a QPDF-backed dependency and is not needed for a first implementation.
## Recommended Package Decision
Use `pypdf` for Sprint 10.
Rationale:
- It is pure Python and fits the current Python 3.12 + `uv` workflow.
- It has permissive `BSD-3-Clause` metadata on PyPI.
- It directly supports page-level PDF assembly with `PdfReader` / `PdfWriter`.
- It avoids adding PyMuPDF's AGPL/commercial licensing considerations for a simple split-only feature.
- It avoids adding pikepdf/QPDF native dependency complexity before there is evidence that pypdf cannot handle the project samples.
Recommended dependency range for implementation:
```toml
dependencies = [
"pypdf>=6.10.2,<7",
]
```
The implementation adds this dependency to `pyproject.toml` and `uv.lock`.
## Current Precondition
- `pdf2md convert` already converts one PDF or a directory of PDFs.
- Existing conversion output per input includes:
- Markdown
- optional metadata JSON, enabled by default
- `<stem>.report.md`
- assets directory
- optional raw MinerU output
- `plan_outputs()` already enforces overwrite and output-root safety.
- `convert_input()` already handles directory batches and continues after per-file failures.
- The MinerU adapter accepts one PDF path at a time and runs direct local `mineru` CLI.
- `samples/` is local and untracked; do not commit sample PDFs or generated outputs.
## Touched Surfaces
Allowed during implementation:
- `pyproject.toml`
- `uv.lock`
- `src/pdf2md/pdf_splitter.py`
- `src/pdf2md/paths.py`
- `src/pdf2md/conversion.py`
- `src/pdf2md/cli.py`
- `src/pdf2md/ir.py` only if new warning codes or chunk provenance records are required
- `src/pdf2md/metadata.py` only for chunk provenance fields
- `src/pdf2md/report.py` only to expose chunk provenance in reports
- `tests/test_pdf_splitter.py`
- `tests/test_conversion.py`
- `tests/test_cli.py`
- `tests/test_paths.py`
- `tests/test_metadata.py`
- `tests/integration/` for mocked chunk workflow coverage
- `README.md`
- `docs/V1IMPLEMENTATIONPLAN.md`
- `docs/Sprints/SPRINT10CONTRACT.md`
- `PLAN.md`
- `PROGRESS.md`
Not allowed:
- Runtime engine selection or alternate conversion engines.
- Use of cloud OCR, remote LLM/VLM, hosted renderers, hosted document parsers, `--api-url`, router mode, HTTP client backends, remote APIs, or remote OpenAI-compatible backends.
- Mandatory default tests requiring real MinerU, GPU, CUDA, model files, network, Obsidian, LaTeX tooling, or `samples/`.
- Committed files under `samples/`.
- Committed generated conversion outputs.
- Automatic model or package downloads triggered by import time, `doctor`, `convert`, or tests.
- Markdown merge behavior for chunk outputs.
- Claims that chunking improves formula correctness; it is only a processing-control feature.
## Product Behavior
Activation:
- Chunking is opt-in and existing conversion behavior is unchanged when `chunk_pages` is unset.
- CLI: `pdf2md convert INPUT --out OUTPUT_DIR --chunk-pages` uses the default chunk size of 20 pages.
- CLI: `pdf2md convert INPUT --out OUTPUT_DIR --chunk-pages 20` uses an explicit positive chunk size.
- Python API: `convert_pdf(..., chunk_pages=20)` and `convert_input(..., chunk_pages=20)`.
- `convert_pdf()` returns `ConversionResult` without chunking and `BatchConversionResult` when chunk mode is active.
- `chunk_pages` must be `None` or a positive integer.
Chunking behavior:
- If `chunk_pages` is unset, current behavior remains unchanged.
- If `chunk_pages=20` and a PDF has 20 or fewer pages, conversion may either:
- convert the original PDF directly, or
- create one chunk PDF and convert that chunk.
- Recommended: convert the original directly when `total_pages <= chunk_pages` to avoid unnecessary intermediate files.
- If a PDF has more than 20 pages, split it into chunk PDFs with ranges:
- chunk 1: source pages 1-20
- chunk 2: source pages 21-40
- chunk N: remaining pages
- Convert chunk PDFs sequentially, not in parallel. GTX 1070 Ti 8GB memory pressure makes sequential conversion the safer default.
- If one chunk conversion fails, continue with later chunks and report the failed chunk clearly.
- Do not merge Markdown outputs.
Recommended chunk output naming:
```text
<stem>.part-001.pages-001-020.md
<stem>.part-001.pages-001-020.metadata.json
<stem>.part-001.pages-001-020.report.md
<stem>.part-001.pages-001-020.assets/
<stem>.part-002.pages-021-040.md
...
```
Recommended chunk PDF staging:
- Use a temporary working directory.
- Delete temporary chunk PDFs after conversion completes, including when `--keep-raw` is enabled.
- Do not add a separate `--keep-chunks` flag in Sprint 10.
## Provenance Requirements
Each chunk conversion must preserve original-source context.
Required chunk fields in metadata or engine options:
- original source PDF path
- original source SHA-256
- chunk PDF path when retained, or chunk PDF filename when temporary
- chunk index, 1-based
- total chunk count
- source page start, 1-based inclusive
- source page end, 1-based inclusive
- chunk page count
Page provenance must distinguish:
- chunk-local page index, starting at 0 for MinerU output
- original source page number, starting at 1 for user-facing reports
The report should include a short chunk context line, for example:
```text
- Chunk: 2/5, source pages: 21-40
```
## Architecture Plan
### WP10.1: PDF Splitter Module
Owner:
- `feature-generator-agent`
- `mineru-integration-agent`
Actions:
- Add `src/pdf2md/pdf_splitter.py`.
- Define project-owned `PdfChunkPlan`.
- Implement page counting with `pypdf.PdfReader`.
- Implement chunk planning without writing files.
- Implement chunk writing with `pypdf.PdfWriter.append(source, (start, end))` or an equivalent tested `PdfReader`/`PdfWriter` path.
- Use zero-based half-open page ranges internally and one-based inclusive ranges for filenames and reports.
- Reject invalid chunk sizes with clear `ValueError`.
- Fail clearly on encrypted/password-protected PDFs unless a later sprint adds password handling.
Expected output:
- Deterministic chunk plans and local chunk PDFs suitable for the existing MinerU adapter.
### WP10.2: Chunk-Aware Path Planning
Owner:
- `feature-generator-agent`
Actions:
- Extend output planning so chunk outputs are deterministic and conflict-checked before conversion starts.
- Avoid collisions between original-output stems and chunk-output stems.
- Preserve output-root escape prevention.
- Respect `--overwrite`.
- Keep Korean and non-ASCII source stems working.
Expected output:
- A long PDF can produce multiple planned Markdown/metadata/report/assets outputs without overwriting another chunk.
### WP10.3: Conversion Orchestration
Owner:
- `feature-generator-agent`
- `mineru-integration-agent`
Actions:
- Add chunk mode to `convert_pdf()` and `convert_input()`.
- When chunk mode is active, split before calling the MinerU adapter.
- Reuse existing per-PDF conversion path for each chunk PDF rather than creating a second conversion pipeline.
- Continue conversion after a chunk-level failure and aggregate a batch-like result for the source.
- Ensure temporary chunk directories are cleaned up unless raw retention is requested.
- Keep strict-local validation unchanged.
Expected output:
- Long PDF conversion yields separate Markdown/metadata/report/assets outputs per chunk.
### WP10.4: Metadata And Report Chunk Provenance
Owner:
- `metadata-agent`
- `obsidian-markdown-agent`
Actions:
- Add chunk provenance fields without exposing raw pypdf objects.
- Keep existing required metadata fields valid.
- Keep original source provenance visible even though MinerU sees a chunk PDF as input.
- Ensure chunk reports are readable without opening JSON metadata.
Expected output:
- Users can map each output file back to original source pages.
### WP10.5: CLI And API Surface
Owner:
- `feature-generator-agent`
- `requirements-guard-agent`
Actions:
- Add `--chunk-pages [INTEGER]` to `pdf2md convert`.
- Keep chunking disabled unless the option is present.
- Use 20 pages when `--chunk-pages` is present without an explicit value.
- Validate positive integer input.
- Keep `--out`, `--metadata`, `--keep-raw`, `--recursive`, `--overwrite`, `--gpu`, and strict-local behavior unchanged.
- Update README with the long-PDF workflow.
Expected output:
```powershell
uv run pdf2md convert samples/long.pdf --out outputs --chunk-pages 20
```
### WP10.6: Tests
Owner:
- `feature-generator-agent`
- `evaluation-agent`
Default tests must not require real MinerU or sample PDFs.
Required tests:
- Build small local PDF fixtures using `pypdf` blank pages or minimal test PDFs.
- Page count detection for 1, 20, 21, 40, and 41 pages.
- Chunk planning produces expected 1-based filenames and page ranges.
- Chunk writing produces PDFs with the expected page counts.
- Non-positive chunk size is rejected.
- Existing conversion without `--chunk-pages` is unchanged.
- Chunked conversion calls the fake adapter once per chunk.
- Chunked conversion writes separate Markdown, metadata JSON, report Markdown, and assets per chunk.
- `--overwrite` and conflict detection work for all planned chunk outputs.
- A failed chunk does not silently fallback and does not prevent later chunks from being attempted.
- Metadata/report contain original source PDF and source page range.
- CLI validates `--chunk-pages` and prints a useful summary.
Optional local validation:
- Run chunked conversion on a local `samples/` PDF only by explicit user request or opt-in gate.
- Do not commit generated chunk PDFs or outputs.
## Acceptance Criteria
- Sprint 10 implementation can split a PDF into 20-page chunk PDFs before MinerU conversion.
- Chunk PDFs are converted one by one using the existing direct local MinerU CLI adapter.
- Markdown outputs are separate and not merged.
- Metadata/report files show chunk index and original page range.
- Default test suite passes without real MinerU, GPU, CUDA, model files, network, Obsidian, LaTeX tooling, or `samples/`.
- Strict-local policy remains unchanged.
- Existing non-chunked conversion behavior remains backward-compatible.
## Hard Failure Criteria
- Chunking uses a remote PDF service or uploads document content.
- Chunking introduces an alternate Markdown conversion engine.
- Default tests require real MinerU, GPU, CUDA, model files, network, or local samples.
- Chunk outputs overwrite each other or overwrite non-chunk outputs without `--overwrite`.
- Chunk metadata loses original source page provenance.
- The implementation merges Markdown despite this contract's non-merge requirement.
- The implementation silently skips failed chunks without warnings.
## Resolved Decisions
- Activation mode: opt-in with `--chunk-pages`; the option defaults to 20 pages when no value is supplied.
- Chunk PDF retention: temporary chunk PDFs only; they are deleted after conversion completes.
- API return type: `convert_pdf()` returns a `BatchConversionResult` when chunk mode is active.
## Verification Commands For Implementation
```powershell
uv sync
uv run pytest tests/test_pdf_splitter.py tests/test_conversion.py tests/test_cli.py tests/test_paths.py tests/test_metadata.py
uv run pytest
git diff --check
git status --short --untracked-files=all
```
Optional local command after implementation and explicit user approval:
```powershell
uv run pdf2md convert samples/MITC공부.pdf --out outputs --chunk-pages 20 --overwrite
```
## Handoff Requirements
After implementation:
- Update `PROGRESS.md` with files changed, commands run, tests passed, optional sample status, known failures, residual risks, and next action.
- Do not mark the sprint implemented until independent evaluation or equivalent focused review verifies the acceptance criteria.
- Commit the completed change without including `samples/` or generated outputs.
## Implementation Handoff
- Files changed: `pyproject.toml`, `uv.lock`, `src/pdf2md/pdf_splitter.py`, `src/pdf2md/conversion.py`, `src/pdf2md/cli.py`, `src/pdf2md/__init__.py`, `src/pdf2md/report.py`, tests, README, `docs/V1IMPLEMENTATIONPLAN.md`, `PLAN.md`, and `PROGRESS.md`.
- Verification status: targeted unit tests passed 42 tests; the full local test suite passed 163 tests with 1 optional skip; `git diff --check` passed with line-ending warnings only.
- Optional local sample conversion remains out of scope unless explicitly requested.
+258
View File
@@ -0,0 +1,258 @@
# Sprint 1 Contract: Project Scaffold And Fast Test Loop
Status: Completed
Last updated: 2026-05-07
## Objective
Create the minimal Python project scaffold and fast local test loop for the PDF-to-Markdown converter.
Sprint 1 must establish:
- A `uv`-managed Python 3.12 project.
- A source package importable as `pdf2md`.
- A reserved `pdf2md` CLI entry point that does not implement conversion yet.
- A fast test command that runs without MinerU, model downloads, GPU access, sample PDFs, or network access.
Sprint 1 is scaffolding only. It must not implement PDF conversion, MinerU execution, Markdown normalization, metadata generation, or report generation.
## Current Precondition
Sprint 0 found that `uv` was not available on PATH in the current local environment.
Sprint 1 resolved this by installing `uv` per-user at `C:\Users\user\.local\bin`.
Before Sprint 1 can be accepted, one of these must happen:
- `uv` is installed and `uv --version` succeeds.
- The user explicitly approves including `uv` bootstrap documentation or setup handling as part of Sprint 1, and the contract result records that `uv sync` could not be run locally.
Do not silently replace `uv` with another package manager.
## Touched Surfaces
Allowed:
- `pyproject.toml`
- `uv.lock`
- `.gitignore`
- `src/pdf2md/__init__.py`
- `src/pdf2md/cli.py` only for a minimal placeholder CLI if needed for entry point verification
- `tests/`
- `README.md` only for minimal setup/test instructions if needed
- `PLAN.md` only for current-goal coordination updates required by the shared agent workflow
- `PROGRESS.md`
- `docs/V1IMPLEMENTATIONPLAN.md` only if sequencing or constraints need adjustment
- `docs/Sprints/SPRINT1CONTRACT.md`
Not allowed:
- `src/pdf2md/conversion.py`
- `src/pdf2md/mineru_adapter.py`
- `src/pdf2md/paths.py`
- `src/pdf2md/ir.py`
- `src/pdf2md/markdown.py`
- `src/pdf2md/metadata.py`
- `src/pdf2md/quality.py`
- `src/pdf2md/report.py`
- `src/pdf2md/doctor.py`
- `scripts/`
- Any real MinerU invocation
- Any model download or install script
- Any committed file under `samples/`
## Expected Outputs
Sprint 1 should produce:
1. Project package scaffold
- `pyproject.toml` with project metadata.
- Python requirement constrained to Python 3.12.
- Build configuration suitable for a `src/` layout.
- `uv.lock` generated by `uv sync`.
- `.gitignore` entries for local virtual environments, pytest cache, and Python bytecode.
- Minimal test dependency configuration.
- CLI entry point name reserved as `pdf2md`.
2. Minimal source package
- `src/pdf2md/__init__.py`.
- A stable package import surface.
- Optional minimal `src/pdf2md/cli.py` placeholder that exits clearly and does not imply conversion is implemented.
3. Fast test loop
- A minimal test suite that verifies the package imports.
- If a CLI placeholder is added, a smoke test that verifies the CLI entry point is wired without invoking conversion.
- Tests must not require MinerU, CUDA, GPU, model files, `samples/`, or network.
4. Developer workflow
- `uv sync` should work when `uv` is installed.
- `uv run pytest` should work when `uv` is installed.
- If `uv` is still missing locally, record the failure explicitly in `PROGRESS.md` and do not mark Sprint 1 complete.
5. Handoff
- `PROGRESS.md` records changed files, commands run, tests passed or blocked, known failures, residual risks, and next action.
## Non-Goals
- Do not implement PDF discovery.
- Do not implement conversion orchestration.
- Do not implement the MinerU adapter.
- Do not run MinerU.
- Do not install MinerU 3.1.0.
- Do not download MinerU models.
- Do not implement Markdown normalization.
- Do not implement metadata JSON or `.report.md` output.
- Do not implement `pdf2md doctor`; a CLI placeholder may mention future commands, but it must not create a doctor module.
- Do not add runtime engine selection.
- Do not add alternate conversion engines.
- Do not add cloud, remote API, router, HTTP client backend, or remote OpenAI-compatible backend support.
## Work Packages
### WP1.1: Scaffold Metadata
Owner:
- `feature-generator-agent`
Actions:
- Create the minimal `pyproject.toml`.
- Use Python 3.12 constraints.
- Configure a `src/` package layout.
- Configure pytest as the fast local test runner.
- Reserve the `pdf2md` console script.
Output:
- A minimal, maintainable scaffold without speculative dependencies.
### WP1.2: Package Import Surface
Owner:
- `feature-generator-agent`
Actions:
- Create `src/pdf2md/__init__.py`.
- Expose only a minimal version/import surface.
- Avoid public API promises beyond what Sprint 1 verifies.
Output:
- `import pdf2md` succeeds.
### WP1.3: CLI Placeholder
Owner:
- `feature-generator-agent`
Actions:
- If needed for console script verification, create `src/pdf2md/cli.py`.
- The placeholder may expose a help message or a clear "not implemented yet" command.
- It must not create conversion flags beyond the reserved command shape unless tests need them.
Output:
- `pdf2md` entry point is wired without implying conversion works.
### WP1.4: Fast Tests
Owner:
- `feature-generator-agent`
- `evaluation-agent`
Actions:
- Add minimal tests for package import and optional CLI placeholder behavior.
- Ensure tests are local, fast, and independent of MinerU/model/GPU/network state.
Output:
- `uv run pytest` passes when `uv` is available.
### WP1.5: Independent Evaluation
Owner:
- `evaluation-agent`
Actions:
- Review the completed scaffold against this contract.
- Verify no converter implementation was added.
- Verify `samples/` remains untracked and unstaged.
- Verify no runtime remote path or alternate engine was introduced.
Output:
- PASS/FAIL notes with any missing acceptance criteria.
## Verification Checks
Required:
- `git status --short` before staging confirms `samples/` remains untracked.
- `uv --version` is run and result is recorded.
- `uv sync` passes if `uv` is available.
- `uv run pytest` passes if `uv` is available.
- If `uv` is unavailable, Sprint 1 is marked blocked rather than complete.
- Import test passes through the configured test command.
- No real MinerU dependency is required for default tests.
- No model downloads occur.
- No network calls are required.
- No candidate engine comparison is reintroduced.
- No conversion behavior is implemented.
- `git diff --check` passes.
Recommended:
- Keep `pyproject.toml` dependency list minimal.
- Avoid adding README content beyond setup/test instructions needed for the scaffold.
- Use `requirements-guard-agent` to check document consistency if the scaffold reveals a sequencing issue.
## Hard Failure Criteria
Sprint 1 fails and must stop for a user decision if any of these are true:
- `uv` remains unavailable and the user has not approved bootstrap handling.
- The project cannot be installed as a Python 3.12 package.
- The package cannot be imported as `pdf2md`.
- Default tests require MinerU, model downloads, GPU access, sample PDFs, or network access.
- The scaffold introduces conversion logic outside Sprint 1 scope.
- The scaffold introduces alternate engines or runtime engine selection.
- The scaffold introduces `--api-url`, remote APIs, router mode, HTTP client backends, or remote OpenAI-compatible backends.
- `samples/` is staged or committed.
## Acceptance Criteria
Sprint 1 is complete when:
- `pyproject.toml` exists and defines a minimal Python 3.12 `uv` project.
- `src/pdf2md/__init__.py` exists and `import pdf2md` works through the project environment.
- `uv sync` passes.
- `uv run pytest` passes.
- The `pdf2md` CLI entry point is reserved and does not imply conversion is implemented.
- No converter implementation code beyond the allowed placeholder exists.
- No default test depends on MinerU, GPU, model files, network, or `samples/`.
- `PROGRESS.md` records checks performed and residual risks.
- Independent evaluation is complete.
- The completed change is committed.
## Handoff Fields
Use these fields when Sprint 1 completes:
- Files changed:
- Commands run:
- Tests passed:
- Tests blocked:
- Known failures:
- Residual risks:
- User decisions needed:
- Go/no-go recommendation for Sprint 2:
- Next action:
+274
View File
@@ -0,0 +1,274 @@
# Sprint 2 Contract: Paths, Input Discovery, And Overwrite Planning
Status: Completed
Last updated: 2026-05-07
## Objective
Implement deterministic input discovery and output path planning before any PDF conversion logic exists.
Sprint 2 must establish:
- A project-owned path planning module for local PDF inputs.
- Deterministic discovery for a single PDF, a directory, and optional recursive directory traversal.
- Deterministic planned output paths for Markdown, assets, metadata JSON, quality report, and optional raw MinerU output.
- Preflight overwrite conflict detection that prevents accidental replacement unless overwrite is explicitly allowed.
- Fast unit tests using generated temporary files, including non-ASCII filenames.
Sprint 2 is path planning only. It must not run MinerU, parse PDFs, write conversion outputs, normalize Markdown, create metadata content, or add the real `convert` command behavior.
## Current Precondition
Sprint 1 is complete:
- `uv` is installed per-user at `C:\Users\user\.local\bin`.
- `pyproject.toml`, `uv.lock`, the `pdf2md` package, CLI placeholder, and fast pytest loop exist.
- `uv sync` and `uv run pytest` passed.
If a new shell cannot find `uv`, prepend `C:\Users\user\.local\bin` to PATH for verification commands and record that in `PROGRESS.md`.
## Touched Surfaces
Allowed:
- `src/pdf2md/paths.py`
- `src/pdf2md/conversion.py` only for a minimal type boundary if path planning cannot be tested cleanly without it
- `src/pdf2md/cli.py` only if a minimal parser hook is needed for path-planning tests; do not expose working conversion behavior
- `tests/test_paths.py` or `tests/unit/test_paths.py`
- `tests/test_cli.py` only for path-planning parser coverage if `cli.py` changes
- `README.md` only if setup/test instructions need a small update
- `PLAN.md` only for current-goal coordination updates required by the shared agent workflow
- `PROGRESS.md`
- `docs/V1IMPLEMENTATIONPLAN.md` only if sequencing or constraints need adjustment
- `docs/Sprints/SPRINT2CONTRACT.md`
Not allowed:
- `src/pdf2md/mineru_adapter.py`
- `src/pdf2md/ir.py`
- `src/pdf2md/markdown.py`
- `src/pdf2md/metadata.py`
- `src/pdf2md/quality.py`
- `src/pdf2md/report.py`
- `src/pdf2md/doctor.py`
- `scripts/`
- Any real MinerU invocation
- Any model download or install script
- Any file parsing beyond local filesystem path and extension checks
- Any conversion output writing beyond temporary files created by tests
- Any committed file under `samples/`
## Expected Outputs
Sprint 2 should produce:
1. Input discovery
- Accept a local path that is either a PDF file or a directory.
- Treat `.pdf` extension matching as case-insensitive.
- Reject a non-existent path with a clear project-owned error.
- Reject a non-PDF file with a clear project-owned error.
- Reject a directory with no discovered PDFs with a clear project-owned error.
- Discover only direct child PDFs for directory input unless recursive traversal is requested.
- Discover nested PDFs only when recursive traversal is requested.
- Return discovered PDFs in a deterministic order.
2. Output path plan
- For each discovered PDF, plan:
- Markdown path: `<output-root>/<relative-parent>/<stem>.md`.
- Assets directory: `<output-root>/<relative-parent>/<stem>.assets`.
- Metadata path when metadata is enabled: `<output-root>/<relative-parent>/<stem>.metadata.json`.
- Quality report path: `<output-root>/<relative-parent>/<stem>.report.md`.
- Raw MinerU directory when raw output is kept: `<output-root>/<relative-parent>/<stem>.raw`.
- For a single PDF input, `relative-parent` is empty unless the implementation has a tested reason to preserve more context.
- For recursive directory input, preserve the source-relative subdirectory under the output root to avoid filename collisions.
- Keep planned paths local filesystem paths. Do not introduce URI, URL, cloud, or remote storage handling.
3. Overwrite preflight
- Detect existing planned file or directory outputs before conversion writes occur.
- Report all detected conflicts in one project-owned error instead of failing on the first conflict.
- Allow conflicts only when overwrite is explicitly enabled.
- Do not delete or replace files in Sprint 2.
4. Tests
- Unit tests for single PDF discovery.
- Unit tests for non-recursive directory discovery.
- Unit tests for recursive directory discovery.
- Unit tests for deterministic ordering.
- Unit tests for non-ASCII filenames, including Korean filenames, using temporary files.
- Unit tests for invalid input errors.
- Unit tests for planned Markdown, assets, metadata, report, and raw output paths.
- Unit tests for overwrite conflict detection.
5. Handoff
- `PROGRESS.md` records changed files, commands run, tests passed or blocked, known failures, residual risks, and next action.
## Non-Goals
- Do not implement PDF conversion.
- Do not implement conversion orchestration.
- Do not implement the MinerU adapter.
- Do not run MinerU.
- Do not install MinerU 3.1.0.
- Do not download MinerU models.
- Do not parse PDF contents.
- Do not compute source SHA-256.
- Do not implement Markdown normalization.
- Do not implement metadata JSON content.
- Do not implement `.report.md` content.
- Do not implement `pdf2md convert` as a working command.
- Do not implement `pdf2md doctor`.
- Do not add runtime engine selection.
- Do not add alternate conversion engines.
- Do not add cloud, remote API, router, HTTP client backend, or remote OpenAI-compatible backend support.
## Work Packages
### WP2.1: Path Planning Types And Errors
Owner:
- `feature-generator-agent`
Actions:
- Add the smallest project-owned types needed to represent discovered inputs, planned outputs, and overwrite conflicts.
- Add clear project-owned exceptions or error result types for invalid inputs and conflicts.
- Avoid public API promises beyond what Sprint 2 tests verify.
Output:
- Path planning can be tested without converter execution.
### WP2.2: Input Discovery
Owner:
- `feature-generator-agent`
Actions:
- Implement single PDF and directory discovery.
- Require explicit recursive mode for subdirectory traversal.
- Sort results deterministically.
- Preserve local `Path` objects rather than converting to strings early.
Output:
- Discovery behavior matches PRD directory and recursive requirements.
### WP2.3: Output Planning
Owner:
- `feature-generator-agent`
Actions:
- Plan Markdown, assets, metadata, report, and optional raw output paths.
- Preserve relative subdirectories for recursive directory input.
- Keep all planned outputs under the requested output root.
Output:
- Later conversion code can write outputs without rediscovering naming rules.
### WP2.4: Overwrite Conflict Detection
Owner:
- `feature-generator-agent`
Actions:
- Check whether any planned output already exists.
- Return or raise a structured conflict list when overwrite is not enabled.
- Permit the plan when overwrite is enabled without deleting anything.
Output:
- Existing user files are protected before conversion starts.
### WP2.5: Independent Evaluation
Owner:
- `evaluation-agent`
Actions:
- Review the completed path planning implementation against this contract.
- Verify no conversion behavior, MinerU execution, remote runtime path, or alternate engine was added.
- Verify `samples/` remains untracked and unstaged.
- Verify tests use temporary files, not committed sample PDFs.
Output:
- PASS/FAIL notes with any missing acceptance criteria.
## Verification Checks
Required:
- `git status --short` before staging confirms `samples/` remains untracked.
- `uv --version` is run and result is recorded.
- `uv sync` passes.
- `uv run pytest` passes.
- Targeted path planning tests pass.
- Tests do not require MinerU, CUDA, GPU, model files, `samples/`, or network.
- No real MinerU dependency is required for default tests.
- No model downloads occur.
- No network calls are required.
- No candidate engine comparison is reintroduced.
- No conversion behavior is implemented.
- No output files are written outside temporary test directories.
- `git diff --check` passes.
Recommended:
- Use temporary directories for all filesystem tests.
- Include Windows-relevant path behavior without hard-coding Windows-only separators in assertions.
- Use `requirements-guard-agent` if path planning reveals a contradiction in PRD or architecture wording.
## Hard Failure Criteria
Sprint 2 fails and must stop for a user decision if any of these are true:
- Directory conversion descends recursively without explicit recursive intent.
- Existing planned outputs can be overwritten without explicit overwrite intent.
- Planned output paths can escape the requested output root.
- Default tests require MinerU, CUDA, GPU, model files, network, or `samples/`.
- The implementation parses PDF contents or invokes conversion behavior.
- The implementation introduces alternate engines or runtime engine selection.
- The implementation introduces `--api-url`, remote APIs, router mode, HTTP client backends, or remote OpenAI-compatible backends.
- `samples/` is staged or committed.
## Acceptance Criteria
Sprint 2 is complete when:
- `src/pdf2md/paths.py` exists and owns input discovery plus output path planning.
- Single PDF discovery is tested.
- Non-recursive and recursive directory discovery are tested.
- Non-ASCII PDF filenames are tested with generated temporary files.
- Markdown, assets, metadata JSON, report Markdown, and optional raw output paths are tested.
- Existing-output conflict detection is tested with and without overwrite enabled.
- No conversion, MinerU, Markdown normalization, metadata content, report content, or doctor behavior is implemented.
- `uv sync` passes.
- `uv run pytest` passes.
- `PROGRESS.md` records checks performed and residual risks.
- Independent evaluation is complete.
- The completed change is committed.
## Handoff Fields
Use these fields when Sprint 2 completes:
- Files changed:
- Commands run:
- Tests passed:
- Tests blocked:
- Known failures:
- Residual risks:
- User decisions needed:
- Go/no-go recommendation for Sprint 3:
- Next action:
+303
View File
@@ -0,0 +1,303 @@
# Sprint 3 Contract: Domain Records, Metadata, And Warning Model
Status: Completed
Last updated: 2026-05-07
## Objective
Define project-owned domain records, warning records, and metadata JSON construction before binding the system to MinerU output.
Sprint 3 must establish:
- Internal records for documents, pages, blocks, assets, warnings, and conversion outputs.
- Stable warning code and severity definitions aligned with `ARCHITECTURE.md`.
- A metadata builder that produces the required v1 top-level and summary fields.
- Warning aggregation behavior that later report generation can consume.
- Fast unit tests that do not require MinerU, model files, GPU, sample PDFs, or network.
Sprint 3 is schema and metadata modeling only. It must not run MinerU, parse PDFs, normalize Markdown, generate final report Markdown content, expose a working `convert` command, or add remote/runtime engine behavior.
## Current Precondition
Sprint 2 is complete:
- `src/pdf2md/paths.py` owns input discovery and output path planning.
- `tests/test_paths.py` verifies directory recursion, non-ASCII filenames, overwrite conflict detection, duplicate planned outputs, and output-root escape prevention.
- `uv run pytest` passed 21 tests.
Sprint 3 may use path planning records as context, but it should not depend on actual conversion output.
## Touched Surfaces
Allowed:
- `src/pdf2md/ir.py`
- `src/pdf2md/metadata.py`
- `src/pdf2md/report.py` only for a minimal type boundary if metadata/report handoff cannot be expressed cleanly without it
- `src/pdf2md/__init__.py` only if exporting a minimal stable type is necessary and tested
- `tests/test_ir.py` or `tests/unit/test_ir.py`
- `tests/test_metadata.py` or `tests/unit/test_metadata.py`
- `PLAN.md` only for current-goal coordination updates required by the shared agent workflow
- `PROGRESS.md`
- `docs/V1IMPLEMENTATIONPLAN.md` only if sequencing or constraints need adjustment
- `docs/Sprints/SPRINT3CONTRACT.md`
Not allowed:
- `src/pdf2md/mineru_adapter.py`
- `src/pdf2md/markdown.py`
- `src/pdf2md/quality.py`
- `src/pdf2md/doctor.py`
- `scripts/`
- Any real MinerU invocation
- Any model download or install script
- Any PDF content parsing
- Any Markdown normalization behavior
- Any `.report.md` content generation beyond a minimal handoff type if absolutely needed
- Any working `pdf2md convert` or `pdf2md doctor` behavior
- Any committed file under `samples/`
## Expected Outputs
Sprint 3 should produce:
1. Domain records
- `DocumentRecord` or equivalent project-owned record.
- `PageRecord` or equivalent with page index and optional page dimensions.
- `BlockRecord` or equivalent with block type, optional page index, optional bbox, optional confidence, and optional Markdown character span.
- `AssetRecord` or equivalent with stable relative path and optional source page/provenance.
- `WarningRecord` or equivalent with code, severity, message, optional page index, and optional bbox.
- `ConversionOutputRecord` or equivalent only if useful for connecting metadata to later orchestration; it must not invoke conversion.
2. Stable enums or constants
- Block types aligned with `ARCHITECTURE.md`: `heading`, `paragraph`, `inline_formula`, `display_formula`, `table`, `figure`, `caption`, `footnote`, `reference`, and `unknown`.
- Warning codes aligned with `ARCHITECTURE.md`, including at least:
- `ENGINE_MISSING`
- `GPU_UNAVAILABLE`
- `LOW_CONFIDENCE_FORMULA`
- `MATH_RENDER_FAILED`
- `ASSET_LINK_MISSING`
- `READING_ORDER_UNCERTAIN`
- `STRICT_LOCAL_VIOLATION`
- `MINERU_CLI_FAILED`
- Warning severity values sufficient for v1 metadata and report summaries, such as `info`, `warning`, and `error`.
3. Metadata builder
- Build a JSON-serializable metadata object with required top-level fields:
- `source_pdf`
- `source_sha256`
- `created_at`
- `engine`
- `engine_version`
- `engine_options`
- `pages`
- `assets`
- `warnings`
- `summary`
- Build required summary fields:
- `pages_processed`
- `warning_count`
- `asset_count`
- `display_formula_count`
- `inline_formula_count`
- `math_render_error_count`
- Preserve optional fields such as bbox and confidence only when present.
- Require `source_sha256` as an input value. Sprint 3 should not compute hashes by reading PDFs unless the contract is explicitly amended.
- Produce only plain Python data structures that `json.dumps` can serialize without custom encoders.
4. Warning aggregation
- Count warnings.
- Count math render failures from `MATH_RENDER_FAILED`.
- Preserve warning order unless there is a tested reason to sort.
- Preserve page-level warning data when available.
5. Tests
- Unit tests for domain record serialization.
- Unit tests for metadata schema creation with all required top-level fields.
- Unit tests for summary counts.
- Unit tests for warning aggregation.
- Unit tests that optional bbox and confidence fields are preserved only when present.
- Unit tests that metadata is JSON serializable.
- Unit tests that metadata requires source PDF, source SHA-256, engine, engine version, and page records.
6. Handoff
- `PROGRESS.md` records changed files, commands run, tests passed or blocked, known failures, residual risks, and next action.
## Non-Goals
- Do not implement PDF conversion.
- Do not implement conversion orchestration.
- Do not implement the MinerU adapter.
- Do not run MinerU.
- Do not install MinerU 3.1.0.
- Do not download MinerU models.
- Do not parse PDF contents.
- Do not compute source SHA-256 by reading files unless this contract is explicitly amended.
- Do not implement Markdown normalization.
- Do not implement asset link checking.
- Do not implement math renderability checking.
- Do not implement full `.report.md` content generation.
- Do not implement `pdf2md convert` as a working command.
- Do not implement `pdf2md doctor`.
- Do not add runtime engine selection.
- Do not add alternate conversion engines.
- Do not add cloud, remote API, router, HTTP client backend, or remote OpenAI-compatible backend support.
## Work Packages
### WP3.1: Domain Record Types
Owner:
- `metadata-agent`
- `feature-generator-agent`
Actions:
- Define small project-owned records for document/page/block/asset/warning concepts.
- Use simple, typed Python structures that are easy to serialize and test.
- Keep MinerU-specific raw objects out of public and required fields.
Output:
- `ir.py` contains the minimal domain model needed by metadata construction.
### WP3.2: Warning Codes And Severities
Owner:
- `metadata-agent`
- `feature-generator-agent`
Actions:
- Define stable warning codes from `ARCHITECTURE.md`.
- Define severity values and validate warning records against them.
- Avoid inventing speculative warning categories beyond the known v1 set unless needed by tests.
Output:
- Warnings are structured, countable, and stable across later sprints.
### WP3.3: Metadata Builder
Owner:
- `metadata-agent`
- `feature-generator-agent`
Actions:
- Build required metadata JSON data from project-owned records.
- Preserve optional provenance fields only when present.
- Require source PDF path, source SHA-256, engine, engine version, pages, assets, warnings, and engine options as explicit inputs.
Output:
- `metadata.py` produces the required v1 metadata object without MinerU execution.
### WP3.4: Metadata And Warning Tests
Owner:
- `feature-generator-agent`
- `evaluation-agent`
Actions:
- Add focused unit tests for schema, counts, optional fields, JSON serialization, and validation failures.
- Use in-memory records and temporary paths only.
Output:
- `uv run pytest` verifies metadata behavior without external dependencies.
### WP3.5: Independent Evaluation
Owner:
- `evaluation-agent`
Actions:
- Review the completed records and metadata builder against this contract.
- Verify no conversion behavior, MinerU execution, remote runtime path, alternate engine, Markdown normalization, quality checks, or report content generation was added.
- Verify `samples/` remains untracked and unstaged.
Output:
- PASS/FAIL notes with any missing acceptance criteria.
## Verification Checks
Required:
- `git status --short` before staging confirms `samples/` remains untracked.
- `uv --version` is run and result is recorded.
- `uv sync` passes.
- `uv run pytest` passes.
- Targeted IR/metadata tests pass.
- Metadata output is JSON serializable through `json.dumps`.
- Tests do not require MinerU, CUDA, GPU, model files, `samples/`, or network.
- No real MinerU dependency is required for default tests.
- No model downloads occur.
- No network calls are required.
- No candidate engine comparison is reintroduced.
- No conversion behavior is implemented.
- No Markdown normalization behavior is implemented.
- No full `.report.md` content generation is implemented.
- `git diff --check` passes.
Recommended:
- Keep dataclass or enum APIs small and explicit.
- Prefer one serialization function per record over ad hoc dict mutation in tests.
- Include tests that fail if a required metadata top-level field is omitted.
- Use `requirements-guard-agent` if metadata requirements conflict between `PRD.md` and `ARCHITECTURE.md`.
## Hard Failure Criteria
Sprint 3 fails and must stop for a user decision if any of these are true:
- Metadata omits source PDF, source SHA-256, engine, engine version, pages, warnings, assets, or summary.
- Summary omits pages processed, warning count, asset count, display formula count, inline formula count, or math render error count.
- Public or required metadata fields require raw MinerU objects.
- Optional bbox, confidence, or page provenance is dropped when provided.
- Optional bbox, confidence, or page provenance is invented when absent.
- Default tests require MinerU, CUDA, GPU, model files, network, or `samples/`.
- The implementation parses PDF contents, invokes conversion behavior, normalizes Markdown, or generates full report Markdown content.
- The implementation introduces alternate engines or runtime engine selection.
- The implementation introduces `--api-url`, remote APIs, router mode, HTTP client backends, or remote OpenAI-compatible backends.
- `samples/` is staged or committed.
## Acceptance Criteria
Sprint 3 is complete when:
- `src/pdf2md/ir.py` exists and owns project domain records.
- `src/pdf2md/metadata.py` exists and builds required metadata JSON data from project-owned records.
- Stable block types and warning codes are defined and tested.
- Metadata top-level fields and summary fields are tested.
- Warning aggregation is tested.
- Optional bbox and confidence preservation is tested.
- Metadata JSON serializability is tested.
- No conversion, MinerU, Markdown normalization, quality check, full report generation, or doctor behavior is implemented.
- `uv sync` passes.
- `uv run pytest` passes.
- `PROGRESS.md` records checks performed and residual risks.
- Independent evaluation is complete.
- The completed change is committed.
## Handoff Fields
Use these fields when Sprint 3 completes:
- Files changed:
- Commands run:
- Tests passed:
- Tests blocked:
- Known failures:
- Residual risks:
- User decisions needed:
- Go/no-go recommendation for Sprint 4:
- Next action:
+316
View File
@@ -0,0 +1,316 @@
# Sprint 4 Contract: MinerU Adapter With Mocked Contract
Status: Implemented
Last updated: 2026-05-07
## Objective
Build the direct local MinerU 3.1.0 adapter boundary using mocked subprocess results and fake output directories first.
Sprint 4 must establish:
- A project-owned adapter module that is the only boundary for MinerU CLI interaction.
- Deterministic command construction for direct local MinerU CLI execution.
- Strict-local validation that rejects prohibited remote/API/router/backend options.
- Subprocess execution wrapping that captures stdout, stderr, exit code, command, and generated paths.
- Optional-file parsing for mocked MinerU output directories.
- Clear adapter result and warning records for missing MinerU, failed CLI execution, missing output, and strict-local violations.
- Fast unit tests that do not require real MinerU, model files, GPU, sample PDFs, or network.
Sprint 4 is an adapter contract sprint. It must not connect the adapter to real conversion orchestration, Markdown normalization, metadata writing, report generation, or a working `pdf2md convert` command.
## Current Precondition
Sprint 3 is complete:
- `src/pdf2md/paths.py` owns input discovery and output path planning.
- `src/pdf2md/ir.py` owns project records, block types, warning codes, and warning severities.
- `src/pdf2md/metadata.py` builds JSON-serializable metadata and summary counts from project-owned records.
- `uv run pytest` passed 46 tests.
Sprint 4 may import project-owned warning records from `ir.py`, but it must not require raw MinerU objects as public or required return fields.
## Touched Surfaces
Allowed:
- `src/pdf2md/mineru_adapter.py`
- `src/pdf2md/doctor.py` only for minimal adapter availability/version check types if needed; do not implement full `pdf2md doctor`
- `src/pdf2md/ir.py` only for narrowly required warning/result fields discovered while implementing the adapter contract
- `tests/test_mineru_adapter.py` or `tests/unit/test_mineru_adapter.py`
- `tests/test_doctor.py` only if `doctor.py` is touched for adapter availability/version checks
- `README.md` only if a small note is needed to clarify mocked adapter tests versus real MinerU setup
- `PLAN.md` only for current-goal coordination updates required by the shared agent workflow
- `PROGRESS.md`
- `docs/V1IMPLEMENTATIONPLAN.md` only if sequencing or constraints need adjustment
- `docs/Sprints/SPRINT4CONTRACT.md`
Not allowed:
- `src/pdf2md/conversion.py`
- `src/pdf2md/markdown.py`
- `src/pdf2md/quality.py`
- `src/pdf2md/report.py`
- Working `pdf2md convert` behavior
- Full `pdf2md doctor` behavior
- `scripts/`
- Any real MinerU invocation in default tests
- Any MinerU/model installation or download script
- Any PDF content parsing
- Any Markdown normalization behavior
- Any metadata JSON file writing
- Any `.report.md` content generation
- Any runtime engine selection or alternate engine support
- Any committed file under `samples/`
## Expected Outputs
Sprint 4 should produce:
1. Adapter records and options
- A small adapter options record for v1-local MinerU execution.
- A result record containing at least:
- command arguments
- input PDF path
- work/output directory path
- raw Markdown when found
- raw structured data when found
- asset paths when found
- warnings
- engine name fixed to MinerU
- engine version when known
- engine options
- exit code
- stdout
- stderr
- No public or required field should expose raw MinerU-specific Python objects.
2. Availability and version checks
- Check whether `mineru` is available using a mockable local mechanism such as `shutil.which`.
- Check MinerU version using a mockable subprocess call.
- Missing MinerU should produce a clear `ENGINE_MISSING` warning or project-owned adapter error.
- Version command failure should be explicit and testable.
3. Direct local command construction
- Baseline conversion command shape:
```text
mineru -p <input-pdf> -o <work-dir>
```
- The command must not include `--api-url`.
- The command must not include router mode, HTTP client backend flags, remote API URLs, remote OpenAI-compatible backend settings, or runtime engine selection.
- GPU/device options may be represented only if they are strict-local and do not introduce remote/backend choices. If the exact MinerU 3.1.0 flag is uncertain, store the requested option without passing a speculative CLI flag until a later source-verified sprint.
4. Strict-local validation
- Reject prohibited CLI args and options before subprocess execution.
- Reject any value that looks like a remote URL in user-controlled adapter options.
- Allow only direct local `mineru` CLI execution.
- Allow the CLI-internal temporary local `mineru-api` process that MinerU 3.1.0 may start when the CLI runs without `--api-url`.
- Do not implement or call an HTTP client backend.
5. Subprocess wrapper
- Use dependency injection or a small runner boundary so tests can fake subprocess behavior.
- Capture stdout, stderr, and exit code.
- Convert non-zero exit into an adapter result or project-owned error with a `MINERU_CLI_FAILED` warning.
- Do not silently fallback to another engine.
6. Mocked output parsing
- Parse fake output directories using optional-file behavior.
- Raw Markdown is optional and should be read only from mocked local files created by tests.
- Raw structured output is optional and may be represented as a JSON-serializable object or raw text, depending on the fake file extension used in tests.
- Assets are optional and should be collected as local relative or absolute paths according to the adapter result design.
- Missing expected output should produce a clear warning or failed result instead of being silently ignored.
- The adapter must not assume real MinerU output layout is fully known until a later local probe.
7. Tests
- Unit tests for availability check success and missing MinerU.
- Unit tests for version check success and version command failure.
- Unit tests for command construction.
- Unit tests that prohibited `--api-url`, remote URLs, router mode, HTTP backend, and OpenAI-compatible backend options are rejected.
- Unit tests for mocked successful MinerU output.
- Unit tests for mocked non-zero exit.
- Unit tests for mocked missing output.
- Unit tests proving no real MinerU binary, model files, GPU, `samples/`, or network are required by default.
8. Handoff
- `PROGRESS.md` records changed files, commands run, tests passed or blocked, known failures, residual risks, and next action.
## Non-Goals
- Do not implement conversion orchestration.
- Do not implement `convert_pdf`.
- Do not implement `pdf2md convert`.
- Do not implement full `pdf2md doctor`.
- Do not install MinerU 3.1.0.
- Do not download MinerU models.
- Do not run real MinerU in default tests.
- Do not parse real PDFs.
- Do not normalize Markdown.
- Do not write final Markdown, metadata JSON, assets, or report files as product behavior.
- Do not compute source SHA-256.
- Do not implement asset link checking.
- Do not implement math renderability checking.
- Do not implement alternate engines or runtime engine selection.
- Do not add cloud, remote API, router, HTTP client backend, or remote OpenAI-compatible backend support.
## Work Packages
### WP4.1: Adapter Types And Strict-Local Options
Owner:
- `mineru-integration-agent`
- `feature-generator-agent`
Actions:
- Define minimal adapter options and result records.
- Encode strict-local defaults.
- Reject prohibited remote/API/backend options before command execution.
Output:
- The adapter boundary can be tested without invoking MinerU.
### WP4.2: Availability, Version, And Command Builder
Owner:
- `mineru-integration-agent`
- `feature-generator-agent`
Actions:
- Implement mockable `is_available` and `version` checks.
- Build the direct local command shape `mineru -p <input-pdf> -o <work-dir>`.
- Keep command construction deterministic and easy to inspect in tests.
Output:
- Later orchestration can call the adapter without knowing MinerU CLI details.
### WP4.3: Subprocess Runner Boundary
Owner:
- `feature-generator-agent`
Actions:
- Add a small runner interface or callable boundary for subprocess execution.
- Capture command, stdout, stderr, exit code, and return values.
- Map non-zero exits to structured adapter warnings/errors.
Output:
- Default tests can fake all MinerU behavior.
### WP4.4: Mocked Output Parser
Owner:
- `mineru-integration-agent`
- `feature-generator-agent`
Actions:
- Parse fake Markdown, JSON/structured, asset, and diagnostic outputs from test-created directories.
- Treat all raw MinerU output files as optional until real local output is probed.
- Emit clear warnings for missing usable output.
Output:
- Adapter result objects can carry raw output into later IR/normalization sprints without binding to a guessed full MinerU layout.
### WP4.5: Independent Evaluation
Owner:
- `evaluation-agent`
Actions:
- Review the completed adapter against this contract.
- Verify no conversion orchestration, real MinerU dependency in default tests, remote runtime path, alternate engine, Markdown normalization, metadata writing, report generation, or working CLI command was added.
- Verify `samples/` remains untracked and unstaged.
Output:
- PASS/FAIL notes with any missing acceptance criteria.
## Verification Checks
Required:
- `git status --short` before staging confirms `samples/` remains untracked.
- `uv --version` is run and result is recorded.
- `uv sync` passes.
- `uv run pytest` passes.
- Targeted MinerU adapter tests pass.
- Tests do not require real MinerU, CUDA, GPU, model files, `samples/`, or network.
- No model downloads occur.
- No network calls are required.
- No candidate engine comparison is reintroduced.
- No conversion orchestration is implemented.
- No Markdown normalization behavior is implemented.
- No metadata JSON writing or full report generation is implemented.
- No working `pdf2md convert` or full `pdf2md doctor` behavior is implemented.
- Strict-local rejection tests cover `--api-url`, remote URL values, router mode, HTTP backend, and remote OpenAI-compatible backend options.
- `git diff --check` passes.
Recommended:
- Use fake runner classes or functions rather than monkeypatching global subprocess calls everywhere.
- Keep adapter result data JSON-friendly where practical, but do not force metadata schema generation in Sprint 4.
- Include a test that no prohibited remote/API flag appears in the successful command args.
- Use `requirements-guard-agent` if command flags or strict-local wording conflict across documents.
## Hard Failure Criteria
Sprint 4 fails and must stop for a user decision if any of these are true:
- The adapter passes `--api-url`.
- The adapter uses router mode.
- The adapter uses an HTTP client backend.
- The adapter accepts a remote API URL or remote OpenAI-compatible backend for runtime conversion.
- The adapter falls back to another engine after MinerU failure.
- Default tests require real MinerU, CUDA, GPU, model files, network, or `samples/`.
- The implementation installs MinerU or downloads models.
- The implementation connects the adapter to a working conversion CLI/API.
- The implementation adds Markdown normalization, metadata file writing, full report generation, or quality checks.
- The implementation assumes real MinerU output layout is fully known without a later local probe.
- `samples/` is staged or committed.
## Acceptance Criteria
Sprint 4 is complete when:
- `src/pdf2md/mineru_adapter.py` exists and owns direct local MinerU CLI adapter behavior.
- Availability and version checks are mock-tested.
- Conversion command construction is mock-tested and uses `mineru -p <input-pdf> -o <work-dir>`.
- Strict-local validation rejects prohibited remote/API/router/backend options.
- Mocked successful MinerU output produces an adapter result with raw Markdown, raw structured data when available, assets when available, engine info, command args, stdout, stderr, and exit code.
- Mocked non-zero exit produces a clear failure result or project-owned error with a `MINERU_CLI_FAILED` warning.
- Mocked missing MinerU produces a clear `ENGINE_MISSING` warning or project-owned adapter error.
- Default tests do not require MinerU, GPU, model files, network, or `samples/`.
- No conversion orchestration, Markdown normalization, metadata file writing, full report generation, or working CLI behavior is implemented.
- `uv sync` passes.
- `uv run pytest` passes.
- `PROGRESS.md` records checks performed and residual risks.
- Independent evaluation is complete.
- The completed change is committed.
## Handoff Fields
Use these fields when Sprint 4 completes:
- Files changed:
- Commands run:
- Tests passed:
- Tests blocked:
- Known failures:
- Residual risks:
- User decisions needed:
- Go/no-go recommendation for Sprint 5:
- Next action:
+311
View File
@@ -0,0 +1,311 @@
# Sprint 5 Contract: Obsidian Markdown Normalization And Asset Links
Status: Implemented
Last updated: 2026-05-07
## Objective
Build the project-owned Markdown normalization boundary for Obsidian output, using deterministic unit tests before it is connected to conversion orchestration.
Sprint 5 must establish:
- A small Markdown normalization module that accepts local raw Markdown-like text and returns normalized Markdown plus project-owned warnings.
- Obsidian math delimiter normalization for inline and display math.
- Stable relative asset link normalization without copying files or writing final outputs.
- Limited local asset link validation where useful for warnings.
- Table preservation and clear warning behavior when a table cannot be safely simplified.
- Fast unit tests that do not require real MinerU, model files, GPU, sample PDFs, network, or Obsidian itself.
Sprint 5 is a normalization contract sprint. It must not connect normalization to the CLI, conversion orchestration, metadata writing, report generation, real MinerU execution, or end-to-end output writing.
## Current Precondition
Sprint 4 is complete:
- `src/pdf2md/paths.py` owns input discovery and output path planning.
- `src/pdf2md/ir.py` owns project records, block types, warning codes, and warning severities.
- `src/pdf2md/metadata.py` builds JSON-serializable metadata and summary counts from project-owned records.
- `src/pdf2md/mineru_adapter.py` owns the mocked direct local MinerU CLI adapter boundary.
- `uv run pytest` passed 72 tests.
Sprint 5 may use `WarningRecord`, `WarningCode`, and `WarningSeverity` from `ir.py`, but it must not require raw MinerU-specific Python objects as public or required inputs.
## Touched Surfaces
Allowed:
- `src/pdf2md/markdown.py`
- `src/pdf2md/quality.py` only for minimal local asset link check helpers if that boundary is cleaner than placing them in `markdown.py`
- `src/pdf2md/ir.py` only for narrowly required warning codes discovered while implementing table or asset fallback warnings
- `tests/test_markdown.py` or `tests/unit/test_markdown.py`
- `tests/test_quality.py` only if `quality.py` is touched for asset link checks
- `README.md` only if a small note is needed to clarify that normalization tests are mocked/local and not full conversion behavior
- `PLAN.md` only for current-goal coordination updates required by the shared agent workflow
- `PROGRESS.md`
- `docs/V1IMPLEMENTATIONPLAN.md` only if sequencing or constraints need adjustment
- `docs/Sprints/SPRINT5CONTRACT.md`
Not allowed:
- `src/pdf2md/conversion.py`
- `src/pdf2md/cli.py`
- `src/pdf2md/mineru_adapter.py`
- `src/pdf2md/report.py`
- Working `pdf2md convert` behavior
- Full `pdf2md doctor` behavior
- `scripts/`
- Any real MinerU invocation in default tests
- Any MinerU/model installation or download script
- Any PDF content parsing
- Any metadata JSON file writing
- Any `.report.md` content generation
- Any runtime engine selection or alternate engine support
- Any remote asset fetch, HTTP client, or cloud/API integration
- Any committed file under `samples/`
## Expected Outputs
Sprint 5 should produce:
1. Normalization records and API
- A small result record or equivalent project-owned return type containing at least:
- normalized Markdown
- warnings
- asset links discovered or normalized when available
- A normalization function with a narrow input surface, such as raw Markdown text plus optional output/assets context.
- No public or required field should expose raw MinerU-specific Python objects.
- The API should be usable by later orchestration without knowing how MinerU represented the original Markdown.
2. Inline math delimiter normalization
- Normalize safe inline math forms to `$...$`.
- Preserve already valid `$...$` inline math.
- Preserve the exact LaTeX body inside inline math except delimiter changes.
- Do not escape or rewrite underscores, carets, braces, or backslashes inside math.
- Do not normalize math delimiters inside fenced code blocks or inline code spans.
- Avoid converting ambiguous dollar signs that look like currency or prose punctuation.
3. Display math delimiter normalization
- Normalize safe display math forms to `$$...$$`.
- Ensure display math delimiters sit on their own lines.
- Keep a blank line around display math blocks.
- Preserve the exact LaTeX body inside display math except delimiter and surrounding whitespace normalization.
- Preserve LaTeX environments such as `equation`, `align`, or `gather` rather than rewriting their semantics.
- Make normalization idempotent: running the normalizer twice should produce the same Markdown.
4. Asset link normalization
- Normalize local image/asset links to stable relative POSIX-style Markdown paths.
- Keep relative links relative; do not turn them into absolute paths.
- Reject or warn on absolute asset links that cannot be represented relative to the planned output/assets context.
- Reject or warn on links that escape the output/assets directory with `..`.
- Do not fetch remote URLs, copy assets, or write files.
- Preserve alt text when rewriting Markdown image links.
5. Table preservation and fallback warnings
- Preserve simple Markdown pipe tables without destructive formatting changes.
- Preserve HTML tables when Markdown would lose row spans, column spans, nested content, or other complex structure.
- Emit a project-owned warning when complex table fallback behavior is detected or when table simplification is intentionally skipped.
- Do not attempt broad table reflow or OCR-style table reconstruction in Sprint 5.
6. Tests
- Unit tests for inline math delimiter normalization.
- Unit tests for display math delimiter normalization and blank-line spacing.
- Unit tests proving underscores and carets inside math are preserved.
- Unit tests proving fenced code blocks and inline code are not normalized.
- Unit tests for idempotency.
- Unit tests for relative asset link normalization.
- Unit tests for missing or escaping asset link warnings when asset checking is implemented.
- Unit tests for simple table preservation.
- Unit tests for complex table fallback warning behavior.
- Unit tests proving no real MinerU binary, model files, GPU, `samples/`, Obsidian installation, or network are required by default.
7. Handoff
- `PROGRESS.md` records changed files, commands run, tests passed or blocked, known failures, residual risks, and next action.
## Non-Goals
- Do not implement conversion orchestration.
- Do not implement `convert_pdf`.
- Do not implement `pdf2md convert`.
- Do not implement full `pdf2md doctor`.
- Do not invoke MinerU.
- Do not install MinerU 3.1.0.
- Do not download MinerU models.
- Do not parse real PDFs.
- Do not write final Markdown files as product behavior.
- Do not copy or move assets as product behavior.
- Do not write metadata JSON.
- Do not generate `.report.md`.
- Do not compute source SHA-256.
- Do not implement math renderability checks beyond a future-facing warning interface if needed.
- Do not implement full quality report checks.
- Do not implement alternate engines or runtime engine selection.
- Do not add cloud, remote API, router, HTTP client backend, remote OpenAI-compatible backend, or remote asset-fetching support.
## Work Packages
### WP5.1: Normalization Types And Safe Boundaries
Owner:
- `obsidian-markdown-agent`
- `feature-generator-agent`
Actions:
- Define a small Markdown normalization result type.
- Define a focused normalization function.
- Keep warnings project-owned through `WarningRecord`.
- Keep the API independent of raw MinerU objects.
Output:
- Later orchestration can normalize adapter Markdown without knowing MinerU internals.
### WP5.2: Math Delimiter Normalization
Owner:
- `obsidian-markdown-agent`
- `feature-generator-agent`
Actions:
- Normalize safe inline math delimiters to `$...$`.
- Normalize safe display math delimiters to `$$...$$` with stable surrounding blank lines.
- Preserve LaTeX bodies exactly.
- Protect code fences and inline code spans.
- Add idempotency tests.
Output:
- Obsidian-friendly math delimiter behavior is deterministic and covered by unit tests.
### WP5.3: Asset Link Normalization
Owner:
- `obsidian-markdown-agent`
- `feature-generator-agent`
Actions:
- Normalize local image/asset links to stable relative POSIX-style paths.
- Preserve alt text.
- Warn on missing, absolute, escaping, or non-local asset links when the helper has enough local context to judge them.
- Do not fetch or copy assets.
Output:
- Later conversion can produce Markdown links that are stable relative to planned output paths.
### WP5.4: Table Preservation And Fallback Warning
Owner:
- `obsidian-markdown-agent`
- `feature-generator-agent`
Actions:
- Preserve simple Markdown pipe tables.
- Preserve complex HTML tables without simplifying them destructively.
- Emit a project-owned warning for complex table fallback behavior.
Output:
- Table handling is conservative and traceable instead of silently lossy.
### WP5.5: Independent Evaluation
Owner:
- `evaluation-agent`
Actions:
- Review the completed normalizer against this contract.
- Verify no conversion orchestration, real MinerU dependency in default tests, remote runtime path, alternate engine, metadata writing, report generation, file-copying behavior, or working CLI command was added.
- Verify `samples/` remains untracked and unstaged.
Output:
- PASS/FAIL notes with any missing acceptance criteria.
## Verification Checks
Required:
- `git status --short` before staging confirms `samples/` remains untracked.
- `uv --version` is run and result is recorded.
- `uv sync` passes.
- `uv run pytest` passes.
- Targeted Markdown normalization tests pass.
- Tests do not require real MinerU, CUDA, GPU, model files, Obsidian, `samples/`, or network.
- No model downloads occur.
- No network calls are required.
- No candidate engine comparison is reintroduced.
- No conversion orchestration is implemented.
- No metadata JSON writing or full report generation is implemented.
- No working `pdf2md convert` or full `pdf2md doctor` behavior is implemented.
- No final output files are written as product behavior.
- No remote asset fetching is implemented.
- Math delimiter normalization is idempotent.
- Asset paths in normalized Markdown are relative when they are rewritten.
- `git diff --check` passes.
Recommended:
- Prefer a small tokenizer or state-machine approach over broad regular-expression rewrites for math/code boundary handling.
- Keep normalization helpers pure and deterministic.
- Treat complex tables conservatively: preserve content and warn rather than flattening structure.
- Use `requirements-guard-agent` if warning codes or output behavior conflict across documents.
## Hard Failure Criteria
Sprint 5 fails and must stop for a user decision if any of these are true:
- The normalizer rewrites LaTeX math bodies beyond delimiter and whitespace normalization without deterministic tests.
- The normalizer changes underscores, carets, braces, or backslashes inside math content.
- The normalizer rewrites code fences or inline code spans as math.
- The normalizer produces absolute asset links where relative links are required.
- The normalizer accepts asset links that escape the output/assets context without warning.
- The implementation fetches remote assets or adds any HTTP/network client path.
- The implementation connects normalization to a working conversion CLI/API.
- The implementation adds metadata file writing, full report generation, real MinerU execution, model downloads, or setup scripts.
- Default tests require real MinerU, CUDA, GPU, model files, network, Obsidian, or `samples/`.
- `samples/` is staged or committed.
## Acceptance Criteria
Sprint 5 is complete when:
- `src/pdf2md/markdown.py` exists and owns Obsidian Markdown normalization behavior.
- Inline math delimiter normalization is unit-tested.
- Display math delimiter normalization and blank-line spacing are unit-tested.
- Tests prove underscores and carets inside math are preserved.
- Tests prove fenced code blocks and inline code are not normalized.
- Normalization idempotency is unit-tested.
- Relative asset link normalization is unit-tested.
- Asset warning behavior is unit-tested when missing, absolute, escaping, or non-local links are in scope.
- Simple table preservation and complex table fallback warning behavior are unit-tested.
- Default tests do not require MinerU, GPU, model files, network, Obsidian, or `samples/`.
- No conversion orchestration, metadata file writing, full report generation, file-copying behavior, or working CLI behavior is implemented.
- `uv sync` passes.
- `uv run pytest` passes.
- `PROGRESS.md` records checks performed and residual risks.
- Independent evaluation is complete.
- The completed change is committed.
## Handoff Fields
Use these fields when Sprint 5 completes:
- Files changed:
- Commands run:
- Tests passed:
- Tests blocked:
- Known failures:
- Residual risks:
- User decisions needed:
- Go/no-go recommendation for Sprint 6:
- Next action:
+334
View File
@@ -0,0 +1,334 @@
# Sprint 6 Contract: Quality Checks And Report Generation
Status: Implemented
Last updated: 2026-05-08
## Objective
Build local quality-check and human-readable report generation boundaries from project-owned metadata and normalized Markdown, before they are connected to conversion orchestration.
Sprint 6 must establish:
- A project-owned quality module for local asset-link and math-renderability signals.
- A report module that renders `<stem>.report.md` content from metadata and quality results.
- Deterministic final status calculation: `success`, `partial`, or `failed`.
- Summary fields needed by reports, including missing asset links and math render failures.
- Fast unit tests that do not require real MinerU, model files, GPU, sample PDFs, Obsidian, LaTeX tooling, network, or a working conversion CLI.
Sprint 6 is a quality/report contract sprint. It may generate report Markdown content as a string, but it must not connect to the CLI, conversion orchestration, real MinerU execution, file output writing, setup scripts, or end-to-end conversion.
## Current Precondition
Sprint 5 is complete:
- `src/pdf2md/paths.py` owns input discovery and output path planning.
- `src/pdf2md/ir.py` owns project records, block types, warning codes, and warning severities.
- `src/pdf2md/metadata.py` builds JSON-serializable metadata and summary counts from project-owned records.
- `src/pdf2md/mineru_adapter.py` owns the mocked direct local MinerU CLI adapter boundary.
- `src/pdf2md/markdown.py` owns Obsidian Markdown normalization, asset link warnings, and table fallback warnings.
- `uv run pytest` passed 89 tests.
Sprint 6 may use metadata dictionaries produced by `build_metadata`, project-owned `WarningRecord` values, and normalized Markdown text. It must not require raw MinerU-specific Python objects as public or required inputs.
## Touched Surfaces
Allowed:
- `src/pdf2md/quality.py`
- `src/pdf2md/report.py`
- `src/pdf2md/metadata.py` only for narrowly required summary fields or helper functions that keep metadata/report consistency
- `src/pdf2md/ir.py` only for narrowly required warning codes discovered while implementing quality checks
- `tests/test_quality.py`
- `tests/test_report.py`
- `tests/test_metadata.py` only if `metadata.py` changes
- `README.md` only if a small note is needed to clarify mocked/local quality and report behavior
- `PLAN.md` only for current-goal coordination updates required by the shared agent workflow
- `PROGRESS.md`
- `docs/V1IMPLEMENTATIONPLAN.md` only if sequencing or constraints need adjustment
- `docs/Sprints/SPRINT6CONTRACT.md`
Not allowed:
- `src/pdf2md/conversion.py`
- `src/pdf2md/cli.py`
- `src/pdf2md/mineru_adapter.py`
- Working `pdf2md convert` behavior
- Full `pdf2md doctor` behavior
- `scripts/`
- Any real MinerU invocation in default tests
- Any MinerU/model installation or download script
- Any PDF content parsing
- Any final Markdown file writing
- Any metadata JSON file writing
- Any `.report.md` file writing as product behavior
- Any asset copying or moving
- Any runtime engine selection or alternate engine support
- Any remote asset fetch, HTTP client, cloud/API integration, hosted renderer, or remote math-render service
- Any committed file under `samples/`
## Expected Outputs
Sprint 6 should produce:
1. Quality result records and API
- A small project-owned quality result type containing at least:
- missing asset link count
- invalid asset link count when available
- math render error count
- warnings produced by quality checks
- A local asset-link check function that accepts normalized Markdown and local asset context without writing files.
- A math renderability check interface that accepts a local checker callable or reports tool-unavailable behavior gracefully.
- No public or required field should expose raw MinerU-specific Python objects.
2. Asset-link quality checks
- Count missing local asset links in Markdown.
- Count invalid links that are absolute, parent-escaping, remote, or otherwise non-local according to project policy.
- Produce project-owned warnings for missing or invalid asset links.
- Keep all checks local and deterministic.
- Do not fetch remote URLs, copy assets, move assets, or write files.
3. Math renderability checks
- Provide a boundary for local math renderability checking.
- Default tests must use fake/local checker callables.
- Tool-unavailable behavior must be explicit and non-fatal.
- Render failures must produce `MATH_RENDER_FAILED` warnings and count toward the report.
- The checker must not call network services or require a LaTeX/Obsidian install in default tests.
4. Metadata summary consistency
- Preserve existing required metadata summary fields.
- Add or derive report-needed counts without breaking existing metadata tests:
- missing asset link count
- invalid asset link count
- math render error count
- Warning order and warning counts must remain deterministic.
- Reports must be derived from metadata and quality results, not independently duplicated state.
5. Report Markdown generation
- Render a human-readable `<stem>.report.md` content string from metadata and quality results.
- Include at least:
- source PDF path
- output Markdown path when provided
- metadata path when provided
- report path when provided
- MinerU engine/version and execution mode/options
- pages processed
- warning count
- asset count
- missing asset link count
- inline formula count
- display formula count
- math render error count
- pages with warnings
- final status: `success`, `partial`, or `failed`
- The report must not invent facts that are absent from metadata; absent optional paths should be omitted or clearly shown as unavailable.
- The report generator must not write files in Sprint 6.
6. Final status policy
- `failed`: metadata or quality warnings contain at least one `error` severity warning.
- `partial`: no error severity warnings, but warnings or quality failures exist.
- `success`: no warnings and no quality failures.
- The status function must be unit-tested and reusable by later orchestration.
7. Tests
- Unit tests for missing asset link counting.
- Unit tests for invalid/remote/escaping asset link warnings.
- Unit tests for math render failure aggregation with a fake checker.
- Unit tests for math checker unavailable behavior.
- Unit tests for report content and required sections.
- Unit tests proving report content is derived from metadata and quality results.
- Unit tests for pages-with-warnings summary.
- Unit tests for final status calculation.
- Unit tests proving no real MinerU binary, model files, GPU, `samples/`, Obsidian, LaTeX install, or network are required by default.
8. Handoff
- `PROGRESS.md` records changed files, commands run, tests passed or blocked, known failures, residual risks, and next action.
## Non-Goals
- Do not implement conversion orchestration.
- Do not implement `convert_pdf`.
- Do not implement `pdf2md convert`.
- Do not implement full `pdf2md doctor`.
- Do not invoke MinerU.
- Do not install MinerU 3.1.0.
- Do not download MinerU models.
- Do not parse real PDFs.
- Do not write final Markdown files.
- Do not copy or move assets.
- Do not write metadata JSON files.
- Do not write `.report.md` files as product behavior.
- Do not compute source SHA-256.
- Do not implement real LaTeX, KaTeX, MathJax, or Obsidian rendering in default tests.
- Do not add setup scripts.
- Do not implement full local environment diagnostics.
- Do not implement alternate engines or runtime engine selection.
- Do not add cloud, remote API, router, HTTP client backend, remote OpenAI-compatible backend, hosted renderer, or remote asset-fetching support.
## Work Packages
### WP6.1: Quality Result Types And Asset Checks
Owner:
- `metadata-agent`
- `feature-generator-agent`
Actions:
- Define a small project-owned quality result type.
- Add deterministic local asset link checks over normalized Markdown.
- Count missing, invalid, escaping, absolute, and remote asset references.
- Return project-owned warnings without writing files.
Output:
- Later orchestration can add local quality results to metadata/report flow without duplicating asset-link logic.
### WP6.2: Math Renderability Boundary
Owner:
- `obsidian-markdown-agent`
- `metadata-agent`
- `feature-generator-agent`
Actions:
- Define a local math render checker interface.
- Support fake checkers in tests.
- Treat checker-unavailable as explicit non-fatal warning/info according to the implementation design.
- Treat render failures as `MATH_RENDER_FAILED` warnings and count them.
Output:
- Math renderability is represented as a local, testable boundary without external dependencies.
### WP6.3: Metadata Summary Extensions
Owner:
- `metadata-agent`
- `feature-generator-agent`
Actions:
- Preserve existing required metadata summary fields.
- Add or derive counts needed by reports in a backward-compatible way.
- Keep metadata JSON serializable and deterministic.
Output:
- Metadata remains the source of truth for report counts and warning summaries.
### WP6.4: Report Markdown Rendering
Owner:
- `metadata-agent`
- `feature-generator-agent`
Actions:
- Implement report content rendering from metadata plus quality results.
- Include required report sections and final status.
- Generate content only; do not write files.
Output:
- Later orchestration can write `<stem>.report.md` by using the tested report renderer.
### WP6.5: Independent Evaluation
Owner:
- `evaluation-agent`
Actions:
- Review completed quality/report behavior against this contract.
- Verify no conversion orchestration, real MinerU dependency in default tests, remote runtime path, alternate engine, final output writing, CLI behavior, or sample dependency was added.
- Verify `samples/` remains untracked and unstaged.
Output:
- PASS/FAIL notes with any missing acceptance criteria.
## Verification Checks
Required:
- `git status --short` before staging confirms `samples/` remains untracked.
- `uv --version` is run and result is recorded.
- `uv sync` passes.
- `uv run pytest` passes.
- Targeted quality/report tests pass.
- Tests do not require real MinerU, CUDA, GPU, model files, Obsidian, LaTeX tooling, `samples/`, or network.
- No model downloads occur.
- No network calls are required.
- No candidate engine comparison is reintroduced.
- No conversion orchestration is implemented.
- No working `pdf2md convert` or full `pdf2md doctor` behavior is implemented.
- No final Markdown, metadata JSON, or `.report.md` files are written as product behavior.
- No remote asset fetching is implemented.
- No real math renderer dependency is required by default tests.
- Report counts match metadata and quality results.
- Report generation does not re-run MinerU.
- `git diff --check` passes.
Recommended:
- Keep quality helpers pure and deterministic.
- Use fake checkers for math renderability tests.
- Keep report rendering stable enough for snapshot-like unit assertions.
- Use `requirements-guard-agent` if warning codes, summary fields, or report wording conflict across documents.
## Hard Failure Criteria
Sprint 6 fails and must stop for a user decision if any of these are true:
- Report content diverges from metadata or quality result counts.
- Math render failures are silently ignored.
- Quality checks require network access.
- The implementation fetches remote assets or adds any HTTP/network client path.
- The implementation requires a real LaTeX/Obsidian/MathJax/KaTeX install in default tests.
- The implementation connects quality/report behavior to a working conversion CLI/API.
- The implementation writes final Markdown, metadata JSON, `.report.md`, or copied assets as product behavior.
- The implementation invokes MinerU, downloads models, adds setup scripts, or parses real PDFs.
- Default tests require real MinerU, CUDA, GPU, model files, network, Obsidian, LaTeX tooling, or `samples/`.
- `samples/` is staged or committed.
## Acceptance Criteria
Sprint 6 is complete when:
- `src/pdf2md/quality.py` exists and owns local quality-check behavior.
- `src/pdf2md/report.py` exists and owns human-readable report content rendering.
- Missing asset link counting is unit-tested.
- Invalid, escaping, absolute, or remote asset link warning behavior is unit-tested.
- Math render failure aggregation is unit-tested with fake checkers.
- Math checker unavailable behavior is unit-tested and non-fatal.
- Report content includes the required sections and counts.
- Pages-with-warnings summary is unit-tested.
- Final status calculation is unit-tested.
- Report generation is proven not to write files or re-run MinerU.
- Default tests do not require MinerU, GPU, model files, network, Obsidian, LaTeX tooling, or `samples/`.
- No conversion orchestration, final output file writing, working CLI behavior, real MinerU execution, or setup script is implemented.
- `uv sync` passes.
- `uv run pytest` passes.
- `PROGRESS.md` records checks performed and residual risks.
- Independent evaluation is complete.
- The completed change is committed.
## Handoff Fields
Use these fields when Sprint 6 completes:
- Files changed:
- Commands run:
- Tests passed:
- Tests blocked:
- Known failures:
- Residual risks:
- User decisions needed:
- Go/no-go recommendation for Sprint 7:
- Next action:
+360
View File
@@ -0,0 +1,360 @@
# Sprint 7 Contract: Conversion Orchestrator, CLI, And Python API
Status: Implemented
Last updated: 2026-05-08
## Objective
Connect the existing project-owned boundaries into a working conversion orchestration layer, public Python API, and `pdf2md convert` CLI path.
Sprint 7 must establish:
- A public `convert_pdf` API for one local PDF.
- A batch conversion API or helper for directory inputs.
- A `pdf2md convert INPUT --out OUTPUT_DIR` command.
- Product behavior that writes Markdown, optional metadata JSON, and `<stem>.report.md`.
- Local asset materialization for adapter-provided asset files.
- CLI summaries that surface success, failure, and warning counts.
- Fast tests that use fake adapter outputs and do not require real MinerU, model files, GPU, sample PDFs, network, Obsidian, or LaTeX tooling.
Sprint 7 is an orchestration sprint. It may call the real `MinerUAdapter` in the normal production path, but the default test suite must use injected fake adapters and must not execute MinerU.
## Current Precondition
Sprint 6 is complete:
- `src/pdf2md/paths.py` owns input discovery and output path planning.
- `src/pdf2md/ir.py` owns project records, block types, warning codes, and warning severities.
- `src/pdf2md/metadata.py` builds JSON-serializable metadata and summary counts from project-owned records.
- `src/pdf2md/mineru_adapter.py` owns the mocked direct local MinerU CLI adapter boundary.
- `src/pdf2md/markdown.py` owns Obsidian Markdown normalization, asset link warnings, and table fallback warnings.
- `src/pdf2md/quality.py` owns local quality checks over normalized Markdown and asset context.
- `src/pdf2md/report.py` owns report content rendering and final status calculation.
- `uv run pytest` passed 103 tests.
Sprint 7 may compute source SHA-256, create conversion output directories, copy local adapter-provided asset files, write final Markdown, write metadata JSON when requested, and write `<stem>.report.md`. It must keep public return types project-owned and must not require raw MinerU-specific Python objects from callers.
## Touched Surfaces
Allowed:
- `src/pdf2md/conversion.py`
- `src/pdf2md/cli.py`
- `src/pdf2md/__init__.py`
- `src/pdf2md/paths.py` only for narrowly required path/output helper compatibility
- `src/pdf2md/mineru_adapter.py` only for narrowly required adapter protocol or result compatibility
- `src/pdf2md/metadata.py` only for narrowly required output-location or summary compatibility
- `src/pdf2md/markdown.py` only for narrowly required orchestration compatibility
- `src/pdf2md/quality.py` only for narrowly required orchestration compatibility
- `src/pdf2md/report.py` only for narrowly required orchestration compatibility
- `tests/test_conversion.py`
- `tests/test_cli.py`
- Existing focused unit tests only if a touched module requires compatibility updates
- `README.md` only if a short usage note is needed for `pdf2md convert`
- `PLAN.md`
- `PROGRESS.md`
- `docs/V1IMPLEMENTATIONPLAN.md`
- `docs/Sprints/SPRINT7CONTRACT.md`
Not allowed:
- `src/pdf2md/doctor.py`
- Working `pdf2md doctor` behavior
- `scripts/`
- MinerU/model installation or download scripts
- Real MinerU invocation in default tests
- Real GPU/CUDA checks
- Real PDF content parsing outside adapter output handling
- Runtime engine selection or alternate engine support
- Cloud OCR, remote LLM/VLM, hosted renderer, remote document parser, remote asset fetching, HTTP client backend, router mode, `--api-url`, remote APIs, or remote OpenAI-compatible backend support
- A CLI flag that disables strict-local policy
- Committed files under `samples/`
## Expected Outputs
Sprint 7 should produce:
1. Public conversion records and API
- A project-owned conversion result type containing at least:
- source PDF path
- Markdown output path
- metadata JSON path when written
- report path
- assets directory
- raw output directory when kept
- engine name and version
- final status
- warning count
- warnings
- `convert_pdf(input_path, output_dir, metadata=True, keep_raw=False, overwrite=False, gpu=None, strict_local=True, adapter=None, clock=None)` or an equivalently small API.
- The default API path uses the direct local MinerU adapter.
- Tests can inject a fake adapter and deterministic clock.
- The public return type must not expose raw MinerU-specific Python objects as required fields.
2. Single-PDF orchestration
- Discover and plan the single PDF using existing path helpers.
- Create required output directories only after preflight path checks pass.
- Run the adapter into a planned temporary or raw work directory.
- Stop the individual conversion on adapter hard failure and return explicit warnings/status.
- Normalize adapter Markdown into Obsidian-friendly Markdown.
- Copy local adapter-provided asset files into the planned assets directory when needed.
- Compute source SHA-256 with local file reads.
- Build metadata from project-owned records.
- Run local quality checks over normalized Markdown and asset context.
- Render report Markdown from metadata and quality results.
- Write final Markdown, optional metadata JSON, and report Markdown.
3. Output writing and overwrite behavior
- Never write outside the planned output root.
- Respect existing-output conflicts unless `overwrite=True`.
- Keep writes deterministic and UTF-8 encoded for text outputs.
- Preserve `--metadata` behavior: metadata JSON is written when enabled and omitted when disabled.
- Always write `<stem>.report.md`.
- `--keep-raw` preserves raw MinerU output in the planned raw directory.
- Without `--keep-raw`, temporary raw work must be cleaned up when the conversion completes, while preserving enough failure context in metadata/report/warnings.
4. Asset materialization
- Copy only local files returned by the adapter.
- Do not fetch remote assets.
- Do not follow asset paths that escape the adapter work directory or the planned output root.
- Handle missing or invalid adapter asset paths with project-owned warnings.
- Normalize final Markdown asset links to stable relative paths.
5. CLI convert command
- `pdf2md convert INPUT --out OUTPUT_DIR`.
- Options:
- `--metadata`
- `--keep-raw`
- `--recursive`
- `--overwrite`
- `--gpu GPU_DEVICE`
- `--strict-local`
- Strict-local must remain enabled in v1; the CLI must not add a supported way to disable it.
- Single PDF conversion returns exit code `0` on success or partial success, and non-zero when a hard error prevents conversion.
- Directory conversion handles multiple PDFs deterministically and prints a summary.
- Batch conversion should continue to the next file when one PDF fails after planning, then return a non-zero exit code if any PDF failed.
- CLI output must include converted count, failed count, and warning count.
6. Batch conversion
- Directory inputs use existing non-recursive discovery by default.
- Recursive discovery occurs only with `--recursive`.
- Output paths preserve relative subdirectories from the input root.
- Duplicate planned outputs and overwrite conflicts fail before conversion starts.
- Results are deterministic and ordered by discovery/path planning order.
7. Failure and warning behavior
- MinerU failure must be clear and must not trigger fallback to any other engine.
- Strict-local violations must be hard failures.
- Per-file failures must include project-owned warnings.
- CLI summaries must not suppress warning counts.
- Metadata/report content must reflect warnings emitted during adapter, normalization, asset, and quality steps.
8. Tests
- API test for one successful conversion with a fake adapter.
- API test for adapter failure with no fallback.
- API test for output conflict and overwrite behavior.
- API test for metadata disabled behavior.
- API test for local asset copying and relative Markdown links.
- API test for `keep_raw` behavior.
- CLI test for single PDF conversion with a fake adapter.
- CLI test for directory conversion with deterministic summary output.
- CLI test for recursive behavior.
- CLI test for failure summary and non-zero exit code.
- Tests proving default checks do not require real MinerU, GPU, models, network, `samples/`, Obsidian, or LaTeX tooling.
9. Handoff
- `PROGRESS.md` records changed files, commands run, tests passed or blocked, known failures, residual risks, and next action.
## Non-Goals
- Do not implement `pdf2md doctor`.
- Do not implement environment diagnostics.
- Do not install MinerU 3.1.0.
- Do not download MinerU models.
- Do not probe real MinerU output with local sample PDFs.
- Do not add setup scripts.
- Do not implement runtime engine selection.
- Do not add alternate engines.
- Do not add cloud, remote API, router, HTTP client backend, remote OpenAI-compatible backend, hosted renderer, or remote asset-fetching support.
- Do not add a CLI flag or API option that disables strict-local policy.
- Do not require real MinerU, CUDA, GPU, model files, network, Obsidian, LaTeX tooling, or `samples/` in default tests.
- Do not implement real math rendering; use the Sprint 6 local checker boundary.
- Do not commit generated conversion outputs or sample PDFs.
## Work Packages
### WP7.1: Public API And Result Records
Owner:
- `feature-generator-agent`
- `requirements-guard-agent`
Actions:
- Add `conversion.py`.
- Define project-owned conversion result records.
- Expose `convert_pdf` from the library surface.
- Support fake adapter and deterministic clock injection for tests.
Output:
- Callers can run one conversion without depending on raw MinerU objects.
### WP7.2: Single-PDF Orchestration And Output Writing
Owner:
- `feature-generator-agent`
- `mineru-integration-agent`
- `metadata-agent`
- `obsidian-markdown-agent`
Actions:
- Connect path planning, adapter execution, Markdown normalization, metadata building, quality checks, and report rendering.
- Write final Markdown, optional metadata JSON, and report Markdown.
- Compute source SHA-256.
- Preserve strict-local behavior and no-fallback behavior.
Output:
- One PDF can be converted through mocked adapter outputs in tests and through the real adapter in normal use.
### WP7.3: Asset And Raw Output Handling
Owner:
- `feature-generator-agent`
- `obsidian-markdown-agent`
- `metadata-agent`
Actions:
- Copy local adapter-provided assets into the planned assets directory.
- Normalize Markdown links relative to the final Markdown file.
- Preserve raw output only when requested.
- Clean temporary work when raw output is not requested.
Output:
- Markdown, assets, metadata, and report paths are stable and local-only.
### WP7.4: CLI Convert And Batch Summary
Owner:
- `feature-generator-agent`
- `requirements-guard-agent`
Actions:
- Replace the placeholder CLI with `convert` while keeping `--version`.
- Add only the agreed v1 options.
- Print deterministic summaries with converted, failed, and warning counts.
- Return non-zero exit code when hard failures occur.
Output:
- Users can run `pdf2md convert` for one PDF or a directory.
### WP7.5: Independent Evaluation
Owner:
- `evaluation-agent`
Actions:
- Review completed orchestration behavior against this contract.
- Verify no default test executes real MinerU, uses GPU, downloads models, uses network, or requires `samples/`.
- Verify no runtime remote/API path or alternate engine is introduced.
- Verify `samples/` remains untracked and unstaged.
Output:
- PASS/FAIL notes with any missing acceptance criteria.
## Verification Checks
Required:
- `git status --short --untracked-files=all` before staging confirms `samples/` remains untracked and unstaged.
- `uv --version` is run and result is recorded.
- `uv sync` passes.
- `uv run pytest tests/test_conversion.py tests/test_cli.py` passes.
- `uv run pytest` passes.
- `git diff --check` passes.
- Default tests do not require real MinerU, CUDA, GPU, model files, network, Obsidian, LaTeX tooling, or `samples/`.
- No model downloads occur.
- No network calls are required.
- No candidate engine comparison is reintroduced.
- No alternate engine or runtime engine selection is added.
- No CLI/API option disables strict-local policy.
- No `--api-url`, router mode, HTTP client backend, remote API, or remote OpenAI-compatible backend support is added.
- Adapter failures produce explicit failed results and no fallback conversion.
- Output files are written only after path preflight succeeds.
- Existing outputs are protected unless overwrite is enabled.
- CLI summaries include warning counts.
- Metadata/report paths and counts match the files written.
Recommended:
- Keep conversion orchestration small and dependency-injected.
- Prefer local temporary directories from the standard library for raw work when `keep_raw` is disabled.
- Keep batch conversion a thin loop over single-file conversion.
- Keep CLI formatting simple and stable enough for tests.
- Use fake adapter records in tests rather than monkeypatching subprocess behavior at the CLI layer.
## Hard Failure Criteria
Sprint 7 fails and must stop for a user decision if any of these are true:
- Public API requires or exposes raw MinerU-specific Python objects as required return fields.
- The implementation silently falls back to another engine after MinerU failure.
- A CLI/API option disables strict-local policy.
- The implementation adds or permits `--api-url`, remote APIs, router mode, HTTP client backends, or remote OpenAI-compatible backends.
- Default tests execute real MinerU, require GPU/CUDA, download models, use network, require Obsidian/LaTeX tooling, or require `samples/`.
- Output writing can escape the planned output root.
- Existing files are overwritten without explicit overwrite intent.
- CLI writes final outputs after a preflight hard failure.
- CLI summaries suppress warning counts or failed counts.
- Metadata/report content omits warnings emitted during adapter, normalization, asset, or quality steps.
- `samples/` is staged or committed.
## Acceptance Criteria
Sprint 7 is complete when:
- `src/pdf2md/conversion.py` exists and owns conversion orchestration.
- `convert_pdf` is available from the public Python package.
- `pdf2md convert INPUT --out OUTPUT_DIR` exists.
- Single-PDF conversion is tested with a fake adapter.
- Directory and recursive conversion behavior is tested with fake adapters.
- Output conflict and overwrite behavior is tested.
- Adapter failure produces a clear failed result and no fallback.
- Final Markdown, metadata JSON when enabled, and report Markdown are written by product behavior.
- Local adapter-provided assets are copied or warned about deterministically.
- `--keep-raw` behavior is tested.
- CLI summaries include converted, failed, and warning counts.
- Default tests do not require real MinerU, GPU, model files, network, Obsidian, LaTeX tooling, or `samples/`.
- `uv sync` passes.
- Targeted conversion/CLI tests pass.
- `uv run pytest` passes.
- `PROGRESS.md` records checks performed and residual risks.
- Independent evaluation is complete.
- The completed change is committed.
## Handoff Fields
Use these fields when Sprint 7 completes:
- Files changed:
- Commands run:
- Tests passed:
- Tests blocked:
- Known failures:
- Residual risks:
- User decisions needed:
- Go/no-go recommendation for Sprint 8:
- Next action:
+339
View File
@@ -0,0 +1,339 @@
# Sprint 8 Contract: Doctor And Setup Documentation
Status: Implemented
Last updated: 2026-05-08
## Objective
Make local setup failures explicit before users run conversions by adding a mockable `pdf2md doctor` diagnostic path and setup documentation for Windows PowerShell, Python 3.12, `uv`, MinerU 3.1.0, local model/cache expectations, NVIDIA GPU/CUDA visibility, and strict-local runtime behavior.
Sprint 8 must establish:
- A project-owned doctor module for local environment diagnostics.
- A `pdf2md doctor` CLI command with deterministic exit codes.
- Clear reporting for Python, `uv`, MinerU CLI, MinerU version, GPU/CUDA/PyTorch visibility, and model/cache path detection.
- Documentation that explains setup steps and local-only runtime constraints without introducing cloud/API fallback paths.
- Fast tests that mock environment checks and do not require real MinerU, CUDA, GPU, model files, network, `samples/`, Obsidian, LaTeX tooling, or package/model downloads.
Sprint 8 is a diagnostics and setup-documentation sprint. It must not change conversion output behavior, run real conversions, probe local sample PDFs, or make runtime conversion depend on network access.
## Current Precondition
Sprint 7 is complete:
- `src/pdf2md/paths.py` owns input discovery and output path planning.
- `src/pdf2md/ir.py` owns project records, block types, warning codes, and warning severities.
- `src/pdf2md/metadata.py` builds JSON-serializable metadata and summary counts from project-owned records.
- `src/pdf2md/mineru_adapter.py` owns the mocked direct local MinerU CLI adapter boundary.
- `src/pdf2md/markdown.py` owns Obsidian Markdown normalization.
- `src/pdf2md/quality.py` owns local quality checks, including math checker unavailable behavior.
- `src/pdf2md/report.py` owns report content rendering and final status calculation.
- `src/pdf2md/conversion.py` owns conversion orchestration and output writing.
- `src/pdf2md/cli.py` owns `pdf2md convert`.
- `uv run pytest` passed 119 tests.
Sprint 8 may add doctor diagnostics and setup documentation. It must not require a real successful local MinerU/GPU/model setup in the default test loop.
## Touched Surfaces
Allowed:
- `src/pdf2md/doctor.py`
- `src/pdf2md/cli.py`
- `src/pdf2md/mineru_adapter.py` only for narrowly required availability/version helper compatibility
- `README.md`
- `scripts/install-mineru.ps1` only if implemented as an explicit user-invoked setup helper
- `scripts/install-models.py` only if implemented as an explicit user-invoked setup helper
- `tests/test_doctor.py`
- `tests/test_cli.py`
- `tests/test_mineru_adapter.py` only if adapter helper compatibility changes
- `PLAN.md`
- `PROGRESS.md`
- `docs/V1IMPLEMENTATIONPLAN.md`
- `docs/Sprints/SPRINT8CONTRACT.md`
Not allowed:
- Changes to `src/pdf2md/conversion.py` unless a doctor/CLI regression forces a narrow compatibility fix
- Changes to Markdown normalization, metadata schema, report rendering, or path planning unrelated to doctor behavior
- Real PDF conversion in default tests
- Real MinerU execution in default tests
- Real CUDA/GPU dependency in default tests
- Model downloads in default tests
- Any setup download triggered by `pdf2md doctor`, `pdf2md convert`, import time, or tests
- Runtime engine selection or alternate engine support
- Cloud OCR, remote LLM/VLM, hosted renderer, remote document parser, remote asset fetching, HTTP client backend, router mode, `--api-url`, remote APIs, or remote OpenAI-compatible backend support
- A CLI/API option that disables strict-local policy
- Committed files under `samples/`
- Generated conversion outputs committed to git
## Expected Outputs
Sprint 8 should produce:
1. Doctor result records and API
- A small project-owned doctor result type containing at least:
- check name
- status: `pass`, `warn`, or `fail`
- human-readable message
- optional details
- A doctor report type containing ordered checks and an overall status.
- Mockable checker dependencies for subprocess calls, executable discovery, Python version, imports, environment variables, and filesystem paths.
- No public or required field should expose raw subprocess or third-party objects.
2. Required checks
- Python version check:
- Pass on Python 3.12.
- Fail outside the supported project range.
- `uv` check:
- Detect executable availability.
- Report version text when available.
- Fail clearly when missing.
- MinerU check:
- Detect direct local `mineru` CLI availability through the existing adapter boundary where possible.
- Report MinerU version when available.
- Fail clearly when the CLI is missing.
- Warn or fail clearly when version detection fails or the detected version is not MinerU 3.1.0.
- GPU/CUDA/PyTorch visibility:
- Report whether an NVIDIA GPU is visible when detectable.
- Report CUDA/PyTorch visibility without requiring PyTorch in default tests.
- Warn clearly when GPU/CUDA/PyTorch acceleration is unavailable.
- Warn clearly when the detected GPU is GTX 1070 Ti or another Pascal/pre-Turing class GPU with likely CUDA/PyTorch compatibility risk.
- Model/cache paths:
- Report detectable local model/cache paths from documented environment variables or known local config locations.
- Warn when model/cache paths cannot be detected.
- Do not download, install, or validate large model files in default tests.
- Local-only policy:
- Report that runtime conversion allows only direct local `mineru` CLI execution and CLI-internal temporary local `mineru-api`.
- Report that `--api-url`, remote APIs, router mode, HTTP client backends, and remote OpenAI-compatible backends are prohibited.
3. CLI command
- `pdf2md doctor` exists.
- `pdf2md --version` remains unchanged.
- `pdf2md convert` behavior remains covered and unchanged except for parser integration.
- Exit code policy:
- `0` when all checks pass or only warnings exist.
- Non-zero when any required check fails.
- CLI output must be concise, deterministic, and testable.
- CLI output must distinguish warnings from failures.
4. Setup documentation
- README or setup section explains:
- Windows PowerShell workflow.
- Python 3.12 requirement.
- `uv` usage and PATH note for `C:\Users\user\.local\bin`.
- MinerU 3.1.0 local CLI expectation.
- Model/cache setup expectations and where doctor looks.
- NVIDIA GPU expectations and GTX 1070 Ti 8GB risk.
- Strict-local runtime policy.
- Difference between explicit setup downloads and runtime conversion, which must stay local-only.
- If setup helper scripts are added, they must be explicit user-invoked helpers, not imported by package code, not called by `doctor`, and not used by default tests.
5. Tests
- Unit tests for doctor success with mocked checks.
- Unit tests for missing Python version support.
- Unit tests for missing `uv`.
- Unit tests for missing MinerU.
- Unit tests for MinerU version detection failure or non-3.1.0 version.
- Unit tests for missing GPU/CUDA/PyTorch warning behavior.
- Unit tests for GTX 1070 Ti/Pascal risk warning behavior.
- Unit tests for missing model/cache warning behavior.
- CLI tests for `pdf2md doctor` success, warning-only success, and hard failure exit code.
- Regression tests proving `pdf2md convert` and `--version` still work.
- Tests proving default checks do not require real MinerU, GPU, CUDA, PyTorch, model files, network, `samples/`, Obsidian, or LaTeX tooling.
6. Handoff
- `PROGRESS.md` records changed files, commands run, tests passed or blocked, known failures, residual risks, and next action.
## Non-Goals
- Do not run real MinerU in default tests.
- Do not install MinerU 3.1.0.
- Do not download MinerU models.
- Do not run real model setup during doctor or tests.
- Do not parse, convert, or inspect sample PDFs.
- Do not implement local fixture evaluation.
- Do not change conversion orchestration behavior except for unavoidable CLI parser integration.
- Do not add runtime engine selection.
- Do not add alternate engines.
- Do not add cloud, remote API, router, HTTP client backend, remote OpenAI-compatible backend, hosted renderer, or remote asset-fetching support.
- Do not add a CLI/API option that disables strict-local policy.
- Do not require real CUDA, GPU, PyTorch, MinerU, model files, network, Obsidian, LaTeX tooling, or `samples/` in default tests.
- Do not claim that GTX 1070 Ti CUDA/PyTorch acceleration is guaranteed until local validation proves it.
- Do not claim perfect setup automation.
## Work Packages
### WP8.1: Doctor Records And Mockable Check Boundaries
Owner:
- `local-setup-agent`
- `feature-generator-agent`
Actions:
- Add `doctor.py`.
- Define doctor check/result records.
- Keep all external probes injectable or mockable.
- Implement deterministic aggregation and exit status policy.
Output:
- The CLI can report setup health without depending on real local tools in tests.
### WP8.2: Environment And MinerU Diagnostics
Owner:
- `local-setup-agent`
- `mineru-integration-agent`
- `feature-generator-agent`
Actions:
- Add Python, `uv`, MinerU availability, MinerU version, GPU/CUDA/PyTorch, and model/cache checks.
- Reuse the direct local MinerU adapter boundary where it fits.
- Keep MinerU 3.1.0 as the only accepted engine target.
Output:
- Users get clear local setup failures before conversion.
### WP8.3: CLI Doctor Command
Owner:
- `feature-generator-agent`
- `requirements-guard-agent`
Actions:
- Add `pdf2md doctor` without breaking `pdf2md convert` or `--version`.
- Print deterministic pass/warn/fail lines and an overall status.
- Return non-zero when required checks fail.
Output:
- The command-line workflow can diagnose setup state.
### WP8.4: Setup Documentation And Optional Explicit Helpers
Owner:
- `local-setup-agent`
- `license-privacy-agent`
- `requirements-guard-agent`
Actions:
- Update README setup docs for Windows PowerShell, Python 3.12, `uv`, MinerU 3.1.0, model/cache, GPU, and local-only runtime policy.
- Verify volatile install/setup claims against official docs before editing.
- If adding scripts, keep them explicit, local setup-only, and never called by doctor, convert, import time, or default tests.
Output:
- Setup instructions are clear without weakening strict-local runtime policy.
### WP8.5: Independent Evaluation
Owner:
- `evaluation-agent`
Actions:
- Review completed doctor behavior and docs against this contract.
- Verify no default test executes real MinerU, uses GPU/CUDA, downloads models, uses network, or requires `samples/`.
- Verify no runtime remote/API path or alternate engine is introduced.
- Verify `samples/` remains untracked and unstaged.
Output:
- PASS/FAIL notes with any missing acceptance criteria.
## Verification Checks
Required:
- `git status --short --untracked-files=all` before staging confirms `samples/` remains untracked and unstaged.
- `uv --version` is run and result is recorded.
- `uv sync` passes.
- `uv run pytest tests/test_doctor.py tests/test_cli.py` passes.
- `uv run pytest` passes.
- `git diff --check` passes.
- Default tests do not require real MinerU, CUDA, GPU, PyTorch, model files, network, Obsidian, LaTeX tooling, or `samples/`.
- No model downloads occur.
- No setup downloads occur from doctor, convert, imports, or tests.
- No network calls are required in default tests.
- No candidate engine comparison is reintroduced.
- No alternate engine or runtime engine selection is added.
- No CLI/API option disables strict-local policy.
- No `--api-url`, router mode, HTTP client backend, remote API, or remote OpenAI-compatible backend support is added.
- Doctor fails clearly when required dependencies are missing.
- Doctor does not report the environment as healthy when MinerU is missing.
- Doctor warnings are clear for GPU/CUDA/PyTorch/model-cache risk.
- `pdf2md convert` tests still pass.
- `pdf2md --version` still works.
Recommended:
- Keep doctor checks small, ordered, and deterministic.
- Keep human-readable output stable enough for unit tests.
- Use dependency injection rather than monkeypatching global process state where possible.
- Treat real local GPU/MinerU/model probes as optional manual verification outside the default test suite.
- Use `requirements-guard-agent` if setup wording risks weakening strict-local policy.
- Use `research-agent` or `local-setup-agent` with live web verification before changing volatile installation commands or model-cache documentation.
## Hard Failure Criteria
Sprint 8 fails and must stop for a user decision if any of these are true:
- Doctor reports a healthy environment when MinerU is missing.
- Doctor says cloud/API fallback is supported.
- Doctor, import time, or default tests install packages, download models, call network services, or run model setup.
- Default tests require real MinerU, CUDA, GPU, PyTorch, model files, network, Obsidian, LaTeX tooling, or `samples/`.
- The implementation adds or permits `--api-url`, remote APIs, router mode, HTTP client backends, or remote OpenAI-compatible backends.
- The implementation adds runtime engine selection or alternate engines.
- `pdf2md doctor` breaks `pdf2md convert` or `pdf2md --version`.
- The README or scripts imply runtime conversion can upload PDFs, page images, extracted text, or intermediates to remote services.
- Setup helper scripts are invoked automatically by doctor, convert, import time, or tests.
- `samples/` is staged or committed.
## Acceptance Criteria
Sprint 8 is complete when:
- `src/pdf2md/doctor.py` exists and owns local setup diagnostics.
- `pdf2md doctor` exists.
- Doctor returns a project-owned report with ordered checks and overall status.
- Python 3.12, `uv`, MinerU availability/version, GPU/CUDA/PyTorch visibility, and model/cache path checks are implemented with mocked tests.
- Missing `uv` is tested.
- Missing MinerU is tested and produces a failure.
- Missing GPU/CUDA/PyTorch is tested and produces clear warning behavior.
- GTX 1070 Ti/Pascal risk warning behavior is tested.
- Missing model/cache path warning behavior is tested.
- `pdf2md doctor` exit code behavior is tested for success, warning-only success, and failure.
- `pdf2md convert` and `pdf2md --version` regression tests still pass.
- Setup docs explain local-only runtime behavior and do not imply cloud/API fallback.
- Default tests do not require real MinerU, GPU, CUDA, PyTorch, model files, network, Obsidian, LaTeX tooling, or `samples/`.
- `uv sync` passes.
- Targeted doctor/CLI tests pass.
- `uv run pytest` passes.
- `PROGRESS.md` records checks performed and residual risks.
- Independent evaluation is complete.
- The completed change is committed.
## Handoff Fields
Use these fields when Sprint 8 completes:
- Files changed:
- Commands run:
- Tests passed:
- Tests blocked:
- Known failures:
- Residual risks:
- User decisions needed:
- Go/no-go recommendation for Sprint 9:
- Next action:
+320
View File
@@ -0,0 +1,320 @@
# Sprint 9 Contract: Local Fixture Evaluation And V1 Release Gate
Status: Implemented
Last updated: 2026-05-08
## Objective
Validate the v1 converter against local fixture workflows without committing sample PDFs or making the default test loop depend on MinerU models, GPU, CUDA, network access, Obsidian, or LaTeX tooling.
Sprint 9 must establish:
- A fast mocked integration suite that exercises the public conversion path end to end.
- An optional, explicitly enabled local MinerU fixture evaluation path for `samples/`.
- A fixture coverage manifest or checklist that records which local PDFs cover math, tables, figures/assets, reading order, Korean filenames, and metadata/report risks.
- Release-gate documentation that distinguishes default automated checks from optional local MinerU/GPU checks.
- Clear `PROGRESS.md` notes for local fixture coverage, skipped/blocked optional checks, known quality risks, and the v1 go/no-go recommendation.
Sprint 9 is an evaluation and release-gate sprint. It may add tests, local-only evaluation helpers, fixture manifests, and narrow compatibility fixes only when needed to evaluate the current v1 behavior. It must not add alternate engines, cloud/API paths, runtime engine selection, or automatic model downloads.
## Current Precondition
Sprint 8 is complete:
- `pdf2md doctor` exists and reports Python, `uv`, MinerU CLI/version, GPU, PyTorch, model/cache, and strict-local policy status.
- Local `pdf2md doctor` currently fails because the `mineru` CLI is not installed on PATH.
- `pdf2md convert` exists and writes Markdown, metadata JSON, and `<stem>.report.md` with fake-adapter test coverage.
- Default tests pass without real MinerU, CUDA, GPU, model files, network, Obsidian, LaTeX tooling, or `samples/`.
- `samples/` exists locally and is untracked. Observed local fixture files include:
- `samples/FourNodeQuadrilateralShellElementMITC4.pdf`
- `samples/MITC공부.pdf`
- `samples/2007쉘구조물의유한요소해석에대하여.pdf`
- `samples/유한요소해석법을이용한쉘구조물의동적좌굴해석.pdf`
- `samples/metadata.json`
Sprint 9 must preserve the untracked status of `samples/` unless the user explicitly requests otherwise.
## Touched Surfaces
Allowed:
- `tests/integration/`
- `tests/test_conversion.py`
- `tests/test_cli.py`
- `tests/test_report.py`
- `tests/test_metadata.py`
- `tests/test_quality.py`
- `tests/conftest.py` only for markers or opt-in fixture controls
- `src/pdf2md/mineru_adapter.py` only for narrow compatibility fixes backed by mocked or optional local MinerU output evidence
- `src/pdf2md/conversion.py` only for narrow release-gate defects found by integration tests
- `src/pdf2md/quality.py` only for local quality metric defects found by integration tests
- `src/pdf2md/report.py` only for report defects found by integration tests
- `README.md`
- `docs/V1RELEASECHECKLIST.md`
- `docs/V1IMPLEMENTATIONPLAN.md`
- `docs/Sprints/SPRINT9CONTRACT.md`
- `PLAN.md`
- `PROGRESS.md`
Not allowed:
- Committed files under `samples/`
- Committed generated conversion outputs from local sample PDFs
- Mandatory tests that require real MinerU, GPU, CUDA, PyTorch, model files, network, Obsidian, LaTeX tooling, or `samples/`
- Automatic package installs or model downloads from tests, import time, doctor, convert, or helpers
- Runtime engine selection or alternate conversion engines
- Cloud OCR, remote LLM/VLM, hosted renderer, remote document parser, remote asset fetching, `--api-url`, router mode, HTTP client backends, remote APIs, or remote OpenAI-compatible backends
- CLI/API options that disable strict-local policy
- Claims that v1 perfectly reconstructs LaTeX, tables, or reading order
## Expected Outputs
1. Fast mocked integration suite
- Exercises `convert_pdf` and/or `pdf2md convert` with a fake MinerU adapter through the real orchestration path.
- Verifies Markdown, metadata JSON, and `<stem>.report.md` are all written.
- Verifies output paths, asset links, warning counts, and report status stay consistent.
- Verifies failures produce metadata/report warnings when possible and do not silently fallback.
- Runs as part of `uv run pytest` without real MinerU, models, GPU, network, Obsidian, LaTeX tooling, or `samples/`.
2. Optional local MinerU fixture evaluation
- Provides an explicit opt-in command or pytest marker/environment gate for real local MinerU sample evaluation.
- Skips or reports a clear local blocker when `pdf2md doctor` fails because MinerU, model/cache paths, or GPU/PyTorch acceleration are unavailable.
- Reads sample PDFs only from `samples/` or a user-provided local sample directory.
- Writes generated outputs to a temporary or ignored output directory, never to tracked fixture paths.
- Produces or records, for each attempted sample:
- source filename
- command run
- exit code
- generated Markdown path
- generated metadata JSON path
- generated `.report.md` path
- warning count
- math renderability or checker-unavailable count
- table fallback/degradation count when available
- missing or broken asset link count
- page coverage when available
- Does not mark optional evaluation as passed when MinerU is missing; it records the blocker.
3. Fixture coverage manifest or checklist
- Maps local sample files to risk categories:
- simple digital PDF
- math-heavy PDF
- multi-column or complex reading order
- table with formulas
- figure/caption/assets
- Korean filename/path handling
- May store only relative sample names, categories, and notes; it must not embed sample PDFs or generated outputs.
- Records coverage gaps that need additional user-provided samples.
4. V1 release checklist
- Defines default release gates:
- `uv sync`
- `uv run pytest`
- `uv run pdf2md --version`
- `uv run pdf2md doctor`
- `git diff --check`
- `git status --short --untracked-files=all`
- Defines optional local MinerU release gates separately from default gates.
- Requires Markdown, metadata JSON, and `.report.md` to exist before any sample conversion is considered successful.
- Requires warnings and residual risks to be recorded in `PROGRESS.md`.
- Makes local-only and no-sample-commit checks explicit.
5. Documentation
- README or release checklist explains how to run default checks and optional local fixture checks.
- Documentation states that optional fixture checks may be skipped or blocked until MinerU 3.1.0 and model/cache setup are available.
- Documentation does not instruct users to use `--api-url`, router mode, HTTP client backends, remote APIs, or remote OpenAI-compatible backends.
6. Handoff
- `PROGRESS.md` records changed files, commands run, tests passed or blocked, local fixture status, generated output location if any, known failures, residual risks, and next action.
## Non-Goals
- Do not install MinerU.
- Do not download MinerU models.
- Do not run model setup automatically.
- Do not require the local GTX 1070 Ti to pass CUDA/PyTorch checks in the default test loop.
- Do not improve OCR/model accuracy.
- Do not introduce a manual review UI or web UI.
- Do not add alternate conversion engines or fallback engines.
- Do not benchmark against cloud OCR/API services.
- Do not commit sample PDFs, sample-derived outputs, or large binary fixtures.
- Do not make text edit distance the only quality criterion.
- Do not claim v1 is release-ready if metadata JSON or `.report.md` generation is missing.
## Work Packages
### WP9.1: Fast Mocked Integration Checks
Owner:
- `feature-generator-agent`
- `evaluation-agent`
Actions:
- Add integration-level tests that use fake adapter output but run the public conversion orchestration and CLI paths.
- Assert generated Markdown, metadata JSON, `.report.md`, assets, warnings, and summaries are mutually consistent.
- Keep tests deterministic and independent of real samples.
Output:
- `uv run pytest` covers v1 file-output behavior without model or GPU dependencies.
### WP9.2: Optional MinerU Sample Evaluation Harness
Owner:
- `mineru-integration-agent`
- `local-setup-agent`
- `evaluation-agent`
Actions:
- Add an explicit opt-in local fixture command/test path.
- Gate real MinerU execution behind an environment variable, marker, or explicit command documented in README/checklist.
- Run `pdf2md doctor` or equivalent preflight before optional local MinerU evaluation.
- Use temporary or ignored output directories.
- Record blocked status clearly when MinerU/model/cache setup is missing.
Output:
- Local users can run real sample evaluation when setup is ready, while default tests stay fast and local.
### WP9.3: Fixture Coverage And Metrics
Owner:
- `evaluation-agent`
- `obsidian-markdown-agent`
- `metadata-agent`
Actions:
- Define fixture categories and expected risk coverage.
- Track math delimiter/renderability, tables, reading order, assets, page coverage, metadata fields, warning counts, and report usefulness.
- Avoid scoring quality only by plain-text edit distance.
Output:
- Fixture coverage is explicit and gaps are visible.
### WP9.4: V1 Release Gate Documentation
Owner:
- `requirements-guard-agent`
- `evaluation-agent`
Actions:
- Add or update release checklist documentation.
- Separate default release gates from optional local MinerU/GPU gates.
- Keep strict-local wording consistent with `ARCHITECTURE.md`, `PRD.md`, and `README.md`.
- Update `PLAN.md` and `PROGRESS.md` with the next action and release readiness state.
Output:
- A future agent can determine whether v1 is blocked, partial, or ready without relying on conversation history.
### WP9.5: Independent Evaluation
Owner:
- `evaluation-agent`
Actions:
- Review completed Sprint 9 work against this contract.
- Verify default tests do not require real MinerU, GPU, CUDA, PyTorch, model files, network, Obsidian, LaTeX tooling, or `samples/`.
- Verify optional local MinerU evaluation is clearly gated.
- Verify generated sample outputs and sample PDFs are not staged.
- Verify release checklist cannot pass without Markdown, metadata JSON, and `.report.md`.
Output:
- PASS/FAIL notes with actionable findings and residual risk.
## Verification Checks
Required:
- `git status --short --untracked-files=all` before staging confirms `samples/` remains untracked and unstaged.
- `uv --version` is run and result is recorded.
- `uv sync` passes.
- `uv run pytest` passes.
- Targeted integration tests pass.
- `uv run pdf2md --version` passes.
- `uv run pdf2md doctor` is run and its result is recorded as pass, warn, or blocked/fail.
- `git diff --check` passes.
- Default tests do not require real MinerU, CUDA, GPU, PyTorch, model files, network, Obsidian, LaTeX tooling, or `samples/`.
- No model downloads occur.
- No setup downloads occur from tests, import time, doctor, convert, or helper scripts.
- No network calls are required in default tests.
- No candidate engine comparison is reintroduced.
- No alternate engine or runtime engine selection is added.
- No CLI/API option disables strict-local policy.
- No `--api-url`, router mode, HTTP client backend, remote API, or remote OpenAI-compatible backend support is added.
- Optional local MinerU checks are skipped or blocked clearly when setup is unavailable.
- Sample PDFs and generated sample outputs are not staged or committed.
- `PROGRESS.md` records local fixture coverage status and release readiness.
Recommended:
- Add a pytest marker or environment variable for optional local MinerU tests.
- Keep optional output under a temporary directory or an ignored local output root.
- Include at least one Korean filename/path check in fast mocked tests.
- Include one fake output with math, one with a table warning, and one with an asset link.
- Record source-to-output paths in release checklist examples.
- Treat local doctor failure as a release blocker for real MinerU validation but not for the default fast test loop.
## Hard Failure Criteria
Sprint 9 fails and must stop for a user decision if any of these are true:
- Default tests require real MinerU, GPU, CUDA, PyTorch, model files, network, Obsidian, LaTeX tooling, or `samples/`.
- Sample PDFs or generated sample outputs are staged or committed.
- Optional real MinerU evaluation runs without an explicit opt-in gate.
- Optional real MinerU evaluation writes generated output into tracked fixture paths.
- V1 release checklist can pass without generated Markdown, metadata JSON, and `.report.md`.
- Release status is marked ready when `pdf2md doctor` has a hard failure and no explicit user waiver is recorded.
- The implementation adds runtime engine selection or alternate engines.
- The implementation adds or permits `--api-url`, remote APIs, router mode, HTTP client backends, or remote OpenAI-compatible backends.
- The implementation uses cloud/API fallback for any fixture evaluation.
- The implementation hides MinerU failure or silently falls back to another engine.
- Quality criteria ignore math, tables, reading order, assets, metadata, or report quality.
## Acceptance Criteria
Sprint 9 is complete when:
- `docs/Sprints/SPRINT9CONTRACT.md` exists and is referenced by relevant agents.
- Fast mocked integration tests exist and pass under `uv run pytest`.
- Optional local MinerU fixture evaluation is documented and explicitly gated.
- Local fixture coverage categories and gaps are recorded.
- Release checklist documentation exists or is updated.
- `PROGRESS.md` records optional local MinerU status, including skipped/blocked reasons when applicable.
- Default tests do not require real MinerU, GPU, CUDA, PyTorch, model files, network, Obsidian, LaTeX tooling, or `samples/`.
- No sample PDF or generated sample output is staged or committed.
- `uv sync` passes.
- `uv run pytest` passes.
- `git diff --check` passes.
- Independent evaluation is complete.
- The completed change is committed.
## Handoff Fields
Use these fields when Sprint 9 completes:
- Files changed:
- Commands run:
- Tests passed:
- Tests blocked:
- Optional local MinerU status:
- Fixture coverage:
- Generated output locations:
- Known failures:
- Residual risks:
- User decisions needed:
- V1 release recommendation:
- Go/no-go recommendation for next sprint:
- Next action:
+661
View File
@@ -0,0 +1,661 @@
# V1 Implementation Plan: Local PDF-to-Markdown Converter
Last updated: 2026-05-08
This document is the implementation plan for v1. It does not replace `PRD.md` or `ARCHITECTURE.md`; use those files as the source of product requirements and system design. This plan explains the order of work, sprint contracts, verification gates, and agent ownership for implementing the converter.
Sprint 1 created the Python package scaffold and CLI placeholder. Sprint 2 created path planning. Sprint 3 created project-owned records and metadata construction. Sprint 4 created the mocked direct local MinerU adapter boundary. Sprint 5 created the Obsidian Markdown normalization boundary. Sprint 6 created local quality-check and report-rendering boundaries. Sprint 7 implemented conversion orchestration, the public conversion API, and the `pdf2md convert` CLI path with fake-adapter tests. Sprint 8 implemented mockable doctor diagnostics, the `pdf2md doctor` CLI path, and setup documentation. Sprint 9 implemented fast mocked integration tests, explicit opt-in local MinerU fixture evaluation, and the v1 release checklist. Sprint 10 implemented opt-in pre-conversion PDF chunking for long documents.
## 1. V1 Outcome
v1 is complete when a local user can run:
```bash
uv run pdf2md doctor
uv run pdf2md convert paper.pdf --out out --metadata
uv run pdf2md convert pdfs --out out --recursive --metadata
```
and receive, for each PDF:
- Obsidian-friendly Markdown.
- A stable sibling assets directory when assets exist.
- `<stem>.metadata.json`.
- `<stem>.report.md`.
- Clear warnings when math, tables, assets, reading order, GPU availability, or MinerU execution are uncertain.
Long PDFs can be chunked explicitly:
```bash
uv run pdf2md convert paper.pdf --out out --chunk-pages
uv run pdf2md convert paper.pdf --out out --chunk-pages 20
```
Chunked conversion writes separate outputs per chunk and does not merge Markdown files.
The converter must use MinerU 3.1.0 through direct local CLI execution only. It must not silently fallback to another engine.
## 2. Non-Negotiable Constraints
- Python 3.12 and `uv`.
- MinerU 3.1.0 is the only conversion engine.
- Direct local MinerU CLI execution only.
- MinerU 3.1.0 may launch a temporary local `mineru-api` internally when CLI runs without `--api-url`.
- No cloud OCR, hosted LLM/VLM, remote document parser, `--api-url`, remote APIs, router mode, HTTP client backends, or remote OpenAI-compatible backends.
- Target hardware: NVIDIA GTX 1070 Ti 8GB.
- Digital PDFs with text layers are the v1 priority.
- `samples/` is local fixture context and must not be committed unless explicitly requested.
- Every substantial implementation chunk needs a sprint contract and independent evaluation.
## 3. Harness Operating Model
Use the project long-running harness only for substantial implementation work.
1. `harness-planner-agent` turns the next user request into a sprint contract.
2. `evaluation-agent` reviews the contract before code changes start.
3. `feature-generator-agent` implements one approved contract at a time.
4. `feature-generator-agent` runs self-checks and records residual risks.
5. `evaluation-agent` independently verifies the result against the contract.
6. The parent agent updates `PROGRESS.md`, commits the completed change, and leaves a handoff.
Each sprint contract must include:
- Objective.
- Touched surfaces.
- Expected outputs.
- Non-goals.
- Verification checks.
- Hard failure criteria.
- Handoff fields.
## 4. Proposed Repository Layout
Create this layout incrementally; do not scaffold unused modules before a sprint needs them.
```text
pyproject.toml
README.md
src/
pdf2md/
__init__.py
cli.py
conversion.py
pdf_splitter.py
paths.py
mineru_adapter.py
ir.py
markdown.py
metadata.py
quality.py
report.py
doctor.py
tests/
unit/
integration/
fixtures/
scripts/
install-mineru.ps1
install-models.py
```
Planned module responsibilities:
- `cli.py`: command parsing, CLI summaries, exit codes.
- `conversion.py`: orchestration for one PDF and batch input.
- `paths.py`: input discovery, output path planning, overwrite checks.
- `mineru_adapter.py`: direct local MinerU CLI boundary.
- `ir.py`: project-owned document/page/block/asset/warning records.
- `markdown.py`: Obsidian Markdown normalization.
- `metadata.py`: metadata schema creation and warning aggregation.
- `quality.py`: local checks for assets, math renderability, and output sanity.
- `report.py`: `<stem>.report.md` generation from metadata.
- `doctor.py`: environment, dependency, CUDA/GPU, MinerU, and cache diagnostics.
## 5. Sprint Sequence
### Sprint 0: Source And Environment Verification
Active contract:
- `docs/Sprints/SPRINT0CONTRACT.md`
Objective:
- Verify the facts needed before implementation starts.
Touched surfaces:
- `docs/KNOWLEDGEBASE.md`
- `docs/V1IMPLEMENTATIONPLAN.md` if sequencing changes
- `PROGRESS.md`
Expected outputs:
- Confirmed MinerU 3.1.0 install command, CLI invocation shape, version command, output paths, and local execution behavior.
- Confirmed Python 3.12, `uv`, CUDA/PyTorch, and GTX 1070 Ti 8GB risks.
- Confirmed license notes needed before redistribution.
Verification checks:
- All volatile facts cite official MinerU, Python, uv, PyTorch/CUDA, or license sources.
- No candidate engine comparison is reintroduced.
- No implementation code is created.
Hard failure criteria:
- MinerU 3.1.0 cannot be reasonably invoked through a direct local CLI on the target environment.
- Python 3.12 compatibility is not viable without changing project requirements.
Primary agents:
- `research-agent`
- `local-setup-agent`
- `license-privacy-agent`
### Sprint 1: Project Scaffold And Fast Test Loop
Active contract:
- `docs/Sprints/SPRINT1CONTRACT.md`
Objective:
- Create the minimal Python project structure and a fast local test loop.
Touched surfaces:
- `pyproject.toml`
- `src/pdf2md/__init__.py`
- `tests/`
- Development documentation if needed
Expected outputs:
- `uv sync` works.
- `uv run pytest` works.
- Project package imports as `pdf2md`.
- CLI entry point name `pdf2md` is reserved but may initially expose only `doctor` or a clear placeholder until later sprints.
- If `uv` is still unavailable locally, Sprint 1 records that blocker and is not marked complete.
Verification checks:
- Import test passes.
- Empty test suite or initial scaffold tests pass.
- No runtime network dependency is introduced.
Hard failure criteria:
- Project cannot be installed with `uv`.
- Scaffolding adds speculative config systems, extra engines, or unused abstractions.
Primary agents:
- `harness-planner-agent`
- `feature-generator-agent`
- `evaluation-agent`
### Sprint 2: Paths, Input Discovery, And Overwrite Planning
Active contract:
- `docs/Sprints/SPRINT2CONTRACT.md`
Objective:
- Implement deterministic input and output planning before conversion logic exists.
Touched surfaces:
- `paths.py`
- `conversion.py` skeleton if needed
- CLI path handling tests
Expected outputs:
- Single PDF discovery.
- Directory PDF discovery.
- Recursive traversal only when requested.
- Deterministic output paths for Markdown, assets, metadata JSON, report, and optional raw MinerU output.
- Existing-output protection unless `--overwrite` is passed.
Verification checks:
- Unit tests for single PDF path planning.
- Unit tests for directory and recursive discovery.
- Unit tests for overwrite behavior.
- Tests include Korean or non-ASCII filename handling using generated temporary files, not committed sample PDFs.
Hard failure criteria:
- Output planning can overwrite user files without explicit overwrite intent.
- Directory conversion descends recursively without `--recursive`.
Primary agents:
- `feature-generator-agent`
- `evaluation-agent`
### Sprint 3: Domain Records, Metadata, And Warning Model
Active contract:
- `docs/Sprints/SPRINT3CONTRACT.md`
Objective:
- Define project-owned records before binding to MinerU output.
Touched surfaces:
- `ir.py`
- `metadata.py`
- `report.py` skeleton if needed
- Unit tests
Expected outputs:
- Document, page, block, asset, and warning records.
- Stable warning codes from `ARCHITECTURE.md`.
- Metadata JSON builder with required top-level and summary fields.
- Warning aggregation logic.
Verification checks:
- Unit tests for metadata schema creation.
- Unit tests for warning aggregation.
- Unit tests for optional fields such as bbox and confidence being preserved only when present.
Hard failure criteria:
- Public API requires raw MinerU objects.
- Metadata omits source PDF, SHA-256, engine, pages, warnings, assets, or summary.
Primary agents:
- `metadata-agent`
- `feature-generator-agent`
- `evaluation-agent`
### Sprint 4: MinerU Adapter With Mocked Contract
Active contract:
- `docs/Sprints/SPRINT4CONTRACT.md`
Objective:
- Build the direct local MinerU adapter boundary with mocked outputs first.
Touched surfaces:
- `mineru_adapter.py`
- `doctor.py` partial checks
- Adapter tests with fake subprocess results and fake output directories
Expected outputs:
- Adapter availability check.
- Version check.
- Direct CLI command construction.
- Strict-local command validation.
- Subprocess execution wrapper capturing stdout, stderr, exit code, and paths.
- Parsed adapter result object with raw Markdown, raw structured data when available, assets, warnings, engine, engine version, options, exit code, and stderr.
- Baseline command shape based on MinerU 3.1.0 direct local CLI: `mineru -p <input> -o <output>`.
- Strict-local validation allows CLI-internal temporary local `mineru-api` orchestration, while rejecting `--api-url`, remote APIs, router mode, HTTP client backends, and remote OpenAI-compatible backends.
Verification checks:
- Mocked successful MinerU output test.
- Mocked missing MinerU test.
- Mocked non-zero exit test.
- Test that prohibited remote/API flags cannot be introduced.
- No real MinerU/model dependency in default tests.
Hard failure criteria:
- Adapter passes `--api-url`, uses router mode, uses an HTTP client backend, or connects to a remote API or remote OpenAI-compatible backend.
- Adapter falls back to another engine after MinerU failure.
- Tests require model downloads by default.
Primary agents:
- `mineru-integration-agent`
- `feature-generator-agent`
- `evaluation-agent`
### Sprint 5: Obsidian Markdown Normalization And Assets
Active contract:
- `docs/Sprints/SPRINT5CONTRACT.md`
Objective:
- Normalize MinerU/project IR output into Obsidian-friendly Markdown.
Touched surfaces:
- `markdown.py`
- `quality.py` partial asset link checks
- Unit tests
Expected outputs:
- Inline math delimiter normalization to `$...$`.
- Display math delimiter normalization to `$$...$$`.
- Blank-line normalization around display math.
- Relative asset link normalization.
- Simple table preservation and complex table fallback warnings.
- No visible page markers by default.
Verification checks:
- Unit tests for inline math.
- Unit tests for display math spacing.
- Unit tests for underscores/carets inside math.
- Unit tests for relative asset links.
- Unit tests for table fallback warning behavior.
Hard failure criteria:
- Normalization rewrites LaTeX semantics without deterministic tests.
- Generated links are absolute when relative links are required.
- Page provenance is only visible in Markdown and missing from metadata.
Primary agents:
- `obsidian-markdown-agent`
- `feature-generator-agent`
- `evaluation-agent`
### Sprint 6: Quality Checks And Report Generation
Active contract:
- `docs/Sprints/SPRINT6CONTRACT.md`
Objective:
- Produce local quality signals and human-readable reports from metadata.
Touched surfaces:
- `quality.py`
- `report.py`
- `metadata.py`
- Unit tests
Expected outputs:
- Missing asset link count.
- Math renderability check interface with graceful unavailable-tool handling.
- Pages-with-warnings summary.
- `<stem>.report.md` generated from metadata.
- Final status: `success`, `partial`, or `failed`.
Verification checks:
- Unit tests for report content.
- Unit tests for missing asset link count.
- Unit tests for math render failure aggregation.
- Report generation does not re-run MinerU.
Hard failure criteria:
- Report diverges from JSON metadata.
- Math render failures are silently ignored.
- Quality checks require network access.
Primary agents:
- `metadata-agent`
- `evaluation-agent`
- `feature-generator-agent`
### Sprint 7: Conversion Orchestrator, CLI, And Python API
Active contract:
- `docs/Sprints/SPRINT7CONTRACT.md`
Objective:
- Connect path planning, MinerU adapter, normalization, metadata, report, and summaries.
Touched surfaces:
- `conversion.py`
- `cli.py`
- `__init__.py`
- CLI and API tests
Expected outputs:
- `convert_pdf(input_path, output_dir, metadata=True)` public API.
- `pdf2md convert INPUT --out OUTPUT_DIR`.
- `--metadata`, `--keep-raw`, `--recursive`, `--overwrite`, `--gpu`, and `--strict-local` behavior.
- Batch conversion for directories.
- CLI summary with warning counts.
Verification checks:
- API test with mocked MinerU adapter.
- CLI single PDF test with mocked MinerU adapter.
- CLI directory test with mocked MinerU adapter.
- Existing output test.
- Failure summary test.
Hard failure criteria:
- Public API exposes raw MinerU objects as required return fields.
- CLI writes outputs after a hard failure that should stop conversion.
- CLI suppresses warning counts.
Primary agents:
- `feature-generator-agent`
- `requirements-guard-agent`
- `evaluation-agent`
### Sprint 8: Doctor And Setup Documentation
Active contract:
- `docs/Sprints/SPRINT8CONTRACT.md`
Status:
- Implemented.
Objective:
- Make local setup failures explicit before users run conversions.
Touched surfaces:
- `doctor.py`
- `cli.py`
- `README.md`
- `scripts/install-mineru.ps1`
- `scripts/install-models.py`
- Tests for mocked environment checks
Expected outputs:
- `pdf2md doctor` reports Python version, `uv`, CUDA/PyTorch GPU visibility, MinerU availability, MinerU version, and detectable model/cache paths.
- GPU unavailable warning is clear.
- Missing `uv` is reported clearly.
- Pre-Turing/Pascal GPU risk is reported clearly for GTX 1070 Ti compute capability 6.1.
- Missing required dependency causes doctor failure.
- Setup docs explain Windows PowerShell, Python 3.12, `uv`, MinerU, models, GPU expectations, and local-only behavior.
Verification checks:
- Mocked doctor tests for success, missing MinerU, missing GPU, and missing dependency.
- Documentation review for no cloud/API runtime path.
Hard failure criteria:
- Doctor says the environment is healthy when MinerU is missing.
- Doctor implies cloud/API fallback is supported.
Primary agents:
- `local-setup-agent`
- `license-privacy-agent`
- `evaluation-agent`
### Sprint 9: Local Fixture Evaluation And V1 Release Gate
Active contract:
- `docs/Sprints/SPRINT9CONTRACT.md`
Status:
- Implemented.
Objective:
- Validate the end-to-end v1 behavior against local samples without committing samples.
Touched surfaces:
- `tests/integration/`
- Optional local-only fixture manifest that does not include sample PDFs
- `README.md`
- `PROGRESS.md`
Expected outputs:
- Fast mocked integration suite.
- Optional MinerU-dependent local test command.
- Local sample coverage notes in `PROGRESS.md`.
- V1 release checklist status.
Verification checks:
- `uv run pytest` passes without model downloads.
- Optional MinerU test is clearly marked and skipped unless explicitly enabled.
- Representative sample produces Markdown, metadata JSON, report Markdown, and asset paths.
- Obsidian math delimiter expectations are met.
- No sample PDFs are staged.
Hard failure criteria:
- Default tests require GPU, MinerU models, or network access.
- Sample files are added to git unintentionally.
- V1 release checklist passes without metadata/report generation.
Primary agents and skills:
- `evaluation-agent`
- `requirements-guard-agent`
- `fixture-evaluation` skill
### Sprint 10: Pre-Conversion PDF Page Chunking
Active contract:
- `docs/Sprints/SPRINT10CONTRACT.md`
Status:
- Implemented.
Objective:
- Split long PDFs into temporary fixed-size page chunks before MinerU conversion.
Touched surfaces:
- `pdf_splitter.py`
- `conversion.py`
- `cli.py`
- `report.py`
- README and Sprint 10 documentation
- Unit tests for splitter, conversion, CLI, and report behavior
Expected outputs:
- `pdf2md convert INPUT --out OUTPUT --chunk-pages` enables 20-page chunks.
- `pdf2md convert INPUT --out OUTPUT --chunk-pages N` enables custom positive chunk size.
- `convert_pdf(..., chunk_pages=N)` returns a `BatchConversionResult` in chunk mode.
- Temporary chunk PDFs are deleted after conversion completes.
- Chunk Markdown files are separate and named with original page ranges.
- Metadata and report content expose original source path and chunk page ranges.
Verification checks:
- pypdf-based local blank PDF tests cover page counts, chunk ranges, and written chunk page counts.
- Mocked conversion tests verify one adapter call per chunk, failed-chunk continuation, chunk metadata/report context, and temporary chunk cleanup.
- CLI tests verify `--chunk-pages` without a value uses 20 pages.
Hard failure criteria:
- Chunking uploads document content or uses another conversion engine.
- Chunk outputs are merged.
- Default tests require real MinerU, GPU, model files, network, Obsidian, LaTeX tooling, or `samples/`.
## 6. Cross-Cutting Acceptance Criteria
Every implementation sprint must preserve these acceptance criteria:
- No runtime remote document processing path exists.
- MinerU is the only conversion engine.
- Failures are explicit and traceable.
- Warnings are structured and countable.
- Markdown and metadata can be traced back to source pages where available.
- Reports are generated from metadata.
- Default tests are fast and local.
- `samples/` remains untracked unless explicitly requested.
## 7. First Implementation Request Contract Template
Use this template when implementation begins.
```markdown
## Sprint Contract
Objective:
Touched surfaces:
Expected outputs:
Non-goals:
Verification checks:
Hard failure criteria:
Handoff fields:
- Files changed:
- Commands run:
- Tests passed:
- Known failures:
- Residual risks:
- Next action:
```
## 8. Open Risks
- MinerU 3.1.0 install and CLI behavior are source-verified, but real local output still needs a later local probe before release.
- GTX 1070 Ti 8GB is visible locally, but it is Pascal compute capability 6.1; `doctor` and setup docs must make CUDA/PyTorch limits clear.
- `uv` is installed per-user at `C:\Users\user\.local\bin`, but a new shell may need PATH refresh before `uv` is visible.
- Formula renderability checks need a local tool choice; the implementation should start with an interface and graceful unavailable-tool warning if needed.
- Some PDFs will have tables or formulas that cannot be faithfully represented in Markdown; metadata and `.report.md` must surface this instead of hiding it.
- Redistribution license obligations must be reviewed before packaging, redistribution, or bundling model weights.
## 9. Recommended Next Step
Run optional real local MinerU validation on a long sample only when requested. Default verification should continue to use mocked adapters and generated temporary PDFs so it remains independent of MinerU, GPU, model files, network access, and `samples/`.
Facts carried forward from Sprint 0:
- MinerU is fixed to version 3.1.0.
- Direct local CLI command shape is `mineru -p <input> -o <output>`.
- MinerU output layout should be treated as optional-file based until locally probed.
- Python 3.12 is compatible with the pinned MinerU package range.
- GTX 1070 Ti CUDA/PyTorch support needs explicit doctor validation.
- MinerU/model license posture is acceptable for personal local use. Redistribution remains out of scope until reviewed.
+146
View File
@@ -0,0 +1,146 @@
# V1 Release Checklist
Use this checklist from the repository root when deciding whether v1 is ready for local use. It separates the default fast gates from optional MinerU/GPU/sample validation so the normal loop stays independent of real models, GPU, network access, Obsidian, LaTeX tooling, and `samples/`.
## Release Status Rules
- Default fast gates passing means the repository is healthy under mocked/local checks.
- Optional local MinerU fixture gates passing means real local sample conversion has also been validated.
- If `pdf2md doctor` reports a hard failure, v1 release status is blocked unless the user records an explicit waiver with the exact reason and scope.
- Do not mark v1 fully validated when optional MinerU/sample checks are skipped or blocked. Record the blocked reason and release recommendation in `PROGRESS.md`.
- Do not claim perfect LaTeX, table, or reading-order reconstruction. The v1 guarantee is best-effort local conversion with warnings, metadata, and a human-readable report.
## Default Fast Gates
These gates should be runnable without real MinerU execution, sample PDFs, model files, GPU acceleration, network access, Obsidian, or LaTeX tooling. They must not install packages or download models during imports, tests, `doctor`, or `convert`.
| Gate | Pass condition | Release handling |
| --- | --- | --- |
| `uv sync` | Exits 0. | Blocks if project dependencies cannot sync. |
| `uv run pytest` | Exits 0 using mocked/local tests only. | Blocks if default tests require real MinerU, GPU, CUDA, PyTorch, model files, network, Obsidian, LaTeX tooling, or `samples/`. |
| `uv run pdf2md --version` | Exits 0 and prints the installed CLI version. | Blocks if the CLI entry point is unavailable. |
| `uv run pdf2md doctor` | Runs to completion and reports pass, warning-only, or hard-fail status. | A hard failure blocks release readiness unless explicitly waived by the user and recorded. Warning-only status can continue, but warnings must be recorded. |
| `git diff --check` | Exits 0. | Blocks on whitespace or patch formatting errors. |
| `git status --short --untracked-files=all` | Shows no staged sample PDFs or generated sample outputs. | Blocks if any `samples/` file or sample-derived output is staged or committed unintentionally. |
## Strict-Local Gate
Before calling v1 ready, verify the release candidate still follows the strict-local policy:
- Allowed runtime path: direct local `mineru` CLI execution.
- Allowed MinerU internal behavior: MinerU 3.1.0 may start a temporary local `mineru-api` when the CLI runs without `--api-url`.
- Prohibited runtime paths: `--api-url`, remote APIs, router mode, HTTP client backends, remote OpenAI-compatible backends, hosted renderers, cloud OCR, remote LLM/VLM calls, remote document parsers, alternate engines, runtime engine selection, and silent fallback after MinerU failure.
- Setup downloads must be explicit user actions and remain separate from runtime conversion. Default tests, imports, `doctor`, `convert`, and helper checks must not download packages or models.
## Doctor Hard-Failure Handling
Treat a non-zero `pdf2md doctor` result as a release blocker for real v1 readiness. Common hard failures include missing required local dependencies, missing `mineru` on PATH, or a strict-local policy failure. Warning-only doctor results can continue, but the warnings must be recorded.
When `doctor` hard-fails:
- Do not run optional sample conversion as a passing release gate.
- Do not mark optional MinerU/GPU/sample validation as skipped-pass. Mark it blocked.
- Do not use a cloud/API fallback or alternate converter to bypass the failure.
- Record the failing check, exit code, and next action in `PROGRESS.md`.
- If the user chooses to proceed anyway, record the waiver and report the release as waived or risk-accepted, not fully validated.
## Optional Local MinerU, GPU, And Sample Gates
Run these only by explicit opt-in after the default gates. They are intended for a local workstation with MinerU 3.1.0, model/cache setup, and GPU/PyTorch expectations already checked by `doctor`.
Preconditions:
- `uv run pdf2md doctor` has no hard failures, or the user has recorded an explicit waiver.
- Source PDFs come from local `samples/` or another user-provided local directory.
- Generated outputs go to a temporary directory or another ignored local output root, never to tracked fixture paths.
- Sample PDFs, `samples/metadata.json`, and generated sample outputs remain unstaged and uncommitted.
Manual single-sample shape:
```powershell
$sample = "samples\FourNodeQuadrilateralShellElementMITC4.pdf"
$out = Join-Path $env:TEMP ("pdf2md-fixture-" + [guid]::NewGuid().ToString())
uv run pdf2md convert $sample --out $out --metadata --overwrite
```
Optional pytest fixture evaluation:
```powershell
$env:PDF2MD_RUN_MINERU_FIXTURES = "1"
uv run pytest tests/integration/test_optional_mineru_fixtures.py
Remove-Item Env:PDF2MD_RUN_MINERU_FIXTURES
```
This optional pytest path runs `pdf2md doctor` first. If doctor has a hard failure, the fixture test is skipped with a blocker message instead of being counted as a passing real-MinerU validation.
A sample conversion is successful only when all of these are true:
- The command exits 0.
- The planned Markdown file exists: `<output>\<stem>.md`.
- The planned metadata JSON exists: `<output>\<stem>.metadata.json`.
- The planned quality report exists: `<output>\<stem>.report.md`.
- Metadata and report warning counts are consistent enough to explain math, table, reading-order, asset, MinerU, and checker-unavailable risks.
- Any Markdown image links resolve relative to the Markdown file, or missing/broken links are reported as warnings.
Missing Markdown, metadata JSON, or `.report.md` means the sample failed or is blocked. Do not count it as a partial success for release gating.
For each attempted sample, record at least:
- Source filename.
- Command run.
- Exit code.
- Generated Markdown path.
- Generated metadata JSON path.
- Generated `.report.md` path.
- Warning count and final status.
- Math renderability failures or checker-unavailable count.
- Table fallback or degradation count when available.
- Missing or broken asset link count.
- Page coverage when available.
- Doctor status and any GPU/PyTorch/model/cache warnings.
## Fixture Coverage Notes
Local fixture coverage should include these risk categories where samples are available:
- Simple digital PDF with a text layer.
- Math-heavy PDF.
- Multi-column or complex reading order.
- Table with formulas.
- Figure, caption, or extracted asset links.
- Korean or non-ASCII filename/path handling.
Observed local fixture map as of 2026-05-08:
| Local sample | Fixture risks covered | Notes |
| --- | --- | --- |
| `samples/FourNodeQuadrilateralShellElementMITC4.pdf` | simple digital PDF, math-heavy engineering content, table/formula risk | Small sample suitable for first optional MinerU smoke validation. |
| `samples/MITC공부.pdf` | Korean filename/path handling, math-heavy notes, table/formula risk | Useful for non-ASCII path and shell-element notation checks. |
| `samples/2007쉘구조물의유한요소해석에대하여.pdf` | Korean filename/path handling, longer academic layout, math-heavy content, reading-order risk | Larger sample; use after the small smoke sample passes. |
| `samples/유한요소해석법을이용한쉘구조물의동적좌굴해석.pdf` | Korean filename/path handling, math-heavy content, figures/assets, reading-order risk | Larger sample for report and warning quality review. |
| `samples/metadata.json` | fixture notes only | Local untracked context; do not treat as generated v1 metadata unless manually confirmed. |
Coverage gaps to keep visible:
- A deliberately simple one-page digital PDF fixture is still useful for release smoke checks.
- A table-dominant sample with known formula cells would make table degradation easier to judge.
- A figure-heavy sample with expected extracted assets would make asset link validation easier to judge.
Do not score fixture quality only by plain-text edit distance. Include math delimiter/renderability behavior, tables, reading order, assets, metadata fields, warning usefulness, and `.report.md` usefulness.
## No-Sample-Commit Check
Before staging or committing any release-gate work:
```powershell
git status --short --untracked-files=all
```
Confirm:
- `samples/` files are not staged.
- Generated sample outputs are not staged.
- No sample PDF or sample-derived binary was copied into tracked tests or docs.
- Any temporary fixture output inside the repository was removed unless the user explicitly approved keeping it and it is ignored.
`samples/` may appear as untracked local context. That is expected; do not add it unless the user explicitly requests it.