add pdftomd
This commit is contained in:
@@ -0,0 +1,239 @@
|
||||
# Sprint 0 Contract: Source And Environment Verification
|
||||
|
||||
Status: Completed
|
||||
Last updated: 2026-05-07
|
||||
Result: PASS, go-with-risks
|
||||
|
||||
## Objective
|
||||
|
||||
Verify the external facts and local environment assumptions needed before any converter implementation starts.
|
||||
|
||||
Current amendment: the project target changed after the original Sprint 0 pass from MinerU 2.5 to MinerU 3.1.0. Read MinerU version constraints in this contract as applying to MinerU 3.1.0 for future work.
|
||||
|
||||
Sprint 0 must answer whether the planned v1 implementation can proceed with:
|
||||
|
||||
- MinerU 3.1.0 through direct local CLI execution only.
|
||||
- Python 3.12 and `uv`.
|
||||
- Windows PowerShell on the current workspace.
|
||||
- NVIDIA GTX 1070 Ti 8GB as the target GPU.
|
||||
- Local-only processing with no cloud OCR, remote LLM/VLM, hosted parser, `--api-url`, remote APIs, router mode, HTTP client backends, or remote OpenAI-compatible backends.
|
||||
- The temporary local `mineru-api` process started internally by MinerU 3.1.0 CLI is allowed when CLI runs without `--api-url`.
|
||||
|
||||
## Touched Surfaces
|
||||
|
||||
Allowed:
|
||||
|
||||
- `docs/KNOWLEDGEBASE.md`
|
||||
- `docs/V1IMPLEMENTATIONPLAN.md` only if the sprint sequence or constraints need adjustment
|
||||
- `docs/Sprints/SPRINT0CONTRACT.md`
|
||||
- `PROGRESS.md`
|
||||
|
||||
Not allowed:
|
||||
|
||||
- `pyproject.toml`
|
||||
- `src/`
|
||||
- `tests/`
|
||||
- `scripts/`
|
||||
- Any converter implementation code
|
||||
- Any committed file under `samples/`
|
||||
|
||||
## Expected Outputs
|
||||
|
||||
Sprint 0 should produce a concise source-backed handoff in `docs/KNOWLEDGEBASE.md` under the heading `## 9. Sprint 0 Verification (2026-05-07)`, plus a short status and handoff entry in `PROGRESS.md`.
|
||||
|
||||
Evidence requirements:
|
||||
|
||||
- Prefer official or primary sources.
|
||||
- Use non-official sources only when official documentation is incomplete and the source is directly relevant.
|
||||
- For volatile implementation claims, record source URL, access date, and whether the claim is a direct fact or a project inference.
|
||||
- Record failures from allowed local commands as environment facts. Do not fix them during Sprint 0.
|
||||
- Web research is allowed for documentation verification. Runtime converter design must remain local-only.
|
||||
|
||||
The handoff must cover:
|
||||
|
||||
1. MinerU 3.1.0 local CLI facts
|
||||
- Install command or supported install path.
|
||||
- Version command or reliable version detection path.
|
||||
- Direct local CLI invocation shape for PDF conversion.
|
||||
- Supported output locations for Markdown, JSON/structured data, assets, logs, and raw diagnostics.
|
||||
- Whether local execution can be kept free of router/API/HTTP endpoint modes.
|
||||
|
||||
2. Python and package workflow facts
|
||||
- Python 3.12 compatibility status for the planned dependency stack.
|
||||
- `uv` setup implications.
|
||||
- Any packaging constraints that affect Sprint 1 scaffolding.
|
||||
|
||||
3. GPU and runtime facts
|
||||
- CUDA/PyTorch expectations relevant to GTX 1070 Ti 8GB.
|
||||
- Known GPU memory risks or CPU fallback/error-message implications.
|
||||
- Commands that future `pdf2md doctor` should check.
|
||||
|
||||
4. License and privacy facts
|
||||
- MinerU license status.
|
||||
- Model-weight license status when identifiable.
|
||||
- Transitive package/license risks that must be reviewed before redistribution.
|
||||
- Confirmation that v1 runtime design remains local-only.
|
||||
|
||||
5. Implementation go/no-go recommendation
|
||||
- `go`: proceed to Sprint 1 with listed assumptions.
|
||||
- `go-with-risks`: proceed but carry specific risks into later sprint contracts. Each risk must name the later sprint that must absorb it.
|
||||
- `blocked`: stop and ask the user for a requirement change.
|
||||
|
||||
## Non-Goals
|
||||
|
||||
- Do not scaffold the Python project.
|
||||
- Do not install MinerU, CUDA, PyTorch, or model weights.
|
||||
- Do not run full PDF conversion.
|
||||
- Do not edit `samples/` or commit sample files.
|
||||
- Do not introduce candidate engine comparisons.
|
||||
- Do not decide a new conversion engine.
|
||||
- Do not add implementation abstractions, config systems, or CLI flags.
|
||||
|
||||
## Work Packages
|
||||
|
||||
### WP0.1: MinerU Source Review
|
||||
|
||||
Owner:
|
||||
|
||||
- `research-agent`
|
||||
- `mineru-research` skill
|
||||
|
||||
Actions:
|
||||
|
||||
- Review official MinerU documentation, MinerU GitHub, release notes/tags, and relevant license files.
|
||||
- Verify current install, CLI, version, output, model/cache, and local execution facts.
|
||||
- Record source URLs and access date for durable claims.
|
||||
|
||||
Output:
|
||||
|
||||
- Source-backed MinerU fact table in `docs/KNOWLEDGEBASE.md` under `## 9. Sprint 0 Verification (2026-05-07)`.
|
||||
- Any adapter-impacting uncertainty listed in `PROGRESS.md`.
|
||||
|
||||
### WP0.2: Local Environment Review
|
||||
|
||||
Owner:
|
||||
|
||||
- `local-setup-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Check non-invasive local facts, such as Python version, `uv` availability, and GPU visibility when available.
|
||||
- Review official Python, `uv`, PyTorch/CUDA, and NVIDIA-relevant documentation for compatibility constraints.
|
||||
- Identify future `pdf2md doctor` checks.
|
||||
|
||||
Output:
|
||||
|
||||
- Environment compatibility notes and doctor-check requirements.
|
||||
- Explicit GTX 1070 Ti 8GB risks.
|
||||
|
||||
Allowed local commands:
|
||||
|
||||
```powershell
|
||||
python --version
|
||||
uv --version
|
||||
nvidia-smi
|
||||
```
|
||||
|
||||
Only run these commands if they are useful for the active research pass. Do not install or modify system packages.
|
||||
|
||||
### WP0.3: Output Layout Probe Plan
|
||||
|
||||
Owner:
|
||||
|
||||
- `mineru-integration-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Define what must be observed from MinerU before implementing the adapter.
|
||||
- If MinerU is already installed locally, a later user-approved probe may inspect command help or run against a disposable file. Sprint 0 does not require conversion execution.
|
||||
|
||||
Output:
|
||||
|
||||
- Adapter-facing output layout assumptions.
|
||||
- List of fields that must remain optional until observed from real MinerU output.
|
||||
|
||||
### WP0.4: License And Privacy Review
|
||||
|
||||
Owner:
|
||||
|
||||
- `license-privacy-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Review MinerU and model/package license sources.
|
||||
- Distinguish personal/research use from redistribution.
|
||||
- Check that no planned runtime path uploads PDFs, page images, extracted text, or model intermediates.
|
||||
|
||||
Output:
|
||||
|
||||
- License/privacy summary with unresolved obligations.
|
||||
- Blockers if redistribution assumptions are unsafe.
|
||||
|
||||
### WP0.5: Contract Evaluation
|
||||
|
||||
Owner:
|
||||
|
||||
- `evaluation-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Review this contract before Sprint 0 research starts.
|
||||
- Require concrete evidence expectations and failure thresholds.
|
||||
- After Sprint 0 research, independently verify that each expected output is present and source-backed.
|
||||
|
||||
Output:
|
||||
|
||||
- Pass/fail evaluation notes.
|
||||
- Specific follow-up findings if the contract or results are incomplete.
|
||||
|
||||
## Verification Checks
|
||||
|
||||
Required:
|
||||
|
||||
- Every volatile implementation fact has an official or primary source URL.
|
||||
- Source-backed claims distinguish direct fact from project inference.
|
||||
- `docs/KNOWLEDGEBASE.md` remains consistent with `PRD.md` and `ARCHITECTURE.md`.
|
||||
- No candidate engine comparison is reintroduced.
|
||||
- No converter implementation code is created.
|
||||
- `samples/` remains untracked.
|
||||
- `git diff --check` passes for documentation changes.
|
||||
|
||||
Recommended:
|
||||
|
||||
- Check all newly added links manually or with a lightweight local link check if available.
|
||||
- Have `evaluation-agent` review the completed Sprint 0 outputs before proceeding to Sprint 1.
|
||||
|
||||
## Hard Failure Criteria
|
||||
|
||||
Sprint 0 fails and must stop for a user decision if any of these are true:
|
||||
|
||||
- MinerU 3.1.0 cannot be invoked through a direct local CLI path suitable for v1.
|
||||
- MinerU's required v1 path requires `--api-url`, router mode, HTTP client backends, remote APIs, or remote OpenAI-compatible backend behavior.
|
||||
- Python 3.12 is incompatible with the required implementation stack.
|
||||
- Local-only processing cannot be maintained.
|
||||
- License terms clearly block the intended personal/research use.
|
||||
- Required evidence cannot be verified from official or primary sources.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
Sprint 0 is complete when:
|
||||
|
||||
- `docs/KNOWLEDGEBASE.md` contains updated Sprint 0 facts with sources.
|
||||
- `PROGRESS.md` records checks performed, unresolved risks, and go/no-go recommendation.
|
||||
- Future Sprint 1 can proceed without guessing install, CLI, environment, or licensing assumptions.
|
||||
- The evaluator review is complete.
|
||||
- The completed documentation change is committed.
|
||||
|
||||
## Handoff Fields
|
||||
|
||||
Use these fields when Sprint 0 completes:
|
||||
|
||||
- Files changed:
|
||||
- Sources checked:
|
||||
- Local commands run:
|
||||
- Facts confirmed:
|
||||
- Inferences made:
|
||||
- Known failures:
|
||||
- Residual risks:
|
||||
- Go/no-go recommendation:
|
||||
- Next action:
|
||||
@@ -0,0 +1,355 @@
|
||||
# Sprint 10 Contract: Pre-Conversion PDF Page Chunking
|
||||
|
||||
Status: Implemented
|
||||
Last updated: 2026-05-08
|
||||
|
||||
## Objective
|
||||
|
||||
Add an opt-in pre-conversion workflow for long PDFs:
|
||||
|
||||
1. Split each source PDF into fixed-size chunk PDFs of 20 pages.
|
||||
2. Convert each chunk PDF independently through the existing MinerU conversion pipeline.
|
||||
3. Do not merge the generated Markdown files.
|
||||
|
||||
The feature is intended to reduce long-document memory/runtime pressure and make partial progress usable when one chunk fails. It must preserve strict-local execution, keep MinerU 3.1.0 as the only conversion engine, and keep default tests independent of real MinerU, GPU, CUDA, model files, network access, Obsidian, LaTeX tooling, and `samples/`.
|
||||
|
||||
## Research Summary
|
||||
|
||||
Sources checked on 2026-05-08:
|
||||
|
||||
- [pypdf PyPI](https://pypi.org/project/pypdf/): current release observed as `6.10.2`, uploaded 2026-04-15; metadata lists `BSD-3-Clause`, Python `>=3.9`, Python 3.12 support, and describes pypdf as a pure-Python PDF library capable of splitting, merging, cropping, and transforming PDF pages.
|
||||
- [pypdf merging docs](https://pypdf.readthedocs.io/en/stable/user/merging-pdfs.html): `PdfWriter.append()` can append a complete or partial source PDF; examples use zero-based page ranges such as `(0, 10)`. The docs recommend `append` or `merge` over low-level `add_page` / `insert_page`.
|
||||
- [pypdf streaming docs](https://pypdf.readthedocs.io/en/latest/user/streaming-data.html): `PdfReader` and `PdfWriter` support file-like objects, but the project should write chunk PDFs to local disk because MinerU accepts local file paths.
|
||||
- [pypdf PdfWriter docs](https://pypdf.readthedocs.io/en/stable/modules/PdfWriter.html): writer operations clone/copy PDF objects into the destination. The docs warn that cloning linked objects can copy more than just the visible page object in some cases, so chunk output size must be checked in tests.
|
||||
- [MinerU CLI tools docs](https://opendatalab.github.io/MinerU/usage/cli_tools/): the direct `mineru` CLI accepts `-p/--path`, `-o/--output`, `-s/--start`, and `-e/--end`; without `--api-url`, it launches a temporary local `mineru-api`.
|
||||
- [PyMuPDF PyPI](https://pypi.org/project/PyMuPDF/): PyMuPDF is fast and local, but PyPI lists dual licensing under GNU AGPL v3 or an Artifex commercial license.
|
||||
- [pikepdf page assembly docs](https://pikepdf.readthedocs.io/en/latest/topics/pages.html): pikepdf can split pages and transfer page-associated data; it is a capable fallback candidate but adds a QPDF-backed dependency and is not needed for a first implementation.
|
||||
|
||||
## Recommended Package Decision
|
||||
|
||||
Use `pypdf` for Sprint 10.
|
||||
|
||||
Rationale:
|
||||
|
||||
- It is pure Python and fits the current Python 3.12 + `uv` workflow.
|
||||
- It has permissive `BSD-3-Clause` metadata on PyPI.
|
||||
- It directly supports page-level PDF assembly with `PdfReader` / `PdfWriter`.
|
||||
- It avoids adding PyMuPDF's AGPL/commercial licensing considerations for a simple split-only feature.
|
||||
- It avoids adding pikepdf/QPDF native dependency complexity before there is evidence that pypdf cannot handle the project samples.
|
||||
|
||||
Recommended dependency range for implementation:
|
||||
|
||||
```toml
|
||||
dependencies = [
|
||||
"pypdf>=6.10.2,<7",
|
||||
]
|
||||
```
|
||||
|
||||
The implementation adds this dependency to `pyproject.toml` and `uv.lock`.
|
||||
|
||||
## Current Precondition
|
||||
|
||||
- `pdf2md convert` already converts one PDF or a directory of PDFs.
|
||||
- Existing conversion output per input includes:
|
||||
- Markdown
|
||||
- optional metadata JSON, enabled by default
|
||||
- `<stem>.report.md`
|
||||
- assets directory
|
||||
- optional raw MinerU output
|
||||
- `plan_outputs()` already enforces overwrite and output-root safety.
|
||||
- `convert_input()` already handles directory batches and continues after per-file failures.
|
||||
- The MinerU adapter accepts one PDF path at a time and runs direct local `mineru` CLI.
|
||||
- `samples/` is local and untracked; do not commit sample PDFs or generated outputs.
|
||||
|
||||
## Touched Surfaces
|
||||
|
||||
Allowed during implementation:
|
||||
|
||||
- `pyproject.toml`
|
||||
- `uv.lock`
|
||||
- `src/pdf2md/pdf_splitter.py`
|
||||
- `src/pdf2md/paths.py`
|
||||
- `src/pdf2md/conversion.py`
|
||||
- `src/pdf2md/cli.py`
|
||||
- `src/pdf2md/ir.py` only if new warning codes or chunk provenance records are required
|
||||
- `src/pdf2md/metadata.py` only for chunk provenance fields
|
||||
- `src/pdf2md/report.py` only to expose chunk provenance in reports
|
||||
- `tests/test_pdf_splitter.py`
|
||||
- `tests/test_conversion.py`
|
||||
- `tests/test_cli.py`
|
||||
- `tests/test_paths.py`
|
||||
- `tests/test_metadata.py`
|
||||
- `tests/integration/` for mocked chunk workflow coverage
|
||||
- `README.md`
|
||||
- `docs/V1IMPLEMENTATIONPLAN.md`
|
||||
- `docs/Sprints/SPRINT10CONTRACT.md`
|
||||
- `PLAN.md`
|
||||
- `PROGRESS.md`
|
||||
|
||||
Not allowed:
|
||||
|
||||
- Runtime engine selection or alternate conversion engines.
|
||||
- Use of cloud OCR, remote LLM/VLM, hosted renderers, hosted document parsers, `--api-url`, router mode, HTTP client backends, remote APIs, or remote OpenAI-compatible backends.
|
||||
- Mandatory default tests requiring real MinerU, GPU, CUDA, model files, network, Obsidian, LaTeX tooling, or `samples/`.
|
||||
- Committed files under `samples/`.
|
||||
- Committed generated conversion outputs.
|
||||
- Automatic model or package downloads triggered by import time, `doctor`, `convert`, or tests.
|
||||
- Markdown merge behavior for chunk outputs.
|
||||
- Claims that chunking improves formula correctness; it is only a processing-control feature.
|
||||
|
||||
## Product Behavior
|
||||
|
||||
Activation:
|
||||
|
||||
- Chunking is opt-in and existing conversion behavior is unchanged when `chunk_pages` is unset.
|
||||
- CLI: `pdf2md convert INPUT --out OUTPUT_DIR --chunk-pages` uses the default chunk size of 20 pages.
|
||||
- CLI: `pdf2md convert INPUT --out OUTPUT_DIR --chunk-pages 20` uses an explicit positive chunk size.
|
||||
- Python API: `convert_pdf(..., chunk_pages=20)` and `convert_input(..., chunk_pages=20)`.
|
||||
- `convert_pdf()` returns `ConversionResult` without chunking and `BatchConversionResult` when chunk mode is active.
|
||||
- `chunk_pages` must be `None` or a positive integer.
|
||||
|
||||
Chunking behavior:
|
||||
|
||||
- If `chunk_pages` is unset, current behavior remains unchanged.
|
||||
- If `chunk_pages=20` and a PDF has 20 or fewer pages, conversion may either:
|
||||
- convert the original PDF directly, or
|
||||
- create one chunk PDF and convert that chunk.
|
||||
- Recommended: convert the original directly when `total_pages <= chunk_pages` to avoid unnecessary intermediate files.
|
||||
- If a PDF has more than 20 pages, split it into chunk PDFs with ranges:
|
||||
- chunk 1: source pages 1-20
|
||||
- chunk 2: source pages 21-40
|
||||
- chunk N: remaining pages
|
||||
- Convert chunk PDFs sequentially, not in parallel. GTX 1070 Ti 8GB memory pressure makes sequential conversion the safer default.
|
||||
- If one chunk conversion fails, continue with later chunks and report the failed chunk clearly.
|
||||
- Do not merge Markdown outputs.
|
||||
|
||||
Recommended chunk output naming:
|
||||
|
||||
```text
|
||||
<stem>.part-001.pages-001-020.md
|
||||
<stem>.part-001.pages-001-020.metadata.json
|
||||
<stem>.part-001.pages-001-020.report.md
|
||||
<stem>.part-001.pages-001-020.assets/
|
||||
|
||||
<stem>.part-002.pages-021-040.md
|
||||
...
|
||||
```
|
||||
|
||||
Recommended chunk PDF staging:
|
||||
|
||||
- Use a temporary working directory.
|
||||
- Delete temporary chunk PDFs after conversion completes, including when `--keep-raw` is enabled.
|
||||
- Do not add a separate `--keep-chunks` flag in Sprint 10.
|
||||
|
||||
## Provenance Requirements
|
||||
|
||||
Each chunk conversion must preserve original-source context.
|
||||
|
||||
Required chunk fields in metadata or engine options:
|
||||
|
||||
- original source PDF path
|
||||
- original source SHA-256
|
||||
- chunk PDF path when retained, or chunk PDF filename when temporary
|
||||
- chunk index, 1-based
|
||||
- total chunk count
|
||||
- source page start, 1-based inclusive
|
||||
- source page end, 1-based inclusive
|
||||
- chunk page count
|
||||
|
||||
Page provenance must distinguish:
|
||||
|
||||
- chunk-local page index, starting at 0 for MinerU output
|
||||
- original source page number, starting at 1 for user-facing reports
|
||||
|
||||
The report should include a short chunk context line, for example:
|
||||
|
||||
```text
|
||||
- Chunk: 2/5, source pages: 21-40
|
||||
```
|
||||
|
||||
## Architecture Plan
|
||||
|
||||
### WP10.1: PDF Splitter Module
|
||||
|
||||
Owner:
|
||||
|
||||
- `feature-generator-agent`
|
||||
- `mineru-integration-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Add `src/pdf2md/pdf_splitter.py`.
|
||||
- Define project-owned `PdfChunkPlan`.
|
||||
- Implement page counting with `pypdf.PdfReader`.
|
||||
- Implement chunk planning without writing files.
|
||||
- Implement chunk writing with `pypdf.PdfWriter.append(source, (start, end))` or an equivalent tested `PdfReader`/`PdfWriter` path.
|
||||
- Use zero-based half-open page ranges internally and one-based inclusive ranges for filenames and reports.
|
||||
- Reject invalid chunk sizes with clear `ValueError`.
|
||||
- Fail clearly on encrypted/password-protected PDFs unless a later sprint adds password handling.
|
||||
|
||||
Expected output:
|
||||
|
||||
- Deterministic chunk plans and local chunk PDFs suitable for the existing MinerU adapter.
|
||||
|
||||
### WP10.2: Chunk-Aware Path Planning
|
||||
|
||||
Owner:
|
||||
|
||||
- `feature-generator-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Extend output planning so chunk outputs are deterministic and conflict-checked before conversion starts.
|
||||
- Avoid collisions between original-output stems and chunk-output stems.
|
||||
- Preserve output-root escape prevention.
|
||||
- Respect `--overwrite`.
|
||||
- Keep Korean and non-ASCII source stems working.
|
||||
|
||||
Expected output:
|
||||
|
||||
- A long PDF can produce multiple planned Markdown/metadata/report/assets outputs without overwriting another chunk.
|
||||
|
||||
### WP10.3: Conversion Orchestration
|
||||
|
||||
Owner:
|
||||
|
||||
- `feature-generator-agent`
|
||||
- `mineru-integration-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Add chunk mode to `convert_pdf()` and `convert_input()`.
|
||||
- When chunk mode is active, split before calling the MinerU adapter.
|
||||
- Reuse existing per-PDF conversion path for each chunk PDF rather than creating a second conversion pipeline.
|
||||
- Continue conversion after a chunk-level failure and aggregate a batch-like result for the source.
|
||||
- Ensure temporary chunk directories are cleaned up unless raw retention is requested.
|
||||
- Keep strict-local validation unchanged.
|
||||
|
||||
Expected output:
|
||||
|
||||
- Long PDF conversion yields separate Markdown/metadata/report/assets outputs per chunk.
|
||||
|
||||
### WP10.4: Metadata And Report Chunk Provenance
|
||||
|
||||
Owner:
|
||||
|
||||
- `metadata-agent`
|
||||
- `obsidian-markdown-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Add chunk provenance fields without exposing raw pypdf objects.
|
||||
- Keep existing required metadata fields valid.
|
||||
- Keep original source provenance visible even though MinerU sees a chunk PDF as input.
|
||||
- Ensure chunk reports are readable without opening JSON metadata.
|
||||
|
||||
Expected output:
|
||||
|
||||
- Users can map each output file back to original source pages.
|
||||
|
||||
### WP10.5: CLI And API Surface
|
||||
|
||||
Owner:
|
||||
|
||||
- `feature-generator-agent`
|
||||
- `requirements-guard-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Add `--chunk-pages [INTEGER]` to `pdf2md convert`.
|
||||
- Keep chunking disabled unless the option is present.
|
||||
- Use 20 pages when `--chunk-pages` is present without an explicit value.
|
||||
- Validate positive integer input.
|
||||
- Keep `--out`, `--metadata`, `--keep-raw`, `--recursive`, `--overwrite`, `--gpu`, and strict-local behavior unchanged.
|
||||
- Update README with the long-PDF workflow.
|
||||
|
||||
Expected output:
|
||||
|
||||
```powershell
|
||||
uv run pdf2md convert samples/long.pdf --out outputs --chunk-pages 20
|
||||
```
|
||||
|
||||
### WP10.6: Tests
|
||||
|
||||
Owner:
|
||||
|
||||
- `feature-generator-agent`
|
||||
- `evaluation-agent`
|
||||
|
||||
Default tests must not require real MinerU or sample PDFs.
|
||||
|
||||
Required tests:
|
||||
|
||||
- Build small local PDF fixtures using `pypdf` blank pages or minimal test PDFs.
|
||||
- Page count detection for 1, 20, 21, 40, and 41 pages.
|
||||
- Chunk planning produces expected 1-based filenames and page ranges.
|
||||
- Chunk writing produces PDFs with the expected page counts.
|
||||
- Non-positive chunk size is rejected.
|
||||
- Existing conversion without `--chunk-pages` is unchanged.
|
||||
- Chunked conversion calls the fake adapter once per chunk.
|
||||
- Chunked conversion writes separate Markdown, metadata JSON, report Markdown, and assets per chunk.
|
||||
- `--overwrite` and conflict detection work for all planned chunk outputs.
|
||||
- A failed chunk does not silently fallback and does not prevent later chunks from being attempted.
|
||||
- Metadata/report contain original source PDF and source page range.
|
||||
- CLI validates `--chunk-pages` and prints a useful summary.
|
||||
|
||||
Optional local validation:
|
||||
|
||||
- Run chunked conversion on a local `samples/` PDF only by explicit user request or opt-in gate.
|
||||
- Do not commit generated chunk PDFs or outputs.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
- Sprint 10 implementation can split a PDF into 20-page chunk PDFs before MinerU conversion.
|
||||
- Chunk PDFs are converted one by one using the existing direct local MinerU CLI adapter.
|
||||
- Markdown outputs are separate and not merged.
|
||||
- Metadata/report files show chunk index and original page range.
|
||||
- Default test suite passes without real MinerU, GPU, CUDA, model files, network, Obsidian, LaTeX tooling, or `samples/`.
|
||||
- Strict-local policy remains unchanged.
|
||||
- Existing non-chunked conversion behavior remains backward-compatible.
|
||||
|
||||
## Hard Failure Criteria
|
||||
|
||||
- Chunking uses a remote PDF service or uploads document content.
|
||||
- Chunking introduces an alternate Markdown conversion engine.
|
||||
- Default tests require real MinerU, GPU, CUDA, model files, network, or local samples.
|
||||
- Chunk outputs overwrite each other or overwrite non-chunk outputs without `--overwrite`.
|
||||
- Chunk metadata loses original source page provenance.
|
||||
- The implementation merges Markdown despite this contract's non-merge requirement.
|
||||
- The implementation silently skips failed chunks without warnings.
|
||||
|
||||
## Resolved Decisions
|
||||
|
||||
- Activation mode: opt-in with `--chunk-pages`; the option defaults to 20 pages when no value is supplied.
|
||||
- Chunk PDF retention: temporary chunk PDFs only; they are deleted after conversion completes.
|
||||
- API return type: `convert_pdf()` returns a `BatchConversionResult` when chunk mode is active.
|
||||
|
||||
## Verification Commands For Implementation
|
||||
|
||||
```powershell
|
||||
uv sync
|
||||
uv run pytest tests/test_pdf_splitter.py tests/test_conversion.py tests/test_cli.py tests/test_paths.py tests/test_metadata.py
|
||||
uv run pytest
|
||||
git diff --check
|
||||
git status --short --untracked-files=all
|
||||
```
|
||||
|
||||
Optional local command after implementation and explicit user approval:
|
||||
|
||||
```powershell
|
||||
uv run pdf2md convert samples/MITC공부.pdf --out outputs --chunk-pages 20 --overwrite
|
||||
```
|
||||
|
||||
## Handoff Requirements
|
||||
|
||||
After implementation:
|
||||
|
||||
- Update `PROGRESS.md` with files changed, commands run, tests passed, optional sample status, known failures, residual risks, and next action.
|
||||
- Do not mark the sprint implemented until independent evaluation or equivalent focused review verifies the acceptance criteria.
|
||||
- Commit the completed change without including `samples/` or generated outputs.
|
||||
|
||||
## Implementation Handoff
|
||||
|
||||
- Files changed: `pyproject.toml`, `uv.lock`, `src/pdf2md/pdf_splitter.py`, `src/pdf2md/conversion.py`, `src/pdf2md/cli.py`, `src/pdf2md/__init__.py`, `src/pdf2md/report.py`, tests, README, `docs/V1IMPLEMENTATIONPLAN.md`, `PLAN.md`, and `PROGRESS.md`.
|
||||
- Verification status: targeted unit tests passed 42 tests; the full local test suite passed 163 tests with 1 optional skip; `git diff --check` passed with line-ending warnings only.
|
||||
- Optional local sample conversion remains out of scope unless explicitly requested.
|
||||
@@ -0,0 +1,258 @@
|
||||
# Sprint 1 Contract: Project Scaffold And Fast Test Loop
|
||||
|
||||
Status: Completed
|
||||
Last updated: 2026-05-07
|
||||
|
||||
## Objective
|
||||
|
||||
Create the minimal Python project scaffold and fast local test loop for the PDF-to-Markdown converter.
|
||||
|
||||
Sprint 1 must establish:
|
||||
|
||||
- A `uv`-managed Python 3.12 project.
|
||||
- A source package importable as `pdf2md`.
|
||||
- A reserved `pdf2md` CLI entry point that does not implement conversion yet.
|
||||
- A fast test command that runs without MinerU, model downloads, GPU access, sample PDFs, or network access.
|
||||
|
||||
Sprint 1 is scaffolding only. It must not implement PDF conversion, MinerU execution, Markdown normalization, metadata generation, or report generation.
|
||||
|
||||
## Current Precondition
|
||||
|
||||
Sprint 0 found that `uv` was not available on PATH in the current local environment.
|
||||
|
||||
Sprint 1 resolved this by installing `uv` per-user at `C:\Users\user\.local\bin`.
|
||||
|
||||
Before Sprint 1 can be accepted, one of these must happen:
|
||||
|
||||
- `uv` is installed and `uv --version` succeeds.
|
||||
- The user explicitly approves including `uv` bootstrap documentation or setup handling as part of Sprint 1, and the contract result records that `uv sync` could not be run locally.
|
||||
|
||||
Do not silently replace `uv` with another package manager.
|
||||
|
||||
## Touched Surfaces
|
||||
|
||||
Allowed:
|
||||
|
||||
- `pyproject.toml`
|
||||
- `uv.lock`
|
||||
- `.gitignore`
|
||||
- `src/pdf2md/__init__.py`
|
||||
- `src/pdf2md/cli.py` only for a minimal placeholder CLI if needed for entry point verification
|
||||
- `tests/`
|
||||
- `README.md` only for minimal setup/test instructions if needed
|
||||
- `PLAN.md` only for current-goal coordination updates required by the shared agent workflow
|
||||
- `PROGRESS.md`
|
||||
- `docs/V1IMPLEMENTATIONPLAN.md` only if sequencing or constraints need adjustment
|
||||
- `docs/Sprints/SPRINT1CONTRACT.md`
|
||||
|
||||
Not allowed:
|
||||
|
||||
- `src/pdf2md/conversion.py`
|
||||
- `src/pdf2md/mineru_adapter.py`
|
||||
- `src/pdf2md/paths.py`
|
||||
- `src/pdf2md/ir.py`
|
||||
- `src/pdf2md/markdown.py`
|
||||
- `src/pdf2md/metadata.py`
|
||||
- `src/pdf2md/quality.py`
|
||||
- `src/pdf2md/report.py`
|
||||
- `src/pdf2md/doctor.py`
|
||||
- `scripts/`
|
||||
- Any real MinerU invocation
|
||||
- Any model download or install script
|
||||
- Any committed file under `samples/`
|
||||
|
||||
## Expected Outputs
|
||||
|
||||
Sprint 1 should produce:
|
||||
|
||||
1. Project package scaffold
|
||||
- `pyproject.toml` with project metadata.
|
||||
- Python requirement constrained to Python 3.12.
|
||||
- Build configuration suitable for a `src/` layout.
|
||||
- `uv.lock` generated by `uv sync`.
|
||||
- `.gitignore` entries for local virtual environments, pytest cache, and Python bytecode.
|
||||
- Minimal test dependency configuration.
|
||||
- CLI entry point name reserved as `pdf2md`.
|
||||
|
||||
2. Minimal source package
|
||||
- `src/pdf2md/__init__.py`.
|
||||
- A stable package import surface.
|
||||
- Optional minimal `src/pdf2md/cli.py` placeholder that exits clearly and does not imply conversion is implemented.
|
||||
|
||||
3. Fast test loop
|
||||
- A minimal test suite that verifies the package imports.
|
||||
- If a CLI placeholder is added, a smoke test that verifies the CLI entry point is wired without invoking conversion.
|
||||
- Tests must not require MinerU, CUDA, GPU, model files, `samples/`, or network.
|
||||
|
||||
4. Developer workflow
|
||||
- `uv sync` should work when `uv` is installed.
|
||||
- `uv run pytest` should work when `uv` is installed.
|
||||
- If `uv` is still missing locally, record the failure explicitly in `PROGRESS.md` and do not mark Sprint 1 complete.
|
||||
|
||||
5. Handoff
|
||||
- `PROGRESS.md` records changed files, commands run, tests passed or blocked, known failures, residual risks, and next action.
|
||||
|
||||
## Non-Goals
|
||||
|
||||
- Do not implement PDF discovery.
|
||||
- Do not implement conversion orchestration.
|
||||
- Do not implement the MinerU adapter.
|
||||
- Do not run MinerU.
|
||||
- Do not install MinerU 3.1.0.
|
||||
- Do not download MinerU models.
|
||||
- Do not implement Markdown normalization.
|
||||
- Do not implement metadata JSON or `.report.md` output.
|
||||
- Do not implement `pdf2md doctor`; a CLI placeholder may mention future commands, but it must not create a doctor module.
|
||||
- Do not add runtime engine selection.
|
||||
- Do not add alternate conversion engines.
|
||||
- Do not add cloud, remote API, router, HTTP client backend, or remote OpenAI-compatible backend support.
|
||||
|
||||
## Work Packages
|
||||
|
||||
### WP1.1: Scaffold Metadata
|
||||
|
||||
Owner:
|
||||
|
||||
- `feature-generator-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Create the minimal `pyproject.toml`.
|
||||
- Use Python 3.12 constraints.
|
||||
- Configure a `src/` package layout.
|
||||
- Configure pytest as the fast local test runner.
|
||||
- Reserve the `pdf2md` console script.
|
||||
|
||||
Output:
|
||||
|
||||
- A minimal, maintainable scaffold without speculative dependencies.
|
||||
|
||||
### WP1.2: Package Import Surface
|
||||
|
||||
Owner:
|
||||
|
||||
- `feature-generator-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Create `src/pdf2md/__init__.py`.
|
||||
- Expose only a minimal version/import surface.
|
||||
- Avoid public API promises beyond what Sprint 1 verifies.
|
||||
|
||||
Output:
|
||||
|
||||
- `import pdf2md` succeeds.
|
||||
|
||||
### WP1.3: CLI Placeholder
|
||||
|
||||
Owner:
|
||||
|
||||
- `feature-generator-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- If needed for console script verification, create `src/pdf2md/cli.py`.
|
||||
- The placeholder may expose a help message or a clear "not implemented yet" command.
|
||||
- It must not create conversion flags beyond the reserved command shape unless tests need them.
|
||||
|
||||
Output:
|
||||
|
||||
- `pdf2md` entry point is wired without implying conversion works.
|
||||
|
||||
### WP1.4: Fast Tests
|
||||
|
||||
Owner:
|
||||
|
||||
- `feature-generator-agent`
|
||||
- `evaluation-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Add minimal tests for package import and optional CLI placeholder behavior.
|
||||
- Ensure tests are local, fast, and independent of MinerU/model/GPU/network state.
|
||||
|
||||
Output:
|
||||
|
||||
- `uv run pytest` passes when `uv` is available.
|
||||
|
||||
### WP1.5: Independent Evaluation
|
||||
|
||||
Owner:
|
||||
|
||||
- `evaluation-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Review the completed scaffold against this contract.
|
||||
- Verify no converter implementation was added.
|
||||
- Verify `samples/` remains untracked and unstaged.
|
||||
- Verify no runtime remote path or alternate engine was introduced.
|
||||
|
||||
Output:
|
||||
|
||||
- PASS/FAIL notes with any missing acceptance criteria.
|
||||
|
||||
## Verification Checks
|
||||
|
||||
Required:
|
||||
|
||||
- `git status --short` before staging confirms `samples/` remains untracked.
|
||||
- `uv --version` is run and result is recorded.
|
||||
- `uv sync` passes if `uv` is available.
|
||||
- `uv run pytest` passes if `uv` is available.
|
||||
- If `uv` is unavailable, Sprint 1 is marked blocked rather than complete.
|
||||
- Import test passes through the configured test command.
|
||||
- No real MinerU dependency is required for default tests.
|
||||
- No model downloads occur.
|
||||
- No network calls are required.
|
||||
- No candidate engine comparison is reintroduced.
|
||||
- No conversion behavior is implemented.
|
||||
- `git diff --check` passes.
|
||||
|
||||
Recommended:
|
||||
|
||||
- Keep `pyproject.toml` dependency list minimal.
|
||||
- Avoid adding README content beyond setup/test instructions needed for the scaffold.
|
||||
- Use `requirements-guard-agent` to check document consistency if the scaffold reveals a sequencing issue.
|
||||
|
||||
## Hard Failure Criteria
|
||||
|
||||
Sprint 1 fails and must stop for a user decision if any of these are true:
|
||||
|
||||
- `uv` remains unavailable and the user has not approved bootstrap handling.
|
||||
- The project cannot be installed as a Python 3.12 package.
|
||||
- The package cannot be imported as `pdf2md`.
|
||||
- Default tests require MinerU, model downloads, GPU access, sample PDFs, or network access.
|
||||
- The scaffold introduces conversion logic outside Sprint 1 scope.
|
||||
- The scaffold introduces alternate engines or runtime engine selection.
|
||||
- The scaffold introduces `--api-url`, remote APIs, router mode, HTTP client backends, or remote OpenAI-compatible backends.
|
||||
- `samples/` is staged or committed.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
Sprint 1 is complete when:
|
||||
|
||||
- `pyproject.toml` exists and defines a minimal Python 3.12 `uv` project.
|
||||
- `src/pdf2md/__init__.py` exists and `import pdf2md` works through the project environment.
|
||||
- `uv sync` passes.
|
||||
- `uv run pytest` passes.
|
||||
- The `pdf2md` CLI entry point is reserved and does not imply conversion is implemented.
|
||||
- No converter implementation code beyond the allowed placeholder exists.
|
||||
- No default test depends on MinerU, GPU, model files, network, or `samples/`.
|
||||
- `PROGRESS.md` records checks performed and residual risks.
|
||||
- Independent evaluation is complete.
|
||||
- The completed change is committed.
|
||||
|
||||
## Handoff Fields
|
||||
|
||||
Use these fields when Sprint 1 completes:
|
||||
|
||||
- Files changed:
|
||||
- Commands run:
|
||||
- Tests passed:
|
||||
- Tests blocked:
|
||||
- Known failures:
|
||||
- Residual risks:
|
||||
- User decisions needed:
|
||||
- Go/no-go recommendation for Sprint 2:
|
||||
- Next action:
|
||||
@@ -0,0 +1,274 @@
|
||||
# Sprint 2 Contract: Paths, Input Discovery, And Overwrite Planning
|
||||
|
||||
Status: Completed
|
||||
Last updated: 2026-05-07
|
||||
|
||||
## Objective
|
||||
|
||||
Implement deterministic input discovery and output path planning before any PDF conversion logic exists.
|
||||
|
||||
Sprint 2 must establish:
|
||||
|
||||
- A project-owned path planning module for local PDF inputs.
|
||||
- Deterministic discovery for a single PDF, a directory, and optional recursive directory traversal.
|
||||
- Deterministic planned output paths for Markdown, assets, metadata JSON, quality report, and optional raw MinerU output.
|
||||
- Preflight overwrite conflict detection that prevents accidental replacement unless overwrite is explicitly allowed.
|
||||
- Fast unit tests using generated temporary files, including non-ASCII filenames.
|
||||
|
||||
Sprint 2 is path planning only. It must not run MinerU, parse PDFs, write conversion outputs, normalize Markdown, create metadata content, or add the real `convert` command behavior.
|
||||
|
||||
## Current Precondition
|
||||
|
||||
Sprint 1 is complete:
|
||||
|
||||
- `uv` is installed per-user at `C:\Users\user\.local\bin`.
|
||||
- `pyproject.toml`, `uv.lock`, the `pdf2md` package, CLI placeholder, and fast pytest loop exist.
|
||||
- `uv sync` and `uv run pytest` passed.
|
||||
|
||||
If a new shell cannot find `uv`, prepend `C:\Users\user\.local\bin` to PATH for verification commands and record that in `PROGRESS.md`.
|
||||
|
||||
## Touched Surfaces
|
||||
|
||||
Allowed:
|
||||
|
||||
- `src/pdf2md/paths.py`
|
||||
- `src/pdf2md/conversion.py` only for a minimal type boundary if path planning cannot be tested cleanly without it
|
||||
- `src/pdf2md/cli.py` only if a minimal parser hook is needed for path-planning tests; do not expose working conversion behavior
|
||||
- `tests/test_paths.py` or `tests/unit/test_paths.py`
|
||||
- `tests/test_cli.py` only for path-planning parser coverage if `cli.py` changes
|
||||
- `README.md` only if setup/test instructions need a small update
|
||||
- `PLAN.md` only for current-goal coordination updates required by the shared agent workflow
|
||||
- `PROGRESS.md`
|
||||
- `docs/V1IMPLEMENTATIONPLAN.md` only if sequencing or constraints need adjustment
|
||||
- `docs/Sprints/SPRINT2CONTRACT.md`
|
||||
|
||||
Not allowed:
|
||||
|
||||
- `src/pdf2md/mineru_adapter.py`
|
||||
- `src/pdf2md/ir.py`
|
||||
- `src/pdf2md/markdown.py`
|
||||
- `src/pdf2md/metadata.py`
|
||||
- `src/pdf2md/quality.py`
|
||||
- `src/pdf2md/report.py`
|
||||
- `src/pdf2md/doctor.py`
|
||||
- `scripts/`
|
||||
- Any real MinerU invocation
|
||||
- Any model download or install script
|
||||
- Any file parsing beyond local filesystem path and extension checks
|
||||
- Any conversion output writing beyond temporary files created by tests
|
||||
- Any committed file under `samples/`
|
||||
|
||||
## Expected Outputs
|
||||
|
||||
Sprint 2 should produce:
|
||||
|
||||
1. Input discovery
|
||||
- Accept a local path that is either a PDF file or a directory.
|
||||
- Treat `.pdf` extension matching as case-insensitive.
|
||||
- Reject a non-existent path with a clear project-owned error.
|
||||
- Reject a non-PDF file with a clear project-owned error.
|
||||
- Reject a directory with no discovered PDFs with a clear project-owned error.
|
||||
- Discover only direct child PDFs for directory input unless recursive traversal is requested.
|
||||
- Discover nested PDFs only when recursive traversal is requested.
|
||||
- Return discovered PDFs in a deterministic order.
|
||||
|
||||
2. Output path plan
|
||||
- For each discovered PDF, plan:
|
||||
- Markdown path: `<output-root>/<relative-parent>/<stem>.md`.
|
||||
- Assets directory: `<output-root>/<relative-parent>/<stem>.assets`.
|
||||
- Metadata path when metadata is enabled: `<output-root>/<relative-parent>/<stem>.metadata.json`.
|
||||
- Quality report path: `<output-root>/<relative-parent>/<stem>.report.md`.
|
||||
- Raw MinerU directory when raw output is kept: `<output-root>/<relative-parent>/<stem>.raw`.
|
||||
- For a single PDF input, `relative-parent` is empty unless the implementation has a tested reason to preserve more context.
|
||||
- For recursive directory input, preserve the source-relative subdirectory under the output root to avoid filename collisions.
|
||||
- Keep planned paths local filesystem paths. Do not introduce URI, URL, cloud, or remote storage handling.
|
||||
|
||||
3. Overwrite preflight
|
||||
- Detect existing planned file or directory outputs before conversion writes occur.
|
||||
- Report all detected conflicts in one project-owned error instead of failing on the first conflict.
|
||||
- Allow conflicts only when overwrite is explicitly enabled.
|
||||
- Do not delete or replace files in Sprint 2.
|
||||
|
||||
4. Tests
|
||||
- Unit tests for single PDF discovery.
|
||||
- Unit tests for non-recursive directory discovery.
|
||||
- Unit tests for recursive directory discovery.
|
||||
- Unit tests for deterministic ordering.
|
||||
- Unit tests for non-ASCII filenames, including Korean filenames, using temporary files.
|
||||
- Unit tests for invalid input errors.
|
||||
- Unit tests for planned Markdown, assets, metadata, report, and raw output paths.
|
||||
- Unit tests for overwrite conflict detection.
|
||||
|
||||
5. Handoff
|
||||
- `PROGRESS.md` records changed files, commands run, tests passed or blocked, known failures, residual risks, and next action.
|
||||
|
||||
## Non-Goals
|
||||
|
||||
- Do not implement PDF conversion.
|
||||
- Do not implement conversion orchestration.
|
||||
- Do not implement the MinerU adapter.
|
||||
- Do not run MinerU.
|
||||
- Do not install MinerU 3.1.0.
|
||||
- Do not download MinerU models.
|
||||
- Do not parse PDF contents.
|
||||
- Do not compute source SHA-256.
|
||||
- Do not implement Markdown normalization.
|
||||
- Do not implement metadata JSON content.
|
||||
- Do not implement `.report.md` content.
|
||||
- Do not implement `pdf2md convert` as a working command.
|
||||
- Do not implement `pdf2md doctor`.
|
||||
- Do not add runtime engine selection.
|
||||
- Do not add alternate conversion engines.
|
||||
- Do not add cloud, remote API, router, HTTP client backend, or remote OpenAI-compatible backend support.
|
||||
|
||||
## Work Packages
|
||||
|
||||
### WP2.1: Path Planning Types And Errors
|
||||
|
||||
Owner:
|
||||
|
||||
- `feature-generator-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Add the smallest project-owned types needed to represent discovered inputs, planned outputs, and overwrite conflicts.
|
||||
- Add clear project-owned exceptions or error result types for invalid inputs and conflicts.
|
||||
- Avoid public API promises beyond what Sprint 2 tests verify.
|
||||
|
||||
Output:
|
||||
|
||||
- Path planning can be tested without converter execution.
|
||||
|
||||
### WP2.2: Input Discovery
|
||||
|
||||
Owner:
|
||||
|
||||
- `feature-generator-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Implement single PDF and directory discovery.
|
||||
- Require explicit recursive mode for subdirectory traversal.
|
||||
- Sort results deterministically.
|
||||
- Preserve local `Path` objects rather than converting to strings early.
|
||||
|
||||
Output:
|
||||
|
||||
- Discovery behavior matches PRD directory and recursive requirements.
|
||||
|
||||
### WP2.3: Output Planning
|
||||
|
||||
Owner:
|
||||
|
||||
- `feature-generator-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Plan Markdown, assets, metadata, report, and optional raw output paths.
|
||||
- Preserve relative subdirectories for recursive directory input.
|
||||
- Keep all planned outputs under the requested output root.
|
||||
|
||||
Output:
|
||||
|
||||
- Later conversion code can write outputs without rediscovering naming rules.
|
||||
|
||||
### WP2.4: Overwrite Conflict Detection
|
||||
|
||||
Owner:
|
||||
|
||||
- `feature-generator-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Check whether any planned output already exists.
|
||||
- Return or raise a structured conflict list when overwrite is not enabled.
|
||||
- Permit the plan when overwrite is enabled without deleting anything.
|
||||
|
||||
Output:
|
||||
|
||||
- Existing user files are protected before conversion starts.
|
||||
|
||||
### WP2.5: Independent Evaluation
|
||||
|
||||
Owner:
|
||||
|
||||
- `evaluation-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Review the completed path planning implementation against this contract.
|
||||
- Verify no conversion behavior, MinerU execution, remote runtime path, or alternate engine was added.
|
||||
- Verify `samples/` remains untracked and unstaged.
|
||||
- Verify tests use temporary files, not committed sample PDFs.
|
||||
|
||||
Output:
|
||||
|
||||
- PASS/FAIL notes with any missing acceptance criteria.
|
||||
|
||||
## Verification Checks
|
||||
|
||||
Required:
|
||||
|
||||
- `git status --short` before staging confirms `samples/` remains untracked.
|
||||
- `uv --version` is run and result is recorded.
|
||||
- `uv sync` passes.
|
||||
- `uv run pytest` passes.
|
||||
- Targeted path planning tests pass.
|
||||
- Tests do not require MinerU, CUDA, GPU, model files, `samples/`, or network.
|
||||
- No real MinerU dependency is required for default tests.
|
||||
- No model downloads occur.
|
||||
- No network calls are required.
|
||||
- No candidate engine comparison is reintroduced.
|
||||
- No conversion behavior is implemented.
|
||||
- No output files are written outside temporary test directories.
|
||||
- `git diff --check` passes.
|
||||
|
||||
Recommended:
|
||||
|
||||
- Use temporary directories for all filesystem tests.
|
||||
- Include Windows-relevant path behavior without hard-coding Windows-only separators in assertions.
|
||||
- Use `requirements-guard-agent` if path planning reveals a contradiction in PRD or architecture wording.
|
||||
|
||||
## Hard Failure Criteria
|
||||
|
||||
Sprint 2 fails and must stop for a user decision if any of these are true:
|
||||
|
||||
- Directory conversion descends recursively without explicit recursive intent.
|
||||
- Existing planned outputs can be overwritten without explicit overwrite intent.
|
||||
- Planned output paths can escape the requested output root.
|
||||
- Default tests require MinerU, CUDA, GPU, model files, network, or `samples/`.
|
||||
- The implementation parses PDF contents or invokes conversion behavior.
|
||||
- The implementation introduces alternate engines or runtime engine selection.
|
||||
- The implementation introduces `--api-url`, remote APIs, router mode, HTTP client backends, or remote OpenAI-compatible backends.
|
||||
- `samples/` is staged or committed.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
Sprint 2 is complete when:
|
||||
|
||||
- `src/pdf2md/paths.py` exists and owns input discovery plus output path planning.
|
||||
- Single PDF discovery is tested.
|
||||
- Non-recursive and recursive directory discovery are tested.
|
||||
- Non-ASCII PDF filenames are tested with generated temporary files.
|
||||
- Markdown, assets, metadata JSON, report Markdown, and optional raw output paths are tested.
|
||||
- Existing-output conflict detection is tested with and without overwrite enabled.
|
||||
- No conversion, MinerU, Markdown normalization, metadata content, report content, or doctor behavior is implemented.
|
||||
- `uv sync` passes.
|
||||
- `uv run pytest` passes.
|
||||
- `PROGRESS.md` records checks performed and residual risks.
|
||||
- Independent evaluation is complete.
|
||||
- The completed change is committed.
|
||||
|
||||
## Handoff Fields
|
||||
|
||||
Use these fields when Sprint 2 completes:
|
||||
|
||||
- Files changed:
|
||||
- Commands run:
|
||||
- Tests passed:
|
||||
- Tests blocked:
|
||||
- Known failures:
|
||||
- Residual risks:
|
||||
- User decisions needed:
|
||||
- Go/no-go recommendation for Sprint 3:
|
||||
- Next action:
|
||||
@@ -0,0 +1,303 @@
|
||||
# Sprint 3 Contract: Domain Records, Metadata, And Warning Model
|
||||
|
||||
Status: Completed
|
||||
Last updated: 2026-05-07
|
||||
|
||||
## Objective
|
||||
|
||||
Define project-owned domain records, warning records, and metadata JSON construction before binding the system to MinerU output.
|
||||
|
||||
Sprint 3 must establish:
|
||||
|
||||
- Internal records for documents, pages, blocks, assets, warnings, and conversion outputs.
|
||||
- Stable warning code and severity definitions aligned with `ARCHITECTURE.md`.
|
||||
- A metadata builder that produces the required v1 top-level and summary fields.
|
||||
- Warning aggregation behavior that later report generation can consume.
|
||||
- Fast unit tests that do not require MinerU, model files, GPU, sample PDFs, or network.
|
||||
|
||||
Sprint 3 is schema and metadata modeling only. It must not run MinerU, parse PDFs, normalize Markdown, generate final report Markdown content, expose a working `convert` command, or add remote/runtime engine behavior.
|
||||
|
||||
## Current Precondition
|
||||
|
||||
Sprint 2 is complete:
|
||||
|
||||
- `src/pdf2md/paths.py` owns input discovery and output path planning.
|
||||
- `tests/test_paths.py` verifies directory recursion, non-ASCII filenames, overwrite conflict detection, duplicate planned outputs, and output-root escape prevention.
|
||||
- `uv run pytest` passed 21 tests.
|
||||
|
||||
Sprint 3 may use path planning records as context, but it should not depend on actual conversion output.
|
||||
|
||||
## Touched Surfaces
|
||||
|
||||
Allowed:
|
||||
|
||||
- `src/pdf2md/ir.py`
|
||||
- `src/pdf2md/metadata.py`
|
||||
- `src/pdf2md/report.py` only for a minimal type boundary if metadata/report handoff cannot be expressed cleanly without it
|
||||
- `src/pdf2md/__init__.py` only if exporting a minimal stable type is necessary and tested
|
||||
- `tests/test_ir.py` or `tests/unit/test_ir.py`
|
||||
- `tests/test_metadata.py` or `tests/unit/test_metadata.py`
|
||||
- `PLAN.md` only for current-goal coordination updates required by the shared agent workflow
|
||||
- `PROGRESS.md`
|
||||
- `docs/V1IMPLEMENTATIONPLAN.md` only if sequencing or constraints need adjustment
|
||||
- `docs/Sprints/SPRINT3CONTRACT.md`
|
||||
|
||||
Not allowed:
|
||||
|
||||
- `src/pdf2md/mineru_adapter.py`
|
||||
- `src/pdf2md/markdown.py`
|
||||
- `src/pdf2md/quality.py`
|
||||
- `src/pdf2md/doctor.py`
|
||||
- `scripts/`
|
||||
- Any real MinerU invocation
|
||||
- Any model download or install script
|
||||
- Any PDF content parsing
|
||||
- Any Markdown normalization behavior
|
||||
- Any `.report.md` content generation beyond a minimal handoff type if absolutely needed
|
||||
- Any working `pdf2md convert` or `pdf2md doctor` behavior
|
||||
- Any committed file under `samples/`
|
||||
|
||||
## Expected Outputs
|
||||
|
||||
Sprint 3 should produce:
|
||||
|
||||
1. Domain records
|
||||
- `DocumentRecord` or equivalent project-owned record.
|
||||
- `PageRecord` or equivalent with page index and optional page dimensions.
|
||||
- `BlockRecord` or equivalent with block type, optional page index, optional bbox, optional confidence, and optional Markdown character span.
|
||||
- `AssetRecord` or equivalent with stable relative path and optional source page/provenance.
|
||||
- `WarningRecord` or equivalent with code, severity, message, optional page index, and optional bbox.
|
||||
- `ConversionOutputRecord` or equivalent only if useful for connecting metadata to later orchestration; it must not invoke conversion.
|
||||
|
||||
2. Stable enums or constants
|
||||
- Block types aligned with `ARCHITECTURE.md`: `heading`, `paragraph`, `inline_formula`, `display_formula`, `table`, `figure`, `caption`, `footnote`, `reference`, and `unknown`.
|
||||
- Warning codes aligned with `ARCHITECTURE.md`, including at least:
|
||||
- `ENGINE_MISSING`
|
||||
- `GPU_UNAVAILABLE`
|
||||
- `LOW_CONFIDENCE_FORMULA`
|
||||
- `MATH_RENDER_FAILED`
|
||||
- `ASSET_LINK_MISSING`
|
||||
- `READING_ORDER_UNCERTAIN`
|
||||
- `STRICT_LOCAL_VIOLATION`
|
||||
- `MINERU_CLI_FAILED`
|
||||
- Warning severity values sufficient for v1 metadata and report summaries, such as `info`, `warning`, and `error`.
|
||||
|
||||
3. Metadata builder
|
||||
- Build a JSON-serializable metadata object with required top-level fields:
|
||||
- `source_pdf`
|
||||
- `source_sha256`
|
||||
- `created_at`
|
||||
- `engine`
|
||||
- `engine_version`
|
||||
- `engine_options`
|
||||
- `pages`
|
||||
- `assets`
|
||||
- `warnings`
|
||||
- `summary`
|
||||
- Build required summary fields:
|
||||
- `pages_processed`
|
||||
- `warning_count`
|
||||
- `asset_count`
|
||||
- `display_formula_count`
|
||||
- `inline_formula_count`
|
||||
- `math_render_error_count`
|
||||
- Preserve optional fields such as bbox and confidence only when present.
|
||||
- Require `source_sha256` as an input value. Sprint 3 should not compute hashes by reading PDFs unless the contract is explicitly amended.
|
||||
- Produce only plain Python data structures that `json.dumps` can serialize without custom encoders.
|
||||
|
||||
4. Warning aggregation
|
||||
- Count warnings.
|
||||
- Count math render failures from `MATH_RENDER_FAILED`.
|
||||
- Preserve warning order unless there is a tested reason to sort.
|
||||
- Preserve page-level warning data when available.
|
||||
|
||||
5. Tests
|
||||
- Unit tests for domain record serialization.
|
||||
- Unit tests for metadata schema creation with all required top-level fields.
|
||||
- Unit tests for summary counts.
|
||||
- Unit tests for warning aggregation.
|
||||
- Unit tests that optional bbox and confidence fields are preserved only when present.
|
||||
- Unit tests that metadata is JSON serializable.
|
||||
- Unit tests that metadata requires source PDF, source SHA-256, engine, engine version, and page records.
|
||||
|
||||
6. Handoff
|
||||
- `PROGRESS.md` records changed files, commands run, tests passed or blocked, known failures, residual risks, and next action.
|
||||
|
||||
## Non-Goals
|
||||
|
||||
- Do not implement PDF conversion.
|
||||
- Do not implement conversion orchestration.
|
||||
- Do not implement the MinerU adapter.
|
||||
- Do not run MinerU.
|
||||
- Do not install MinerU 3.1.0.
|
||||
- Do not download MinerU models.
|
||||
- Do not parse PDF contents.
|
||||
- Do not compute source SHA-256 by reading files unless this contract is explicitly amended.
|
||||
- Do not implement Markdown normalization.
|
||||
- Do not implement asset link checking.
|
||||
- Do not implement math renderability checking.
|
||||
- Do not implement full `.report.md` content generation.
|
||||
- Do not implement `pdf2md convert` as a working command.
|
||||
- Do not implement `pdf2md doctor`.
|
||||
- Do not add runtime engine selection.
|
||||
- Do not add alternate conversion engines.
|
||||
- Do not add cloud, remote API, router, HTTP client backend, or remote OpenAI-compatible backend support.
|
||||
|
||||
## Work Packages
|
||||
|
||||
### WP3.1: Domain Record Types
|
||||
|
||||
Owner:
|
||||
|
||||
- `metadata-agent`
|
||||
- `feature-generator-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Define small project-owned records for document/page/block/asset/warning concepts.
|
||||
- Use simple, typed Python structures that are easy to serialize and test.
|
||||
- Keep MinerU-specific raw objects out of public and required fields.
|
||||
|
||||
Output:
|
||||
|
||||
- `ir.py` contains the minimal domain model needed by metadata construction.
|
||||
|
||||
### WP3.2: Warning Codes And Severities
|
||||
|
||||
Owner:
|
||||
|
||||
- `metadata-agent`
|
||||
- `feature-generator-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Define stable warning codes from `ARCHITECTURE.md`.
|
||||
- Define severity values and validate warning records against them.
|
||||
- Avoid inventing speculative warning categories beyond the known v1 set unless needed by tests.
|
||||
|
||||
Output:
|
||||
|
||||
- Warnings are structured, countable, and stable across later sprints.
|
||||
|
||||
### WP3.3: Metadata Builder
|
||||
|
||||
Owner:
|
||||
|
||||
- `metadata-agent`
|
||||
- `feature-generator-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Build required metadata JSON data from project-owned records.
|
||||
- Preserve optional provenance fields only when present.
|
||||
- Require source PDF path, source SHA-256, engine, engine version, pages, assets, warnings, and engine options as explicit inputs.
|
||||
|
||||
Output:
|
||||
|
||||
- `metadata.py` produces the required v1 metadata object without MinerU execution.
|
||||
|
||||
### WP3.4: Metadata And Warning Tests
|
||||
|
||||
Owner:
|
||||
|
||||
- `feature-generator-agent`
|
||||
- `evaluation-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Add focused unit tests for schema, counts, optional fields, JSON serialization, and validation failures.
|
||||
- Use in-memory records and temporary paths only.
|
||||
|
||||
Output:
|
||||
|
||||
- `uv run pytest` verifies metadata behavior without external dependencies.
|
||||
|
||||
### WP3.5: Independent Evaluation
|
||||
|
||||
Owner:
|
||||
|
||||
- `evaluation-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Review the completed records and metadata builder against this contract.
|
||||
- Verify no conversion behavior, MinerU execution, remote runtime path, alternate engine, Markdown normalization, quality checks, or report content generation was added.
|
||||
- Verify `samples/` remains untracked and unstaged.
|
||||
|
||||
Output:
|
||||
|
||||
- PASS/FAIL notes with any missing acceptance criteria.
|
||||
|
||||
## Verification Checks
|
||||
|
||||
Required:
|
||||
|
||||
- `git status --short` before staging confirms `samples/` remains untracked.
|
||||
- `uv --version` is run and result is recorded.
|
||||
- `uv sync` passes.
|
||||
- `uv run pytest` passes.
|
||||
- Targeted IR/metadata tests pass.
|
||||
- Metadata output is JSON serializable through `json.dumps`.
|
||||
- Tests do not require MinerU, CUDA, GPU, model files, `samples/`, or network.
|
||||
- No real MinerU dependency is required for default tests.
|
||||
- No model downloads occur.
|
||||
- No network calls are required.
|
||||
- No candidate engine comparison is reintroduced.
|
||||
- No conversion behavior is implemented.
|
||||
- No Markdown normalization behavior is implemented.
|
||||
- No full `.report.md` content generation is implemented.
|
||||
- `git diff --check` passes.
|
||||
|
||||
Recommended:
|
||||
|
||||
- Keep dataclass or enum APIs small and explicit.
|
||||
- Prefer one serialization function per record over ad hoc dict mutation in tests.
|
||||
- Include tests that fail if a required metadata top-level field is omitted.
|
||||
- Use `requirements-guard-agent` if metadata requirements conflict between `PRD.md` and `ARCHITECTURE.md`.
|
||||
|
||||
## Hard Failure Criteria
|
||||
|
||||
Sprint 3 fails and must stop for a user decision if any of these are true:
|
||||
|
||||
- Metadata omits source PDF, source SHA-256, engine, engine version, pages, warnings, assets, or summary.
|
||||
- Summary omits pages processed, warning count, asset count, display formula count, inline formula count, or math render error count.
|
||||
- Public or required metadata fields require raw MinerU objects.
|
||||
- Optional bbox, confidence, or page provenance is dropped when provided.
|
||||
- Optional bbox, confidence, or page provenance is invented when absent.
|
||||
- Default tests require MinerU, CUDA, GPU, model files, network, or `samples/`.
|
||||
- The implementation parses PDF contents, invokes conversion behavior, normalizes Markdown, or generates full report Markdown content.
|
||||
- The implementation introduces alternate engines or runtime engine selection.
|
||||
- The implementation introduces `--api-url`, remote APIs, router mode, HTTP client backends, or remote OpenAI-compatible backends.
|
||||
- `samples/` is staged or committed.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
Sprint 3 is complete when:
|
||||
|
||||
- `src/pdf2md/ir.py` exists and owns project domain records.
|
||||
- `src/pdf2md/metadata.py` exists and builds required metadata JSON data from project-owned records.
|
||||
- Stable block types and warning codes are defined and tested.
|
||||
- Metadata top-level fields and summary fields are tested.
|
||||
- Warning aggregation is tested.
|
||||
- Optional bbox and confidence preservation is tested.
|
||||
- Metadata JSON serializability is tested.
|
||||
- No conversion, MinerU, Markdown normalization, quality check, full report generation, or doctor behavior is implemented.
|
||||
- `uv sync` passes.
|
||||
- `uv run pytest` passes.
|
||||
- `PROGRESS.md` records checks performed and residual risks.
|
||||
- Independent evaluation is complete.
|
||||
- The completed change is committed.
|
||||
|
||||
## Handoff Fields
|
||||
|
||||
Use these fields when Sprint 3 completes:
|
||||
|
||||
- Files changed:
|
||||
- Commands run:
|
||||
- Tests passed:
|
||||
- Tests blocked:
|
||||
- Known failures:
|
||||
- Residual risks:
|
||||
- User decisions needed:
|
||||
- Go/no-go recommendation for Sprint 4:
|
||||
- Next action:
|
||||
@@ -0,0 +1,316 @@
|
||||
# Sprint 4 Contract: MinerU Adapter With Mocked Contract
|
||||
|
||||
Status: Implemented
|
||||
Last updated: 2026-05-07
|
||||
|
||||
## Objective
|
||||
|
||||
Build the direct local MinerU 3.1.0 adapter boundary using mocked subprocess results and fake output directories first.
|
||||
|
||||
Sprint 4 must establish:
|
||||
|
||||
- A project-owned adapter module that is the only boundary for MinerU CLI interaction.
|
||||
- Deterministic command construction for direct local MinerU CLI execution.
|
||||
- Strict-local validation that rejects prohibited remote/API/router/backend options.
|
||||
- Subprocess execution wrapping that captures stdout, stderr, exit code, command, and generated paths.
|
||||
- Optional-file parsing for mocked MinerU output directories.
|
||||
- Clear adapter result and warning records for missing MinerU, failed CLI execution, missing output, and strict-local violations.
|
||||
- Fast unit tests that do not require real MinerU, model files, GPU, sample PDFs, or network.
|
||||
|
||||
Sprint 4 is an adapter contract sprint. It must not connect the adapter to real conversion orchestration, Markdown normalization, metadata writing, report generation, or a working `pdf2md convert` command.
|
||||
|
||||
## Current Precondition
|
||||
|
||||
Sprint 3 is complete:
|
||||
|
||||
- `src/pdf2md/paths.py` owns input discovery and output path planning.
|
||||
- `src/pdf2md/ir.py` owns project records, block types, warning codes, and warning severities.
|
||||
- `src/pdf2md/metadata.py` builds JSON-serializable metadata and summary counts from project-owned records.
|
||||
- `uv run pytest` passed 46 tests.
|
||||
|
||||
Sprint 4 may import project-owned warning records from `ir.py`, but it must not require raw MinerU objects as public or required return fields.
|
||||
|
||||
## Touched Surfaces
|
||||
|
||||
Allowed:
|
||||
|
||||
- `src/pdf2md/mineru_adapter.py`
|
||||
- `src/pdf2md/doctor.py` only for minimal adapter availability/version check types if needed; do not implement full `pdf2md doctor`
|
||||
- `src/pdf2md/ir.py` only for narrowly required warning/result fields discovered while implementing the adapter contract
|
||||
- `tests/test_mineru_adapter.py` or `tests/unit/test_mineru_adapter.py`
|
||||
- `tests/test_doctor.py` only if `doctor.py` is touched for adapter availability/version checks
|
||||
- `README.md` only if a small note is needed to clarify mocked adapter tests versus real MinerU setup
|
||||
- `PLAN.md` only for current-goal coordination updates required by the shared agent workflow
|
||||
- `PROGRESS.md`
|
||||
- `docs/V1IMPLEMENTATIONPLAN.md` only if sequencing or constraints need adjustment
|
||||
- `docs/Sprints/SPRINT4CONTRACT.md`
|
||||
|
||||
Not allowed:
|
||||
|
||||
- `src/pdf2md/conversion.py`
|
||||
- `src/pdf2md/markdown.py`
|
||||
- `src/pdf2md/quality.py`
|
||||
- `src/pdf2md/report.py`
|
||||
- Working `pdf2md convert` behavior
|
||||
- Full `pdf2md doctor` behavior
|
||||
- `scripts/`
|
||||
- Any real MinerU invocation in default tests
|
||||
- Any MinerU/model installation or download script
|
||||
- Any PDF content parsing
|
||||
- Any Markdown normalization behavior
|
||||
- Any metadata JSON file writing
|
||||
- Any `.report.md` content generation
|
||||
- Any runtime engine selection or alternate engine support
|
||||
- Any committed file under `samples/`
|
||||
|
||||
## Expected Outputs
|
||||
|
||||
Sprint 4 should produce:
|
||||
|
||||
1. Adapter records and options
|
||||
- A small adapter options record for v1-local MinerU execution.
|
||||
- A result record containing at least:
|
||||
- command arguments
|
||||
- input PDF path
|
||||
- work/output directory path
|
||||
- raw Markdown when found
|
||||
- raw structured data when found
|
||||
- asset paths when found
|
||||
- warnings
|
||||
- engine name fixed to MinerU
|
||||
- engine version when known
|
||||
- engine options
|
||||
- exit code
|
||||
- stdout
|
||||
- stderr
|
||||
- No public or required field should expose raw MinerU-specific Python objects.
|
||||
|
||||
2. Availability and version checks
|
||||
- Check whether `mineru` is available using a mockable local mechanism such as `shutil.which`.
|
||||
- Check MinerU version using a mockable subprocess call.
|
||||
- Missing MinerU should produce a clear `ENGINE_MISSING` warning or project-owned adapter error.
|
||||
- Version command failure should be explicit and testable.
|
||||
|
||||
3. Direct local command construction
|
||||
- Baseline conversion command shape:
|
||||
|
||||
```text
|
||||
mineru -p <input-pdf> -o <work-dir>
|
||||
```
|
||||
|
||||
- The command must not include `--api-url`.
|
||||
- The command must not include router mode, HTTP client backend flags, remote API URLs, remote OpenAI-compatible backend settings, or runtime engine selection.
|
||||
- GPU/device options may be represented only if they are strict-local and do not introduce remote/backend choices. If the exact MinerU 3.1.0 flag is uncertain, store the requested option without passing a speculative CLI flag until a later source-verified sprint.
|
||||
|
||||
4. Strict-local validation
|
||||
- Reject prohibited CLI args and options before subprocess execution.
|
||||
- Reject any value that looks like a remote URL in user-controlled adapter options.
|
||||
- Allow only direct local `mineru` CLI execution.
|
||||
- Allow the CLI-internal temporary local `mineru-api` process that MinerU 3.1.0 may start when the CLI runs without `--api-url`.
|
||||
- Do not implement or call an HTTP client backend.
|
||||
|
||||
5. Subprocess wrapper
|
||||
- Use dependency injection or a small runner boundary so tests can fake subprocess behavior.
|
||||
- Capture stdout, stderr, and exit code.
|
||||
- Convert non-zero exit into an adapter result or project-owned error with a `MINERU_CLI_FAILED` warning.
|
||||
- Do not silently fallback to another engine.
|
||||
|
||||
6. Mocked output parsing
|
||||
- Parse fake output directories using optional-file behavior.
|
||||
- Raw Markdown is optional and should be read only from mocked local files created by tests.
|
||||
- Raw structured output is optional and may be represented as a JSON-serializable object or raw text, depending on the fake file extension used in tests.
|
||||
- Assets are optional and should be collected as local relative or absolute paths according to the adapter result design.
|
||||
- Missing expected output should produce a clear warning or failed result instead of being silently ignored.
|
||||
- The adapter must not assume real MinerU output layout is fully known until a later local probe.
|
||||
|
||||
7. Tests
|
||||
- Unit tests for availability check success and missing MinerU.
|
||||
- Unit tests for version check success and version command failure.
|
||||
- Unit tests for command construction.
|
||||
- Unit tests that prohibited `--api-url`, remote URLs, router mode, HTTP backend, and OpenAI-compatible backend options are rejected.
|
||||
- Unit tests for mocked successful MinerU output.
|
||||
- Unit tests for mocked non-zero exit.
|
||||
- Unit tests for mocked missing output.
|
||||
- Unit tests proving no real MinerU binary, model files, GPU, `samples/`, or network are required by default.
|
||||
|
||||
8. Handoff
|
||||
- `PROGRESS.md` records changed files, commands run, tests passed or blocked, known failures, residual risks, and next action.
|
||||
|
||||
## Non-Goals
|
||||
|
||||
- Do not implement conversion orchestration.
|
||||
- Do not implement `convert_pdf`.
|
||||
- Do not implement `pdf2md convert`.
|
||||
- Do not implement full `pdf2md doctor`.
|
||||
- Do not install MinerU 3.1.0.
|
||||
- Do not download MinerU models.
|
||||
- Do not run real MinerU in default tests.
|
||||
- Do not parse real PDFs.
|
||||
- Do not normalize Markdown.
|
||||
- Do not write final Markdown, metadata JSON, assets, or report files as product behavior.
|
||||
- Do not compute source SHA-256.
|
||||
- Do not implement asset link checking.
|
||||
- Do not implement math renderability checking.
|
||||
- Do not implement alternate engines or runtime engine selection.
|
||||
- Do not add cloud, remote API, router, HTTP client backend, or remote OpenAI-compatible backend support.
|
||||
|
||||
## Work Packages
|
||||
|
||||
### WP4.1: Adapter Types And Strict-Local Options
|
||||
|
||||
Owner:
|
||||
|
||||
- `mineru-integration-agent`
|
||||
- `feature-generator-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Define minimal adapter options and result records.
|
||||
- Encode strict-local defaults.
|
||||
- Reject prohibited remote/API/backend options before command execution.
|
||||
|
||||
Output:
|
||||
|
||||
- The adapter boundary can be tested without invoking MinerU.
|
||||
|
||||
### WP4.2: Availability, Version, And Command Builder
|
||||
|
||||
Owner:
|
||||
|
||||
- `mineru-integration-agent`
|
||||
- `feature-generator-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Implement mockable `is_available` and `version` checks.
|
||||
- Build the direct local command shape `mineru -p <input-pdf> -o <work-dir>`.
|
||||
- Keep command construction deterministic and easy to inspect in tests.
|
||||
|
||||
Output:
|
||||
|
||||
- Later orchestration can call the adapter without knowing MinerU CLI details.
|
||||
|
||||
### WP4.3: Subprocess Runner Boundary
|
||||
|
||||
Owner:
|
||||
|
||||
- `feature-generator-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Add a small runner interface or callable boundary for subprocess execution.
|
||||
- Capture command, stdout, stderr, exit code, and return values.
|
||||
- Map non-zero exits to structured adapter warnings/errors.
|
||||
|
||||
Output:
|
||||
|
||||
- Default tests can fake all MinerU behavior.
|
||||
|
||||
### WP4.4: Mocked Output Parser
|
||||
|
||||
Owner:
|
||||
|
||||
- `mineru-integration-agent`
|
||||
- `feature-generator-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Parse fake Markdown, JSON/structured, asset, and diagnostic outputs from test-created directories.
|
||||
- Treat all raw MinerU output files as optional until real local output is probed.
|
||||
- Emit clear warnings for missing usable output.
|
||||
|
||||
Output:
|
||||
|
||||
- Adapter result objects can carry raw output into later IR/normalization sprints without binding to a guessed full MinerU layout.
|
||||
|
||||
### WP4.5: Independent Evaluation
|
||||
|
||||
Owner:
|
||||
|
||||
- `evaluation-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Review the completed adapter against this contract.
|
||||
- Verify no conversion orchestration, real MinerU dependency in default tests, remote runtime path, alternate engine, Markdown normalization, metadata writing, report generation, or working CLI command was added.
|
||||
- Verify `samples/` remains untracked and unstaged.
|
||||
|
||||
Output:
|
||||
|
||||
- PASS/FAIL notes with any missing acceptance criteria.
|
||||
|
||||
## Verification Checks
|
||||
|
||||
Required:
|
||||
|
||||
- `git status --short` before staging confirms `samples/` remains untracked.
|
||||
- `uv --version` is run and result is recorded.
|
||||
- `uv sync` passes.
|
||||
- `uv run pytest` passes.
|
||||
- Targeted MinerU adapter tests pass.
|
||||
- Tests do not require real MinerU, CUDA, GPU, model files, `samples/`, or network.
|
||||
- No model downloads occur.
|
||||
- No network calls are required.
|
||||
- No candidate engine comparison is reintroduced.
|
||||
- No conversion orchestration is implemented.
|
||||
- No Markdown normalization behavior is implemented.
|
||||
- No metadata JSON writing or full report generation is implemented.
|
||||
- No working `pdf2md convert` or full `pdf2md doctor` behavior is implemented.
|
||||
- Strict-local rejection tests cover `--api-url`, remote URL values, router mode, HTTP backend, and remote OpenAI-compatible backend options.
|
||||
- `git diff --check` passes.
|
||||
|
||||
Recommended:
|
||||
|
||||
- Use fake runner classes or functions rather than monkeypatching global subprocess calls everywhere.
|
||||
- Keep adapter result data JSON-friendly where practical, but do not force metadata schema generation in Sprint 4.
|
||||
- Include a test that no prohibited remote/API flag appears in the successful command args.
|
||||
- Use `requirements-guard-agent` if command flags or strict-local wording conflict across documents.
|
||||
|
||||
## Hard Failure Criteria
|
||||
|
||||
Sprint 4 fails and must stop for a user decision if any of these are true:
|
||||
|
||||
- The adapter passes `--api-url`.
|
||||
- The adapter uses router mode.
|
||||
- The adapter uses an HTTP client backend.
|
||||
- The adapter accepts a remote API URL or remote OpenAI-compatible backend for runtime conversion.
|
||||
- The adapter falls back to another engine after MinerU failure.
|
||||
- Default tests require real MinerU, CUDA, GPU, model files, network, or `samples/`.
|
||||
- The implementation installs MinerU or downloads models.
|
||||
- The implementation connects the adapter to a working conversion CLI/API.
|
||||
- The implementation adds Markdown normalization, metadata file writing, full report generation, or quality checks.
|
||||
- The implementation assumes real MinerU output layout is fully known without a later local probe.
|
||||
- `samples/` is staged or committed.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
Sprint 4 is complete when:
|
||||
|
||||
- `src/pdf2md/mineru_adapter.py` exists and owns direct local MinerU CLI adapter behavior.
|
||||
- Availability and version checks are mock-tested.
|
||||
- Conversion command construction is mock-tested and uses `mineru -p <input-pdf> -o <work-dir>`.
|
||||
- Strict-local validation rejects prohibited remote/API/router/backend options.
|
||||
- Mocked successful MinerU output produces an adapter result with raw Markdown, raw structured data when available, assets when available, engine info, command args, stdout, stderr, and exit code.
|
||||
- Mocked non-zero exit produces a clear failure result or project-owned error with a `MINERU_CLI_FAILED` warning.
|
||||
- Mocked missing MinerU produces a clear `ENGINE_MISSING` warning or project-owned adapter error.
|
||||
- Default tests do not require MinerU, GPU, model files, network, or `samples/`.
|
||||
- No conversion orchestration, Markdown normalization, metadata file writing, full report generation, or working CLI behavior is implemented.
|
||||
- `uv sync` passes.
|
||||
- `uv run pytest` passes.
|
||||
- `PROGRESS.md` records checks performed and residual risks.
|
||||
- Independent evaluation is complete.
|
||||
- The completed change is committed.
|
||||
|
||||
## Handoff Fields
|
||||
|
||||
Use these fields when Sprint 4 completes:
|
||||
|
||||
- Files changed:
|
||||
- Commands run:
|
||||
- Tests passed:
|
||||
- Tests blocked:
|
||||
- Known failures:
|
||||
- Residual risks:
|
||||
- User decisions needed:
|
||||
- Go/no-go recommendation for Sprint 5:
|
||||
- Next action:
|
||||
@@ -0,0 +1,311 @@
|
||||
# Sprint 5 Contract: Obsidian Markdown Normalization And Asset Links
|
||||
|
||||
Status: Implemented
|
||||
Last updated: 2026-05-07
|
||||
|
||||
## Objective
|
||||
|
||||
Build the project-owned Markdown normalization boundary for Obsidian output, using deterministic unit tests before it is connected to conversion orchestration.
|
||||
|
||||
Sprint 5 must establish:
|
||||
|
||||
- A small Markdown normalization module that accepts local raw Markdown-like text and returns normalized Markdown plus project-owned warnings.
|
||||
- Obsidian math delimiter normalization for inline and display math.
|
||||
- Stable relative asset link normalization without copying files or writing final outputs.
|
||||
- Limited local asset link validation where useful for warnings.
|
||||
- Table preservation and clear warning behavior when a table cannot be safely simplified.
|
||||
- Fast unit tests that do not require real MinerU, model files, GPU, sample PDFs, network, or Obsidian itself.
|
||||
|
||||
Sprint 5 is a normalization contract sprint. It must not connect normalization to the CLI, conversion orchestration, metadata writing, report generation, real MinerU execution, or end-to-end output writing.
|
||||
|
||||
## Current Precondition
|
||||
|
||||
Sprint 4 is complete:
|
||||
|
||||
- `src/pdf2md/paths.py` owns input discovery and output path planning.
|
||||
- `src/pdf2md/ir.py` owns project records, block types, warning codes, and warning severities.
|
||||
- `src/pdf2md/metadata.py` builds JSON-serializable metadata and summary counts from project-owned records.
|
||||
- `src/pdf2md/mineru_adapter.py` owns the mocked direct local MinerU CLI adapter boundary.
|
||||
- `uv run pytest` passed 72 tests.
|
||||
|
||||
Sprint 5 may use `WarningRecord`, `WarningCode`, and `WarningSeverity` from `ir.py`, but it must not require raw MinerU-specific Python objects as public or required inputs.
|
||||
|
||||
## Touched Surfaces
|
||||
|
||||
Allowed:
|
||||
|
||||
- `src/pdf2md/markdown.py`
|
||||
- `src/pdf2md/quality.py` only for minimal local asset link check helpers if that boundary is cleaner than placing them in `markdown.py`
|
||||
- `src/pdf2md/ir.py` only for narrowly required warning codes discovered while implementing table or asset fallback warnings
|
||||
- `tests/test_markdown.py` or `tests/unit/test_markdown.py`
|
||||
- `tests/test_quality.py` only if `quality.py` is touched for asset link checks
|
||||
- `README.md` only if a small note is needed to clarify that normalization tests are mocked/local and not full conversion behavior
|
||||
- `PLAN.md` only for current-goal coordination updates required by the shared agent workflow
|
||||
- `PROGRESS.md`
|
||||
- `docs/V1IMPLEMENTATIONPLAN.md` only if sequencing or constraints need adjustment
|
||||
- `docs/Sprints/SPRINT5CONTRACT.md`
|
||||
|
||||
Not allowed:
|
||||
|
||||
- `src/pdf2md/conversion.py`
|
||||
- `src/pdf2md/cli.py`
|
||||
- `src/pdf2md/mineru_adapter.py`
|
||||
- `src/pdf2md/report.py`
|
||||
- Working `pdf2md convert` behavior
|
||||
- Full `pdf2md doctor` behavior
|
||||
- `scripts/`
|
||||
- Any real MinerU invocation in default tests
|
||||
- Any MinerU/model installation or download script
|
||||
- Any PDF content parsing
|
||||
- Any metadata JSON file writing
|
||||
- Any `.report.md` content generation
|
||||
- Any runtime engine selection or alternate engine support
|
||||
- Any remote asset fetch, HTTP client, or cloud/API integration
|
||||
- Any committed file under `samples/`
|
||||
|
||||
## Expected Outputs
|
||||
|
||||
Sprint 5 should produce:
|
||||
|
||||
1. Normalization records and API
|
||||
- A small result record or equivalent project-owned return type containing at least:
|
||||
- normalized Markdown
|
||||
- warnings
|
||||
- asset links discovered or normalized when available
|
||||
- A normalization function with a narrow input surface, such as raw Markdown text plus optional output/assets context.
|
||||
- No public or required field should expose raw MinerU-specific Python objects.
|
||||
- The API should be usable by later orchestration without knowing how MinerU represented the original Markdown.
|
||||
|
||||
2. Inline math delimiter normalization
|
||||
- Normalize safe inline math forms to `$...$`.
|
||||
- Preserve already valid `$...$` inline math.
|
||||
- Preserve the exact LaTeX body inside inline math except delimiter changes.
|
||||
- Do not escape or rewrite underscores, carets, braces, or backslashes inside math.
|
||||
- Do not normalize math delimiters inside fenced code blocks or inline code spans.
|
||||
- Avoid converting ambiguous dollar signs that look like currency or prose punctuation.
|
||||
|
||||
3. Display math delimiter normalization
|
||||
- Normalize safe display math forms to `$$...$$`.
|
||||
- Ensure display math delimiters sit on their own lines.
|
||||
- Keep a blank line around display math blocks.
|
||||
- Preserve the exact LaTeX body inside display math except delimiter and surrounding whitespace normalization.
|
||||
- Preserve LaTeX environments such as `equation`, `align`, or `gather` rather than rewriting their semantics.
|
||||
- Make normalization idempotent: running the normalizer twice should produce the same Markdown.
|
||||
|
||||
4. Asset link normalization
|
||||
- Normalize local image/asset links to stable relative POSIX-style Markdown paths.
|
||||
- Keep relative links relative; do not turn them into absolute paths.
|
||||
- Reject or warn on absolute asset links that cannot be represented relative to the planned output/assets context.
|
||||
- Reject or warn on links that escape the output/assets directory with `..`.
|
||||
- Do not fetch remote URLs, copy assets, or write files.
|
||||
- Preserve alt text when rewriting Markdown image links.
|
||||
|
||||
5. Table preservation and fallback warnings
|
||||
- Preserve simple Markdown pipe tables without destructive formatting changes.
|
||||
- Preserve HTML tables when Markdown would lose row spans, column spans, nested content, or other complex structure.
|
||||
- Emit a project-owned warning when complex table fallback behavior is detected or when table simplification is intentionally skipped.
|
||||
- Do not attempt broad table reflow or OCR-style table reconstruction in Sprint 5.
|
||||
|
||||
6. Tests
|
||||
- Unit tests for inline math delimiter normalization.
|
||||
- Unit tests for display math delimiter normalization and blank-line spacing.
|
||||
- Unit tests proving underscores and carets inside math are preserved.
|
||||
- Unit tests proving fenced code blocks and inline code are not normalized.
|
||||
- Unit tests for idempotency.
|
||||
- Unit tests for relative asset link normalization.
|
||||
- Unit tests for missing or escaping asset link warnings when asset checking is implemented.
|
||||
- Unit tests for simple table preservation.
|
||||
- Unit tests for complex table fallback warning behavior.
|
||||
- Unit tests proving no real MinerU binary, model files, GPU, `samples/`, Obsidian installation, or network are required by default.
|
||||
|
||||
7. Handoff
|
||||
- `PROGRESS.md` records changed files, commands run, tests passed or blocked, known failures, residual risks, and next action.
|
||||
|
||||
## Non-Goals
|
||||
|
||||
- Do not implement conversion orchestration.
|
||||
- Do not implement `convert_pdf`.
|
||||
- Do not implement `pdf2md convert`.
|
||||
- Do not implement full `pdf2md doctor`.
|
||||
- Do not invoke MinerU.
|
||||
- Do not install MinerU 3.1.0.
|
||||
- Do not download MinerU models.
|
||||
- Do not parse real PDFs.
|
||||
- Do not write final Markdown files as product behavior.
|
||||
- Do not copy or move assets as product behavior.
|
||||
- Do not write metadata JSON.
|
||||
- Do not generate `.report.md`.
|
||||
- Do not compute source SHA-256.
|
||||
- Do not implement math renderability checks beyond a future-facing warning interface if needed.
|
||||
- Do not implement full quality report checks.
|
||||
- Do not implement alternate engines or runtime engine selection.
|
||||
- Do not add cloud, remote API, router, HTTP client backend, remote OpenAI-compatible backend, or remote asset-fetching support.
|
||||
|
||||
## Work Packages
|
||||
|
||||
### WP5.1: Normalization Types And Safe Boundaries
|
||||
|
||||
Owner:
|
||||
|
||||
- `obsidian-markdown-agent`
|
||||
- `feature-generator-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Define a small Markdown normalization result type.
|
||||
- Define a focused normalization function.
|
||||
- Keep warnings project-owned through `WarningRecord`.
|
||||
- Keep the API independent of raw MinerU objects.
|
||||
|
||||
Output:
|
||||
|
||||
- Later orchestration can normalize adapter Markdown without knowing MinerU internals.
|
||||
|
||||
### WP5.2: Math Delimiter Normalization
|
||||
|
||||
Owner:
|
||||
|
||||
- `obsidian-markdown-agent`
|
||||
- `feature-generator-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Normalize safe inline math delimiters to `$...$`.
|
||||
- Normalize safe display math delimiters to `$$...$$` with stable surrounding blank lines.
|
||||
- Preserve LaTeX bodies exactly.
|
||||
- Protect code fences and inline code spans.
|
||||
- Add idempotency tests.
|
||||
|
||||
Output:
|
||||
|
||||
- Obsidian-friendly math delimiter behavior is deterministic and covered by unit tests.
|
||||
|
||||
### WP5.3: Asset Link Normalization
|
||||
|
||||
Owner:
|
||||
|
||||
- `obsidian-markdown-agent`
|
||||
- `feature-generator-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Normalize local image/asset links to stable relative POSIX-style paths.
|
||||
- Preserve alt text.
|
||||
- Warn on missing, absolute, escaping, or non-local asset links when the helper has enough local context to judge them.
|
||||
- Do not fetch or copy assets.
|
||||
|
||||
Output:
|
||||
|
||||
- Later conversion can produce Markdown links that are stable relative to planned output paths.
|
||||
|
||||
### WP5.4: Table Preservation And Fallback Warning
|
||||
|
||||
Owner:
|
||||
|
||||
- `obsidian-markdown-agent`
|
||||
- `feature-generator-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Preserve simple Markdown pipe tables.
|
||||
- Preserve complex HTML tables without simplifying them destructively.
|
||||
- Emit a project-owned warning for complex table fallback behavior.
|
||||
|
||||
Output:
|
||||
|
||||
- Table handling is conservative and traceable instead of silently lossy.
|
||||
|
||||
### WP5.5: Independent Evaluation
|
||||
|
||||
Owner:
|
||||
|
||||
- `evaluation-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Review the completed normalizer against this contract.
|
||||
- Verify no conversion orchestration, real MinerU dependency in default tests, remote runtime path, alternate engine, metadata writing, report generation, file-copying behavior, or working CLI command was added.
|
||||
- Verify `samples/` remains untracked and unstaged.
|
||||
|
||||
Output:
|
||||
|
||||
- PASS/FAIL notes with any missing acceptance criteria.
|
||||
|
||||
## Verification Checks
|
||||
|
||||
Required:
|
||||
|
||||
- `git status --short` before staging confirms `samples/` remains untracked.
|
||||
- `uv --version` is run and result is recorded.
|
||||
- `uv sync` passes.
|
||||
- `uv run pytest` passes.
|
||||
- Targeted Markdown normalization tests pass.
|
||||
- Tests do not require real MinerU, CUDA, GPU, model files, Obsidian, `samples/`, or network.
|
||||
- No model downloads occur.
|
||||
- No network calls are required.
|
||||
- No candidate engine comparison is reintroduced.
|
||||
- No conversion orchestration is implemented.
|
||||
- No metadata JSON writing or full report generation is implemented.
|
||||
- No working `pdf2md convert` or full `pdf2md doctor` behavior is implemented.
|
||||
- No final output files are written as product behavior.
|
||||
- No remote asset fetching is implemented.
|
||||
- Math delimiter normalization is idempotent.
|
||||
- Asset paths in normalized Markdown are relative when they are rewritten.
|
||||
- `git diff --check` passes.
|
||||
|
||||
Recommended:
|
||||
|
||||
- Prefer a small tokenizer or state-machine approach over broad regular-expression rewrites for math/code boundary handling.
|
||||
- Keep normalization helpers pure and deterministic.
|
||||
- Treat complex tables conservatively: preserve content and warn rather than flattening structure.
|
||||
- Use `requirements-guard-agent` if warning codes or output behavior conflict across documents.
|
||||
|
||||
## Hard Failure Criteria
|
||||
|
||||
Sprint 5 fails and must stop for a user decision if any of these are true:
|
||||
|
||||
- The normalizer rewrites LaTeX math bodies beyond delimiter and whitespace normalization without deterministic tests.
|
||||
- The normalizer changes underscores, carets, braces, or backslashes inside math content.
|
||||
- The normalizer rewrites code fences or inline code spans as math.
|
||||
- The normalizer produces absolute asset links where relative links are required.
|
||||
- The normalizer accepts asset links that escape the output/assets context without warning.
|
||||
- The implementation fetches remote assets or adds any HTTP/network client path.
|
||||
- The implementation connects normalization to a working conversion CLI/API.
|
||||
- The implementation adds metadata file writing, full report generation, real MinerU execution, model downloads, or setup scripts.
|
||||
- Default tests require real MinerU, CUDA, GPU, model files, network, Obsidian, or `samples/`.
|
||||
- `samples/` is staged or committed.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
Sprint 5 is complete when:
|
||||
|
||||
- `src/pdf2md/markdown.py` exists and owns Obsidian Markdown normalization behavior.
|
||||
- Inline math delimiter normalization is unit-tested.
|
||||
- Display math delimiter normalization and blank-line spacing are unit-tested.
|
||||
- Tests prove underscores and carets inside math are preserved.
|
||||
- Tests prove fenced code blocks and inline code are not normalized.
|
||||
- Normalization idempotency is unit-tested.
|
||||
- Relative asset link normalization is unit-tested.
|
||||
- Asset warning behavior is unit-tested when missing, absolute, escaping, or non-local links are in scope.
|
||||
- Simple table preservation and complex table fallback warning behavior are unit-tested.
|
||||
- Default tests do not require MinerU, GPU, model files, network, Obsidian, or `samples/`.
|
||||
- No conversion orchestration, metadata file writing, full report generation, file-copying behavior, or working CLI behavior is implemented.
|
||||
- `uv sync` passes.
|
||||
- `uv run pytest` passes.
|
||||
- `PROGRESS.md` records checks performed and residual risks.
|
||||
- Independent evaluation is complete.
|
||||
- The completed change is committed.
|
||||
|
||||
## Handoff Fields
|
||||
|
||||
Use these fields when Sprint 5 completes:
|
||||
|
||||
- Files changed:
|
||||
- Commands run:
|
||||
- Tests passed:
|
||||
- Tests blocked:
|
||||
- Known failures:
|
||||
- Residual risks:
|
||||
- User decisions needed:
|
||||
- Go/no-go recommendation for Sprint 6:
|
||||
- Next action:
|
||||
@@ -0,0 +1,334 @@
|
||||
# Sprint 6 Contract: Quality Checks And Report Generation
|
||||
|
||||
Status: Implemented
|
||||
Last updated: 2026-05-08
|
||||
|
||||
## Objective
|
||||
|
||||
Build local quality-check and human-readable report generation boundaries from project-owned metadata and normalized Markdown, before they are connected to conversion orchestration.
|
||||
|
||||
Sprint 6 must establish:
|
||||
|
||||
- A project-owned quality module for local asset-link and math-renderability signals.
|
||||
- A report module that renders `<stem>.report.md` content from metadata and quality results.
|
||||
- Deterministic final status calculation: `success`, `partial`, or `failed`.
|
||||
- Summary fields needed by reports, including missing asset links and math render failures.
|
||||
- Fast unit tests that do not require real MinerU, model files, GPU, sample PDFs, Obsidian, LaTeX tooling, network, or a working conversion CLI.
|
||||
|
||||
Sprint 6 is a quality/report contract sprint. It may generate report Markdown content as a string, but it must not connect to the CLI, conversion orchestration, real MinerU execution, file output writing, setup scripts, or end-to-end conversion.
|
||||
|
||||
## Current Precondition
|
||||
|
||||
Sprint 5 is complete:
|
||||
|
||||
- `src/pdf2md/paths.py` owns input discovery and output path planning.
|
||||
- `src/pdf2md/ir.py` owns project records, block types, warning codes, and warning severities.
|
||||
- `src/pdf2md/metadata.py` builds JSON-serializable metadata and summary counts from project-owned records.
|
||||
- `src/pdf2md/mineru_adapter.py` owns the mocked direct local MinerU CLI adapter boundary.
|
||||
- `src/pdf2md/markdown.py` owns Obsidian Markdown normalization, asset link warnings, and table fallback warnings.
|
||||
- `uv run pytest` passed 89 tests.
|
||||
|
||||
Sprint 6 may use metadata dictionaries produced by `build_metadata`, project-owned `WarningRecord` values, and normalized Markdown text. It must not require raw MinerU-specific Python objects as public or required inputs.
|
||||
|
||||
## Touched Surfaces
|
||||
|
||||
Allowed:
|
||||
|
||||
- `src/pdf2md/quality.py`
|
||||
- `src/pdf2md/report.py`
|
||||
- `src/pdf2md/metadata.py` only for narrowly required summary fields or helper functions that keep metadata/report consistency
|
||||
- `src/pdf2md/ir.py` only for narrowly required warning codes discovered while implementing quality checks
|
||||
- `tests/test_quality.py`
|
||||
- `tests/test_report.py`
|
||||
- `tests/test_metadata.py` only if `metadata.py` changes
|
||||
- `README.md` only if a small note is needed to clarify mocked/local quality and report behavior
|
||||
- `PLAN.md` only for current-goal coordination updates required by the shared agent workflow
|
||||
- `PROGRESS.md`
|
||||
- `docs/V1IMPLEMENTATIONPLAN.md` only if sequencing or constraints need adjustment
|
||||
- `docs/Sprints/SPRINT6CONTRACT.md`
|
||||
|
||||
Not allowed:
|
||||
|
||||
- `src/pdf2md/conversion.py`
|
||||
- `src/pdf2md/cli.py`
|
||||
- `src/pdf2md/mineru_adapter.py`
|
||||
- Working `pdf2md convert` behavior
|
||||
- Full `pdf2md doctor` behavior
|
||||
- `scripts/`
|
||||
- Any real MinerU invocation in default tests
|
||||
- Any MinerU/model installation or download script
|
||||
- Any PDF content parsing
|
||||
- Any final Markdown file writing
|
||||
- Any metadata JSON file writing
|
||||
- Any `.report.md` file writing as product behavior
|
||||
- Any asset copying or moving
|
||||
- Any runtime engine selection or alternate engine support
|
||||
- Any remote asset fetch, HTTP client, cloud/API integration, hosted renderer, or remote math-render service
|
||||
- Any committed file under `samples/`
|
||||
|
||||
## Expected Outputs
|
||||
|
||||
Sprint 6 should produce:
|
||||
|
||||
1. Quality result records and API
|
||||
- A small project-owned quality result type containing at least:
|
||||
- missing asset link count
|
||||
- invalid asset link count when available
|
||||
- math render error count
|
||||
- warnings produced by quality checks
|
||||
- A local asset-link check function that accepts normalized Markdown and local asset context without writing files.
|
||||
- A math renderability check interface that accepts a local checker callable or reports tool-unavailable behavior gracefully.
|
||||
- No public or required field should expose raw MinerU-specific Python objects.
|
||||
|
||||
2. Asset-link quality checks
|
||||
- Count missing local asset links in Markdown.
|
||||
- Count invalid links that are absolute, parent-escaping, remote, or otherwise non-local according to project policy.
|
||||
- Produce project-owned warnings for missing or invalid asset links.
|
||||
- Keep all checks local and deterministic.
|
||||
- Do not fetch remote URLs, copy assets, move assets, or write files.
|
||||
|
||||
3. Math renderability checks
|
||||
- Provide a boundary for local math renderability checking.
|
||||
- Default tests must use fake/local checker callables.
|
||||
- Tool-unavailable behavior must be explicit and non-fatal.
|
||||
- Render failures must produce `MATH_RENDER_FAILED` warnings and count toward the report.
|
||||
- The checker must not call network services or require a LaTeX/Obsidian install in default tests.
|
||||
|
||||
4. Metadata summary consistency
|
||||
- Preserve existing required metadata summary fields.
|
||||
- Add or derive report-needed counts without breaking existing metadata tests:
|
||||
- missing asset link count
|
||||
- invalid asset link count
|
||||
- math render error count
|
||||
- Warning order and warning counts must remain deterministic.
|
||||
- Reports must be derived from metadata and quality results, not independently duplicated state.
|
||||
|
||||
5. Report Markdown generation
|
||||
- Render a human-readable `<stem>.report.md` content string from metadata and quality results.
|
||||
- Include at least:
|
||||
- source PDF path
|
||||
- output Markdown path when provided
|
||||
- metadata path when provided
|
||||
- report path when provided
|
||||
- MinerU engine/version and execution mode/options
|
||||
- pages processed
|
||||
- warning count
|
||||
- asset count
|
||||
- missing asset link count
|
||||
- inline formula count
|
||||
- display formula count
|
||||
- math render error count
|
||||
- pages with warnings
|
||||
- final status: `success`, `partial`, or `failed`
|
||||
- The report must not invent facts that are absent from metadata; absent optional paths should be omitted or clearly shown as unavailable.
|
||||
- The report generator must not write files in Sprint 6.
|
||||
|
||||
6. Final status policy
|
||||
- `failed`: metadata or quality warnings contain at least one `error` severity warning.
|
||||
- `partial`: no error severity warnings, but warnings or quality failures exist.
|
||||
- `success`: no warnings and no quality failures.
|
||||
- The status function must be unit-tested and reusable by later orchestration.
|
||||
|
||||
7. Tests
|
||||
- Unit tests for missing asset link counting.
|
||||
- Unit tests for invalid/remote/escaping asset link warnings.
|
||||
- Unit tests for math render failure aggregation with a fake checker.
|
||||
- Unit tests for math checker unavailable behavior.
|
||||
- Unit tests for report content and required sections.
|
||||
- Unit tests proving report content is derived from metadata and quality results.
|
||||
- Unit tests for pages-with-warnings summary.
|
||||
- Unit tests for final status calculation.
|
||||
- Unit tests proving no real MinerU binary, model files, GPU, `samples/`, Obsidian, LaTeX install, or network are required by default.
|
||||
|
||||
8. Handoff
|
||||
- `PROGRESS.md` records changed files, commands run, tests passed or blocked, known failures, residual risks, and next action.
|
||||
|
||||
## Non-Goals
|
||||
|
||||
- Do not implement conversion orchestration.
|
||||
- Do not implement `convert_pdf`.
|
||||
- Do not implement `pdf2md convert`.
|
||||
- Do not implement full `pdf2md doctor`.
|
||||
- Do not invoke MinerU.
|
||||
- Do not install MinerU 3.1.0.
|
||||
- Do not download MinerU models.
|
||||
- Do not parse real PDFs.
|
||||
- Do not write final Markdown files.
|
||||
- Do not copy or move assets.
|
||||
- Do not write metadata JSON files.
|
||||
- Do not write `.report.md` files as product behavior.
|
||||
- Do not compute source SHA-256.
|
||||
- Do not implement real LaTeX, KaTeX, MathJax, or Obsidian rendering in default tests.
|
||||
- Do not add setup scripts.
|
||||
- Do not implement full local environment diagnostics.
|
||||
- Do not implement alternate engines or runtime engine selection.
|
||||
- Do not add cloud, remote API, router, HTTP client backend, remote OpenAI-compatible backend, hosted renderer, or remote asset-fetching support.
|
||||
|
||||
## Work Packages
|
||||
|
||||
### WP6.1: Quality Result Types And Asset Checks
|
||||
|
||||
Owner:
|
||||
|
||||
- `metadata-agent`
|
||||
- `feature-generator-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Define a small project-owned quality result type.
|
||||
- Add deterministic local asset link checks over normalized Markdown.
|
||||
- Count missing, invalid, escaping, absolute, and remote asset references.
|
||||
- Return project-owned warnings without writing files.
|
||||
|
||||
Output:
|
||||
|
||||
- Later orchestration can add local quality results to metadata/report flow without duplicating asset-link logic.
|
||||
|
||||
### WP6.2: Math Renderability Boundary
|
||||
|
||||
Owner:
|
||||
|
||||
- `obsidian-markdown-agent`
|
||||
- `metadata-agent`
|
||||
- `feature-generator-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Define a local math render checker interface.
|
||||
- Support fake checkers in tests.
|
||||
- Treat checker-unavailable as explicit non-fatal warning/info according to the implementation design.
|
||||
- Treat render failures as `MATH_RENDER_FAILED` warnings and count them.
|
||||
|
||||
Output:
|
||||
|
||||
- Math renderability is represented as a local, testable boundary without external dependencies.
|
||||
|
||||
### WP6.3: Metadata Summary Extensions
|
||||
|
||||
Owner:
|
||||
|
||||
- `metadata-agent`
|
||||
- `feature-generator-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Preserve existing required metadata summary fields.
|
||||
- Add or derive counts needed by reports in a backward-compatible way.
|
||||
- Keep metadata JSON serializable and deterministic.
|
||||
|
||||
Output:
|
||||
|
||||
- Metadata remains the source of truth for report counts and warning summaries.
|
||||
|
||||
### WP6.4: Report Markdown Rendering
|
||||
|
||||
Owner:
|
||||
|
||||
- `metadata-agent`
|
||||
- `feature-generator-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Implement report content rendering from metadata plus quality results.
|
||||
- Include required report sections and final status.
|
||||
- Generate content only; do not write files.
|
||||
|
||||
Output:
|
||||
|
||||
- Later orchestration can write `<stem>.report.md` by using the tested report renderer.
|
||||
|
||||
### WP6.5: Independent Evaluation
|
||||
|
||||
Owner:
|
||||
|
||||
- `evaluation-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Review completed quality/report behavior against this contract.
|
||||
- Verify no conversion orchestration, real MinerU dependency in default tests, remote runtime path, alternate engine, final output writing, CLI behavior, or sample dependency was added.
|
||||
- Verify `samples/` remains untracked and unstaged.
|
||||
|
||||
Output:
|
||||
|
||||
- PASS/FAIL notes with any missing acceptance criteria.
|
||||
|
||||
## Verification Checks
|
||||
|
||||
Required:
|
||||
|
||||
- `git status --short` before staging confirms `samples/` remains untracked.
|
||||
- `uv --version` is run and result is recorded.
|
||||
- `uv sync` passes.
|
||||
- `uv run pytest` passes.
|
||||
- Targeted quality/report tests pass.
|
||||
- Tests do not require real MinerU, CUDA, GPU, model files, Obsidian, LaTeX tooling, `samples/`, or network.
|
||||
- No model downloads occur.
|
||||
- No network calls are required.
|
||||
- No candidate engine comparison is reintroduced.
|
||||
- No conversion orchestration is implemented.
|
||||
- No working `pdf2md convert` or full `pdf2md doctor` behavior is implemented.
|
||||
- No final Markdown, metadata JSON, or `.report.md` files are written as product behavior.
|
||||
- No remote asset fetching is implemented.
|
||||
- No real math renderer dependency is required by default tests.
|
||||
- Report counts match metadata and quality results.
|
||||
- Report generation does not re-run MinerU.
|
||||
- `git diff --check` passes.
|
||||
|
||||
Recommended:
|
||||
|
||||
- Keep quality helpers pure and deterministic.
|
||||
- Use fake checkers for math renderability tests.
|
||||
- Keep report rendering stable enough for snapshot-like unit assertions.
|
||||
- Use `requirements-guard-agent` if warning codes, summary fields, or report wording conflict across documents.
|
||||
|
||||
## Hard Failure Criteria
|
||||
|
||||
Sprint 6 fails and must stop for a user decision if any of these are true:
|
||||
|
||||
- Report content diverges from metadata or quality result counts.
|
||||
- Math render failures are silently ignored.
|
||||
- Quality checks require network access.
|
||||
- The implementation fetches remote assets or adds any HTTP/network client path.
|
||||
- The implementation requires a real LaTeX/Obsidian/MathJax/KaTeX install in default tests.
|
||||
- The implementation connects quality/report behavior to a working conversion CLI/API.
|
||||
- The implementation writes final Markdown, metadata JSON, `.report.md`, or copied assets as product behavior.
|
||||
- The implementation invokes MinerU, downloads models, adds setup scripts, or parses real PDFs.
|
||||
- Default tests require real MinerU, CUDA, GPU, model files, network, Obsidian, LaTeX tooling, or `samples/`.
|
||||
- `samples/` is staged or committed.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
Sprint 6 is complete when:
|
||||
|
||||
- `src/pdf2md/quality.py` exists and owns local quality-check behavior.
|
||||
- `src/pdf2md/report.py` exists and owns human-readable report content rendering.
|
||||
- Missing asset link counting is unit-tested.
|
||||
- Invalid, escaping, absolute, or remote asset link warning behavior is unit-tested.
|
||||
- Math render failure aggregation is unit-tested with fake checkers.
|
||||
- Math checker unavailable behavior is unit-tested and non-fatal.
|
||||
- Report content includes the required sections and counts.
|
||||
- Pages-with-warnings summary is unit-tested.
|
||||
- Final status calculation is unit-tested.
|
||||
- Report generation is proven not to write files or re-run MinerU.
|
||||
- Default tests do not require MinerU, GPU, model files, network, Obsidian, LaTeX tooling, or `samples/`.
|
||||
- No conversion orchestration, final output file writing, working CLI behavior, real MinerU execution, or setup script is implemented.
|
||||
- `uv sync` passes.
|
||||
- `uv run pytest` passes.
|
||||
- `PROGRESS.md` records checks performed and residual risks.
|
||||
- Independent evaluation is complete.
|
||||
- The completed change is committed.
|
||||
|
||||
## Handoff Fields
|
||||
|
||||
Use these fields when Sprint 6 completes:
|
||||
|
||||
- Files changed:
|
||||
- Commands run:
|
||||
- Tests passed:
|
||||
- Tests blocked:
|
||||
- Known failures:
|
||||
- Residual risks:
|
||||
- User decisions needed:
|
||||
- Go/no-go recommendation for Sprint 7:
|
||||
- Next action:
|
||||
@@ -0,0 +1,360 @@
|
||||
# Sprint 7 Contract: Conversion Orchestrator, CLI, And Python API
|
||||
|
||||
Status: Implemented
|
||||
Last updated: 2026-05-08
|
||||
|
||||
## Objective
|
||||
|
||||
Connect the existing project-owned boundaries into a working conversion orchestration layer, public Python API, and `pdf2md convert` CLI path.
|
||||
|
||||
Sprint 7 must establish:
|
||||
|
||||
- A public `convert_pdf` API for one local PDF.
|
||||
- A batch conversion API or helper for directory inputs.
|
||||
- A `pdf2md convert INPUT --out OUTPUT_DIR` command.
|
||||
- Product behavior that writes Markdown, optional metadata JSON, and `<stem>.report.md`.
|
||||
- Local asset materialization for adapter-provided asset files.
|
||||
- CLI summaries that surface success, failure, and warning counts.
|
||||
- Fast tests that use fake adapter outputs and do not require real MinerU, model files, GPU, sample PDFs, network, Obsidian, or LaTeX tooling.
|
||||
|
||||
Sprint 7 is an orchestration sprint. It may call the real `MinerUAdapter` in the normal production path, but the default test suite must use injected fake adapters and must not execute MinerU.
|
||||
|
||||
## Current Precondition
|
||||
|
||||
Sprint 6 is complete:
|
||||
|
||||
- `src/pdf2md/paths.py` owns input discovery and output path planning.
|
||||
- `src/pdf2md/ir.py` owns project records, block types, warning codes, and warning severities.
|
||||
- `src/pdf2md/metadata.py` builds JSON-serializable metadata and summary counts from project-owned records.
|
||||
- `src/pdf2md/mineru_adapter.py` owns the mocked direct local MinerU CLI adapter boundary.
|
||||
- `src/pdf2md/markdown.py` owns Obsidian Markdown normalization, asset link warnings, and table fallback warnings.
|
||||
- `src/pdf2md/quality.py` owns local quality checks over normalized Markdown and asset context.
|
||||
- `src/pdf2md/report.py` owns report content rendering and final status calculation.
|
||||
- `uv run pytest` passed 103 tests.
|
||||
|
||||
Sprint 7 may compute source SHA-256, create conversion output directories, copy local adapter-provided asset files, write final Markdown, write metadata JSON when requested, and write `<stem>.report.md`. It must keep public return types project-owned and must not require raw MinerU-specific Python objects from callers.
|
||||
|
||||
## Touched Surfaces
|
||||
|
||||
Allowed:
|
||||
|
||||
- `src/pdf2md/conversion.py`
|
||||
- `src/pdf2md/cli.py`
|
||||
- `src/pdf2md/__init__.py`
|
||||
- `src/pdf2md/paths.py` only for narrowly required path/output helper compatibility
|
||||
- `src/pdf2md/mineru_adapter.py` only for narrowly required adapter protocol or result compatibility
|
||||
- `src/pdf2md/metadata.py` only for narrowly required output-location or summary compatibility
|
||||
- `src/pdf2md/markdown.py` only for narrowly required orchestration compatibility
|
||||
- `src/pdf2md/quality.py` only for narrowly required orchestration compatibility
|
||||
- `src/pdf2md/report.py` only for narrowly required orchestration compatibility
|
||||
- `tests/test_conversion.py`
|
||||
- `tests/test_cli.py`
|
||||
- Existing focused unit tests only if a touched module requires compatibility updates
|
||||
- `README.md` only if a short usage note is needed for `pdf2md convert`
|
||||
- `PLAN.md`
|
||||
- `PROGRESS.md`
|
||||
- `docs/V1IMPLEMENTATIONPLAN.md`
|
||||
- `docs/Sprints/SPRINT7CONTRACT.md`
|
||||
|
||||
Not allowed:
|
||||
|
||||
- `src/pdf2md/doctor.py`
|
||||
- Working `pdf2md doctor` behavior
|
||||
- `scripts/`
|
||||
- MinerU/model installation or download scripts
|
||||
- Real MinerU invocation in default tests
|
||||
- Real GPU/CUDA checks
|
||||
- Real PDF content parsing outside adapter output handling
|
||||
- Runtime engine selection or alternate engine support
|
||||
- Cloud OCR, remote LLM/VLM, hosted renderer, remote document parser, remote asset fetching, HTTP client backend, router mode, `--api-url`, remote APIs, or remote OpenAI-compatible backend support
|
||||
- A CLI flag that disables strict-local policy
|
||||
- Committed files under `samples/`
|
||||
|
||||
## Expected Outputs
|
||||
|
||||
Sprint 7 should produce:
|
||||
|
||||
1. Public conversion records and API
|
||||
- A project-owned conversion result type containing at least:
|
||||
- source PDF path
|
||||
- Markdown output path
|
||||
- metadata JSON path when written
|
||||
- report path
|
||||
- assets directory
|
||||
- raw output directory when kept
|
||||
- engine name and version
|
||||
- final status
|
||||
- warning count
|
||||
- warnings
|
||||
- `convert_pdf(input_path, output_dir, metadata=True, keep_raw=False, overwrite=False, gpu=None, strict_local=True, adapter=None, clock=None)` or an equivalently small API.
|
||||
- The default API path uses the direct local MinerU adapter.
|
||||
- Tests can inject a fake adapter and deterministic clock.
|
||||
- The public return type must not expose raw MinerU-specific Python objects as required fields.
|
||||
|
||||
2. Single-PDF orchestration
|
||||
- Discover and plan the single PDF using existing path helpers.
|
||||
- Create required output directories only after preflight path checks pass.
|
||||
- Run the adapter into a planned temporary or raw work directory.
|
||||
- Stop the individual conversion on adapter hard failure and return explicit warnings/status.
|
||||
- Normalize adapter Markdown into Obsidian-friendly Markdown.
|
||||
- Copy local adapter-provided asset files into the planned assets directory when needed.
|
||||
- Compute source SHA-256 with local file reads.
|
||||
- Build metadata from project-owned records.
|
||||
- Run local quality checks over normalized Markdown and asset context.
|
||||
- Render report Markdown from metadata and quality results.
|
||||
- Write final Markdown, optional metadata JSON, and report Markdown.
|
||||
|
||||
3. Output writing and overwrite behavior
|
||||
- Never write outside the planned output root.
|
||||
- Respect existing-output conflicts unless `overwrite=True`.
|
||||
- Keep writes deterministic and UTF-8 encoded for text outputs.
|
||||
- Preserve `--metadata` behavior: metadata JSON is written when enabled and omitted when disabled.
|
||||
- Always write `<stem>.report.md`.
|
||||
- `--keep-raw` preserves raw MinerU output in the planned raw directory.
|
||||
- Without `--keep-raw`, temporary raw work must be cleaned up when the conversion completes, while preserving enough failure context in metadata/report/warnings.
|
||||
|
||||
4. Asset materialization
|
||||
- Copy only local files returned by the adapter.
|
||||
- Do not fetch remote assets.
|
||||
- Do not follow asset paths that escape the adapter work directory or the planned output root.
|
||||
- Handle missing or invalid adapter asset paths with project-owned warnings.
|
||||
- Normalize final Markdown asset links to stable relative paths.
|
||||
|
||||
5. CLI convert command
|
||||
- `pdf2md convert INPUT --out OUTPUT_DIR`.
|
||||
- Options:
|
||||
- `--metadata`
|
||||
- `--keep-raw`
|
||||
- `--recursive`
|
||||
- `--overwrite`
|
||||
- `--gpu GPU_DEVICE`
|
||||
- `--strict-local`
|
||||
- Strict-local must remain enabled in v1; the CLI must not add a supported way to disable it.
|
||||
- Single PDF conversion returns exit code `0` on success or partial success, and non-zero when a hard error prevents conversion.
|
||||
- Directory conversion handles multiple PDFs deterministically and prints a summary.
|
||||
- Batch conversion should continue to the next file when one PDF fails after planning, then return a non-zero exit code if any PDF failed.
|
||||
- CLI output must include converted count, failed count, and warning count.
|
||||
|
||||
6. Batch conversion
|
||||
- Directory inputs use existing non-recursive discovery by default.
|
||||
- Recursive discovery occurs only with `--recursive`.
|
||||
- Output paths preserve relative subdirectories from the input root.
|
||||
- Duplicate planned outputs and overwrite conflicts fail before conversion starts.
|
||||
- Results are deterministic and ordered by discovery/path planning order.
|
||||
|
||||
7. Failure and warning behavior
|
||||
- MinerU failure must be clear and must not trigger fallback to any other engine.
|
||||
- Strict-local violations must be hard failures.
|
||||
- Per-file failures must include project-owned warnings.
|
||||
- CLI summaries must not suppress warning counts.
|
||||
- Metadata/report content must reflect warnings emitted during adapter, normalization, asset, and quality steps.
|
||||
|
||||
8. Tests
|
||||
- API test for one successful conversion with a fake adapter.
|
||||
- API test for adapter failure with no fallback.
|
||||
- API test for output conflict and overwrite behavior.
|
||||
- API test for metadata disabled behavior.
|
||||
- API test for local asset copying and relative Markdown links.
|
||||
- API test for `keep_raw` behavior.
|
||||
- CLI test for single PDF conversion with a fake adapter.
|
||||
- CLI test for directory conversion with deterministic summary output.
|
||||
- CLI test for recursive behavior.
|
||||
- CLI test for failure summary and non-zero exit code.
|
||||
- Tests proving default checks do not require real MinerU, GPU, models, network, `samples/`, Obsidian, or LaTeX tooling.
|
||||
|
||||
9. Handoff
|
||||
- `PROGRESS.md` records changed files, commands run, tests passed or blocked, known failures, residual risks, and next action.
|
||||
|
||||
## Non-Goals
|
||||
|
||||
- Do not implement `pdf2md doctor`.
|
||||
- Do not implement environment diagnostics.
|
||||
- Do not install MinerU 3.1.0.
|
||||
- Do not download MinerU models.
|
||||
- Do not probe real MinerU output with local sample PDFs.
|
||||
- Do not add setup scripts.
|
||||
- Do not implement runtime engine selection.
|
||||
- Do not add alternate engines.
|
||||
- Do not add cloud, remote API, router, HTTP client backend, remote OpenAI-compatible backend, hosted renderer, or remote asset-fetching support.
|
||||
- Do not add a CLI flag or API option that disables strict-local policy.
|
||||
- Do not require real MinerU, CUDA, GPU, model files, network, Obsidian, LaTeX tooling, or `samples/` in default tests.
|
||||
- Do not implement real math rendering; use the Sprint 6 local checker boundary.
|
||||
- Do not commit generated conversion outputs or sample PDFs.
|
||||
|
||||
## Work Packages
|
||||
|
||||
### WP7.1: Public API And Result Records
|
||||
|
||||
Owner:
|
||||
|
||||
- `feature-generator-agent`
|
||||
- `requirements-guard-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Add `conversion.py`.
|
||||
- Define project-owned conversion result records.
|
||||
- Expose `convert_pdf` from the library surface.
|
||||
- Support fake adapter and deterministic clock injection for tests.
|
||||
|
||||
Output:
|
||||
|
||||
- Callers can run one conversion without depending on raw MinerU objects.
|
||||
|
||||
### WP7.2: Single-PDF Orchestration And Output Writing
|
||||
|
||||
Owner:
|
||||
|
||||
- `feature-generator-agent`
|
||||
- `mineru-integration-agent`
|
||||
- `metadata-agent`
|
||||
- `obsidian-markdown-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Connect path planning, adapter execution, Markdown normalization, metadata building, quality checks, and report rendering.
|
||||
- Write final Markdown, optional metadata JSON, and report Markdown.
|
||||
- Compute source SHA-256.
|
||||
- Preserve strict-local behavior and no-fallback behavior.
|
||||
|
||||
Output:
|
||||
|
||||
- One PDF can be converted through mocked adapter outputs in tests and through the real adapter in normal use.
|
||||
|
||||
### WP7.3: Asset And Raw Output Handling
|
||||
|
||||
Owner:
|
||||
|
||||
- `feature-generator-agent`
|
||||
- `obsidian-markdown-agent`
|
||||
- `metadata-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Copy local adapter-provided assets into the planned assets directory.
|
||||
- Normalize Markdown links relative to the final Markdown file.
|
||||
- Preserve raw output only when requested.
|
||||
- Clean temporary work when raw output is not requested.
|
||||
|
||||
Output:
|
||||
|
||||
- Markdown, assets, metadata, and report paths are stable and local-only.
|
||||
|
||||
### WP7.4: CLI Convert And Batch Summary
|
||||
|
||||
Owner:
|
||||
|
||||
- `feature-generator-agent`
|
||||
- `requirements-guard-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Replace the placeholder CLI with `convert` while keeping `--version`.
|
||||
- Add only the agreed v1 options.
|
||||
- Print deterministic summaries with converted, failed, and warning counts.
|
||||
- Return non-zero exit code when hard failures occur.
|
||||
|
||||
Output:
|
||||
|
||||
- Users can run `pdf2md convert` for one PDF or a directory.
|
||||
|
||||
### WP7.5: Independent Evaluation
|
||||
|
||||
Owner:
|
||||
|
||||
- `evaluation-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Review completed orchestration behavior against this contract.
|
||||
- Verify no default test executes real MinerU, uses GPU, downloads models, uses network, or requires `samples/`.
|
||||
- Verify no runtime remote/API path or alternate engine is introduced.
|
||||
- Verify `samples/` remains untracked and unstaged.
|
||||
|
||||
Output:
|
||||
|
||||
- PASS/FAIL notes with any missing acceptance criteria.
|
||||
|
||||
## Verification Checks
|
||||
|
||||
Required:
|
||||
|
||||
- `git status --short --untracked-files=all` before staging confirms `samples/` remains untracked and unstaged.
|
||||
- `uv --version` is run and result is recorded.
|
||||
- `uv sync` passes.
|
||||
- `uv run pytest tests/test_conversion.py tests/test_cli.py` passes.
|
||||
- `uv run pytest` passes.
|
||||
- `git diff --check` passes.
|
||||
- Default tests do not require real MinerU, CUDA, GPU, model files, network, Obsidian, LaTeX tooling, or `samples/`.
|
||||
- No model downloads occur.
|
||||
- No network calls are required.
|
||||
- No candidate engine comparison is reintroduced.
|
||||
- No alternate engine or runtime engine selection is added.
|
||||
- No CLI/API option disables strict-local policy.
|
||||
- No `--api-url`, router mode, HTTP client backend, remote API, or remote OpenAI-compatible backend support is added.
|
||||
- Adapter failures produce explicit failed results and no fallback conversion.
|
||||
- Output files are written only after path preflight succeeds.
|
||||
- Existing outputs are protected unless overwrite is enabled.
|
||||
- CLI summaries include warning counts.
|
||||
- Metadata/report paths and counts match the files written.
|
||||
|
||||
Recommended:
|
||||
|
||||
- Keep conversion orchestration small and dependency-injected.
|
||||
- Prefer local temporary directories from the standard library for raw work when `keep_raw` is disabled.
|
||||
- Keep batch conversion a thin loop over single-file conversion.
|
||||
- Keep CLI formatting simple and stable enough for tests.
|
||||
- Use fake adapter records in tests rather than monkeypatching subprocess behavior at the CLI layer.
|
||||
|
||||
## Hard Failure Criteria
|
||||
|
||||
Sprint 7 fails and must stop for a user decision if any of these are true:
|
||||
|
||||
- Public API requires or exposes raw MinerU-specific Python objects as required return fields.
|
||||
- The implementation silently falls back to another engine after MinerU failure.
|
||||
- A CLI/API option disables strict-local policy.
|
||||
- The implementation adds or permits `--api-url`, remote APIs, router mode, HTTP client backends, or remote OpenAI-compatible backends.
|
||||
- Default tests execute real MinerU, require GPU/CUDA, download models, use network, require Obsidian/LaTeX tooling, or require `samples/`.
|
||||
- Output writing can escape the planned output root.
|
||||
- Existing files are overwritten without explicit overwrite intent.
|
||||
- CLI writes final outputs after a preflight hard failure.
|
||||
- CLI summaries suppress warning counts or failed counts.
|
||||
- Metadata/report content omits warnings emitted during adapter, normalization, asset, or quality steps.
|
||||
- `samples/` is staged or committed.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
Sprint 7 is complete when:
|
||||
|
||||
- `src/pdf2md/conversion.py` exists and owns conversion orchestration.
|
||||
- `convert_pdf` is available from the public Python package.
|
||||
- `pdf2md convert INPUT --out OUTPUT_DIR` exists.
|
||||
- Single-PDF conversion is tested with a fake adapter.
|
||||
- Directory and recursive conversion behavior is tested with fake adapters.
|
||||
- Output conflict and overwrite behavior is tested.
|
||||
- Adapter failure produces a clear failed result and no fallback.
|
||||
- Final Markdown, metadata JSON when enabled, and report Markdown are written by product behavior.
|
||||
- Local adapter-provided assets are copied or warned about deterministically.
|
||||
- `--keep-raw` behavior is tested.
|
||||
- CLI summaries include converted, failed, and warning counts.
|
||||
- Default tests do not require real MinerU, GPU, model files, network, Obsidian, LaTeX tooling, or `samples/`.
|
||||
- `uv sync` passes.
|
||||
- Targeted conversion/CLI tests pass.
|
||||
- `uv run pytest` passes.
|
||||
- `PROGRESS.md` records checks performed and residual risks.
|
||||
- Independent evaluation is complete.
|
||||
- The completed change is committed.
|
||||
|
||||
## Handoff Fields
|
||||
|
||||
Use these fields when Sprint 7 completes:
|
||||
|
||||
- Files changed:
|
||||
- Commands run:
|
||||
- Tests passed:
|
||||
- Tests blocked:
|
||||
- Known failures:
|
||||
- Residual risks:
|
||||
- User decisions needed:
|
||||
- Go/no-go recommendation for Sprint 8:
|
||||
- Next action:
|
||||
@@ -0,0 +1,339 @@
|
||||
# Sprint 8 Contract: Doctor And Setup Documentation
|
||||
|
||||
Status: Implemented
|
||||
Last updated: 2026-05-08
|
||||
|
||||
## Objective
|
||||
|
||||
Make local setup failures explicit before users run conversions by adding a mockable `pdf2md doctor` diagnostic path and setup documentation for Windows PowerShell, Python 3.12, `uv`, MinerU 3.1.0, local model/cache expectations, NVIDIA GPU/CUDA visibility, and strict-local runtime behavior.
|
||||
|
||||
Sprint 8 must establish:
|
||||
|
||||
- A project-owned doctor module for local environment diagnostics.
|
||||
- A `pdf2md doctor` CLI command with deterministic exit codes.
|
||||
- Clear reporting for Python, `uv`, MinerU CLI, MinerU version, GPU/CUDA/PyTorch visibility, and model/cache path detection.
|
||||
- Documentation that explains setup steps and local-only runtime constraints without introducing cloud/API fallback paths.
|
||||
- Fast tests that mock environment checks and do not require real MinerU, CUDA, GPU, model files, network, `samples/`, Obsidian, LaTeX tooling, or package/model downloads.
|
||||
|
||||
Sprint 8 is a diagnostics and setup-documentation sprint. It must not change conversion output behavior, run real conversions, probe local sample PDFs, or make runtime conversion depend on network access.
|
||||
|
||||
## Current Precondition
|
||||
|
||||
Sprint 7 is complete:
|
||||
|
||||
- `src/pdf2md/paths.py` owns input discovery and output path planning.
|
||||
- `src/pdf2md/ir.py` owns project records, block types, warning codes, and warning severities.
|
||||
- `src/pdf2md/metadata.py` builds JSON-serializable metadata and summary counts from project-owned records.
|
||||
- `src/pdf2md/mineru_adapter.py` owns the mocked direct local MinerU CLI adapter boundary.
|
||||
- `src/pdf2md/markdown.py` owns Obsidian Markdown normalization.
|
||||
- `src/pdf2md/quality.py` owns local quality checks, including math checker unavailable behavior.
|
||||
- `src/pdf2md/report.py` owns report content rendering and final status calculation.
|
||||
- `src/pdf2md/conversion.py` owns conversion orchestration and output writing.
|
||||
- `src/pdf2md/cli.py` owns `pdf2md convert`.
|
||||
- `uv run pytest` passed 119 tests.
|
||||
|
||||
Sprint 8 may add doctor diagnostics and setup documentation. It must not require a real successful local MinerU/GPU/model setup in the default test loop.
|
||||
|
||||
## Touched Surfaces
|
||||
|
||||
Allowed:
|
||||
|
||||
- `src/pdf2md/doctor.py`
|
||||
- `src/pdf2md/cli.py`
|
||||
- `src/pdf2md/mineru_adapter.py` only for narrowly required availability/version helper compatibility
|
||||
- `README.md`
|
||||
- `scripts/install-mineru.ps1` only if implemented as an explicit user-invoked setup helper
|
||||
- `scripts/install-models.py` only if implemented as an explicit user-invoked setup helper
|
||||
- `tests/test_doctor.py`
|
||||
- `tests/test_cli.py`
|
||||
- `tests/test_mineru_adapter.py` only if adapter helper compatibility changes
|
||||
- `PLAN.md`
|
||||
- `PROGRESS.md`
|
||||
- `docs/V1IMPLEMENTATIONPLAN.md`
|
||||
- `docs/Sprints/SPRINT8CONTRACT.md`
|
||||
|
||||
Not allowed:
|
||||
|
||||
- Changes to `src/pdf2md/conversion.py` unless a doctor/CLI regression forces a narrow compatibility fix
|
||||
- Changes to Markdown normalization, metadata schema, report rendering, or path planning unrelated to doctor behavior
|
||||
- Real PDF conversion in default tests
|
||||
- Real MinerU execution in default tests
|
||||
- Real CUDA/GPU dependency in default tests
|
||||
- Model downloads in default tests
|
||||
- Any setup download triggered by `pdf2md doctor`, `pdf2md convert`, import time, or tests
|
||||
- Runtime engine selection or alternate engine support
|
||||
- Cloud OCR, remote LLM/VLM, hosted renderer, remote document parser, remote asset fetching, HTTP client backend, router mode, `--api-url`, remote APIs, or remote OpenAI-compatible backend support
|
||||
- A CLI/API option that disables strict-local policy
|
||||
- Committed files under `samples/`
|
||||
- Generated conversion outputs committed to git
|
||||
|
||||
## Expected Outputs
|
||||
|
||||
Sprint 8 should produce:
|
||||
|
||||
1. Doctor result records and API
|
||||
- A small project-owned doctor result type containing at least:
|
||||
- check name
|
||||
- status: `pass`, `warn`, or `fail`
|
||||
- human-readable message
|
||||
- optional details
|
||||
- A doctor report type containing ordered checks and an overall status.
|
||||
- Mockable checker dependencies for subprocess calls, executable discovery, Python version, imports, environment variables, and filesystem paths.
|
||||
- No public or required field should expose raw subprocess or third-party objects.
|
||||
|
||||
2. Required checks
|
||||
- Python version check:
|
||||
- Pass on Python 3.12.
|
||||
- Fail outside the supported project range.
|
||||
- `uv` check:
|
||||
- Detect executable availability.
|
||||
- Report version text when available.
|
||||
- Fail clearly when missing.
|
||||
- MinerU check:
|
||||
- Detect direct local `mineru` CLI availability through the existing adapter boundary where possible.
|
||||
- Report MinerU version when available.
|
||||
- Fail clearly when the CLI is missing.
|
||||
- Warn or fail clearly when version detection fails or the detected version is not MinerU 3.1.0.
|
||||
- GPU/CUDA/PyTorch visibility:
|
||||
- Report whether an NVIDIA GPU is visible when detectable.
|
||||
- Report CUDA/PyTorch visibility without requiring PyTorch in default tests.
|
||||
- Warn clearly when GPU/CUDA/PyTorch acceleration is unavailable.
|
||||
- Warn clearly when the detected GPU is GTX 1070 Ti or another Pascal/pre-Turing class GPU with likely CUDA/PyTorch compatibility risk.
|
||||
- Model/cache paths:
|
||||
- Report detectable local model/cache paths from documented environment variables or known local config locations.
|
||||
- Warn when model/cache paths cannot be detected.
|
||||
- Do not download, install, or validate large model files in default tests.
|
||||
- Local-only policy:
|
||||
- Report that runtime conversion allows only direct local `mineru` CLI execution and CLI-internal temporary local `mineru-api`.
|
||||
- Report that `--api-url`, remote APIs, router mode, HTTP client backends, and remote OpenAI-compatible backends are prohibited.
|
||||
|
||||
3. CLI command
|
||||
- `pdf2md doctor` exists.
|
||||
- `pdf2md --version` remains unchanged.
|
||||
- `pdf2md convert` behavior remains covered and unchanged except for parser integration.
|
||||
- Exit code policy:
|
||||
- `0` when all checks pass or only warnings exist.
|
||||
- Non-zero when any required check fails.
|
||||
- CLI output must be concise, deterministic, and testable.
|
||||
- CLI output must distinguish warnings from failures.
|
||||
|
||||
4. Setup documentation
|
||||
- README or setup section explains:
|
||||
- Windows PowerShell workflow.
|
||||
- Python 3.12 requirement.
|
||||
- `uv` usage and PATH note for `C:\Users\user\.local\bin`.
|
||||
- MinerU 3.1.0 local CLI expectation.
|
||||
- Model/cache setup expectations and where doctor looks.
|
||||
- NVIDIA GPU expectations and GTX 1070 Ti 8GB risk.
|
||||
- Strict-local runtime policy.
|
||||
- Difference between explicit setup downloads and runtime conversion, which must stay local-only.
|
||||
- If setup helper scripts are added, they must be explicit user-invoked helpers, not imported by package code, not called by `doctor`, and not used by default tests.
|
||||
|
||||
5. Tests
|
||||
- Unit tests for doctor success with mocked checks.
|
||||
- Unit tests for missing Python version support.
|
||||
- Unit tests for missing `uv`.
|
||||
- Unit tests for missing MinerU.
|
||||
- Unit tests for MinerU version detection failure or non-3.1.0 version.
|
||||
- Unit tests for missing GPU/CUDA/PyTorch warning behavior.
|
||||
- Unit tests for GTX 1070 Ti/Pascal risk warning behavior.
|
||||
- Unit tests for missing model/cache warning behavior.
|
||||
- CLI tests for `pdf2md doctor` success, warning-only success, and hard failure exit code.
|
||||
- Regression tests proving `pdf2md convert` and `--version` still work.
|
||||
- Tests proving default checks do not require real MinerU, GPU, CUDA, PyTorch, model files, network, `samples/`, Obsidian, or LaTeX tooling.
|
||||
|
||||
6. Handoff
|
||||
- `PROGRESS.md` records changed files, commands run, tests passed or blocked, known failures, residual risks, and next action.
|
||||
|
||||
## Non-Goals
|
||||
|
||||
- Do not run real MinerU in default tests.
|
||||
- Do not install MinerU 3.1.0.
|
||||
- Do not download MinerU models.
|
||||
- Do not run real model setup during doctor or tests.
|
||||
- Do not parse, convert, or inspect sample PDFs.
|
||||
- Do not implement local fixture evaluation.
|
||||
- Do not change conversion orchestration behavior except for unavoidable CLI parser integration.
|
||||
- Do not add runtime engine selection.
|
||||
- Do not add alternate engines.
|
||||
- Do not add cloud, remote API, router, HTTP client backend, remote OpenAI-compatible backend, hosted renderer, or remote asset-fetching support.
|
||||
- Do not add a CLI/API option that disables strict-local policy.
|
||||
- Do not require real CUDA, GPU, PyTorch, MinerU, model files, network, Obsidian, LaTeX tooling, or `samples/` in default tests.
|
||||
- Do not claim that GTX 1070 Ti CUDA/PyTorch acceleration is guaranteed until local validation proves it.
|
||||
- Do not claim perfect setup automation.
|
||||
|
||||
## Work Packages
|
||||
|
||||
### WP8.1: Doctor Records And Mockable Check Boundaries
|
||||
|
||||
Owner:
|
||||
|
||||
- `local-setup-agent`
|
||||
- `feature-generator-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Add `doctor.py`.
|
||||
- Define doctor check/result records.
|
||||
- Keep all external probes injectable or mockable.
|
||||
- Implement deterministic aggregation and exit status policy.
|
||||
|
||||
Output:
|
||||
|
||||
- The CLI can report setup health without depending on real local tools in tests.
|
||||
|
||||
### WP8.2: Environment And MinerU Diagnostics
|
||||
|
||||
Owner:
|
||||
|
||||
- `local-setup-agent`
|
||||
- `mineru-integration-agent`
|
||||
- `feature-generator-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Add Python, `uv`, MinerU availability, MinerU version, GPU/CUDA/PyTorch, and model/cache checks.
|
||||
- Reuse the direct local MinerU adapter boundary where it fits.
|
||||
- Keep MinerU 3.1.0 as the only accepted engine target.
|
||||
|
||||
Output:
|
||||
|
||||
- Users get clear local setup failures before conversion.
|
||||
|
||||
### WP8.3: CLI Doctor Command
|
||||
|
||||
Owner:
|
||||
|
||||
- `feature-generator-agent`
|
||||
- `requirements-guard-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Add `pdf2md doctor` without breaking `pdf2md convert` or `--version`.
|
||||
- Print deterministic pass/warn/fail lines and an overall status.
|
||||
- Return non-zero when required checks fail.
|
||||
|
||||
Output:
|
||||
|
||||
- The command-line workflow can diagnose setup state.
|
||||
|
||||
### WP8.4: Setup Documentation And Optional Explicit Helpers
|
||||
|
||||
Owner:
|
||||
|
||||
- `local-setup-agent`
|
||||
- `license-privacy-agent`
|
||||
- `requirements-guard-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Update README setup docs for Windows PowerShell, Python 3.12, `uv`, MinerU 3.1.0, model/cache, GPU, and local-only runtime policy.
|
||||
- Verify volatile install/setup claims against official docs before editing.
|
||||
- If adding scripts, keep them explicit, local setup-only, and never called by doctor, convert, import time, or default tests.
|
||||
|
||||
Output:
|
||||
|
||||
- Setup instructions are clear without weakening strict-local runtime policy.
|
||||
|
||||
### WP8.5: Independent Evaluation
|
||||
|
||||
Owner:
|
||||
|
||||
- `evaluation-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Review completed doctor behavior and docs against this contract.
|
||||
- Verify no default test executes real MinerU, uses GPU/CUDA, downloads models, uses network, or requires `samples/`.
|
||||
- Verify no runtime remote/API path or alternate engine is introduced.
|
||||
- Verify `samples/` remains untracked and unstaged.
|
||||
|
||||
Output:
|
||||
|
||||
- PASS/FAIL notes with any missing acceptance criteria.
|
||||
|
||||
## Verification Checks
|
||||
|
||||
Required:
|
||||
|
||||
- `git status --short --untracked-files=all` before staging confirms `samples/` remains untracked and unstaged.
|
||||
- `uv --version` is run and result is recorded.
|
||||
- `uv sync` passes.
|
||||
- `uv run pytest tests/test_doctor.py tests/test_cli.py` passes.
|
||||
- `uv run pytest` passes.
|
||||
- `git diff --check` passes.
|
||||
- Default tests do not require real MinerU, CUDA, GPU, PyTorch, model files, network, Obsidian, LaTeX tooling, or `samples/`.
|
||||
- No model downloads occur.
|
||||
- No setup downloads occur from doctor, convert, imports, or tests.
|
||||
- No network calls are required in default tests.
|
||||
- No candidate engine comparison is reintroduced.
|
||||
- No alternate engine or runtime engine selection is added.
|
||||
- No CLI/API option disables strict-local policy.
|
||||
- No `--api-url`, router mode, HTTP client backend, remote API, or remote OpenAI-compatible backend support is added.
|
||||
- Doctor fails clearly when required dependencies are missing.
|
||||
- Doctor does not report the environment as healthy when MinerU is missing.
|
||||
- Doctor warnings are clear for GPU/CUDA/PyTorch/model-cache risk.
|
||||
- `pdf2md convert` tests still pass.
|
||||
- `pdf2md --version` still works.
|
||||
|
||||
Recommended:
|
||||
|
||||
- Keep doctor checks small, ordered, and deterministic.
|
||||
- Keep human-readable output stable enough for unit tests.
|
||||
- Use dependency injection rather than monkeypatching global process state where possible.
|
||||
- Treat real local GPU/MinerU/model probes as optional manual verification outside the default test suite.
|
||||
- Use `requirements-guard-agent` if setup wording risks weakening strict-local policy.
|
||||
- Use `research-agent` or `local-setup-agent` with live web verification before changing volatile installation commands or model-cache documentation.
|
||||
|
||||
## Hard Failure Criteria
|
||||
|
||||
Sprint 8 fails and must stop for a user decision if any of these are true:
|
||||
|
||||
- Doctor reports a healthy environment when MinerU is missing.
|
||||
- Doctor says cloud/API fallback is supported.
|
||||
- Doctor, import time, or default tests install packages, download models, call network services, or run model setup.
|
||||
- Default tests require real MinerU, CUDA, GPU, PyTorch, model files, network, Obsidian, LaTeX tooling, or `samples/`.
|
||||
- The implementation adds or permits `--api-url`, remote APIs, router mode, HTTP client backends, or remote OpenAI-compatible backends.
|
||||
- The implementation adds runtime engine selection or alternate engines.
|
||||
- `pdf2md doctor` breaks `pdf2md convert` or `pdf2md --version`.
|
||||
- The README or scripts imply runtime conversion can upload PDFs, page images, extracted text, or intermediates to remote services.
|
||||
- Setup helper scripts are invoked automatically by doctor, convert, import time, or tests.
|
||||
- `samples/` is staged or committed.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
Sprint 8 is complete when:
|
||||
|
||||
- `src/pdf2md/doctor.py` exists and owns local setup diagnostics.
|
||||
- `pdf2md doctor` exists.
|
||||
- Doctor returns a project-owned report with ordered checks and overall status.
|
||||
- Python 3.12, `uv`, MinerU availability/version, GPU/CUDA/PyTorch visibility, and model/cache path checks are implemented with mocked tests.
|
||||
- Missing `uv` is tested.
|
||||
- Missing MinerU is tested and produces a failure.
|
||||
- Missing GPU/CUDA/PyTorch is tested and produces clear warning behavior.
|
||||
- GTX 1070 Ti/Pascal risk warning behavior is tested.
|
||||
- Missing model/cache path warning behavior is tested.
|
||||
- `pdf2md doctor` exit code behavior is tested for success, warning-only success, and failure.
|
||||
- `pdf2md convert` and `pdf2md --version` regression tests still pass.
|
||||
- Setup docs explain local-only runtime behavior and do not imply cloud/API fallback.
|
||||
- Default tests do not require real MinerU, GPU, CUDA, PyTorch, model files, network, Obsidian, LaTeX tooling, or `samples/`.
|
||||
- `uv sync` passes.
|
||||
- Targeted doctor/CLI tests pass.
|
||||
- `uv run pytest` passes.
|
||||
- `PROGRESS.md` records checks performed and residual risks.
|
||||
- Independent evaluation is complete.
|
||||
- The completed change is committed.
|
||||
|
||||
## Handoff Fields
|
||||
|
||||
Use these fields when Sprint 8 completes:
|
||||
|
||||
- Files changed:
|
||||
- Commands run:
|
||||
- Tests passed:
|
||||
- Tests blocked:
|
||||
- Known failures:
|
||||
- Residual risks:
|
||||
- User decisions needed:
|
||||
- Go/no-go recommendation for Sprint 9:
|
||||
- Next action:
|
||||
@@ -0,0 +1,320 @@
|
||||
# Sprint 9 Contract: Local Fixture Evaluation And V1 Release Gate
|
||||
|
||||
Status: Implemented
|
||||
Last updated: 2026-05-08
|
||||
|
||||
## Objective
|
||||
|
||||
Validate the v1 converter against local fixture workflows without committing sample PDFs or making the default test loop depend on MinerU models, GPU, CUDA, network access, Obsidian, or LaTeX tooling.
|
||||
|
||||
Sprint 9 must establish:
|
||||
|
||||
- A fast mocked integration suite that exercises the public conversion path end to end.
|
||||
- An optional, explicitly enabled local MinerU fixture evaluation path for `samples/`.
|
||||
- A fixture coverage manifest or checklist that records which local PDFs cover math, tables, figures/assets, reading order, Korean filenames, and metadata/report risks.
|
||||
- Release-gate documentation that distinguishes default automated checks from optional local MinerU/GPU checks.
|
||||
- Clear `PROGRESS.md` notes for local fixture coverage, skipped/blocked optional checks, known quality risks, and the v1 go/no-go recommendation.
|
||||
|
||||
Sprint 9 is an evaluation and release-gate sprint. It may add tests, local-only evaluation helpers, fixture manifests, and narrow compatibility fixes only when needed to evaluate the current v1 behavior. It must not add alternate engines, cloud/API paths, runtime engine selection, or automatic model downloads.
|
||||
|
||||
## Current Precondition
|
||||
|
||||
Sprint 8 is complete:
|
||||
|
||||
- `pdf2md doctor` exists and reports Python, `uv`, MinerU CLI/version, GPU, PyTorch, model/cache, and strict-local policy status.
|
||||
- Local `pdf2md doctor` currently fails because the `mineru` CLI is not installed on PATH.
|
||||
- `pdf2md convert` exists and writes Markdown, metadata JSON, and `<stem>.report.md` with fake-adapter test coverage.
|
||||
- Default tests pass without real MinerU, CUDA, GPU, model files, network, Obsidian, LaTeX tooling, or `samples/`.
|
||||
- `samples/` exists locally and is untracked. Observed local fixture files include:
|
||||
- `samples/FourNodeQuadrilateralShellElementMITC4.pdf`
|
||||
- `samples/MITC공부.pdf`
|
||||
- `samples/2007쉘구조물의유한요소해석에대하여.pdf`
|
||||
- `samples/유한요소해석법을이용한쉘구조물의동적좌굴해석.pdf`
|
||||
- `samples/metadata.json`
|
||||
|
||||
Sprint 9 must preserve the untracked status of `samples/` unless the user explicitly requests otherwise.
|
||||
|
||||
## Touched Surfaces
|
||||
|
||||
Allowed:
|
||||
|
||||
- `tests/integration/`
|
||||
- `tests/test_conversion.py`
|
||||
- `tests/test_cli.py`
|
||||
- `tests/test_report.py`
|
||||
- `tests/test_metadata.py`
|
||||
- `tests/test_quality.py`
|
||||
- `tests/conftest.py` only for markers or opt-in fixture controls
|
||||
- `src/pdf2md/mineru_adapter.py` only for narrow compatibility fixes backed by mocked or optional local MinerU output evidence
|
||||
- `src/pdf2md/conversion.py` only for narrow release-gate defects found by integration tests
|
||||
- `src/pdf2md/quality.py` only for local quality metric defects found by integration tests
|
||||
- `src/pdf2md/report.py` only for report defects found by integration tests
|
||||
- `README.md`
|
||||
- `docs/V1RELEASECHECKLIST.md`
|
||||
- `docs/V1IMPLEMENTATIONPLAN.md`
|
||||
- `docs/Sprints/SPRINT9CONTRACT.md`
|
||||
- `PLAN.md`
|
||||
- `PROGRESS.md`
|
||||
|
||||
Not allowed:
|
||||
|
||||
- Committed files under `samples/`
|
||||
- Committed generated conversion outputs from local sample PDFs
|
||||
- Mandatory tests that require real MinerU, GPU, CUDA, PyTorch, model files, network, Obsidian, LaTeX tooling, or `samples/`
|
||||
- Automatic package installs or model downloads from tests, import time, doctor, convert, or helpers
|
||||
- Runtime engine selection or alternate conversion engines
|
||||
- Cloud OCR, remote LLM/VLM, hosted renderer, remote document parser, remote asset fetching, `--api-url`, router mode, HTTP client backends, remote APIs, or remote OpenAI-compatible backends
|
||||
- CLI/API options that disable strict-local policy
|
||||
- Claims that v1 perfectly reconstructs LaTeX, tables, or reading order
|
||||
|
||||
## Expected Outputs
|
||||
|
||||
1. Fast mocked integration suite
|
||||
- Exercises `convert_pdf` and/or `pdf2md convert` with a fake MinerU adapter through the real orchestration path.
|
||||
- Verifies Markdown, metadata JSON, and `<stem>.report.md` are all written.
|
||||
- Verifies output paths, asset links, warning counts, and report status stay consistent.
|
||||
- Verifies failures produce metadata/report warnings when possible and do not silently fallback.
|
||||
- Runs as part of `uv run pytest` without real MinerU, models, GPU, network, Obsidian, LaTeX tooling, or `samples/`.
|
||||
|
||||
2. Optional local MinerU fixture evaluation
|
||||
- Provides an explicit opt-in command or pytest marker/environment gate for real local MinerU sample evaluation.
|
||||
- Skips or reports a clear local blocker when `pdf2md doctor` fails because MinerU, model/cache paths, or GPU/PyTorch acceleration are unavailable.
|
||||
- Reads sample PDFs only from `samples/` or a user-provided local sample directory.
|
||||
- Writes generated outputs to a temporary or ignored output directory, never to tracked fixture paths.
|
||||
- Produces or records, for each attempted sample:
|
||||
- source filename
|
||||
- command run
|
||||
- exit code
|
||||
- generated Markdown path
|
||||
- generated metadata JSON path
|
||||
- generated `.report.md` path
|
||||
- warning count
|
||||
- math renderability or checker-unavailable count
|
||||
- table fallback/degradation count when available
|
||||
- missing or broken asset link count
|
||||
- page coverage when available
|
||||
- Does not mark optional evaluation as passed when MinerU is missing; it records the blocker.
|
||||
|
||||
3. Fixture coverage manifest or checklist
|
||||
- Maps local sample files to risk categories:
|
||||
- simple digital PDF
|
||||
- math-heavy PDF
|
||||
- multi-column or complex reading order
|
||||
- table with formulas
|
||||
- figure/caption/assets
|
||||
- Korean filename/path handling
|
||||
- May store only relative sample names, categories, and notes; it must not embed sample PDFs or generated outputs.
|
||||
- Records coverage gaps that need additional user-provided samples.
|
||||
|
||||
4. V1 release checklist
|
||||
- Defines default release gates:
|
||||
- `uv sync`
|
||||
- `uv run pytest`
|
||||
- `uv run pdf2md --version`
|
||||
- `uv run pdf2md doctor`
|
||||
- `git diff --check`
|
||||
- `git status --short --untracked-files=all`
|
||||
- Defines optional local MinerU release gates separately from default gates.
|
||||
- Requires Markdown, metadata JSON, and `.report.md` to exist before any sample conversion is considered successful.
|
||||
- Requires warnings and residual risks to be recorded in `PROGRESS.md`.
|
||||
- Makes local-only and no-sample-commit checks explicit.
|
||||
|
||||
5. Documentation
|
||||
- README or release checklist explains how to run default checks and optional local fixture checks.
|
||||
- Documentation states that optional fixture checks may be skipped or blocked until MinerU 3.1.0 and model/cache setup are available.
|
||||
- Documentation does not instruct users to use `--api-url`, router mode, HTTP client backends, remote APIs, or remote OpenAI-compatible backends.
|
||||
|
||||
6. Handoff
|
||||
- `PROGRESS.md` records changed files, commands run, tests passed or blocked, local fixture status, generated output location if any, known failures, residual risks, and next action.
|
||||
|
||||
## Non-Goals
|
||||
|
||||
- Do not install MinerU.
|
||||
- Do not download MinerU models.
|
||||
- Do not run model setup automatically.
|
||||
- Do not require the local GTX 1070 Ti to pass CUDA/PyTorch checks in the default test loop.
|
||||
- Do not improve OCR/model accuracy.
|
||||
- Do not introduce a manual review UI or web UI.
|
||||
- Do not add alternate conversion engines or fallback engines.
|
||||
- Do not benchmark against cloud OCR/API services.
|
||||
- Do not commit sample PDFs, sample-derived outputs, or large binary fixtures.
|
||||
- Do not make text edit distance the only quality criterion.
|
||||
- Do not claim v1 is release-ready if metadata JSON or `.report.md` generation is missing.
|
||||
|
||||
## Work Packages
|
||||
|
||||
### WP9.1: Fast Mocked Integration Checks
|
||||
|
||||
Owner:
|
||||
|
||||
- `feature-generator-agent`
|
||||
- `evaluation-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Add integration-level tests that use fake adapter output but run the public conversion orchestration and CLI paths.
|
||||
- Assert generated Markdown, metadata JSON, `.report.md`, assets, warnings, and summaries are mutually consistent.
|
||||
- Keep tests deterministic and independent of real samples.
|
||||
|
||||
Output:
|
||||
|
||||
- `uv run pytest` covers v1 file-output behavior without model or GPU dependencies.
|
||||
|
||||
### WP9.2: Optional MinerU Sample Evaluation Harness
|
||||
|
||||
Owner:
|
||||
|
||||
- `mineru-integration-agent`
|
||||
- `local-setup-agent`
|
||||
- `evaluation-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Add an explicit opt-in local fixture command/test path.
|
||||
- Gate real MinerU execution behind an environment variable, marker, or explicit command documented in README/checklist.
|
||||
- Run `pdf2md doctor` or equivalent preflight before optional local MinerU evaluation.
|
||||
- Use temporary or ignored output directories.
|
||||
- Record blocked status clearly when MinerU/model/cache setup is missing.
|
||||
|
||||
Output:
|
||||
|
||||
- Local users can run real sample evaluation when setup is ready, while default tests stay fast and local.
|
||||
|
||||
### WP9.3: Fixture Coverage And Metrics
|
||||
|
||||
Owner:
|
||||
|
||||
- `evaluation-agent`
|
||||
- `obsidian-markdown-agent`
|
||||
- `metadata-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Define fixture categories and expected risk coverage.
|
||||
- Track math delimiter/renderability, tables, reading order, assets, page coverage, metadata fields, warning counts, and report usefulness.
|
||||
- Avoid scoring quality only by plain-text edit distance.
|
||||
|
||||
Output:
|
||||
|
||||
- Fixture coverage is explicit and gaps are visible.
|
||||
|
||||
### WP9.4: V1 Release Gate Documentation
|
||||
|
||||
Owner:
|
||||
|
||||
- `requirements-guard-agent`
|
||||
- `evaluation-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Add or update release checklist documentation.
|
||||
- Separate default release gates from optional local MinerU/GPU gates.
|
||||
- Keep strict-local wording consistent with `ARCHITECTURE.md`, `PRD.md`, and `README.md`.
|
||||
- Update `PLAN.md` and `PROGRESS.md` with the next action and release readiness state.
|
||||
|
||||
Output:
|
||||
|
||||
- A future agent can determine whether v1 is blocked, partial, or ready without relying on conversation history.
|
||||
|
||||
### WP9.5: Independent Evaluation
|
||||
|
||||
Owner:
|
||||
|
||||
- `evaluation-agent`
|
||||
|
||||
Actions:
|
||||
|
||||
- Review completed Sprint 9 work against this contract.
|
||||
- Verify default tests do not require real MinerU, GPU, CUDA, PyTorch, model files, network, Obsidian, LaTeX tooling, or `samples/`.
|
||||
- Verify optional local MinerU evaluation is clearly gated.
|
||||
- Verify generated sample outputs and sample PDFs are not staged.
|
||||
- Verify release checklist cannot pass without Markdown, metadata JSON, and `.report.md`.
|
||||
|
||||
Output:
|
||||
|
||||
- PASS/FAIL notes with actionable findings and residual risk.
|
||||
|
||||
## Verification Checks
|
||||
|
||||
Required:
|
||||
|
||||
- `git status --short --untracked-files=all` before staging confirms `samples/` remains untracked and unstaged.
|
||||
- `uv --version` is run and result is recorded.
|
||||
- `uv sync` passes.
|
||||
- `uv run pytest` passes.
|
||||
- Targeted integration tests pass.
|
||||
- `uv run pdf2md --version` passes.
|
||||
- `uv run pdf2md doctor` is run and its result is recorded as pass, warn, or blocked/fail.
|
||||
- `git diff --check` passes.
|
||||
- Default tests do not require real MinerU, CUDA, GPU, PyTorch, model files, network, Obsidian, LaTeX tooling, or `samples/`.
|
||||
- No model downloads occur.
|
||||
- No setup downloads occur from tests, import time, doctor, convert, or helper scripts.
|
||||
- No network calls are required in default tests.
|
||||
- No candidate engine comparison is reintroduced.
|
||||
- No alternate engine or runtime engine selection is added.
|
||||
- No CLI/API option disables strict-local policy.
|
||||
- No `--api-url`, router mode, HTTP client backend, remote API, or remote OpenAI-compatible backend support is added.
|
||||
- Optional local MinerU checks are skipped or blocked clearly when setup is unavailable.
|
||||
- Sample PDFs and generated sample outputs are not staged or committed.
|
||||
- `PROGRESS.md` records local fixture coverage status and release readiness.
|
||||
|
||||
Recommended:
|
||||
|
||||
- Add a pytest marker or environment variable for optional local MinerU tests.
|
||||
- Keep optional output under a temporary directory or an ignored local output root.
|
||||
- Include at least one Korean filename/path check in fast mocked tests.
|
||||
- Include one fake output with math, one with a table warning, and one with an asset link.
|
||||
- Record source-to-output paths in release checklist examples.
|
||||
- Treat local doctor failure as a release blocker for real MinerU validation but not for the default fast test loop.
|
||||
|
||||
## Hard Failure Criteria
|
||||
|
||||
Sprint 9 fails and must stop for a user decision if any of these are true:
|
||||
|
||||
- Default tests require real MinerU, GPU, CUDA, PyTorch, model files, network, Obsidian, LaTeX tooling, or `samples/`.
|
||||
- Sample PDFs or generated sample outputs are staged or committed.
|
||||
- Optional real MinerU evaluation runs without an explicit opt-in gate.
|
||||
- Optional real MinerU evaluation writes generated output into tracked fixture paths.
|
||||
- V1 release checklist can pass without generated Markdown, metadata JSON, and `.report.md`.
|
||||
- Release status is marked ready when `pdf2md doctor` has a hard failure and no explicit user waiver is recorded.
|
||||
- The implementation adds runtime engine selection or alternate engines.
|
||||
- The implementation adds or permits `--api-url`, remote APIs, router mode, HTTP client backends, or remote OpenAI-compatible backends.
|
||||
- The implementation uses cloud/API fallback for any fixture evaluation.
|
||||
- The implementation hides MinerU failure or silently falls back to another engine.
|
||||
- Quality criteria ignore math, tables, reading order, assets, metadata, or report quality.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
Sprint 9 is complete when:
|
||||
|
||||
- `docs/Sprints/SPRINT9CONTRACT.md` exists and is referenced by relevant agents.
|
||||
- Fast mocked integration tests exist and pass under `uv run pytest`.
|
||||
- Optional local MinerU fixture evaluation is documented and explicitly gated.
|
||||
- Local fixture coverage categories and gaps are recorded.
|
||||
- Release checklist documentation exists or is updated.
|
||||
- `PROGRESS.md` records optional local MinerU status, including skipped/blocked reasons when applicable.
|
||||
- Default tests do not require real MinerU, GPU, CUDA, PyTorch, model files, network, Obsidian, LaTeX tooling, or `samples/`.
|
||||
- No sample PDF or generated sample output is staged or committed.
|
||||
- `uv sync` passes.
|
||||
- `uv run pytest` passes.
|
||||
- `git diff --check` passes.
|
||||
- Independent evaluation is complete.
|
||||
- The completed change is committed.
|
||||
|
||||
## Handoff Fields
|
||||
|
||||
Use these fields when Sprint 9 completes:
|
||||
|
||||
- Files changed:
|
||||
- Commands run:
|
||||
- Tests passed:
|
||||
- Tests blocked:
|
||||
- Optional local MinerU status:
|
||||
- Fixture coverage:
|
||||
- Generated output locations:
|
||||
- Known failures:
|
||||
- Residual risks:
|
||||
- User decisions needed:
|
||||
- V1 release recommendation:
|
||||
- Go/no-go recommendation for next sprint:
|
||||
- Next action:
|
||||
Reference in New Issue
Block a user