PDFToMD/docs/V1IMPLEMENTATIONPLAN.md

# V1 Implementation Plan: Local PDF-to-Markdown Converter

Last updated: 2026-05-08

This document is the implementation plan for v1. It does not replace `PRD.md` or `ARCHITECTURE.md`; use those files as the source of product requirements and system design. This plan explains the order of work, sprint contracts, verification gates, and agent ownership for implementing the converter.

Sprint 1 created the Python package scaffold and CLI placeholder. Sprint 2 created path planning. Sprint 3 created project-owned records and metadata construction. Sprint 4 created the mocked direct local MinerU adapter boundary. Sprint 5 created the Obsidian Markdown normalization boundary. Sprint 6 created local quality-check and report-rendering boundaries. Sprint 7 implemented conversion orchestration, the public conversion API, and the `pdf2md convert` CLI path with fake-adapter tests. Sprint 8 implemented mockable doctor diagnostics, the `pdf2md doctor` CLI path, and setup documentation. Sprint 9 implemented fast mocked integration tests, explicit opt-in local MinerU fixture evaluation, and the v1 release checklist. Sprint 10 implemented opt-in pre-conversion PDF chunking for long documents.

## 1. V1 Outcome

v1 is complete when a local user can run:

```bash
uv run pdf2md doctor
uv run pdf2md convert paper.pdf --out out --metadata
uv run pdf2md convert pdfs --out out --recursive --metadata
```

and receive, for each PDF:

- Obsidian-friendly Markdown.
- A stable sibling assets directory when assets exist.
- `<stem>.metadata.json`.
- `<stem>.report.md`.
- Clear warnings when math, tables, assets, reading order, GPU availability, or MinerU execution are uncertain.

Long PDFs can be chunked explicitly:

```bash
uv run pdf2md convert paper.pdf --out out --chunk-pages
uv run pdf2md convert paper.pdf --out out --chunk-pages 20
```

Chunked conversion writes separate outputs per chunk and does not merge Markdown files.

The converter must use MinerU 3.1.0 through direct local CLI execution only. It must not silently fallback to another engine.

## 2. Non-Negotiable Constraints

- Python 3.12 and `uv`.
- MinerU 3.1.0 is the only conversion engine.
- Direct local MinerU CLI execution only.
- MinerU 3.1.0 may launch a temporary local `mineru-api` internally when CLI runs without `--api-url`.
- No cloud OCR, hosted LLM/VLM, remote document parser, `--api-url`, remote APIs, router mode, HTTP client backends, or remote OpenAI-compatible backends.
- Target hardware: NVIDIA GTX 1070 Ti 8GB.
- Digital PDFs with text layers are the v1 priority.
- `samples/` is local fixture context and must not be committed unless explicitly requested.
- Every substantial implementation chunk needs a sprint contract and independent evaluation.

## 3. Harness Operating Model

Use the project long-running harness only for substantial implementation work.

1. `harness-planner-agent` turns the next user request into a sprint contract.
2. `evaluation-agent` reviews the contract before code changes start.
3. `feature-generator-agent` implements one approved contract at a time.
4. `feature-generator-agent` runs self-checks and records residual risks.
5. `evaluation-agent` independently verifies the result against the contract.
6. The parent agent updates `PROGRESS.md`, commits the completed change, and leaves a handoff.

After a chunk is no longer active, archive completed-work details in `docs/WORKARCHIVE.md` and keep `PROGRESS.md` focused on current status, blockers, and next actions.

Each sprint contract must include:

- Objective.
- Touched surfaces.
- Expected outputs.
- Non-goals.
- Verification checks.
- Hard failure criteria.
- Handoff fields.

## 4. Proposed Repository Layout

Create this layout incrementally; do not scaffold unused modules before a sprint needs them.

```text
pyproject.toml
README.md
src/
  pdf2md/
    __init__.py
    cli.py
    conversion.py
    pdf_splitter.py
    paths.py
    mineru_adapter.py
    ir.py
    markdown.py
    metadata.py
    quality.py
    report.py
    doctor.py
tests/
  unit/
  integration/
  fixtures/
scripts/
  install-mineru.ps1
  install-models.py
```

Planned module responsibilities:

- `cli.py`: command parsing, CLI summaries, exit codes.
- `conversion.py`: orchestration for one PDF and batch input.
- `paths.py`: input discovery, output path planning, overwrite checks.
- `mineru_adapter.py`: direct local MinerU CLI boundary.
- `ir.py`: project-owned document/page/block/asset/warning records.
- `markdown.py`: Obsidian Markdown normalization.
- `metadata.py`: metadata schema creation and warning aggregation.
- `quality.py`: local checks for assets, math renderability, and output sanity.
- `report.py`: `<stem>.report.md` generation from metadata.
- `doctor.py`: environment, dependency, CUDA/GPU, MinerU, and cache diagnostics.

## 5. Sprint Sequence

### Sprint 0: Source And Environment Verification

Active contract:

- `docs/Sprints/SPRINT0CONTRACT.md`

Objective:

- Verify the facts needed before implementation starts.

Touched surfaces:

- `docs/KNOWLEDGEBASE.md`
- `docs/V1IMPLEMENTATIONPLAN.md` if sequencing changes
- `PROGRESS.md`

Expected outputs:

- Confirmed MinerU 3.1.0 install command, CLI invocation shape, version command, output paths, and local execution behavior.
- Confirmed Python 3.12, `uv`, CUDA/PyTorch, and GTX 1070 Ti 8GB risks.
- Confirmed license notes needed before redistribution.

Verification checks:

- All volatile facts cite official MinerU, Python, uv, PyTorch/CUDA, or license sources.
- No candidate engine comparison is reintroduced.
- No implementation code is created.

Hard failure criteria:

- MinerU 3.1.0 cannot be reasonably invoked through a direct local CLI on the target environment.
- Python 3.12 compatibility is not viable without changing project requirements.

Primary agents:

- `research-agent`
- `local-setup-agent`
- `license-privacy-agent`

### Sprint 1: Project Scaffold And Fast Test Loop

Active contract:

- `docs/Sprints/SPRINT1CONTRACT.md`

Objective:

- Create the minimal Python project structure and a fast local test loop.

Touched surfaces:

- `pyproject.toml`
- `src/pdf2md/__init__.py`
- `tests/`
- Development documentation if needed

Expected outputs:

- `uv sync` works.
- `uv run pytest` works.
- Project package imports as `pdf2md`.
- CLI entry point name `pdf2md` is reserved but may initially expose only `doctor` or a clear placeholder until later sprints.
- If `uv` is still unavailable locally, Sprint 1 records that blocker and is not marked complete.

Verification checks:

- Import test passes.
- Empty test suite or initial scaffold tests pass.
- No runtime network dependency is introduced.

Hard failure criteria:

- Project cannot be installed with `uv`.
- Scaffolding adds speculative config systems, extra engines, or unused abstractions.

Primary agents:

- `harness-planner-agent`
- `feature-generator-agent`
- `evaluation-agent`

### Sprint 2: Paths, Input Discovery, And Overwrite Planning

Active contract:

- `docs/Sprints/SPRINT2CONTRACT.md`

Objective:

- Implement deterministic input and output planning before conversion logic exists.

Touched surfaces:

- `paths.py`
- `conversion.py` skeleton if needed
- CLI path handling tests

Expected outputs:

- Single PDF discovery.
- Directory PDF discovery.
- Recursive traversal only when requested.
- Deterministic output paths for Markdown, assets, metadata JSON, report, and optional raw MinerU output.
- Existing-output protection unless `--overwrite` is passed.

Verification checks:

- Unit tests for single PDF path planning.
- Unit tests for directory and recursive discovery.
- Unit tests for overwrite behavior.
- Tests include Korean or non-ASCII filename handling using generated temporary files, not committed sample PDFs.

Hard failure criteria:

- Output planning can overwrite user files without explicit overwrite intent.
- Directory conversion descends recursively without `--recursive`.

Primary agents:

- `feature-generator-agent`
- `evaluation-agent`

### Sprint 3: Domain Records, Metadata, And Warning Model

Active contract:

- `docs/Sprints/SPRINT3CONTRACT.md`

Objective:

- Define project-owned records before binding to MinerU output.

Touched surfaces:

- `ir.py`
- `metadata.py`
- `report.py` skeleton if needed
- Unit tests

Expected outputs:

- Document, page, block, asset, and warning records.
- Stable warning codes from `ARCHITECTURE.md`.
- Metadata JSON builder with required top-level and summary fields.
- Warning aggregation logic.

Verification checks:

- Unit tests for metadata schema creation.
- Unit tests for warning aggregation.
- Unit tests for optional fields such as bbox and confidence being preserved only when present.

Hard failure criteria:

- Public API requires raw MinerU objects.
- Metadata omits source PDF, SHA-256, engine, pages, warnings, assets, or summary.

Primary agents:

- `metadata-agent`
- `feature-generator-agent`
- `evaluation-agent`

### Sprint 4: MinerU Adapter With Mocked Contract

Active contract:

- `docs/Sprints/SPRINT4CONTRACT.md`

Objective:

- Build the direct local MinerU adapter boundary with mocked outputs first.

Touched surfaces:

- `mineru_adapter.py`
- `doctor.py` partial checks
- Adapter tests with fake subprocess results and fake output directories

Expected outputs:

- Adapter availability check.
- Version check.
- Direct CLI command construction.
- Strict-local command validation.
- Subprocess execution wrapper capturing stdout, stderr, exit code, and paths.
- Parsed adapter result object with raw Markdown, raw structured data when available, assets, warnings, engine, engine version, options, exit code, and stderr.
- Baseline command shape based on MinerU 3.1.0 direct local CLI: `mineru -p <input> -o <output>`.
- Strict-local validation allows CLI-internal temporary local `mineru-api` orchestration, while rejecting `--api-url`, remote APIs, router mode, HTTP client backends, and remote OpenAI-compatible backends.

Verification checks:

- Mocked successful MinerU output test.
- Mocked missing MinerU test.
- Mocked non-zero exit test.
- Test that prohibited remote/API flags cannot be introduced.
- No real MinerU/model dependency in default tests.

Hard failure criteria:

- Adapter passes `--api-url`, uses router mode, uses an HTTP client backend, or connects to a remote API or remote OpenAI-compatible backend.
- Adapter falls back to another engine after MinerU failure.
- Tests require model downloads by default.

Primary agents:

- `mineru-integration-agent`
- `feature-generator-agent`
- `evaluation-agent`

### Sprint 5: Obsidian Markdown Normalization And Assets

Active contract:

- `docs/Sprints/SPRINT5CONTRACT.md`

Objective:

- Normalize MinerU/project IR output into Obsidian-friendly Markdown.

Touched surfaces:

- `markdown.py`
- `quality.py` partial asset link checks
- Unit tests

Expected outputs:

- Inline math delimiter normalization to `$...$`.
- Display math delimiter normalization to `$$...$$`.
- Blank-line normalization around display math.
- Relative asset link normalization.
- Simple table preservation and complex table fallback warnings.
- No visible page markers by default.

Verification checks:

- Unit tests for inline math.
- Unit tests for display math spacing.
- Unit tests for underscores/carets inside math.
- Unit tests for relative asset links.
- Unit tests for table fallback warning behavior.

Hard failure criteria:

- Normalization rewrites LaTeX semantics without deterministic tests.
- Generated links are absolute when relative links are required.
- Page provenance is only visible in Markdown and missing from metadata.

Primary agents:

- `obsidian-markdown-agent`
- `feature-generator-agent`
- `evaluation-agent`

### Sprint 6: Quality Checks And Report Generation

Active contract:

- `docs/Sprints/SPRINT6CONTRACT.md`

Objective:

- Produce local quality signals and human-readable reports from metadata.

Touched surfaces:

- `quality.py`
- `report.py`
- `metadata.py`
- Unit tests

Expected outputs:

- Missing asset link count.
- Math renderability check interface with graceful unavailable-tool handling.
- Pages-with-warnings summary.
- `<stem>.report.md` generated from metadata.
- Final status: `success`, `partial`, or `failed`.

Verification checks:

- Unit tests for report content.
- Unit tests for missing asset link count.
- Unit tests for math render failure aggregation.
- Report generation does not re-run MinerU.

Hard failure criteria:

- Report diverges from JSON metadata.
- Math render failures are silently ignored.
- Quality checks require network access.

Primary agents:

- `metadata-agent`
- `evaluation-agent`
- `feature-generator-agent`

### Sprint 7: Conversion Orchestrator, CLI, And Python API

Active contract:

- `docs/Sprints/SPRINT7CONTRACT.md`

Objective:

- Connect path planning, MinerU adapter, normalization, metadata, report, and summaries.

Touched surfaces:

- `conversion.py`
- `cli.py`
- `__init__.py`
- CLI and API tests

Expected outputs:

- `convert_pdf(input_path, output_dir, metadata=True)` public API.
- `pdf2md convert INPUT --out OUTPUT_DIR`.
- `--metadata`, `--keep-raw`, `--recursive`, `--overwrite`, `--gpu`, and `--strict-local` behavior.
- Batch conversion for directories.
- CLI summary with warning counts.

Verification checks:

- API test with mocked MinerU adapter.
- CLI single PDF test with mocked MinerU adapter.
- CLI directory test with mocked MinerU adapter.
- Existing output test.
- Failure summary test.

Hard failure criteria:

- Public API exposes raw MinerU objects as required return fields.
- CLI writes outputs after a hard failure that should stop conversion.
- CLI suppresses warning counts.

Primary agents:

- `feature-generator-agent`
- `requirements-guard-agent`
- `evaluation-agent`

### Sprint 8: Doctor And Setup Documentation

Active contract:

- `docs/Sprints/SPRINT8CONTRACT.md`

Status:

- Implemented.

Objective:

- Make local setup failures explicit before users run conversions.

Touched surfaces:

- `doctor.py`
- `cli.py`
- `README.md`
- `scripts/install-mineru.ps1`
- `scripts/install-models.py`
- Tests for mocked environment checks

Expected outputs:

- `pdf2md doctor` reports Python version, `uv`, CUDA/PyTorch GPU visibility, MinerU availability, MinerU version, and detectable model/cache paths.
- GPU unavailable warning is clear.
- Missing `uv` is reported clearly.
- Pre-Turing/Pascal GPU risk is reported clearly for GTX 1070 Ti compute capability 6.1.
- Missing required dependency causes doctor failure.
- Setup docs explain Windows PowerShell, Python 3.12, `uv`, MinerU, models, GPU expectations, and local-only behavior.

Verification checks:

- Mocked doctor tests for success, missing MinerU, missing GPU, and missing dependency.
- Documentation review for no cloud/API runtime path.

Hard failure criteria:

- Doctor says the environment is healthy when MinerU is missing.
- Doctor implies cloud/API fallback is supported.

Primary agents:

- `local-setup-agent`
- `license-privacy-agent`
- `evaluation-agent`

### Sprint 9: Local Fixture Evaluation And V1 Release Gate

Active contract:

- `docs/Sprints/SPRINT9CONTRACT.md`

Status:

- Implemented.

Objective:

- Validate the end-to-end v1 behavior against local samples without committing samples.

Touched surfaces:

- `tests/integration/`
- Optional local-only fixture manifest that does not include sample PDFs
- `README.md`
- `PROGRESS.md`

Expected outputs:

- Fast mocked integration suite.
- Optional MinerU-dependent local test command.
- Local sample coverage notes in `PROGRESS.md`.
- V1 release checklist status.

Verification checks:

- `uv run pytest` passes without model downloads.
- Optional MinerU test is clearly marked and skipped unless explicitly enabled.
- Representative sample produces Markdown, metadata JSON, report Markdown, and asset paths.
- Obsidian math delimiter expectations are met.
- No sample PDFs are staged.

Hard failure criteria:

- Default tests require GPU, MinerU models, or network access.
- Sample files are added to git unintentionally.
- V1 release checklist passes without metadata/report generation.

Primary agents and skills:

- `evaluation-agent`
- `requirements-guard-agent`
- `fixture-evaluation` skill

### Sprint 10: Pre-Conversion PDF Page Chunking

Active contract:

- `docs/Sprints/SPRINT10CONTRACT.md`

Status:

- Implemented.

Objective:

- Split long PDFs into temporary fixed-size page chunks before MinerU conversion.

Touched surfaces:

- `pdf_splitter.py`
- `conversion.py`
- `cli.py`
- `report.py`
- README and Sprint 10 documentation
- Unit tests for splitter, conversion, CLI, and report behavior

Expected outputs:

- `pdf2md convert INPUT --out OUTPUT --chunk-pages` enables 20-page chunks.
- `pdf2md convert INPUT --out OUTPUT --chunk-pages N` enables custom positive chunk size.
- `convert_pdf(..., chunk_pages=N)` returns a `BatchConversionResult` in chunk mode.
- Temporary chunk PDFs are deleted after conversion completes.
- Chunk Markdown files are separate and named with original page ranges.
- Metadata and report content expose original source path and chunk page ranges.

Verification checks:

- pypdf-based local blank PDF tests cover page counts, chunk ranges, and written chunk page counts.
- Mocked conversion tests verify one adapter call per chunk, failed-chunk continuation, chunk metadata/report context, and temporary chunk cleanup.
- CLI tests verify `--chunk-pages` without a value uses 20 pages.

Hard failure criteria:

- Chunking uploads document content or uses another conversion engine.
- Chunk outputs are merged.
- Default tests require real MinerU, GPU, model files, network, Obsidian, LaTeX tooling, or `samples/`.

## 6. Cross-Cutting Acceptance Criteria

Every implementation sprint must preserve these acceptance criteria:

- No runtime remote document processing path exists.
- MinerU is the only conversion engine.
- Failures are explicit and traceable.
- Warnings are structured and countable.
- Markdown and metadata can be traced back to source pages where available.
- Reports are generated from metadata.
- Default tests are fast and local.
- `samples/` remains untracked unless explicitly requested.

## 7. First Implementation Request Contract Template

Use this template when implementation begins.

```markdown
## Sprint Contract

Objective:

Touched surfaces:

Expected outputs:

Non-goals:

Verification checks:

Hard failure criteria:

Handoff fields:
- Files changed:
- Commands run:
- Tests passed:
- Known failures:
- Residual risks:
- Next action:
```

## 8. Open Risks

- MinerU 3.1.0 install and CLI behavior are source-verified, but real local output still needs a later local probe before release.
- GTX 1070 Ti 8GB is visible locally, but it is Pascal compute capability 6.1; `doctor` and setup docs must make CUDA/PyTorch limits clear.
- `uv` is installed per-user at `C:\Users\user\.local\bin`, but a new shell may need PATH refresh before `uv` is visible.
- Formula renderability checks need a local tool choice; the implementation should start with an interface and graceful unavailable-tool warning if needed.
- Some PDFs will have tables or formulas that cannot be faithfully represented in Markdown; metadata and `.report.md` must surface this instead of hiding it.
- Redistribution license obligations must be reviewed before packaging, redistribution, or bundling model weights.

## 9. Recommended Next Step

Run optional real local MinerU validation on a long sample only when requested. Default verification should continue to use mocked adapters and generated temporary PDFs so it remains independent of MinerU, GPU, model files, network access, and `samples/`.

Facts carried forward from Sprint 0:

- MinerU is fixed to version 3.1.0.
- Direct local CLI command shape is `mineru -p <input> -o <output>`.
- MinerU output layout should be treated as optional-file based until locally probed.
- Python 3.12 is compatible with the pinned MinerU package range.
- GTX 1070 Ti CUDA/PyTorch support needs explicit doctor validation.
- MinerU/model license posture is acceptable for personal local use. Redistribution remains out of scope until reviewed.