21 KiB
V1 Implementation Plan: Local PDF-to-Markdown Converter
Last updated: 2026-05-08
This document is the implementation plan for v1. It does not replace PRD.md or ARCHITECTURE.md; use those files as the source of product requirements and system design. This plan explains the order of work, sprint contracts, verification gates, and agent ownership for implementing the converter.
Sprint 1 created the Python package scaffold and CLI placeholder. Sprint 2 created path planning. Sprint 3 created project-owned records and metadata construction. Sprint 4 created the mocked direct local MinerU adapter boundary. Sprint 5 created the Obsidian Markdown normalization boundary. Sprint 6 created local quality-check and report-rendering boundaries. Sprint 7 implemented conversion orchestration, the public conversion API, and the pdf2md convert CLI path with fake-adapter tests. Sprint 8 implemented mockable doctor diagnostics, the pdf2md doctor CLI path, and setup documentation. Sprint 9 implemented fast mocked integration tests, explicit opt-in local MinerU fixture evaluation, and the v1 release checklist. Sprint 10 implemented opt-in pre-conversion PDF chunking for long documents. Sprint 11 implemented conservative MathJax warning mitigation for failed math spans.
1. V1 Outcome
v1 is complete when a local user can run:
uv run pdf2md doctor
uv run pdf2md convert paper.pdf --out out --metadata
uv run pdf2md convert pdfs --out out --recursive --metadata
and receive, for each PDF:
- Obsidian-friendly Markdown.
- A stable sibling assets directory when assets exist.
<stem>.metadata.json.<stem>.report.md.- Clear warnings when math, tables, assets, reading order, GPU availability, or MinerU execution are uncertain.
Long PDFs can be chunked explicitly:
uv run pdf2md convert paper.pdf --out out --chunk-pages
uv run pdf2md convert paper.pdf --out out --chunk-pages 20
Chunked conversion writes separate outputs per chunk and does not merge Markdown files.
The converter must use MinerU 3.1.0 through direct local CLI execution only. It must not silently fallback to another engine.
2. Non-Negotiable Constraints
- Python 3.12 and
uv. - MinerU 3.1.0 is the only conversion engine.
- Direct local MinerU CLI execution only.
- MinerU 3.1.0 may launch a temporary local
mineru-apiinternally when CLI runs without--api-url. - No cloud OCR, hosted LLM/VLM, remote document parser,
--api-url, remote APIs, router mode, HTTP client backends, or remote OpenAI-compatible backends. - Target hardware: NVIDIA GTX 1070 Ti 8GB.
- Digital PDFs with text layers are the v1 priority.
samples/is local fixture context and must not be committed unless explicitly requested.- Every substantial implementation chunk needs a sprint contract and independent evaluation.
3. Harness Operating Model
Use the project long-running harness only for substantial implementation work.
harness-planner-agentturns the next user request into a sprint contract.evaluation-agentreviews the contract before code changes start.feature-generator-agentimplements one approved contract at a time.feature-generator-agentruns self-checks and records residual risks.evaluation-agentindependently verifies the result against the contract.- The parent agent updates
PROGRESS.md, commits the completed change, and leaves a handoff.
After a chunk is no longer active, archive completed-work details in docs/WORKARCHIVE.md and keep PROGRESS.md focused on current status, blockers, and next actions.
Each sprint contract must include:
- Objective.
- Touched surfaces.
- Expected outputs.
- Non-goals.
- Verification checks.
- Hard failure criteria.
- Handoff fields.
4. Proposed Repository Layout
Create this layout incrementally; do not scaffold unused modules before a sprint needs them.
pyproject.toml
README.md
src/
pdf2md/
__init__.py
cli.py
conversion.py
pdf_splitter.py
paths.py
mineru_adapter.py
ir.py
markdown.py
metadata.py
quality.py
report.py
doctor.py
tests/
unit/
integration/
fixtures/
scripts/
install-mineru.ps1
install-models.py
Planned module responsibilities:
cli.py: command parsing, CLI summaries, exit codes.conversion.py: orchestration for one PDF and batch input.paths.py: input discovery, output path planning, overwrite checks.mineru_adapter.py: direct local MinerU CLI boundary.ir.py: project-owned document/page/block/asset/warning records.markdown.py: Obsidian Markdown normalization.metadata.py: metadata schema creation and warning aggregation.quality.py: local checks for assets, math renderability, and output sanity.report.py:<stem>.report.mdgeneration from metadata.doctor.py: environment, dependency, CUDA/GPU, MinerU, and cache diagnostics.
5. Sprint Sequence
Sprint 0: Source And Environment Verification
Active contract:
docs/Sprints/SPRINT0CONTRACT.md
Objective:
- Verify the facts needed before implementation starts.
Touched surfaces:
docs/KNOWLEDGEBASE.mddocs/V1IMPLEMENTATIONPLAN.mdif sequencing changesPROGRESS.md
Expected outputs:
- Confirmed MinerU 3.1.0 install command, CLI invocation shape, version command, output paths, and local execution behavior.
- Confirmed Python 3.12,
uv, CUDA/PyTorch, and GTX 1070 Ti 8GB risks. - Confirmed license notes needed before redistribution.
Verification checks:
- All volatile facts cite official MinerU, Python, uv, PyTorch/CUDA, or license sources.
- No candidate engine comparison is reintroduced.
- No implementation code is created.
Hard failure criteria:
- MinerU 3.1.0 cannot be reasonably invoked through a direct local CLI on the target environment.
- Python 3.12 compatibility is not viable without changing project requirements.
Primary agents:
research-agentlocal-setup-agentlicense-privacy-agent
Sprint 1: Project Scaffold And Fast Test Loop
Active contract:
docs/Sprints/SPRINT1CONTRACT.md
Objective:
- Create the minimal Python project structure and a fast local test loop.
Touched surfaces:
pyproject.tomlsrc/pdf2md/__init__.pytests/- Development documentation if needed
Expected outputs:
uv syncworks.uv run pytestworks.- Project package imports as
pdf2md. - CLI entry point name
pdf2mdis reserved but may initially expose onlydoctoror a clear placeholder until later sprints. - If
uvis still unavailable locally, Sprint 1 records that blocker and is not marked complete.
Verification checks:
- Import test passes.
- Empty test suite or initial scaffold tests pass.
- No runtime network dependency is introduced.
Hard failure criteria:
- Project cannot be installed with
uv. - Scaffolding adds speculative config systems, extra engines, or unused abstractions.
Primary agents:
harness-planner-agentfeature-generator-agentevaluation-agent
Sprint 2: Paths, Input Discovery, And Overwrite Planning
Active contract:
docs/Sprints/SPRINT2CONTRACT.md
Objective:
- Implement deterministic input and output planning before conversion logic exists.
Touched surfaces:
paths.pyconversion.pyskeleton if needed- CLI path handling tests
Expected outputs:
- Single PDF discovery.
- Directory PDF discovery.
- Recursive traversal only when requested.
- Deterministic output paths for Markdown, assets, metadata JSON, report, and optional raw MinerU output.
- Existing-output protection unless
--overwriteis passed.
Verification checks:
- Unit tests for single PDF path planning.
- Unit tests for directory and recursive discovery.
- Unit tests for overwrite behavior.
- Tests include Korean or non-ASCII filename handling using generated temporary files, not committed sample PDFs.
Hard failure criteria:
- Output planning can overwrite user files without explicit overwrite intent.
- Directory conversion descends recursively without
--recursive.
Primary agents:
feature-generator-agentevaluation-agent
Sprint 3: Domain Records, Metadata, And Warning Model
Active contract:
docs/Sprints/SPRINT3CONTRACT.md
Objective:
- Define project-owned records before binding to MinerU output.
Touched surfaces:
ir.pymetadata.pyreport.pyskeleton if needed- Unit tests
Expected outputs:
- Document, page, block, asset, and warning records.
- Stable warning codes from
ARCHITECTURE.md. - Metadata JSON builder with required top-level and summary fields.
- Warning aggregation logic.
Verification checks:
- Unit tests for metadata schema creation.
- Unit tests for warning aggregation.
- Unit tests for optional fields such as bbox and confidence being preserved only when present.
Hard failure criteria:
- Public API requires raw MinerU objects.
- Metadata omits source PDF, SHA-256, engine, pages, warnings, assets, or summary.
Primary agents:
metadata-agentfeature-generator-agentevaluation-agent
Sprint 4: MinerU Adapter With Mocked Contract
Active contract:
docs/Sprints/SPRINT4CONTRACT.md
Objective:
- Build the direct local MinerU adapter boundary with mocked outputs first.
Touched surfaces:
mineru_adapter.pydoctor.pypartial checks- Adapter tests with fake subprocess results and fake output directories
Expected outputs:
- Adapter availability check.
- Version check.
- Direct CLI command construction.
- Strict-local command validation.
- Subprocess execution wrapper capturing stdout, stderr, exit code, and paths.
- Parsed adapter result object with raw Markdown, raw structured data when available, assets, warnings, engine, engine version, options, exit code, and stderr.
- Baseline command shape based on MinerU 3.1.0 direct local CLI:
mineru -p <input> -o <output>. - Strict-local validation allows CLI-internal temporary local
mineru-apiorchestration, while rejecting--api-url, remote APIs, router mode, HTTP client backends, and remote OpenAI-compatible backends.
Verification checks:
- Mocked successful MinerU output test.
- Mocked missing MinerU test.
- Mocked non-zero exit test.
- Test that prohibited remote/API flags cannot be introduced.
- No real MinerU/model dependency in default tests.
Hard failure criteria:
- Adapter passes
--api-url, uses router mode, uses an HTTP client backend, or connects to a remote API or remote OpenAI-compatible backend. - Adapter falls back to another engine after MinerU failure.
- Tests require model downloads by default.
Primary agents:
mineru-integration-agentfeature-generator-agentevaluation-agent
Sprint 5: Obsidian Markdown Normalization And Assets
Active contract:
docs/Sprints/SPRINT5CONTRACT.md
Objective:
- Normalize MinerU/project IR output into Obsidian-friendly Markdown.
Touched surfaces:
markdown.pyquality.pypartial asset link checks- Unit tests
Expected outputs:
- Inline math delimiter normalization to
$...$. - Display math delimiter normalization to
$$...$$. - Blank-line normalization around display math.
- Relative asset link normalization.
- Simple table preservation and complex table fallback warnings.
- No visible page markers by default.
Verification checks:
- Unit tests for inline math.
- Unit tests for display math spacing.
- Unit tests for underscores/carets inside math.
- Unit tests for relative asset links.
- Unit tests for table fallback warning behavior.
Hard failure criteria:
- Normalization rewrites LaTeX semantics without deterministic tests.
- Generated links are absolute when relative links are required.
- Page provenance is only visible in Markdown and missing from metadata.
Primary agents:
obsidian-markdown-agentfeature-generator-agentevaluation-agent
Sprint 6: Quality Checks And Report Generation
Active contract:
docs/Sprints/SPRINT6CONTRACT.md
Objective:
- Produce local quality signals and human-readable reports from metadata.
Touched surfaces:
quality.pyreport.pymetadata.py- Unit tests
Expected outputs:
- Missing asset link count.
- Math renderability check interface with graceful unavailable-tool handling.
- Pages-with-warnings summary.
<stem>.report.mdgenerated from metadata.- Final status:
success,partial, orfailed.
Verification checks:
- Unit tests for report content.
- Unit tests for missing asset link count.
- Unit tests for math render failure aggregation.
- Report generation does not re-run MinerU.
Hard failure criteria:
- Report diverges from JSON metadata.
- Math render failures are silently ignored.
- Quality checks require network access.
Primary agents:
metadata-agentevaluation-agentfeature-generator-agent
Sprint 7: Conversion Orchestrator, CLI, And Python API
Active contract:
docs/Sprints/SPRINT7CONTRACT.md
Objective:
- Connect path planning, MinerU adapter, normalization, metadata, report, and summaries.
Touched surfaces:
conversion.pycli.py__init__.py- CLI and API tests
Expected outputs:
convert_pdf(input_path, output_dir, metadata=True)public API.pdf2md convert INPUT --out OUTPUT_DIR.--metadata,--keep-raw,--recursive,--overwrite,--gpu, and--strict-localbehavior.- Batch conversion for directories.
- CLI summary with warning counts.
Verification checks:
- API test with mocked MinerU adapter.
- CLI single PDF test with mocked MinerU adapter.
- CLI directory test with mocked MinerU adapter.
- Existing output test.
- Failure summary test.
Hard failure criteria:
- Public API exposes raw MinerU objects as required return fields.
- CLI writes outputs after a hard failure that should stop conversion.
- CLI suppresses warning counts.
Primary agents:
feature-generator-agentrequirements-guard-agentevaluation-agent
Sprint 8: Doctor And Setup Documentation
Active contract:
docs/Sprints/SPRINT8CONTRACT.md
Status:
- Implemented.
Objective:
- Make local setup failures explicit before users run conversions.
Touched surfaces:
doctor.pycli.pyREADME.mdscripts/install-mineru.ps1scripts/install-models.py- Tests for mocked environment checks
Expected outputs:
pdf2md doctorreports Python version,uv, CUDA/PyTorch GPU visibility, MinerU availability, MinerU version, and detectable model/cache paths.- GPU unavailable warning is clear.
- Missing
uvis reported clearly. - Pre-Turing/Pascal GPU risk is reported clearly for GTX 1070 Ti compute capability 6.1.
- Missing required dependency causes doctor failure.
- Setup docs explain Windows PowerShell, Python 3.12,
uv, MinerU, models, GPU expectations, and local-only behavior.
Verification checks:
- Mocked doctor tests for success, missing MinerU, missing GPU, and missing dependency.
- Documentation review for no cloud/API runtime path.
Hard failure criteria:
- Doctor says the environment is healthy when MinerU is missing.
- Doctor implies cloud/API fallback is supported.
Primary agents:
local-setup-agentlicense-privacy-agentevaluation-agent
Sprint 9: Local Fixture Evaluation And V1 Release Gate
Active contract:
docs/Sprints/SPRINT9CONTRACT.md
Status:
- Implemented.
Objective:
- Validate the end-to-end v1 behavior against local samples without committing samples.
Touched surfaces:
tests/integration/- Optional local-only fixture manifest that does not include sample PDFs
README.mdPROGRESS.md
Expected outputs:
- Fast mocked integration suite.
- Optional MinerU-dependent local test command.
- Local sample coverage notes in
PROGRESS.md. - V1 release checklist status.
Verification checks:
uv run pytestpasses without model downloads.- Optional MinerU test is clearly marked and skipped unless explicitly enabled.
- Representative sample produces Markdown, metadata JSON, report Markdown, and asset paths.
- Obsidian math delimiter expectations are met.
- No sample PDFs are staged.
Hard failure criteria:
- Default tests require GPU, MinerU models, or network access.
- Sample files are added to git unintentionally.
- V1 release checklist passes without metadata/report generation.
Primary agents and skills:
evaluation-agentrequirements-guard-agentfixture-evaluationskill
Sprint 10: Pre-Conversion PDF Page Chunking
Active contract:
docs/Sprints/SPRINT10CONTRACT.md
Status:
- Implemented.
Objective:
- Split long PDFs into temporary fixed-size page chunks before MinerU conversion.
Touched surfaces:
pdf_splitter.pyconversion.pycli.pyreport.py- README and Sprint 10 documentation
- Unit tests for splitter, conversion, CLI, and report behavior
Expected outputs:
pdf2md convert INPUT --out OUTPUT --chunk-pagesenables 20-page chunks.pdf2md convert INPUT --out OUTPUT --chunk-pages Nenables custom positive chunk size.convert_pdf(..., chunk_pages=N)returns aBatchConversionResultin chunk mode.- Temporary chunk PDFs are deleted after conversion completes.
- Chunk Markdown files are separate and named with original page ranges.
- Metadata and report content expose original source path and chunk page ranges.
Verification checks:
- pypdf-based local blank PDF tests cover page counts, chunk ranges, and written chunk page counts.
- Mocked conversion tests verify one adapter call per chunk, failed-chunk continuation, chunk metadata/report context, and temporary chunk cleanup.
- CLI tests verify
--chunk-pageswithout a value uses 20 pages.
Hard failure criteria:
- Chunking uploads document content or uses another conversion engine.
- Chunk outputs are merged.
- Default tests require real MinerU, GPU, model files, network, Obsidian, LaTeX tooling, or
samples/.
Sprint 11: MathJax Warning Mitigation
Active contract:
docs/Sprints/SPRINT11CONTRACT.md
Status:
- Implemented.
Objective:
- Repair narrow MathJax-invalid formula artifacts after initial local validation and before final output writing.
Touched surfaces:
quality.pymath_repair.pyconversion.pyir.py- Unit tests for quality details, repair rules, conversion, and recheck behavior
Expected outputs:
- Failed math expression records expose body, display mode, span, and checker message.
- Repair candidates are generated only for failed math spans.
- Repeated same-direction scripts are disambiguated with an empty group.
- Truncated
\end{a}array endings are repaired when array environments are unbalanced. convertandrecheckshare the same repair behavior.- Applied repairs are recorded as
MATH_RENDER_REPAIREDinfo warnings and do not count as math render errors.
Verification checks:
- Default fast tests pass without real MinerU, GPU, Node.js, MathJax, network, Obsidian, or
samples/. samples/MITC공부.pdfvalidates locally withMath render error count: 0.
Hard failure criteria:
- Repair changes math spans that did not fail local MathJax validation.
- Repair claims success without candidate revalidation.
- Repair introduces remote services, alternate engines, or mandatory sample-dependent default tests.
6. Cross-Cutting Acceptance Criteria
Every implementation sprint must preserve these acceptance criteria:
- No runtime remote document processing path exists.
- MinerU is the only conversion engine.
- Failures are explicit and traceable.
- Warnings are structured and countable.
- Markdown and metadata can be traced back to source pages where available.
- Reports are generated from metadata.
- Default tests are fast and local.
samples/remains untracked unless explicitly requested.
7. First Implementation Request Contract Template
Use this template when implementation begins.
## Sprint Contract
Objective:
Touched surfaces:
Expected outputs:
Non-goals:
Verification checks:
Hard failure criteria:
Handoff fields:
- Files changed:
- Commands run:
- Tests passed:
- Known failures:
- Residual risks:
- Next action:
8. Open Risks
- MinerU 3.1.0 install and CLI behavior are source-verified, but real local output still needs a later local probe before release.
- GTX 1070 Ti 8GB is visible locally, but it is Pascal compute capability 6.1;
doctorand setup docs must make CUDA/PyTorch limits clear. uvis installed per-user atC:\Users\user\.local\bin, but a new shell may need PATH refresh beforeuvis visible.- Formula renderability checks and conservative warning mitigation are implemented, but formula reconstruction remains best effort and should keep warnings/provenance visible.
- Some PDFs will have tables or formulas that cannot be faithfully represented in Markdown; metadata and
.report.mdmust surface this instead of hiding it. - Redistribution license obligations must be reviewed before packaging, redistribution, or bundling model weights.
9. Recommended Next Step
Run optional real local MinerU validation on a long sample only when requested. Default verification should continue to use mocked adapters and generated temporary PDFs so it remains independent of MinerU, GPU, model files, network access, and samples/.
Facts carried forward from Sprint 0:
- MinerU is fixed to version 3.1.0.
- Direct local CLI command shape is
mineru -p <input> -o <output>. - MinerU output layout should be treated as optional-file based until locally probed.
- Python 3.12 is compatible with the pinned MinerU package range.
- GTX 1070 Ti CUDA/PyTorch support needs explicit doctor validation.
- MinerU/model license posture is acceptable for personal local use. Redistribution remains out of scope until reviewed.