# PRD: Local PDF-to-Markdown Converter Last updated: 2026-05-07 ## 1. Summary Build a local-only CLI and Python library that converts math-heavy digital PDFs into Obsidian-friendly Markdown. The product prioritizes accurate LaTeX reconstruction for equations, preservation of document structure, stable asset links, and traceable page-level metadata. The first version is for personal/research use, targets NVIDIA GPU machines, and uses MinerU 3.1.0 as the fixed conversion engine. It should process digital PDFs with existing text layers first. Scanned books, cloud OCR APIs, web UI, and manual review workflows are out of scope for v1. ## 2. Goals - Convert a single PDF into one Markdown file plus assets, metadata JSON, and a human-readable quality report. - Convert a folder of PDFs in batch mode. - Preserve inline math as `$...$` and display math as `$$...$$`. - Produce Markdown that opens cleanly in Obsidian. - Use MinerU 3.1.0 locally. - Keep enough metadata to diagnose formula, layout, and reading-order errors. - Continue conversion automatically when a page or formula is low-confidence, while logging warnings. ## 3. Non-Goals - No cloud OCR, cloud LLM, or third-party document upload in v1. - No web app or GUI in v1. - No manual review queue in v1. - No optimization for low-quality scanned books in v1. - No guaranteed perfect LaTeX reconstruction. - No multi-user server or hosted API in v1. - No commercial redistribution assumptions until dependency licenses are reviewed. ## 4. Target Users Primary user: - A researcher, student, or developer converting math-heavy papers/books into Obsidian notes. ## 5. Input Scope Supported in v1: - Local `.pdf` files. - Directories containing `.pdf` files. - Digital PDFs with embedded text layers. - Academic papers with sections, references, figures, captions, tables, inline math, and display equations. Best effort in v1: - Multi-column academic layouts. - Tables containing math. - Figures and captions. - Page numbers, headers, and footers. Out of scope for v1 optimization: - Poor-quality scans. - Handwritten math. - Camera photos. - Password-protected PDFs. - Damaged PDFs that cannot be opened by local tooling. ## 6. Output Scope For each input PDF, the converter writes: - A normalized Markdown file. - An assets directory when MinerU extracts images or other media. - A metadata JSON file. - A human-readable quality report named `.report.md`. - Optional raw MinerU outputs for debugging. Markdown rules: - Inline equations use `$...$`. - Display equations use `$$...$$` on separate lines. - Simple tables use Markdown pipe tables. - Complex tables may use HTML when Markdown would lose structure. - Images use relative links to the generated assets directory. - Visible page markers should be avoided by default; page provenance belongs in metadata. - Obsidian compatibility is the output standard. Detailed Markdown normalization rules are defined in `ARCHITECTURE.md`. ## 7. CLI Requirements The CLI binary should be named `pdf2md`. Required commands: ```bash pdf2md convert INPUT --out OUTPUT_DIR pdf2md doctor ``` `convert` behavior: - If `INPUT` is a PDF, convert that file. - If `INPUT` is a directory, convert PDFs in that directory. - Directory conversion requires `--recursive` to descend into subdirectories. - Output filenames default to the source PDF stem plus `.md`. - Asset directories default to `.assets`. - Existing outputs are not overwritten unless `--overwrite` is passed. Required `convert` options: - `--out PATH`: output directory. - `--metadata`: write metadata JSON. Enabled by default in v1. - `--keep-raw`: keep raw MinerU output for debugging. - `--recursive`: recursively process directory inputs. - `--overwrite`: replace existing outputs. - `--gpu DEVICE`: select CUDA device. Default: `cuda:0`. - `--strict-local`: forbid remote network/cloud execution during conversion. Default: true. `doctor` behavior: - Report Python version. - Report `uv` availability. - Report CUDA/PyTorch GPU availability when detectable. - Report MinerU availability. - Report local model/cache paths when detectable. - Warn if no NVIDIA GPU is available. - Fail if required v1 runtime dependencies are missing. ## 8. Python Library Requirements The library should expose a stable API suitable for scripts and tests. Required high-level API: ```python from pdf2md import convert_pdf result = convert_pdf( input_path="paper.pdf", output_dir="out", metadata=True, ) ``` Required return fields: - `markdown_path` - `metadata_path` - `assets_dir` - `warnings` - `engine` - `pages_processed` The public API should not expose raw MinerU objects as required return types. MinerU-specific data may be stored under optional metadata fields. ## 9. Metadata Requirements When `--metadata` is enabled, write `.metadata.json`. Required top-level fields: - `source_pdf` - `source_sha256` - `created_at` - `engine` - `engine_version` - `engine_options` - `pages` - `assets` - `warnings` - `summary` Required summary fields: - `pages_processed` - `warning_count` - `asset_count` - `display_formula_count` - `inline_formula_count` - `math_render_error_count` Warnings must be non-fatal unless the source file cannot be read or no output can be produced. Detailed metadata fields, block types, and warning codes are defined in `ARCHITECTURE.md`. ## 10. Quality Report Requirements For every conversion, write `.report.md`. The report must be readable without opening the JSON metadata and include: - Source PDF path. - Output Markdown path. - MinerU version. - Page count. - Warning count. - Asset count. - Inline formula count. - Display formula count. - Math render error count. - Missing asset link count. - A short list of pages with warnings. ## 11. Quality Policy The product is fully automatic in v1. - Low-confidence formulas are included in the output as best effort. - Low-confidence pages are included in the output as best effort. - The converter logs warnings and metadata records. - Conversion uses MinerU's default local CLI execution. If MinerU cannot run or fails, the converter must emit a clear error/warning instead of silently falling back to another backend. - Conversion fails only when the input cannot be opened, MinerU cannot run, output cannot be written, no usable output can be produced, or local-only policy is violated. The CLI summary must report warning counts clearly. ## 12. Local-Only Policy The implementation must not upload PDFs or page images to cloud APIs. Prohibited in v1 runtime: - Any cloud OCR API. - Any hosted document parsing API. - Any remote LLM or VLM call. - Remote model inference endpoints. Allowed: - Local model files. - Local Python packages. - Local CLI tools. - Documentation links. - Explicit installation downloads initiated by the user during setup. `--strict-local` is on by default. The MinerU adapter must not use remote endpoints in strict-local mode. Allowed in v1 runtime: - Direct `mineru` CLI execution. - The temporary local `mineru-api` process that MinerU 3.1.0 starts internally when the CLI runs without `--api-url`. Prohibited in v1 runtime: - `--api-url`. - Remote APIs. - Router mode. - HTTP client backends. - Remote OpenAI-compatible backends or inference endpoints. Detailed strict-local enforcement rules are defined in `ARCHITECTURE.md`. ## 13. Installation Requirements Use `uv` as the primary project workflow. Expected setup commands: ```bash uv sync uv run pdf2md doctor ``` MinerU/model setup may require additional scripts, for example: ```bash uv run scripts/install-mineru.ps1 uv run scripts/install-models.py ``` The project should document NVIDIA GPU/CUDA expectations and provide clear errors when GPU acceleration is unavailable. ## 14. Test Requirements Required test categories: - Unit tests for Markdown math delimiter normalization. - Unit tests for asset path normalization. - Unit tests for metadata schema creation. - Unit tests for warning aggregation. - MinerU adapter contract tests with mocked outputs. - CLI tests for single PDF, directory input, overwrite behavior, and metadata output. Fixture categories: - Small digital PDF with simple text and math. - Math-heavy academic paper page. - Multi-column paper page. - Table with formulas. - Figure with caption. Acceptance checks: - Markdown exists after conversion. - Metadata exists when requested. - Quality report exists after conversion. - Asset links resolve. - Inline/display math delimiters match Obsidian expectations. - Math render checks report failures instead of silently passing. - No cloud calls are made. - Warnings do not stop conversion unless MinerU cannot produce output. - MinerU failure produces a clear error/warning and does not silently switch backend. ## 15. Release Criteria for v1 v1 is acceptable when: - `pdf2md convert paper.pdf --out out --metadata` works on a representative digital academic PDF. - `pdf2md convert pdfs --out out --recursive --metadata` works on a small folder. - `pdf2md doctor` reports MinerU/GPU status clearly. - The default output opens in Obsidian with math blocks rendered. - Metadata links pages, blocks, warnings, and assets to the source PDF. - `.report.md` summarizes warnings, formulas, assets, and render/link check results. - The README or setup docs explain local-only behavior and GPU expectations.