# Knowledge Base: Local PDF-to-Markdown Converter for Math-Heavy Documents Last updated: 2026-05-11 ## 1. Product Direction This project will build a local-first PDF-to-Markdown converter for math-heavy academic PDFs and books. The v1 target is intentionally narrow: - Processing policy: local-only. Do not send user PDFs to cloud OCR or external AI APIs. - Primary interface: CLI plus Python library. A later thin local desktop launcher may wrap the CLI, but it must not become a separate conversion pipeline. - Primary output: Obsidian-friendly Markdown. - Main conversion engine: MinerU 3.1.0. - Math output: inline math as `$...$`, display math as `$$...$$`. - Hardware target: NVIDIA GPU. - PDF scope: digital PDFs with an existing text layer first. Scanned books and poor-quality scans are out of scope for v1 optimization. - Quality workflow: fully automatic conversion. Low-confidence regions should be logged and represented in metadata, but conversion should not stop. - Install target: Python with `uv`; scripts should document local model download/setup. - License posture: personal use. License terms are not a v1 blocker for this local project, but document MinerU and transitive model/package licenses before any redistribution. The rest of this document records the research basis and implementation implications for these decisions. This file is background research, not a second requirements source. Use `PRD.md` for product requirements and `ARCHITECTURE.md` for implementation structure. ## 2. Why Math PDF Conversion Is Hard PDF is a visual/page-description format, not a semantic source format. In scientific PDFs, the displayed equation often survives visually while the source-level LaTeX structure is lost. The Nougat paper states the core issue clearly: scientific knowledge is frequently stored in PDFs, and the PDF format loses semantic information, especially for mathematical expressions. Source: [Nougat: Neural Optical Understanding for Academic Documents](https://arxiv.org/abs/2308.13418). For this project, equation conversion must be treated as a document understanding and formula recognition problem, not just text extraction. A robust converter needs to recover: - Reading order across multi-column layouts. - Inline and display equation boundaries. - LaTeX syntax that renders in common Markdown math renderers. - Tables, captions, figures, references, footnotes, and image assets. - Page-level provenance for debugging and downstream correction. Digital PDFs make the problem easier because embedded text can be extracted directly, but formulas may still appear as glyph streams, vector paths, images, or fragmented positioned characters. A v1 product should therefore use the PDF text layer where reliable and use document parsing/OCR models for layout and math reconstruction. ## 3. MinerU 3.1.0 Engine Strategy MinerU 3.1.0 is the fixed local parser for v1. Relevant source facts: - MinerU 3.1.0 is published on PyPI, requires Python `>=3.10,<3.14`, and describes support for PDF, images, DOCX, PPTX, and XLSX to Markdown and JSON. Source: [PyPI mineru 3.1.0](https://pypi.org/project/mineru/3.1.0/). - MinerU 3.1.0 release notes describe a license move to the MinerU Open Source License, a VLM main model upgrade to `MinerU2.5-Pro-2604-1.2B`, improved image/chart parsing, truncated paragraph merging, cross-page table merging, table-internal image recognition, and native PPTX/XLSX parsing. Source: [MinerU releases](https://github.com/opendatalab/MinerU/releases). - MinerU's current quick usage documents the CLI shape as `mineru -p -o ` and states that without `--api-url`, the CLI launches a temporary local `mineru-api`. Source: [MinerU quick usage docs](https://opendatalab.github.io/MinerU/usage/quick_usage/). - Starting with MinerU 3.0, the `mineru` command is an orchestration client on top of `mineru-api`. Source: [MinerU usage guide](https://opendatalab.github.io/MinerU/usage/). - MinerU model source configuration uses `MINERU_MODEL_SOURCE`, and local models are enabled with `MINERU_MODEL_SOURCE=local`. Source: [MinerU model source docs](https://opendatalab.github.io/MinerU/usage/model_source/). Implementation implications: - Wrap MinerU behind a project-owned adapter rather than binding the codebase to its CLI output. - Preserve both Markdown and structured JSON when available. - Treat MinerU output as the first-pass parse, then normalize for Obsidian. - Keep engine name/version, command/options, page ids, and warnings in metadata. - Expect model downloads and GPU runtime setup to be heavy; document this clearly. - Treat MinerU 3.1.0's CLI-internal temporary local `mineru-api` as allowed local orchestration. - Reject `--api-url`, remote APIs, router mode, HTTP client backends, and remote OpenAI-compatible backends in strict-local mode. - Do not implement runtime engine selection in v1. Risks: - MinerU version behavior may change quickly. - GTX 1070 Ti is below the current documented GPU acceleration recommendation for some MinerU 3.x paths; `pipeline` CPU or limited GPU behavior must be validated locally. - Some licenses are custom or changed over time; check the exact dependency license before redistribution if the project scope changes. - Output Markdown may require post-processing to match Obsidian math and asset path expectations. ## 4. Output Standard: Obsidian-Friendly Markdown The final Markdown should be optimized for Obsidian rendering and long-term note usage. Rules: - Inline math: `$...$`. - Display math: `$$...$$` on separate lines. - Store extracted images in the PDF output folder's shared `images/` directory, for example `paper/images/page-003_figure-01.png`. - Use relative links from the Markdown file to assets. - Preserve page boundaries in metadata, not by noisy visible page markers in the main Markdown. - Prefer normal Markdown tables for simple tables. - Use HTML tables only when Markdown tables would destroy merged cells or mathematical structure. - Do not silently drop formulas, figures, captions, or references. If exact conversion fails, keep a best-effort representation and log a warning. Post-processing should normalize: - Math delimiters: convert `\(...\)` to `$...$` and `\[...\]` to `$$...$$` where safe. - Display math spacing: ensure blank lines around display equations. - Escaping: avoid over-escaping underscores inside math. - Asset links: convert absolute/generated paths to stable relative paths. - Heading levels: avoid treating running headers, footers, and page numbers as section headings. - Repeated hyphenation and line breaks from PDF extraction. ## 5. Metadata and Provenance Metadata is necessary because fully automatic conversion can produce imperfect formulas and reading order. The converter must keep enough provenance to identify the source page, source region, engine version, warnings, and emitted assets for each result. The metadata schema is defined in `ARCHITECTURE.md`. ## 6. Evaluation Strategy Do not rely only on raw text edit distance. The project should evaluate multiple dimensions. ### Benchmarks to Learn From OmniDocBench: - Evaluates document parsing across text, formulas, tables, and reading order. - End-to-end scoring combines text edit distance, table TEDS, and formula CDM. - Provides formula recognition evaluation for display and inline formulas. Source: [OmniDocBench](https://github.com/opendatalab/OmniDocBench). Unit-test-style document parsing checks: - Uses simple, unambiguous, machine-checkable page facts instead of only soft edit-distance comparisons. - Includes arXiv math, old scans math, tables, headers/footers, multi-column layouts, long/tiny text, and base cases. - Explicitly notes that small equation symbol swaps can be critical even when edit distance is small. ParseBench: - Emphasizes semantic correctness for agentic document parsing: table structure, chart data, formatting, visual grounding, and content faithfulness. Source: [ParseBench](https://arxiv.org/abs/2604.08538). ### Project Acceptance Metrics For v1, use a small project fixture suite rather than trying to run every public benchmark. Required fixture categories: - A short digital PDF with simple inline and display math. - A math-heavy academic paper with multi-column layout. - A PDF page containing a table with formulas. - A PDF with figures, captions, references, and page numbers. Required checks: - Markdown file is generated. - Metadata JSON is generated when requested. - All pages have provenance records. - Inline math and display math render in a KaTeX/MathJax-compatible check. - Asset links resolve. - No cloud network calls occur during conversion. - MinerU failures produce warnings and a best-effort output, not a hard crash, unless the input file cannot be opened or no output can be produced. Recommended quantitative checks: - Count display math blocks detected. - Count math render failures. - Count missing asset links. - Count pages with warnings. - Compare selected known formulas against expected LaTeX or normalized render output. ## 7. Implementation Source of Truth Do not implement directly from this research note. - Use `PRD.md` for CLI, API, scope, tests, and release criteria. - Use `ARCHITECTURE.md` for the conversion pipeline, MinerU boundary, intermediate representation, metadata schema, and strict-local enforcement. ## 8. Implementation Risks - Formula correctness cannot be guaranteed purely by text extraction. - MinerU behavior may change across versions; adapter tests should pin expected behavior. - GPU dependencies can be difficult on Windows; `doctor` should detect CUDA, PyTorch, model paths, and engine availability. - Obsidian math rendering may differ from GitHub or Pandoc. - Licensing must be reviewed before packaging or redistributing models/tools. - Fully automatic mode means users may receive imperfect formulas; warnings and metadata are essential. ## 9. Sprint 0 Verification And Engine Update (2026-05-07) Sprint 0 originally verified MinerU 2.5.4 assumptions before implementation. After Sprint 0, the project owner changed the v1 engine target to MinerU 3.1.0 and redefined strict-local execution. The current implementation target is therefore MinerU 3.1.0, not the earlier 2.5.4 pin. ### 9.1 Recommendation Recommendation: `go-with-risks` for personal-use v1. The project can proceed to Sprint 1 after the `uv` workflow is available locally or Sprint 1 explicitly handles the bootstrap gap. Use `mineru[core]==3.1.0` or another explicitly reviewed 3.1.0 installation path until a later sprint contract changes it. Current strict-local policy: - Allowed: direct `mineru` CLI execution. - Allowed: the temporary local `mineru-api` process that MinerU 3.1.0 starts internally when the CLI runs without `--api-url`. - Prohibited: `--api-url`, remote APIs, router mode, HTTP client backends, and remote OpenAI-compatible backends. - Setup may download models only when the user explicitly runs setup commands. - Runtime conversion should use local model paths, for example with `MINERU_MODEL_SOURCE=local`. ### 9.2 Evidence Policy Applied Sprint 0 claims use official or primary sources where available. Each claim below is marked as either: - Direct fact: stated by a source or observed from an allowed local command. - Project inference: a project decision derived from source facts. Web research was allowed for documentation verification. Runtime converter design remains local-only. ### 9.3 MinerU 3.1.0 Facts | Area | Confirmed fact | Evidence type | Source | Implementation implication | | --- | --- | --- | --- | --- | | Package pin | PyPI lists MinerU `3.1.0`, released 2026-04-17, with Python `>=3.10,<3.14` and extras including `core`, `pipeline`, `vlm`, `vllm`, `lmdeploy`, `mlx`, `gradio`, and `all`. | Direct fact | [PyPI mineru 3.1.0](https://pypi.org/project/mineru/3.1.0/) | Pin v1 setup to MinerU 3.1.0 until changed by a later contract. | | Install path | Current MinerU docs show `uv pip install -U "mineru[all]"`; PyPI also supports extras including `core`. | Direct fact plus project inference | [PyPI mineru 3.1.0](https://pypi.org/project/mineru/3.1.0/) | Start with `mineru[core]==3.1.0` for the wrapper unless Sprint 1 or Sprint 8 proves that `all` is required. | | CLI shape | Current docs show `mineru -p -o `. | Direct fact | [MinerU quick usage docs](https://opendatalab.github.io/MinerU/usage/quick_usage/) | The adapter still calls the `mineru` CLI directly. | | Local temporary API | MinerU docs state that without `--api-url`, the CLI launches a temporary local `mineru-api`; with `--api-url`, the CLI connects to an existing local or remote FastAPI service. | Direct fact | [MinerU quick usage docs](https://opendatalab.github.io/MinerU/usage/quick_usage/), [MinerU CLI tools docs](https://opendatalab.github.io/MinerU/usage/cli_tools/) | Allow only the CLI-internal temporary local `mineru-api`; reject `--api-url` and user-managed API endpoints. | | Backend risk | Current docs describe `pipeline`, `vlm`, and `hybrid` paths plus HTTP-client variants for OpenAI-compatible servers. | Direct fact | [PyPI mineru 3.1.0](https://pypi.org/project/mineru/3.1.0/), [MinerU CLI tools docs](https://opendatalab.github.io/MinerU/usage/cli_tools/) | Strict-local validation must reject HTTP client backends and remote OpenAI-compatible backends. | | Output layout | MinerU output docs list Markdown plus visual debugging files and structured files such as `model.json`, `middle.json`, and `content_list.json`; the exact set depends on backend and input type. | Direct fact | [MinerU output files docs](https://opendatalab.github.io/MinerU/reference/output_files/) | Adapter parsing must tolerate optional files and backend-specific structured output. Keep raw output optional through `--keep-raw`. | | Model/cache behavior | MinerU uses Hugging Face and ModelScope by default and switches source through `MINERU_MODEL_SOURCE`; local parsing uses `MINERU_MODEL_SOURCE=local` after models are downloaded. | Direct fact | [MinerU model source docs](https://opendatalab.github.io/MinerU/usage/model_source/) | Setup scripts may download models; runtime conversion should require local model paths under strict-local mode. | | 3.1.0 capability update | Release notes say 3.1.0 upgrades the main VLM model to `MinerU2.5-Pro-2604-1.2B` and improves image/chart parsing, truncated paragraph merging, cross-page table merging, and image recognition inside tables. | Direct fact | [MinerU releases](https://github.com/opendatalab/MinerU/releases) | 3.1.0 is a better target for math-heavy and complex-layout documents, pending local hardware validation. | Adapter-facing fields that must remain optional until a local MinerU output probe is run: - Backend/parse method directory names. - Presence of `content_list`, `middle`, `model`, `layout`, `span`, and `origin` files. - Page/block confidence and bbox fields in structured output. - Asset naming and relative paths. - Exact stdout/stderr warning text. ### 9.4 Local Environment Facts Allowed local commands run during Sprint 0: ```powershell python --version uv --version nvidia-smi ``` Results: | Area | Result | Evidence type | Source or command | Impact | | --- | --- | --- | --- | --- | | Python | Local Python is `Python 3.12.7`. Python 3.12 target is viable for MinerU 3.1.0's declared `>=3.10,<3.14` range. | Direct fact plus project inference | `python --version`; [PyPI mineru 3.1.0](https://pypi.org/project/mineru/3.1.0/) | Continue targeting Python 3.12. Prefer current 3.12 patch in setup docs, but do not require a patch bump for v1. | | `uv` | `uv` is not on PATH. PowerShell reported that `uv` is not recognized. | Direct fact | `uv --version`; [uv installation docs](https://docs.astral.sh/uv/getting-started/installation/) | Sprint 1 cannot honestly claim `uv sync` works until `uv` is installed or the Sprint 1 contract includes bootstrap instructions. | | GPU | `nvidia-smi` detects NVIDIA GeForce GTX 1070 Ti, driver `577.00`, WDDM, 8192 MiB VRAM, about 5034 MiB free at check time. | Direct fact | `nvidia-smi` | GPU is visible, but available VRAM is tight for model workloads and must be reported by `doctor`. | | Compute capability | GTX 1070 Ti is Pascal and listed as CUDA compute capability `6.1`. | Direct fact | [NVIDIA legacy CUDA GPU table](https://developer.nvidia.com/cuda/gpus/legacy), [NVIDIA GTX 10-series specs](https://www.nvidia.com/en-us/geforce/10-series/10-series-specs/) | Warn on pre-Turing GPUs. Do not assume modern CUDA 12.8/12.9 PyTorch wheels work. | | PyTorch/CUDA risk | PyTorch project discussion states CUDA 12.8/12.9 builds remove Maxwell/Pascal support. `nvidia-smi` CUDA version is a driver capability ceiling, not proof that PyTorch CUDA works. | Direct fact plus project inference | [PyTorch CUDA architecture support discussion](https://dev-discuss.pytorch.org/t/cuda-toolkit-version-and-architecture-support-update-maxwell-and-pascal-architecture-support-removed-in-cuda-12-8-and-12-9-builds/3128), [PyTorch install docs](https://pytorch.org/get-started/locally/) | `pdf2md doctor` must test actual PyTorch import, CUDA runtime, device name, compute capability, and free memory. | Future `pdf2md doctor` checks: - `python --version`, requiring `>=3.12,<3.13`. - `uv --version`, failing clearly when unavailable. - PowerShell version and edition on Windows. - `nvidia-smi` GPU name, driver, WDDM/TCC mode, total/free VRAM. - GPU compute capability warning for `<7.5`, especially Pascal `6.1`. - `torch` import, `torch.__version__`, `torch.version.cuda`, `torch.cuda.is_available()`, device name, compute capability, and free memory. - MinerU CLI availability, installed package version, and model/cache configuration. - Strict-local validation that runtime uses direct CLI, allows only CLI-internal temporary local `mineru-api`, uses local model paths, and rejects `--api-url`, router mode, HTTP client backends, remote APIs, and remote OpenAI-compatible backends. ### 9.5 License And Privacy Facts | Area | Confirmed fact | Evidence type | Source | Implementation implication | | --- | --- | --- | --- | --- | | MinerU 3.1.0 license | PyPI identifies MinerU 3.1.0 as `LicenseRef-MinerU-Open-Source-License`; release notes state MinerU moved from AGPLv3 to a custom license based on Apache 2.0. | Direct fact | [PyPI mineru 3.1.0](https://pypi.org/project/mineru/3.1.0/), [MinerU releases](https://github.com/opendatalab/MinerU/releases) | License is not a blocker for personal v1. Redistribution still needs review. | | Runtime privacy | MinerU docs expose remote-capable paths, including `--api-url`, router usage, HTTP client backends, and OpenAI-compatible backend URLs. | Direct fact | [MinerU usage guide](https://opendatalab.github.io/MinerU/usage/), [MinerU CLI tools docs](https://opendatalab.github.io/MinerU/usage/cli_tools/) | The wrapper must never upload PDFs, page images, extracted text, or intermediates. Setup downloads are separate from runtime conversion. | | Model downloads | `mineru-models-download` must use a remote model source for real downloads, while runtime local parsing uses `MINERU_MODEL_SOURCE=local`. | Direct fact | [MinerU model source docs](https://opendatalab.github.io/MinerU/usage/model_source/) | Download scripts are setup-only. Conversion runtime must not download or call remote inference. | ### 9.6 Go-With-Risks Register | Risk | Later sprint that must absorb it | Required handling | | --- | --- | --- | | `uv` is missing locally. | Sprint 1 and Sprint 8 | Install `uv` before claiming scaffold success, or make bootstrap documentation part of Sprint 1. `doctor` must report missing `uv`. | | GTX 1070 Ti is Pascal compute capability 6.1 and is below current documented GPU acceleration recommendations for some MinerU 3.x paths. | Sprint 8 | `doctor` must verify actual PyTorch/CUDA/MinerU backend availability and warn on pre-Turing GPUs. Setup docs should avoid assuming CUDA 12.8/12.9 PyTorch works on this GPU. | | MinerU 3.1.0 output layout is source-verified but not locally probed. | Sprint 4 and Sprint 9 | Adapter tests should mock optional outputs first. A later local MinerU probe should confirm real output paths before release. | | CLI-internal local `mineru-api` is allowed, but user-specified API paths are prohibited. | Sprint 4 and Sprint 8 | Adapter command validation must allow `mineru -p ... -o ...` without `--api-url`, while rejecting `--api-url`, router mode, HTTP client backends, remote APIs, and remote OpenAI-compatible backends. | | Runtime must be local-only while setup may download packages/models. | Sprint 4 and Sprint 8 | Separate setup/download commands from conversion runtime. Enforce `MINERU_MODEL_SOURCE=local` or equivalent local model configuration in strict-local mode. | No hard failure criteria are currently met. Direct local MinerU 3.1.0 CLI execution appears suitable for v1 under the redefined strict-local policy, Python 3.12 is compatible with the package metadata, local-only design remains viable, and license posture does not block personal use. ## 10. Source Index - [Nougat paper](https://arxiv.org/abs/2308.13418) - [MinerU docs](https://opendatalab.github.io/MinerU/zh/) - [MinerU GitHub](https://github.com/opendatalab/mineru) - [PyPI mineru 3.1.0](https://pypi.org/project/mineru/3.1.0/) - [MinerU releases](https://github.com/opendatalab/MinerU/releases) - [MinerU usage guide](https://opendatalab.github.io/MinerU/usage/) - [MinerU quick usage docs](https://opendatalab.github.io/MinerU/usage/quick_usage/) - [MinerU CLI tools docs](https://opendatalab.github.io/MinerU/usage/cli_tools/) - [MinerU output files docs](https://opendatalab.github.io/MinerU/reference/output_files/) - [MinerU model source docs](https://opendatalab.github.io/MinerU/usage/model_source/) - [uv installation docs](https://docs.astral.sh/uv/getting-started/installation/) - [PyTorch install docs](https://pytorch.org/get-started/locally/) - [PyTorch CUDA architecture support discussion](https://dev-discuss.pytorch.org/t/cuda-toolkit-version-and-architecture-support-update-maxwell-and-pascal-architecture-support-removed-in-cuda-12-8-and-12-9-builds/3128) - [NVIDIA legacy CUDA GPU table](https://developer.nvidia.com/cuda/gpus/legacy) - [NVIDIA GTX 10-series specs](https://www.nvidia.com/en-us/geforce/10-series/10-series-specs/) - [OmniDocBench](https://github.com/opendatalab/OmniDocBench) - [ParseBench](https://arxiv.org/abs/2604.08538)