22 KiB
Knowledge Base: Local PDF-to-Markdown Converter for Math-Heavy Documents
Last updated: 2026-05-07
1. Product Direction
This project will build a local-first PDF-to-Markdown converter for math-heavy academic PDFs and books. The v1 target is intentionally narrow:
- Processing policy: local-only. Do not send user PDFs to cloud OCR or external AI APIs.
- Primary interface: CLI plus Python library.
- Primary output: Obsidian-friendly Markdown.
- Main conversion engine: MinerU 3.1.0.
- Math output: inline math as
$...$, display math as$$...$$. - Hardware target: NVIDIA GPU.
- PDF scope: digital PDFs with an existing text layer first. Scanned books and poor-quality scans are out of scope for v1 optimization.
- Quality workflow: fully automatic conversion. Low-confidence regions should be logged and represented in metadata, but conversion should not stop.
- Install target: Python with
uv; scripts should document local model download/setup. - License posture: personal use. License terms are not a v1 blocker for this local project, but document MinerU and transitive model/package licenses before any redistribution.
The rest of this document records the research basis and implementation implications for these decisions.
This file is background research, not a second requirements source. Use PRD.md for product requirements and ARCHITECTURE.md for implementation structure.
2. Why Math PDF Conversion Is Hard
PDF is a visual/page-description format, not a semantic source format. In scientific PDFs, the displayed equation often survives visually while the source-level LaTeX structure is lost. The Nougat paper states the core issue clearly: scientific knowledge is frequently stored in PDFs, and the PDF format loses semantic information, especially for mathematical expressions. Source: Nougat: Neural Optical Understanding for Academic Documents.
For this project, equation conversion must be treated as a document understanding and formula recognition problem, not just text extraction. A robust converter needs to recover:
- Reading order across multi-column layouts.
- Inline and display equation boundaries.
- LaTeX syntax that renders in common Markdown math renderers.
- Tables, captions, figures, references, footnotes, and image assets.
- Page-level provenance for debugging and downstream correction.
Digital PDFs make the problem easier because embedded text can be extracted directly, but formulas may still appear as glyph streams, vector paths, images, or fragmented positioned characters. A v1 product should therefore use the PDF text layer where reliable and use document parsing/OCR models for layout and math reconstruction.
3. MinerU 3.1.0 Engine Strategy
MinerU 3.1.0 is the fixed local parser for v1.
Relevant source facts:
- MinerU 3.1.0 is published on PyPI, requires Python
>=3.10,<3.14, and describes support for PDF, images, DOCX, PPTX, and XLSX to Markdown and JSON. Source: PyPI mineru 3.1.0. - MinerU 3.1.0 release notes describe a license move to the MinerU Open Source License, a VLM main model upgrade to
MinerU2.5-Pro-2604-1.2B, improved image/chart parsing, truncated paragraph merging, cross-page table merging, table-internal image recognition, and native PPTX/XLSX parsing. Source: MinerU releases. - MinerU's current quick usage documents the CLI shape as
mineru -p <input_path> -o <output_path>and states that without--api-url, the CLI launches a temporary localmineru-api. Source: MinerU quick usage docs. - Starting with MinerU 3.0, the
minerucommand is an orchestration client on top ofmineru-api. Source: MinerU usage guide. - MinerU model source configuration uses
MINERU_MODEL_SOURCE, and local models are enabled withMINERU_MODEL_SOURCE=local. Source: MinerU model source docs.
Implementation implications:
- Wrap MinerU behind a project-owned adapter rather than binding the codebase to its CLI output.
- Preserve both Markdown and structured JSON when available.
- Treat MinerU output as the first-pass parse, then normalize for Obsidian.
- Keep engine name/version, command/options, page ids, and warnings in metadata.
- Expect model downloads and GPU runtime setup to be heavy; document this clearly.
- Treat MinerU 3.1.0's CLI-internal temporary local
mineru-apias allowed local orchestration. - Reject
--api-url, remote APIs, router mode, HTTP client backends, and remote OpenAI-compatible backends in strict-local mode. - Do not implement runtime engine selection in v1.
Risks:
- MinerU version behavior may change quickly.
- GTX 1070 Ti is below the current documented GPU acceleration recommendation for some MinerU 3.x paths;
pipelineCPU or limited GPU behavior must be validated locally. - Some licenses are custom or changed over time; check the exact dependency license before redistribution if the project scope changes.
- Output Markdown may require post-processing to match Obsidian math and asset path expectations.
4. Output Standard: Obsidian-Friendly Markdown
The final Markdown should be optimized for Obsidian rendering and long-term note usage.
Rules:
- Inline math:
$...$. - Display math:
$$...$$on separate lines. - Store extracted images in a sibling assets directory, for example
paper.assets/page-003-figure-01.png. - Use relative links from the Markdown file to assets.
- Preserve page boundaries in metadata, not by noisy visible page markers in the main Markdown.
- Prefer normal Markdown tables for simple tables.
- Use HTML tables only when Markdown tables would destroy merged cells or mathematical structure.
- Do not silently drop formulas, figures, captions, or references. If exact conversion fails, keep a best-effort representation and log a warning.
Post-processing should normalize:
- Math delimiters: convert
\(...\)to$...$and\[...\]to$$...$$where safe. - Display math spacing: ensure blank lines around display equations.
- Escaping: avoid over-escaping underscores inside math.
- Asset links: convert absolute/generated paths to stable relative paths.
- Heading levels: avoid treating running headers, footers, and page numbers as section headings.
- Repeated hyphenation and line breaks from PDF extraction.
5. Metadata and Provenance
Metadata is necessary because fully automatic conversion can produce imperfect formulas and reading order. The converter must keep enough provenance to identify the source page, source region, engine version, warnings, and emitted assets for each result.
The metadata schema is defined in ARCHITECTURE.md.
6. Evaluation Strategy
Do not rely only on raw text edit distance. The project should evaluate multiple dimensions.
Benchmarks to Learn From
OmniDocBench:
- Evaluates document parsing across text, formulas, tables, and reading order.
- End-to-end scoring combines text edit distance, table TEDS, and formula CDM.
- Provides formula recognition evaluation for display and inline formulas. Source: OmniDocBench.
Unit-test-style document parsing checks:
- Uses simple, unambiguous, machine-checkable page facts instead of only soft edit-distance comparisons.
- Includes arXiv math, old scans math, tables, headers/footers, multi-column layouts, long/tiny text, and base cases.
- Explicitly notes that small equation symbol swaps can be critical even when edit distance is small.
ParseBench:
- Emphasizes semantic correctness for agentic document parsing: table structure, chart data, formatting, visual grounding, and content faithfulness. Source: ParseBench.
Project Acceptance Metrics
For v1, use a small project fixture suite rather than trying to run every public benchmark.
Required fixture categories:
- A short digital PDF with simple inline and display math.
- A math-heavy academic paper with multi-column layout.
- A PDF page containing a table with formulas.
- A PDF with figures, captions, references, and page numbers.
Required checks:
- Markdown file is generated.
- Metadata JSON is generated when requested.
- All pages have provenance records.
- Inline math and display math render in a KaTeX/MathJax-compatible check.
- Asset links resolve.
- No cloud network calls occur during conversion.
- MinerU failures produce warnings and a best-effort output, not a hard crash, unless the input file cannot be opened or no output can be produced.
Recommended quantitative checks:
- Count display math blocks detected.
- Count math render failures.
- Count missing asset links.
- Count pages with warnings.
- Compare selected known formulas against expected LaTeX or normalized render output.
7. Implementation Source of Truth
Do not implement directly from this research note.
- Use
PRD.mdfor CLI, API, scope, tests, and release criteria. - Use
ARCHITECTURE.mdfor the conversion pipeline, MinerU boundary, intermediate representation, metadata schema, and strict-local enforcement.
8. Implementation Risks
- Formula correctness cannot be guaranteed purely by text extraction.
- MinerU behavior may change across versions; adapter tests should pin expected behavior.
- GPU dependencies can be difficult on Windows;
doctorshould detect CUDA, PyTorch, model paths, and engine availability. - Obsidian math rendering may differ from GitHub or Pandoc.
- Licensing must be reviewed before packaging or redistributing models/tools.
- Fully automatic mode means users may receive imperfect formulas; warnings and metadata are essential.
9. Sprint 0 Verification And Engine Update (2026-05-07)
Sprint 0 originally verified MinerU 2.5.4 assumptions before implementation. After Sprint 0, the project owner changed the v1 engine target to MinerU 3.1.0 and redefined strict-local execution. The current implementation target is therefore MinerU 3.1.0, not the earlier 2.5.4 pin.
9.1 Recommendation
Recommendation: go-with-risks for personal-use v1.
The project can proceed to Sprint 1 after the uv workflow is available locally or Sprint 1 explicitly handles the bootstrap gap. Use mineru[core]==3.1.0 or another explicitly reviewed 3.1.0 installation path until a later sprint contract changes it.
Current strict-local policy:
- Allowed: direct
mineruCLI execution. - Allowed: the temporary local
mineru-apiprocess that MinerU 3.1.0 starts internally when the CLI runs without--api-url. - Prohibited:
--api-url, remote APIs, router mode, HTTP client backends, and remote OpenAI-compatible backends. - Setup may download models only when the user explicitly runs setup commands.
- Runtime conversion should use local model paths, for example with
MINERU_MODEL_SOURCE=local.
9.2 Evidence Policy Applied
Sprint 0 claims use official or primary sources where available. Each claim below is marked as either:
- Direct fact: stated by a source or observed from an allowed local command.
- Project inference: a project decision derived from source facts.
Web research was allowed for documentation verification. Runtime converter design remains local-only.
9.3 MinerU 3.1.0 Facts
| Area | Confirmed fact | Evidence type | Source | Implementation implication |
|---|---|---|---|---|
| Package pin | PyPI lists MinerU 3.1.0, released 2026-04-17, with Python >=3.10,<3.14 and extras including core, pipeline, vlm, vllm, lmdeploy, mlx, gradio, and all. |
Direct fact | PyPI mineru 3.1.0 | Pin v1 setup to MinerU 3.1.0 until changed by a later contract. |
| Install path | Current MinerU docs show uv pip install -U "mineru[all]"; PyPI also supports extras including core. |
Direct fact plus project inference | PyPI mineru 3.1.0 | Start with mineru[core]==3.1.0 for the wrapper unless Sprint 1 or Sprint 8 proves that all is required. |
| CLI shape | Current docs show mineru -p <input_path> -o <output_path>. |
Direct fact | MinerU quick usage docs | The adapter still calls the mineru CLI directly. |
| Local temporary API | MinerU docs state that without --api-url, the CLI launches a temporary local mineru-api; with --api-url, the CLI connects to an existing local or remote FastAPI service. |
Direct fact | MinerU quick usage docs, MinerU CLI tools docs | Allow only the CLI-internal temporary local mineru-api; reject --api-url and user-managed API endpoints. |
| Backend risk | Current docs describe pipeline, vlm, and hybrid paths plus HTTP-client variants for OpenAI-compatible servers. |
Direct fact | PyPI mineru 3.1.0, MinerU CLI tools docs | Strict-local validation must reject HTTP client backends and remote OpenAI-compatible backends. |
| Output layout | MinerU output docs list Markdown plus visual debugging files and structured files such as model.json, middle.json, and content_list.json; the exact set depends on backend and input type. |
Direct fact | MinerU output files docs | Adapter parsing must tolerate optional files and backend-specific structured output. Keep raw output optional through --keep-raw. |
| Model/cache behavior | MinerU uses Hugging Face and ModelScope by default and switches source through MINERU_MODEL_SOURCE; local parsing uses MINERU_MODEL_SOURCE=local after models are downloaded. |
Direct fact | MinerU model source docs | Setup scripts may download models; runtime conversion should require local model paths under strict-local mode. |
| 3.1.0 capability update | Release notes say 3.1.0 upgrades the main VLM model to MinerU2.5-Pro-2604-1.2B and improves image/chart parsing, truncated paragraph merging, cross-page table merging, and image recognition inside tables. |
Direct fact | MinerU releases | 3.1.0 is a better target for math-heavy and complex-layout documents, pending local hardware validation. |
Adapter-facing fields that must remain optional until a local MinerU output probe is run:
- Backend/parse method directory names.
- Presence of
content_list,middle,model,layout,span, andoriginfiles. - Page/block confidence and bbox fields in structured output.
- Asset naming and relative paths.
- Exact stdout/stderr warning text.
9.4 Local Environment Facts
Allowed local commands run during Sprint 0:
python --version
uv --version
nvidia-smi
Results:
| Area | Result | Evidence type | Source or command | Impact |
|---|---|---|---|---|
| Python | Local Python is Python 3.12.7. Python 3.12 target is viable for MinerU 3.1.0's declared >=3.10,<3.14 range. |
Direct fact plus project inference | python --version; PyPI mineru 3.1.0 |
Continue targeting Python 3.12. Prefer current 3.12 patch in setup docs, but do not require a patch bump for v1. |
uv |
uv is not on PATH. PowerShell reported that uv is not recognized. |
Direct fact | uv --version; uv installation docs |
Sprint 1 cannot honestly claim uv sync works until uv is installed or the Sprint 1 contract includes bootstrap instructions. |
| GPU | nvidia-smi detects NVIDIA GeForce GTX 1070 Ti, driver 577.00, WDDM, 8192 MiB VRAM, about 5034 MiB free at check time. |
Direct fact | nvidia-smi |
GPU is visible, but available VRAM is tight for model workloads and must be reported by doctor. |
| Compute capability | GTX 1070 Ti is Pascal and listed as CUDA compute capability 6.1. |
Direct fact | NVIDIA legacy CUDA GPU table, NVIDIA GTX 10-series specs | Warn on pre-Turing GPUs. Do not assume modern CUDA 12.8/12.9 PyTorch wheels work. |
| PyTorch/CUDA risk | PyTorch project discussion states CUDA 12.8/12.9 builds remove Maxwell/Pascal support. nvidia-smi CUDA version is a driver capability ceiling, not proof that PyTorch CUDA works. |
Direct fact plus project inference | PyTorch CUDA architecture support discussion, PyTorch install docs | pdf2md doctor must test actual PyTorch import, CUDA runtime, device name, compute capability, and free memory. |
Future pdf2md doctor checks:
python --version, requiring>=3.12,<3.13.uv --version, failing clearly when unavailable.- PowerShell version and edition on Windows.
nvidia-smiGPU name, driver, WDDM/TCC mode, total/free VRAM.- GPU compute capability warning for
<7.5, especially Pascal6.1. torchimport,torch.__version__,torch.version.cuda,torch.cuda.is_available(), device name, compute capability, and free memory.- MinerU CLI availability, installed package version, and model/cache configuration.
- Strict-local validation that runtime uses direct CLI, allows only CLI-internal temporary local
mineru-api, uses local model paths, and rejects--api-url, router mode, HTTP client backends, remote APIs, and remote OpenAI-compatible backends.
9.5 License And Privacy Facts
| Area | Confirmed fact | Evidence type | Source | Implementation implication |
|---|---|---|---|---|
| MinerU 3.1.0 license | PyPI identifies MinerU 3.1.0 as LicenseRef-MinerU-Open-Source-License; release notes state MinerU moved from AGPLv3 to a custom license based on Apache 2.0. |
Direct fact | PyPI mineru 3.1.0, MinerU releases | License is not a blocker for personal v1. Redistribution still needs review. |
| Runtime privacy | MinerU docs expose remote-capable paths, including --api-url, router usage, HTTP client backends, and OpenAI-compatible backend URLs. |
Direct fact | MinerU usage guide, MinerU CLI tools docs | The wrapper must never upload PDFs, page images, extracted text, or intermediates. Setup downloads are separate from runtime conversion. |
| Model downloads | mineru-models-download must use a remote model source for real downloads, while runtime local parsing uses MINERU_MODEL_SOURCE=local. |
Direct fact | MinerU model source docs | Download scripts are setup-only. Conversion runtime must not download or call remote inference. |
9.6 Go-With-Risks Register
| Risk | Later sprint that must absorb it | Required handling |
|---|---|---|
uv is missing locally. |
Sprint 1 and Sprint 8 | Install uv before claiming scaffold success, or make bootstrap documentation part of Sprint 1. doctor must report missing uv. |
| GTX 1070 Ti is Pascal compute capability 6.1 and is below current documented GPU acceleration recommendations for some MinerU 3.x paths. | Sprint 8 | doctor must verify actual PyTorch/CUDA/MinerU backend availability and warn on pre-Turing GPUs. Setup docs should avoid assuming CUDA 12.8/12.9 PyTorch works on this GPU. |
| MinerU 3.1.0 output layout is source-verified but not locally probed. | Sprint 4 and Sprint 9 | Adapter tests should mock optional outputs first. A later local MinerU probe should confirm real output paths before release. |
CLI-internal local mineru-api is allowed, but user-specified API paths are prohibited. |
Sprint 4 and Sprint 8 | Adapter command validation must allow mineru -p ... -o ... without --api-url, while rejecting --api-url, router mode, HTTP client backends, remote APIs, and remote OpenAI-compatible backends. |
| Runtime must be local-only while setup may download packages/models. | Sprint 4 and Sprint 8 | Separate setup/download commands from conversion runtime. Enforce MINERU_MODEL_SOURCE=local or equivalent local model configuration in strict-local mode. |
No hard failure criteria are currently met. Direct local MinerU 3.1.0 CLI execution appears suitable for v1 under the redefined strict-local policy, Python 3.12 is compatible with the package metadata, local-only design remains viable, and license posture does not block personal use.
10. Source Index
- Nougat paper
- MinerU docs
- MinerU GitHub
- PyPI mineru 3.1.0
- MinerU releases
- MinerU usage guide
- MinerU quick usage docs
- MinerU CLI tools docs
- MinerU output files docs
- MinerU model source docs
- uv installation docs
- PyTorch install docs
- PyTorch CUDA architecture support discussion
- NVIDIA legacy CUDA GPU table
- NVIDIA GTX 10-series specs
- OmniDocBench
- ParseBench