baram2584/PDFToMD

Fork 0

Files

T

김경종 88d6b92283 add pdftomd

2026-05-08 16:42:19 +09:00

22 KiB

Raw Blame History

Knowledge Base: Local PDF-to-Markdown Converter for Math-Heavy Documents

Last updated: 2026-05-07

1. Product Direction

This project will build a local-first PDF-to-Markdown converter for math-heavy academic PDFs and books. The v1 target is intentionally narrow:

Processing policy: local-only. Do not send user PDFs to cloud OCR or external AI APIs.
Primary interface: CLI plus Python library.
Primary output: Obsidian-friendly Markdown.
Main conversion engine: MinerU 3.1.0.
Math output: inline math as $...$ , display math as $$...$$.
Hardware target: NVIDIA GPU.
PDF scope: digital PDFs with an existing text layer first. Scanned books and poor-quality scans are out of scope for v1 optimization.
Quality workflow: fully automatic conversion. Low-confidence regions should be logged and represented in metadata, but conversion should not stop.
Install target: Python with uv; scripts should document local model download/setup.
License posture: personal use. License terms are not a v1 blocker for this local project, but document MinerU and transitive model/package licenses before any redistribution.

The rest of this document records the research basis and implementation implications for these decisions.

This file is background research, not a second requirements source. Use PRD.md for product requirements and ARCHITECTURE.md for implementation structure.

2. Why Math PDF Conversion Is Hard

PDF is a visual/page-description format, not a semantic source format. In scientific PDFs, the displayed equation often survives visually while the source-level LaTeX structure is lost. The Nougat paper states the core issue clearly: scientific knowledge is frequently stored in PDFs, and the PDF format loses semantic information, especially for mathematical expressions. Source: Nougat: Neural Optical Understanding for Academic Documents.

For this project, equation conversion must be treated as a document understanding and formula recognition problem, not just text extraction. A robust converter needs to recover:

Reading order across multi-column layouts.
Inline and display equation boundaries.
LaTeX syntax that renders in common Markdown math renderers.
Tables, captions, figures, references, footnotes, and image assets.
Page-level provenance for debugging and downstream correction.

Digital PDFs make the problem easier because embedded text can be extracted directly, but formulas may still appear as glyph streams, vector paths, images, or fragmented positioned characters. A v1 product should therefore use the PDF text layer where reliable and use document parsing/OCR models for layout and math reconstruction.

3. MinerU 3.1.0 Engine Strategy

MinerU 3.1.0 is the fixed local parser for v1.

Relevant source facts:

MinerU 3.1.0 is published on PyPI, requires Python >=3.10,<3.14, and describes support for PDF, images, DOCX, PPTX, and XLSX to Markdown and JSON. Source: PyPI mineru 3.1.0.
MinerU 3.1.0 release notes describe a license move to the MinerU Open Source License, a VLM main model upgrade to MinerU2.5-Pro-2604-1.2B, improved image/chart parsing, truncated paragraph merging, cross-page table merging, table-internal image recognition, and native PPTX/XLSX parsing. Source: MinerU releases.
MinerU's current quick usage documents the CLI shape as mineru -p <input_path> -o <output_path> and states that without --api-url, the CLI launches a temporary local mineru-api. Source: MinerU quick usage docs.
Starting with MinerU 3.0, the mineru command is an orchestration client on top of mineru-api. Source: MinerU usage guide.
MinerU model source configuration uses MINERU_MODEL_SOURCE, and local models are enabled with MINERU_MODEL_SOURCE=local. Source: MinerU model source docs.

Implementation implications:

Wrap MinerU behind a project-owned adapter rather than binding the codebase to its CLI output.
Preserve both Markdown and structured JSON when available.
Treat MinerU output as the first-pass parse, then normalize for Obsidian.
Keep engine name/version, command/options, page ids, and warnings in metadata.
Expect model downloads and GPU runtime setup to be heavy; document this clearly.
Treat MinerU 3.1.0's CLI-internal temporary local mineru-api as allowed local orchestration.
Reject --api-url, remote APIs, router mode, HTTP client backends, and remote OpenAI-compatible backends in strict-local mode.
Do not implement runtime engine selection in v1.

Risks:

MinerU version behavior may change quickly.
GTX 1070 Ti is below the current documented GPU acceleration recommendation for some MinerU 3.x paths; pipeline CPU or limited GPU behavior must be validated locally.
Some licenses are custom or changed over time; check the exact dependency license before redistribution if the project scope changes.
Output Markdown may require post-processing to match Obsidian math and asset path expectations.

4. Output Standard: Obsidian-Friendly Markdown

The final Markdown should be optimized for Obsidian rendering and long-term note usage.

Rules:

Inline math: $...$ .
Display math: $$...$$ on separate lines.
Store extracted images in a sibling assets directory, for example paper.assets/page-003-figure-01.png.
Use relative links from the Markdown file to assets.
Preserve page boundaries in metadata, not by noisy visible page markers in the main Markdown.
Prefer normal Markdown tables for simple tables.
Use HTML tables only when Markdown tables would destroy merged cells or mathematical structure.
Do not silently drop formulas, figures, captions, or references. If exact conversion fails, keep a best-effort representation and log a warning.

Post-processing should normalize:

Math delimiters: convert $...$ to $...$ and \[...\] to $$...$$ where safe.
Display math spacing: ensure blank lines around display equations.
Escaping: avoid over-escaping underscores inside math.
Asset links: convert absolute/generated paths to stable relative paths.
Heading levels: avoid treating running headers, footers, and page numbers as section headings.
Repeated hyphenation and line breaks from PDF extraction.

5. Metadata and Provenance

Metadata is necessary because fully automatic conversion can produce imperfect formulas and reading order. The converter must keep enough provenance to identify the source page, source region, engine version, warnings, and emitted assets for each result.

The metadata schema is defined in ARCHITECTURE.md.

6. Evaluation Strategy

Do not rely only on raw text edit distance. The project should evaluate multiple dimensions.

Benchmarks to Learn From

OmniDocBench:

Evaluates document parsing across text, formulas, tables, and reading order.
End-to-end scoring combines text edit distance, table TEDS, and formula CDM.
Provides formula recognition evaluation for display and inline formulas. Source: OmniDocBench.

Unit-test-style document parsing checks:

Uses simple, unambiguous, machine-checkable page facts instead of only soft edit-distance comparisons.
Includes arXiv math, old scans math, tables, headers/footers, multi-column layouts, long/tiny text, and base cases.
Explicitly notes that small equation symbol swaps can be critical even when edit distance is small.

ParseBench:

Emphasizes semantic correctness for agentic document parsing: table structure, chart data, formatting, visual grounding, and content faithfulness. Source: ParseBench.

Project Acceptance Metrics

For v1, use a small project fixture suite rather than trying to run every public benchmark.

Required fixture categories:

A short digital PDF with simple inline and display math.
A math-heavy academic paper with multi-column layout.
A PDF page containing a table with formulas.
A PDF with figures, captions, references, and page numbers.

Required checks:

Markdown file is generated.
Metadata JSON is generated when requested.
All pages have provenance records.
Inline math and display math render in a KaTeX/MathJax-compatible check.
Asset links resolve.
No cloud network calls occur during conversion.
MinerU failures produce warnings and a best-effort output, not a hard crash, unless the input file cannot be opened or no output can be produced.

Recommended quantitative checks:

Count display math blocks detected.
Count math render failures.
Count missing asset links.
Count pages with warnings.
Compare selected known formulas against expected LaTeX or normalized render output.

7. Implementation Source of Truth

Do not implement directly from this research note.

Use PRD.md for CLI, API, scope, tests, and release criteria.
Use ARCHITECTURE.md for the conversion pipeline, MinerU boundary, intermediate representation, metadata schema, and strict-local enforcement.

8. Implementation Risks

Formula correctness cannot be guaranteed purely by text extraction.
MinerU behavior may change across versions; adapter tests should pin expected behavior.
GPU dependencies can be difficult on Windows; doctor should detect CUDA, PyTorch, model paths, and engine availability.
Obsidian math rendering may differ from GitHub or Pandoc.
Licensing must be reviewed before packaging or redistributing models/tools.
Fully automatic mode means users may receive imperfect formulas; warnings and metadata are essential.

9. Sprint 0 Verification And Engine Update (2026-05-07)

Sprint 0 originally verified MinerU 2.5.4 assumptions before implementation. After Sprint 0, the project owner changed the v1 engine target to MinerU 3.1.0 and redefined strict-local execution. The current implementation target is therefore MinerU 3.1.0, not the earlier 2.5.4 pin.

9.1 Recommendation

Recommendation: go-with-risks for personal-use v1.

The project can proceed to Sprint 1 after the uv workflow is available locally or Sprint 1 explicitly handles the bootstrap gap. Use mineru[core]==3.1.0 or another explicitly reviewed 3.1.0 installation path until a later sprint contract changes it.

Current strict-local policy:

Allowed: direct mineru CLI execution.
Allowed: the temporary local mineru-api process that MinerU 3.1.0 starts internally when the CLI runs without --api-url.
Prohibited: --api-url, remote APIs, router mode, HTTP client backends, and remote OpenAI-compatible backends.
Setup may download models only when the user explicitly runs setup commands.
Runtime conversion should use local model paths, for example with MINERU_MODEL_SOURCE=local.

9.2 Evidence Policy Applied

Sprint 0 claims use official or primary sources where available. Each claim below is marked as either:

Direct fact: stated by a source or observed from an allowed local command.
Project inference: a project decision derived from source facts.

Web research was allowed for documentation verification. Runtime converter design remains local-only.

9.3 MinerU 3.1.0 Facts

Area	Confirmed fact	Evidence type	Source	Implementation implication
Package pin	PyPI lists MinerU `3.1.0`, released 2026-04-17, with Python `>=3.10,<3.14` and extras including `core`, `pipeline`, `vlm`, `vllm`, `lmdeploy`, `mlx`, `gradio`, and `all`.	Direct fact	PyPI mineru 3.1.0	Pin v1 setup to MinerU 3.1.0 until changed by a later contract.
Install path	Current MinerU docs show `uv pip install -U "mineru[all]"`; PyPI also supports extras including `core`.	Direct fact plus project inference	PyPI mineru 3.1.0	Start with `mineru[core]==3.1.0` for the wrapper unless Sprint 1 or Sprint 8 proves that `all` is required.
CLI shape	Current docs show `mineru -p <input_path> -o <output_path>`.	Direct fact	MinerU quick usage docs	The adapter still calls the `mineru` CLI directly.
Local temporary API	MinerU docs state that without `--api-url`, the CLI launches a temporary local `mineru-api`; with `--api-url`, the CLI connects to an existing local or remote FastAPI service.	Direct fact	MinerU quick usage docs, MinerU CLI tools docs	Allow only the CLI-internal temporary local `mineru-api`; reject `--api-url` and user-managed API endpoints.
Backend risk	Current docs describe `pipeline`, `vlm`, and `hybrid` paths plus HTTP-client variants for OpenAI-compatible servers.	Direct fact	PyPI mineru 3.1.0, MinerU CLI tools docs	Strict-local validation must reject HTTP client backends and remote OpenAI-compatible backends.
Output layout	MinerU output docs list Markdown plus visual debugging files and structured files such as `model.json`, `middle.json`, and `content_list.json`; the exact set depends on backend and input type.	Direct fact	MinerU output files docs	Adapter parsing must tolerate optional files and backend-specific structured output. Keep raw output optional through `--keep-raw`.
Model/cache behavior	MinerU uses Hugging Face and ModelScope by default and switches source through `MINERU_MODEL_SOURCE`; local parsing uses `MINERU_MODEL_SOURCE=local` after models are downloaded.	Direct fact	MinerU model source docs	Setup scripts may download models; runtime conversion should require local model paths under strict-local mode.
3.1.0 capability update	Release notes say 3.1.0 upgrades the main VLM model to `MinerU2.5-Pro-2604-1.2B` and improves image/chart parsing, truncated paragraph merging, cross-page table merging, and image recognition inside tables.	Direct fact	MinerU releases	3.1.0 is a better target for math-heavy and complex-layout documents, pending local hardware validation.

Adapter-facing fields that must remain optional until a local MinerU output probe is run:

Backend/parse method directory names.
Presence of content_list, middle, model, layout, span, and origin files.
Page/block confidence and bbox fields in structured output.
Asset naming and relative paths.
Exact stdout/stderr warning text.

9.4 Local Environment Facts

Allowed local commands run during Sprint 0:

python --version
uv --version
nvidia-smi

Results:

Area	Result	Evidence type	Source or command	Impact
Python	Local Python is `Python 3.12.7`. Python 3.12 target is viable for MinerU 3.1.0's declared `>=3.10,<3.14` range.	Direct fact plus project inference	`python --version`; PyPI mineru 3.1.0	Continue targeting Python 3.12. Prefer current 3.12 patch in setup docs, but do not require a patch bump for v1.
`uv`	`uv` is not on PATH. PowerShell reported that `uv` is not recognized.	Direct fact	`uv --version`; uv installation docs	Sprint 1 cannot honestly claim `uv sync` works until `uv` is installed or the Sprint 1 contract includes bootstrap instructions.
GPU	`nvidia-smi` detects NVIDIA GeForce GTX 1070 Ti, driver `577.00`, WDDM, 8192 MiB VRAM, about 5034 MiB free at check time.	Direct fact	`nvidia-smi`	GPU is visible, but available VRAM is tight for model workloads and must be reported by `doctor`.
Compute capability	GTX 1070 Ti is Pascal and listed as CUDA compute capability `6.1`.	Direct fact	NVIDIA legacy CUDA GPU table, NVIDIA GTX 10-series specs	Warn on pre-Turing GPUs. Do not assume modern CUDA 12.8/12.9 PyTorch wheels work.
PyTorch/CUDA risk	PyTorch project discussion states CUDA 12.8/12.9 builds remove Maxwell/Pascal support. `nvidia-smi` CUDA version is a driver capability ceiling, not proof that PyTorch CUDA works.	Direct fact plus project inference	PyTorch CUDA architecture support discussion, PyTorch install docs	`pdf2md doctor` must test actual PyTorch import, CUDA runtime, device name, compute capability, and free memory.

Future pdf2md doctor checks:

python --version, requiring >=3.12,<3.13.
uv --version, failing clearly when unavailable.
PowerShell version and edition on Windows.
nvidia-smi GPU name, driver, WDDM/TCC mode, total/free VRAM.
GPU compute capability warning for <7.5, especially Pascal 6.1.
torch import, torch.__version__, torch.version.cuda, torch.cuda.is_available(), device name, compute capability, and free memory.
MinerU CLI availability, installed package version, and model/cache configuration.
Strict-local validation that runtime uses direct CLI, allows only CLI-internal temporary local mineru-api, uses local model paths, and rejects --api-url, router mode, HTTP client backends, remote APIs, and remote OpenAI-compatible backends.

9.5 License And Privacy Facts

Area	Confirmed fact	Evidence type	Source	Implementation implication
MinerU 3.1.0 license	PyPI identifies MinerU 3.1.0 as `LicenseRef-MinerU-Open-Source-License`; release notes state MinerU moved from AGPLv3 to a custom license based on Apache 2.0.	Direct fact	PyPI mineru 3.1.0, MinerU releases	License is not a blocker for personal v1. Redistribution still needs review.
Runtime privacy	MinerU docs expose remote-capable paths, including `--api-url`, router usage, HTTP client backends, and OpenAI-compatible backend URLs.	Direct fact	MinerU usage guide, MinerU CLI tools docs	The wrapper must never upload PDFs, page images, extracted text, or intermediates. Setup downloads are separate from runtime conversion.
Model downloads	`mineru-models-download` must use a remote model source for real downloads, while runtime local parsing uses `MINERU_MODEL_SOURCE=local`.	Direct fact	MinerU model source docs	Download scripts are setup-only. Conversion runtime must not download or call remote inference.

9.6 Go-With-Risks Register

Risk	Later sprint that must absorb it	Required handling
`uv` is missing locally.	Sprint 1 and Sprint 8	Install `uv` before claiming scaffold success, or make bootstrap documentation part of Sprint 1. `doctor` must report missing `uv`.
GTX 1070 Ti is Pascal compute capability 6.1 and is below current documented GPU acceleration recommendations for some MinerU 3.x paths.	Sprint 8	`doctor` must verify actual PyTorch/CUDA/MinerU backend availability and warn on pre-Turing GPUs. Setup docs should avoid assuming CUDA 12.8/12.9 PyTorch works on this GPU.
MinerU 3.1.0 output layout is source-verified but not locally probed.	Sprint 4 and Sprint 9	Adapter tests should mock optional outputs first. A later local MinerU probe should confirm real output paths before release.
CLI-internal local `mineru-api` is allowed, but user-specified API paths are prohibited.	Sprint 4 and Sprint 8	Adapter command validation must allow `mineru -p ... -o ...` without `--api-url`, while rejecting `--api-url`, router mode, HTTP client backends, remote APIs, and remote OpenAI-compatible backends.
Runtime must be local-only while setup may download packages/models.	Sprint 4 and Sprint 8	Separate setup/download commands from conversion runtime. Enforce `MINERU_MODEL_SOURCE=local` or equivalent local model configuration in strict-local mode.

No hard failure criteria are currently met. Direct local MinerU 3.1.0 CLI execution appears suitable for v1 under the redefined strict-local policy, Python 3.12 is compatible with the package metadata, local-only design remains viable, and license posture does not block personal use.

22 KiB Raw Blame History