11 KiB
Architecture: Local PDF-to-Markdown Converter
Last updated: 2026-05-13
1. Overview
The system converts math-heavy digital PDFs into Obsidian-friendly Markdown using MinerU 3.1.0 as the fixed local conversion engine. Product requirements live in PRD.md; agent workflow rules live in AGENTS.md; research notes live in docs/KNOWLEDGEBASE.md.
The architecture separates MinerU execution from project-owned normalization and internal provenance/reporting. This boundary exists only to isolate MinerU I/O; it is not a pluggable engine system.
2. System Layers
- CLI/API layer
- Parse command arguments and Python API parameters.
- Discover input PDFs.
- Plan output paths.
- Enforce overwrite behavior.
- Print conversion summaries.
Optional local UI launcher sits above this layer and invokes the project-owned pdf2md CLI. It can run a selected folder by discovering direct-child PDFs and sequentially invoking existing pdf2md convert commands. It must not call MinerU directly, add a second conversion engine, run parallel GPU conversions by default, or expose remote/API runtime paths.
-
MinerU adapter layer
- Validate MinerU 3.1.0 installation and version.
- Run MinerU through direct local CLI execution.
- Capture raw Markdown, structured output, assets, logs, and exit status.
- Enforce strict-local execution.
-
Intermediate representation layer
- Convert MinerU-specific output into project-owned document/page/block objects.
- Preserve page index, bbox, confidence, source engine, and asset references returned by MinerU.
- Prevent raw MinerU objects from becoming public API return types.
-
Normalization layer
- Convert project-owned objects and MinerU Markdown into Obsidian-friendly Markdown.
- Normalize math delimiters, display math spacing, headings, tables, and asset links.
-
Quality and reporting layer
- Run link checks and math renderability checks with local tooling.
- Aggregate structured warnings.
- Build internal metadata-like records for reports and result summaries.
- Write quality report Markdown and optional raw MinerU diagnostics.
3. Conversion Pipeline
-
Input discovery
- Accept a single PDF or a directory.
- Require
--recursivefor subdirectory traversal. - Validate that each selected input is a local PDF.
- Compute source SHA-256.
-
MinerU conversion
- Create an isolated work directory per input PDF.
- Run the MinerU 3.1.0 adapter through the direct
mineruCLI. - Capture raw Markdown, raw JSON/structured output when available, extracted assets, warnings, and logs.
- When
--chunk-pagesis active, write one-page temporary PDFs, run MinerU once per source page, and group successful page Markdown into final outputs of the configured page count.
-
Intermediate representation
- Build document/page/block records from MinerU output.
- Preserve provenance data instead of relying only on final Markdown text.
-
Obsidian normalization
- Normalize inline math to
$...$. - Normalize display math to
$$...$$blocks on separate lines. - Normalize image links to stable relative asset paths.
- Normalize tables without destroying complex table structure.
- Normalize inline math to
-
Quality checks
- Verify generated asset links.
- Check math renderability when local tooling is available.
- Compare local pypdf text-layer extraction with Markdown text where page mapping is credible.
- Emit warnings without stopping conversion unless no usable output can be produced.
-
Output writing
- Write final Markdown parts under
<output>/<stem>/. - Write extracted assets under
<output>/<stem>/images/. - Write one report at
<output>/<stem>/<stem>_report.md. - Keep raw MinerU output when requested.
- In grouped page conversion mode, write one public Markdown part per grouped page range and delete temporary one-page PDFs plus intermediate per-page outputs.
- Write final Markdown parts under
4. MinerU Adapter Contract
The MinerU adapter exposes:
nameis_available()version()doctor()convert(input_pdf, work_dir, options)
Adapter conversion output contains:
raw_markdownraw_structuredassetspageswarningsengineengine_versionengine_optionsexit_codestderr
The adapter must fail fast if it cannot run in strict-local mode. Runtime engine selection is not part of v1.
The default conversion device is cuda:0. Because MinerU 3.1.0 selects its local device through environment/config rather than a dedicated CLI GPU flag, the adapter must set the MinerU subprocess environment to request CUDA by default while keeping the command shape direct and local.
Runtime tuning is project-owned and strict-local:
--gpu autoselects the visible NVIDIA GPU with the largest VRAM from localnvidia-smiinventory.--mineru-profile autois the default.- Safe profile settings are used for GTX 1070 Ti 8GB, pre-Turing, low-VRAM GPUs, or unavailable inventory.
- Stronger settings are used only for 16GB+ Turing-or-newer GPUs.
- Tuning is applied only through allowlisted MinerU subprocess environment variables:
MINERU_PROCESSING_WINDOW_SIZE,MINERU_API_MAX_CONCURRENT_REQUESTS, andMINERU_PDF_RENDER_THREADS. - The adapter must not add MinerU backend flags, API URLs, router mode, HTTP client backend use, remote OpenAI-compatible endpoints, or
MINERU_HYBRID_BATCH_RATIO.
Resolved profile details must be recorded in engine_options["mineru_profile"], including requested profile, applied profile, environment values, and selected GPU details when known.
Allowed MinerU execution in v1:
- Direct local
mineruCLI execution. - The temporary local
mineru-apiprocess that MinerU 3.1.0 starts internally when the CLI runs without--api-url.
Prohibited MinerU execution in v1:
- Passing
--api-url. - Remote APIs.
- Router mode.
- HTTP client backends.
- Remote OpenAI-compatible backends or inference endpoints.
5. Intermediate Representation
The project uses a small internal representation for normalization and metadata.
Required concepts:
- Document
- Page
- Block
- Asset
- Warning
Required block types:
headingparagraphinline_formuladisplay_formulatablefigurecaptionfootnotereferenceunknown
Record these page/block fields when MinerU returns them. Do not invent missing values.
- Page index.
- Page dimensions.
- Bounding boxes.
- Confidence.
- Source engine, fixed to MinerU 3.1.0 in v1.
- Markdown character span.
6. Markdown Normalization
Final Markdown must prioritize Obsidian.
- Use
$...$for inline math. - Use display math blocks with
$$on their own lines. - Keep blank lines around display math.
- Do not escape underscores or carets inside math unnecessarily.
- Prefer Markdown tables for simple tables.
- Use HTML tables for complex tables when Markdown would lose structure.
- Store figures/images in the stable
images/directory under the PDF output folder. - Do not add visible page separators in v1; grouped page conversion may add invisible HTML comments such as
<!-- source-page: 7 -->for provenance. - Preserve captions and references when MinerU provides them.
7. Internal Provenance Schema
New conversions do not write a public metadata JSON sidecar. The same schema shape remains useful internally for report generation, warning aggregation, and tests.
Required top-level fields:
source_pdfsource_sha256created_atengineengine_versionengine_optionspagesassetswarningssummary
Optional top-level fields:
text_fidelity: page-level local pypdf-vs-Markdown text diagnostics when source text can be extracted or page mapping uncertainty needs to be recorded.
Required summary fields:
pages_processedwarning_countasset_countdisplay_formula_countinline_formula_countmath_render_error_count
Optional text fidelity summary fields:
text_fidelity_checked_page_counttext_fidelity_low_page_counttext_fidelity_unexpected_cjk_counttext_fidelity_replacement_candidate_page_counttext_fidelity_page_mapping_uncertain_count
Grouped page conversion records these engine_options entries:
chunk: original source PDF path, grouped output index, total grouped outputs, and original source page range.page_conversion:single_pagemode, MinerU input page count of 1, grouped output page count, and failed source page numbers.parts: aggregate report records for output Markdown part paths, source page ranges, status, warning counts, and failed source pages.output_folder: the PDF-stem output folder.
Warning records include:
codeseveritypage_indexbboxmessage
Stable warning code examples:
ENGINE_MISSINGGPU_UNAVAILABLELOW_CONFIDENCE_FORMULAMATH_RENDER_FAILEDMATH_RENDER_REPAIREDASSET_LINK_MISSINGREADING_ORDER_UNCERTAINSTRICT_LOCAL_VIOLATIONMINERU_CLI_FAILEDMINERU_PROFILE_ADJUSTEDTEXT_LAYER_AVAILABLETEXT_FIDELITY_LOWUNEXPECTED_CJK_IN_KOREAN_TEXTHANGUL_SPACING_SUSPECTTEXT_PAGE_MAPPING_UNCERTAIN
8. Quality Report
Every conversion writes <stem>/<stem>_report.md.
The report is derived from internal provenance and local quality checks. It contains:
- Source and output paths.
- Markdown part paths and source page ranges.
- MinerU version and execution mode.
- Pages processed.
- Warning count.
- Asset count and missing asset link count.
- Inline and display formula counts.
- Math render error count.
- Text fidelity summary when pypdf diagnostics are available.
- Pages with warnings.
- Final status:
success,partial, orfailed.
9. Local-Only Enforcement
The implementation must not upload PDFs, page images, or extracted text to remote services.
Strict-local mode is on by default. The MinerU adapter must not call cloud OCR APIs, hosted document parsing APIs, hosted LLM/VLM APIs, or remote model inference endpoints.
Local-only execution means direct mineru CLI execution in v1. MinerU 3.1.0's CLI-internal temporary local mineru-api process is allowed because it is local orchestration owned by the CLI invocation. User-specified API URLs, router mode, HTTP client backends, remote APIs, and remote OpenAI-compatible backends are not allowed.
Allowed network activity is limited to documentation, package/model installation initiated by setup commands, or tests explicitly marked as network tests and disabled by default.
10. Failure Policy
Conversion should continue automatically when possible.
- Low-confidence formulas are included as best effort and recorded as warnings.
- Low-confidence pages are included as best effort and recorded as warnings.
- Run MinerU with its default local CLI behavior first.
- If MinerU cannot run or returns a failure, report the failure clearly and do not silently switch backend.
- Conversion fails only when the input cannot be opened, MinerU cannot run, no usable output can be produced, output cannot be written, or strict-local policy is violated.
- CLI summaries must report warning counts clearly.