# Architecture: Local PDF-to-Markdown Converter Last updated: 2026-05-13 ## 1. Overview The system converts math-heavy digital PDFs into Obsidian-friendly Markdown using MinerU 3.1.0 as the fixed local conversion engine. Product requirements live in `PRD.md`; agent workflow rules live in `AGENTS.md`; research notes live in `docs/KNOWLEDGEBASE.md`. The architecture separates MinerU execution from project-owned normalization and internal provenance/reporting. This boundary exists only to isolate MinerU I/O; it is not a pluggable engine system. ## 2. System Layers 1. CLI/API layer - Parse command arguments and Python API parameters. - Discover input PDFs. - Plan output paths. - Enforce overwrite behavior. - Print conversion summaries. Optional local UI launcher sits above this layer and invokes the project-owned `pdf2md` CLI. It can run a selected folder by discovering direct-child PDFs and sequentially invoking existing `pdf2md convert` commands. It must not call MinerU directly, add a second conversion engine, run parallel GPU conversions by default, or expose remote/API runtime paths. 2. MinerU adapter layer - Validate MinerU 3.1.0 installation and version. - Run MinerU through direct local CLI execution. - Capture raw Markdown, structured output, assets, logs, and exit status. - Enforce strict-local execution. 3. Intermediate representation layer - Convert MinerU-specific output into project-owned document/page/block objects. - Preserve page index, bbox, confidence, source engine, and asset references returned by MinerU. - Prevent raw MinerU objects from becoming public API return types. 4. Normalization layer - Convert project-owned objects and MinerU Markdown into Obsidian-friendly Markdown. - Normalize math delimiters, display math spacing, headings, tables, and asset links. 5. Quality and reporting layer - Run link checks and math renderability checks with local tooling. - Aggregate structured warnings. - Build internal metadata-like records for reports and result summaries. - Write quality report Markdown and optional raw MinerU diagnostics. ## 3. Conversion Pipeline 1. Input discovery - Accept a single PDF or a directory. - Require `--recursive` for subdirectory traversal. - Validate that each selected input is a local PDF. - Compute source SHA-256. 2. MinerU conversion - Create an isolated work directory per input PDF. - Run the MinerU 3.1.0 adapter through the direct `mineru` CLI. - Capture raw Markdown, raw JSON/structured output when available, extracted assets, warnings, and logs. - When `--chunk-pages` is active, write one-page temporary PDFs, run MinerU once per source page, and group successful page Markdown into final outputs of the configured page count. 3. Intermediate representation - Build document/page/block records from MinerU output. - Preserve provenance data instead of relying only on final Markdown text. 4. Obsidian normalization - Normalize inline math to `$...$`. - Normalize display math to `$$...$$` blocks on separate lines. - Normalize image links to stable relative asset paths. - Normalize tables without destroying complex table structure. 5. Quality checks - Verify generated asset links. - Check math renderability when local tooling is available. - Compare local pypdf text-layer extraction with Markdown text where page mapping is credible. - Emit warnings without stopping conversion unless no usable output can be produced. 6. Output writing - Write final Markdown parts under `//`. - Write extracted assets under `//images/`. - Write one report at `//_report.md`. - Keep raw MinerU output when requested. - In grouped page conversion mode, write one public Markdown part per grouped page range and delete temporary one-page PDFs plus intermediate per-page outputs. ## 4. MinerU Adapter Contract The MinerU adapter exposes: - `name` - `is_available()` - `version()` - `doctor()` - `convert(input_pdf, work_dir, options)` Adapter conversion output contains: - `raw_markdown` - `raw_structured` - `assets` - `pages` - `warnings` - `engine` - `engine_version` - `engine_options` - `exit_code` - `stderr` The adapter must fail fast if it cannot run in strict-local mode. Runtime engine selection is not part of v1. The default conversion device is `cuda:0`. Because MinerU 3.1.0 selects its local device through environment/config rather than a dedicated CLI GPU flag, the adapter must set the MinerU subprocess environment to request CUDA by default while keeping the command shape direct and local. Runtime tuning is project-owned and strict-local: - `--gpu auto` selects the visible NVIDIA GPU with the largest VRAM from local `nvidia-smi` inventory. - `--mineru-profile auto` is the default. - Safe profile settings are used for GTX 1070 Ti 8GB, pre-Turing, low-VRAM GPUs, or unavailable inventory. - Stronger settings are used only for 16GB+ Turing-or-newer GPUs. - Tuning is applied only through allowlisted MinerU subprocess environment variables: `MINERU_PROCESSING_WINDOW_SIZE`, `MINERU_API_MAX_CONCURRENT_REQUESTS`, and `MINERU_PDF_RENDER_THREADS`. - The adapter must not add MinerU backend flags, API URLs, router mode, HTTP client backend use, remote OpenAI-compatible endpoints, or `MINERU_HYBRID_BATCH_RATIO`. Resolved profile details must be recorded in `engine_options["mineru_profile"]`, including requested profile, applied profile, environment values, and selected GPU details when known. Allowed MinerU execution in v1: - Direct local `mineru` CLI execution. - The temporary local `mineru-api` process that MinerU 3.1.0 starts internally when the CLI runs without `--api-url`. Prohibited MinerU execution in v1: - Passing `--api-url`. - Remote APIs. - Router mode. - HTTP client backends. - Remote OpenAI-compatible backends or inference endpoints. ## 5. Intermediate Representation The project uses a small internal representation for normalization and metadata. Required concepts: - Document - Page - Block - Asset - Warning Required block types: - `heading` - `paragraph` - `inline_formula` - `display_formula` - `table` - `figure` - `caption` - `footnote` - `reference` - `unknown` Record these page/block fields when MinerU returns them. Do not invent missing values. - Page index. - Page dimensions. - Bounding boxes. - Confidence. - Source engine, fixed to MinerU 3.1.0 in v1. - Markdown character span. ## 6. Markdown Normalization Final Markdown must prioritize Obsidian. - Use `$...$` for inline math. - Use display math blocks with `$$` on their own lines. - Keep blank lines around display math. - Do not escape underscores or carets inside math unnecessarily. - Prefer Markdown tables for simple tables. - Use HTML tables for complex tables when Markdown would lose structure. - Store figures/images in the stable `images/` directory under the PDF output folder. - Do not add visible page separators in v1; grouped page conversion may add invisible HTML comments such as `` for provenance. - Preserve captions and references when MinerU provides them. ## 7. Internal Provenance Schema New conversions do not write a public metadata JSON sidecar. The same schema shape remains useful internally for report generation, warning aggregation, and tests. Required top-level fields: - `source_pdf` - `source_sha256` - `created_at` - `engine` - `engine_version` - `engine_options` - `pages` - `assets` - `warnings` - `summary` Optional top-level fields: - `text_fidelity`: page-level local pypdf-vs-Markdown text diagnostics when source text can be extracted or page mapping uncertainty needs to be recorded. Required summary fields: - `pages_processed` - `warning_count` - `asset_count` - `display_formula_count` - `inline_formula_count` - `math_render_error_count` Optional text fidelity summary fields: - `text_fidelity_checked_page_count` - `text_fidelity_low_page_count` - `text_fidelity_unexpected_cjk_count` - `text_fidelity_replacement_candidate_page_count` - `text_fidelity_page_mapping_uncertain_count` Grouped page conversion records these `engine_options` entries: - `chunk`: original source PDF path, grouped output index, total grouped outputs, and original source page range. - `page_conversion`: `single_page` mode, MinerU input page count of 1, grouped output page count, and failed source page numbers. - `parts`: aggregate report records for output Markdown part paths, source page ranges, status, warning counts, and failed source pages. - `output_folder`: the PDF-stem output folder. Warning records include: - `code` - `severity` - `page_index` - `bbox` - `message` Stable warning code examples: - `ENGINE_MISSING` - `GPU_UNAVAILABLE` - `LOW_CONFIDENCE_FORMULA` - `MATH_RENDER_FAILED` - `MATH_RENDER_REPAIRED` - `ASSET_LINK_MISSING` - `READING_ORDER_UNCERTAIN` - `STRICT_LOCAL_VIOLATION` - `MINERU_CLI_FAILED` - `MINERU_PROFILE_ADJUSTED` - `TEXT_LAYER_AVAILABLE` - `TEXT_FIDELITY_LOW` - `UNEXPECTED_CJK_IN_KOREAN_TEXT` - `HANGUL_SPACING_SUSPECT` - `TEXT_PAGE_MAPPING_UNCERTAIN` ## 8. Quality Report Every conversion writes `/_report.md`. The report is derived from internal provenance and local quality checks. It contains: - Source and output paths. - Markdown part paths and source page ranges. - MinerU version and execution mode. - Pages processed. - Warning count. - Asset count and missing asset link count. - Inline and display formula counts. - Math render error count. - Text fidelity summary when pypdf diagnostics are available. - Pages with warnings. - Final status: `success`, `partial`, or `failed`. ## 9. Local-Only Enforcement The implementation must not upload PDFs, page images, or extracted text to remote services. Strict-local mode is on by default. The MinerU adapter must not call cloud OCR APIs, hosted document parsing APIs, hosted LLM/VLM APIs, or remote model inference endpoints. Local-only execution means direct `mineru` CLI execution in v1. MinerU 3.1.0's CLI-internal temporary local `mineru-api` process is allowed because it is local orchestration owned by the CLI invocation. User-specified API URLs, router mode, HTTP client backends, remote APIs, and remote OpenAI-compatible backends are not allowed. Allowed network activity is limited to documentation, package/model installation initiated by setup commands, or tests explicitly marked as network tests and disabled by default. ## 10. Failure Policy Conversion should continue automatically when possible. - Low-confidence formulas are included as best effort and recorded as warnings. - Low-confidence pages are included as best effort and recorded as warnings. - Run MinerU with its default local CLI behavior first. - If MinerU cannot run or returns a failure, report the failure clearly and do not silently switch backend. - Conversion fails only when the input cannot be opened, MinerU cannot run, no usable output can be produced, output cannot be written, or strict-local policy is violated. - CLI summaries must report warning counts clearly.