Files
PDFToMD/ARCHITECTURE.md
T
2026-05-14 10:16:59 +09:00

11 KiB

Architecture: Local PDF-to-Markdown Converter

Last updated: 2026-05-13

1. Overview

The system converts math-heavy digital PDFs into Obsidian-friendly Markdown using MinerU 3.1.0 as the fixed local conversion engine. Product requirements live in PRD.md; agent workflow rules live in AGENTS.md; research notes live in docs/KNOWLEDGEBASE.md.

The architecture separates MinerU execution from project-owned normalization and internal provenance/reporting. This boundary exists only to isolate MinerU I/O; it is not a pluggable engine system.

2. System Layers

  1. CLI/API layer
    • Parse command arguments and Python API parameters.
    • Discover input PDFs.
    • Plan output paths.
    • Enforce overwrite behavior.
    • Print conversion summaries.

Optional local UI launcher sits above this layer and invokes the project-owned pdf2md CLI. It can run a selected folder by discovering direct-child PDFs and sequentially invoking existing pdf2md convert commands. It must not call MinerU directly, add a second conversion engine, run parallel GPU conversions by default, or expose remote/API runtime paths.

  1. MinerU adapter layer

    • Validate MinerU 3.1.0 installation and version.
    • Run MinerU through direct local CLI execution.
    • Capture raw Markdown, structured output, assets, logs, and exit status.
    • Enforce strict-local execution.
  2. Intermediate representation layer

    • Convert MinerU-specific output into project-owned document/page/block objects.
    • Preserve page index, bbox, confidence, source engine, and asset references returned by MinerU.
    • Prevent raw MinerU objects from becoming public API return types.
  3. Normalization layer

    • Convert project-owned objects and MinerU Markdown into Obsidian-friendly Markdown.
    • Normalize math delimiters, display math spacing, headings, tables, and asset links.
  4. Quality and reporting layer

    • Run link checks and math renderability checks with local tooling.
    • Aggregate structured warnings.
    • Build internal metadata-like records for reports and result summaries.
    • Write quality report Markdown and optional raw MinerU diagnostics.

3. Conversion Pipeline

  1. Input discovery

    • Accept a single PDF or a directory.
    • Require --recursive for subdirectory traversal.
    • Validate that each selected input is a local PDF.
    • Compute source SHA-256.
  2. MinerU conversion

    • Create an isolated work directory per input PDF.
    • Run the MinerU 3.1.0 adapter through the direct mineru CLI.
    • Capture raw Markdown, raw JSON/structured output when available, extracted assets, warnings, and logs.
    • When --chunk-pages is active, write one-page temporary PDFs, run MinerU once per source page, and group successful page Markdown into final outputs of the configured page count.
  3. Intermediate representation

    • Build document/page/block records from MinerU output.
    • Preserve provenance data instead of relying only on final Markdown text.
  4. Obsidian normalization

    • Normalize inline math to $...$.
    • Normalize display math to $$...$$ blocks on separate lines.
    • Normalize image links to stable relative asset paths.
    • Normalize tables without destroying complex table structure.
  5. Quality checks

    • Verify generated asset links.
    • Check math renderability when local tooling is available.
    • Compare local pypdf text-layer extraction with Markdown text where page mapping is credible.
    • Emit warnings without stopping conversion unless no usable output can be produced.
  6. Output writing

    • Write final Markdown parts under <output>/<stem>/.
    • Write extracted assets under <output>/<stem>/images/.
    • Write one report at <output>/<stem>/<stem>_report.md.
    • Keep raw MinerU output when requested.
    • In grouped page conversion mode, write one public Markdown part per grouped page range and delete temporary one-page PDFs plus intermediate per-page outputs.

4. MinerU Adapter Contract

The MinerU adapter exposes:

  • name
  • is_available()
  • version()
  • doctor()
  • convert(input_pdf, work_dir, options)

Adapter conversion output contains:

  • raw_markdown
  • raw_structured
  • assets
  • pages
  • warnings
  • engine
  • engine_version
  • engine_options
  • exit_code
  • stderr

The adapter must fail fast if it cannot run in strict-local mode. Runtime engine selection is not part of v1.

The default conversion device is cuda:0. Because MinerU 3.1.0 selects its local device through environment/config rather than a dedicated CLI GPU flag, the adapter must set the MinerU subprocess environment to request CUDA by default while keeping the command shape direct and local.

Runtime tuning is project-owned and strict-local:

  • --gpu auto selects the visible NVIDIA GPU with the largest VRAM from local nvidia-smi inventory.
  • --mineru-profile auto is the default.
  • Safe profile settings are used for GTX 1070 Ti 8GB, pre-Turing, low-VRAM GPUs, or unavailable inventory.
  • Stronger settings are used only for 16GB+ Turing-or-newer GPUs.
  • Tuning is applied only through allowlisted MinerU subprocess environment variables: MINERU_PROCESSING_WINDOW_SIZE, MINERU_API_MAX_CONCURRENT_REQUESTS, and MINERU_PDF_RENDER_THREADS.
  • The adapter must not add MinerU backend flags, API URLs, router mode, HTTP client backend use, remote OpenAI-compatible endpoints, or MINERU_HYBRID_BATCH_RATIO.

Resolved profile details must be recorded in engine_options["mineru_profile"], including requested profile, applied profile, environment values, and selected GPU details when known.

Allowed MinerU execution in v1:

  • Direct local mineru CLI execution.
  • The temporary local mineru-api process that MinerU 3.1.0 starts internally when the CLI runs without --api-url.

Prohibited MinerU execution in v1:

  • Passing --api-url.
  • Remote APIs.
  • Router mode.
  • HTTP client backends.
  • Remote OpenAI-compatible backends or inference endpoints.

5. Intermediate Representation

The project uses a small internal representation for normalization and metadata.

Required concepts:

  • Document
  • Page
  • Block
  • Asset
  • Warning

Required block types:

  • heading
  • paragraph
  • inline_formula
  • display_formula
  • table
  • figure
  • caption
  • footnote
  • reference
  • unknown

Record these page/block fields when MinerU returns them. Do not invent missing values.

  • Page index.
  • Page dimensions.
  • Bounding boxes.
  • Confidence.
  • Source engine, fixed to MinerU 3.1.0 in v1.
  • Markdown character span.

6. Markdown Normalization

Final Markdown must prioritize Obsidian.

  • Use $...$ for inline math.
  • Use display math blocks with $$ on their own lines.
  • Keep blank lines around display math.
  • Do not escape underscores or carets inside math unnecessarily.
  • Prefer Markdown tables for simple tables.
  • Use HTML tables for complex tables when Markdown would lose structure.
  • Store figures/images in the stable images/ directory under the PDF output folder.
  • Do not add visible page separators in v1; grouped page conversion may add invisible HTML comments such as <!-- source-page: 7 --> for provenance.
  • Preserve captions and references when MinerU provides them.

7. Internal Provenance Schema

New conversions do not write a public metadata JSON sidecar. The same schema shape remains useful internally for report generation, warning aggregation, and tests.

Required top-level fields:

  • source_pdf
  • source_sha256
  • created_at
  • engine
  • engine_version
  • engine_options
  • pages
  • assets
  • warnings
  • summary

Optional top-level fields:

  • text_fidelity: page-level local pypdf-vs-Markdown text diagnostics when source text can be extracted or page mapping uncertainty needs to be recorded.

Required summary fields:

  • pages_processed
  • warning_count
  • asset_count
  • display_formula_count
  • inline_formula_count
  • math_render_error_count

Optional text fidelity summary fields:

  • text_fidelity_checked_page_count
  • text_fidelity_low_page_count
  • text_fidelity_unexpected_cjk_count
  • text_fidelity_replacement_candidate_page_count
  • text_fidelity_page_mapping_uncertain_count

Grouped page conversion records these engine_options entries:

  • chunk: original source PDF path, grouped output index, total grouped outputs, and original source page range.
  • page_conversion: single_page mode, MinerU input page count of 1, grouped output page count, and failed source page numbers.
  • parts: aggregate report records for output Markdown part paths, source page ranges, status, warning counts, and failed source pages.
  • output_folder: the PDF-stem output folder.

Warning records include:

  • code
  • severity
  • page_index
  • bbox
  • message

Stable warning code examples:

  • ENGINE_MISSING
  • GPU_UNAVAILABLE
  • LOW_CONFIDENCE_FORMULA
  • MATH_RENDER_FAILED
  • MATH_RENDER_REPAIRED
  • ASSET_LINK_MISSING
  • READING_ORDER_UNCERTAIN
  • STRICT_LOCAL_VIOLATION
  • MINERU_CLI_FAILED
  • MINERU_PROFILE_ADJUSTED
  • TEXT_LAYER_AVAILABLE
  • TEXT_FIDELITY_LOW
  • UNEXPECTED_CJK_IN_KOREAN_TEXT
  • HANGUL_SPACING_SUSPECT
  • TEXT_PAGE_MAPPING_UNCERTAIN

8. Quality Report

Every conversion writes <stem>/<stem>_report.md.

The report is derived from internal provenance and local quality checks. It contains:

  • Source and output paths.
  • Markdown part paths and source page ranges.
  • MinerU version and execution mode.
  • Pages processed.
  • Warning count.
  • Asset count and missing asset link count.
  • Inline and display formula counts.
  • Math render error count.
  • Text fidelity summary when pypdf diagnostics are available.
  • Pages with warnings.
  • Final status: success, partial, or failed.

9. Local-Only Enforcement

The implementation must not upload PDFs, page images, or extracted text to remote services.

Strict-local mode is on by default. The MinerU adapter must not call cloud OCR APIs, hosted document parsing APIs, hosted LLM/VLM APIs, or remote model inference endpoints.

Local-only execution means direct mineru CLI execution in v1. MinerU 3.1.0's CLI-internal temporary local mineru-api process is allowed because it is local orchestration owned by the CLI invocation. User-specified API URLs, router mode, HTTP client backends, remote APIs, and remote OpenAI-compatible backends are not allowed.

Allowed network activity is limited to documentation, package/model installation initiated by setup commands, or tests explicitly marked as network tests and disabled by default.

10. Failure Policy

Conversion should continue automatically when possible.

  • Low-confidence formulas are included as best effort and recorded as warnings.
  • Low-confidence pages are included as best effort and recorded as warnings.
  • Run MinerU with its default local CLI behavior first.
  • If MinerU cannot run or returns a failure, report the failure clearly and do not silently switch backend.
  • Conversion fails only when the input cannot be opened, MinerU cannot run, no usable output can be produced, output cannot be written, or strict-local policy is violated.
  • CLI summaries must report warning counts clearly.