Files
PDFToMD/ARCHITECTURE.md
T
2026-05-08 16:42:19 +09:00

244 lines
7.6 KiB
Markdown

# Architecture: Local PDF-to-Markdown Converter
Last updated: 2026-05-07
## 1. Overview
The system converts math-heavy digital PDFs into Obsidian-friendly Markdown using MinerU 3.1.0 as the fixed local conversion engine. Product requirements live in `PRD.md`; agent workflow rules live in `AGENTS.md`; research notes live in `docs/KNOWLEDGEBASE.md`.
The architecture separates MinerU execution from project-owned normalization and metadata. This boundary exists only to isolate MinerU I/O; it is not a pluggable engine system.
## 2. System Layers
1. CLI/API layer
- Parse command arguments and Python API parameters.
- Discover input PDFs.
- Plan output paths.
- Enforce overwrite behavior.
- Print conversion summaries.
2. MinerU adapter layer
- Validate MinerU 3.1.0 installation and version.
- Run MinerU through direct local CLI execution.
- Capture raw Markdown, structured output, assets, logs, and exit status.
- Enforce strict-local execution.
3. Intermediate representation layer
- Convert MinerU-specific output into project-owned document/page/block objects.
- Preserve page index, bbox, confidence, source engine, and asset references returned by MinerU.
- Prevent raw MinerU objects from becoming public API return types.
4. Normalization layer
- Convert project-owned objects and MinerU Markdown into Obsidian-friendly Markdown.
- Normalize math delimiters, display math spacing, headings, tables, and asset links.
5. Quality and metadata layer
- Run link checks and math renderability checks with local tooling.
- Aggregate structured warnings.
- Write metadata JSON, quality report Markdown, and optional raw MinerU diagnostics.
## 3. Conversion Pipeline
1. Input discovery
- Accept a single PDF or a directory.
- Require `--recursive` for subdirectory traversal.
- Validate that each selected input is a local PDF.
- Compute source SHA-256.
2. MinerU conversion
- Create an isolated work directory per input PDF.
- Run the MinerU 3.1.0 adapter through the direct `mineru` CLI.
- Capture raw Markdown, raw JSON/structured output when available, extracted assets, warnings, and logs.
3. Intermediate representation
- Build document/page/block records from MinerU output.
- Preserve provenance data instead of relying only on final Markdown text.
4. Obsidian normalization
- Normalize inline math to `$...$`.
- Normalize display math to `$$...$$` blocks on separate lines.
- Normalize image links to stable relative asset paths.
- Normalize tables without destroying complex table structure.
5. Quality checks
- Verify generated asset links.
- Check math renderability when local tooling is available.
- Emit warnings without stopping conversion unless no usable output can be produced.
6. Output writing
- Write final Markdown.
- Write extracted assets.
- Write metadata JSON.
- Write `<stem>.report.md`.
- Keep raw MinerU output when requested.
## 4. MinerU Adapter Contract
The MinerU adapter exposes:
- `name`
- `is_available()`
- `version()`
- `doctor()`
- `convert(input_pdf, work_dir, options)`
Adapter conversion output contains:
- `raw_markdown`
- `raw_structured`
- `assets`
- `pages`
- `warnings`
- `engine`
- `engine_version`
- `engine_options`
- `exit_code`
- `stderr`
The adapter must fail fast if it cannot run in strict-local mode. Runtime engine selection is not part of v1.
The default conversion device is `cuda:0`. Because MinerU 3.1.0 selects its local device through environment/config rather than a dedicated CLI GPU flag, the adapter must set the MinerU subprocess environment to request CUDA by default while keeping the command shape direct and local.
Allowed MinerU execution in v1:
- Direct local `mineru` CLI execution.
- The temporary local `mineru-api` process that MinerU 3.1.0 starts internally when the CLI runs without `--api-url`.
Prohibited MinerU execution in v1:
- Passing `--api-url`.
- Remote APIs.
- Router mode.
- HTTP client backends.
- Remote OpenAI-compatible backends or inference endpoints.
## 5. Intermediate Representation
The project uses a small internal representation for normalization and metadata.
Required concepts:
- Document
- Page
- Block
- Asset
- Warning
Required block types:
- `heading`
- `paragraph`
- `inline_formula`
- `display_formula`
- `table`
- `figure`
- `caption`
- `footnote`
- `reference`
- `unknown`
Record these page/block fields when MinerU returns them. Do not invent missing values.
- Page index.
- Page dimensions.
- Bounding boxes.
- Confidence.
- Source engine, fixed to MinerU 3.1.0 in v1.
- Markdown character span.
## 6. Markdown Normalization
Final Markdown must prioritize Obsidian.
- Use `$...$` for inline math.
- Use display math blocks with `$$` on their own lines.
- Keep blank lines around display math.
- Do not escape underscores or carets inside math unnecessarily.
- Prefer Markdown tables for simple tables.
- Use HTML tables for complex tables when Markdown would lose structure.
- Store figures/images in a stable relative assets directory.
- Do not add visible page separators in v1.
- Preserve captions and references when MinerU provides them.
## 7. Metadata Schema
When metadata is enabled, write `<stem>.metadata.json`.
Required top-level fields:
- `source_pdf`
- `source_sha256`
- `created_at`
- `engine`
- `engine_version`
- `engine_options`
- `pages`
- `assets`
- `warnings`
- `summary`
Required summary fields:
- `pages_processed`
- `warning_count`
- `asset_count`
- `display_formula_count`
- `inline_formula_count`
- `math_render_error_count`
Warning records include:
- `code`
- `severity`
- `page_index`
- `bbox`
- `message`
Stable warning code examples:
- `ENGINE_MISSING`
- `GPU_UNAVAILABLE`
- `LOW_CONFIDENCE_FORMULA`
- `MATH_RENDER_FAILED`
- `ASSET_LINK_MISSING`
- `READING_ORDER_UNCERTAIN`
- `STRICT_LOCAL_VIOLATION`
- `MINERU_CLI_FAILED`
## 8. Quality Report
Every conversion writes `<stem>.report.md`.
The report is derived from metadata and local quality checks. It contains:
- Source and output paths.
- MinerU version and execution mode.
- Pages processed.
- Warning count.
- Asset count and missing asset link count.
- Inline and display formula counts.
- Math render error count.
- Pages with warnings.
- Final status: `success`, `partial`, or `failed`.
## 9. Local-Only Enforcement
The implementation must not upload PDFs, page images, or extracted text to remote services.
Strict-local mode is on by default. The MinerU adapter must not call cloud OCR APIs, hosted document parsing APIs, hosted LLM/VLM APIs, or remote model inference endpoints.
Local-only execution means direct `mineru` CLI execution in v1. MinerU 3.1.0's CLI-internal temporary local `mineru-api` process is allowed because it is local orchestration owned by the CLI invocation. User-specified API URLs, router mode, HTTP client backends, remote APIs, and remote OpenAI-compatible backends are not allowed.
Allowed network activity is limited to documentation, package/model installation initiated by setup commands, or tests explicitly marked as network tests and disabled by default.
## 10. Failure Policy
Conversion should continue automatically when possible.
- Low-confidence formulas are included as best effort and recorded as warnings.
- Low-confidence pages are included as best effort and recorded as warnings.
- Run MinerU with its default local CLI behavior first.
- If MinerU cannot run or returns a failure, report the failure clearly and do not silently switch backend.
- Conversion fails only when the input cannot be opened, MinerU cannot run, no usable output can be produced, output cannot be written, or strict-local policy is violated.
- CLI summaries must report warning counts clearly.