Files
PDFToMD/ARCHITECTURE.md
2026-05-14 10:16:59 +09:00

288 lines
11 KiB
Markdown

# Architecture: Local PDF-to-Markdown Converter
Last updated: 2026-05-13
## 1. Overview
The system converts math-heavy digital PDFs into Obsidian-friendly Markdown using MinerU 3.1.0 as the fixed local conversion engine. Product requirements live in `PRD.md`; agent workflow rules live in `AGENTS.md`; research notes live in `docs/KNOWLEDGEBASE.md`.
The architecture separates MinerU execution from project-owned normalization and internal provenance/reporting. This boundary exists only to isolate MinerU I/O; it is not a pluggable engine system.
## 2. System Layers
1. CLI/API layer
- Parse command arguments and Python API parameters.
- Discover input PDFs.
- Plan output paths.
- Enforce overwrite behavior.
- Print conversion summaries.
Optional local UI launcher sits above this layer and invokes the project-owned `pdf2md` CLI. It can run a selected folder by discovering direct-child PDFs and sequentially invoking existing `pdf2md convert` commands. It must not call MinerU directly, add a second conversion engine, run parallel GPU conversions by default, or expose remote/API runtime paths.
2. MinerU adapter layer
- Validate MinerU 3.1.0 installation and version.
- Run MinerU through direct local CLI execution.
- Capture raw Markdown, structured output, assets, logs, and exit status.
- Enforce strict-local execution.
3. Intermediate representation layer
- Convert MinerU-specific output into project-owned document/page/block objects.
- Preserve page index, bbox, confidence, source engine, and asset references returned by MinerU.
- Prevent raw MinerU objects from becoming public API return types.
4. Normalization layer
- Convert project-owned objects and MinerU Markdown into Obsidian-friendly Markdown.
- Normalize math delimiters, display math spacing, headings, tables, and asset links.
5. Quality and reporting layer
- Run link checks and math renderability checks with local tooling.
- Aggregate structured warnings.
- Build internal metadata-like records for reports and result summaries.
- Write quality report Markdown and optional raw MinerU diagnostics.
## 3. Conversion Pipeline
1. Input discovery
- Accept a single PDF or a directory.
- Require `--recursive` for subdirectory traversal.
- Validate that each selected input is a local PDF.
- Compute source SHA-256.
2. MinerU conversion
- Create an isolated work directory per input PDF.
- Run the MinerU 3.1.0 adapter through the direct `mineru` CLI.
- Capture raw Markdown, raw JSON/structured output when available, extracted assets, warnings, and logs.
- When `--chunk-pages` is active, write one-page temporary PDFs, run MinerU once per source page, and group successful page Markdown into final outputs of the configured page count.
3. Intermediate representation
- Build document/page/block records from MinerU output.
- Preserve provenance data instead of relying only on final Markdown text.
4. Obsidian normalization
- Normalize inline math to `$...$`.
- Normalize display math to `$$...$$` blocks on separate lines.
- Normalize image links to stable relative asset paths.
- Normalize tables without destroying complex table structure.
5. Quality checks
- Verify generated asset links.
- Check math renderability when local tooling is available.
- Compare local pypdf text-layer extraction with Markdown text where page mapping is credible.
- Emit warnings without stopping conversion unless no usable output can be produced.
6. Output writing
- Write final Markdown parts under `<output>/<stem>/`.
- Write extracted assets under `<output>/<stem>/images/`.
- Write one report at `<output>/<stem>/<stem>_report.md`.
- Keep raw MinerU output when requested.
- In grouped page conversion mode, write one public Markdown part per grouped page range and delete temporary one-page PDFs plus intermediate per-page outputs.
## 4. MinerU Adapter Contract
The MinerU adapter exposes:
- `name`
- `is_available()`
- `version()`
- `doctor()`
- `convert(input_pdf, work_dir, options)`
Adapter conversion output contains:
- `raw_markdown`
- `raw_structured`
- `assets`
- `pages`
- `warnings`
- `engine`
- `engine_version`
- `engine_options`
- `exit_code`
- `stderr`
The adapter must fail fast if it cannot run in strict-local mode. Runtime engine selection is not part of v1.
The default conversion device is `cuda:0`. Because MinerU 3.1.0 selects its local device through environment/config rather than a dedicated CLI GPU flag, the adapter must set the MinerU subprocess environment to request CUDA by default while keeping the command shape direct and local.
Runtime tuning is project-owned and strict-local:
- `--gpu auto` selects the visible NVIDIA GPU with the largest VRAM from local `nvidia-smi` inventory.
- `--mineru-profile auto` is the default.
- Safe profile settings are used for GTX 1070 Ti 8GB, pre-Turing, low-VRAM GPUs, or unavailable inventory.
- Stronger settings are used only for 16GB+ Turing-or-newer GPUs.
- Tuning is applied only through allowlisted MinerU subprocess environment variables: `MINERU_PROCESSING_WINDOW_SIZE`, `MINERU_API_MAX_CONCURRENT_REQUESTS`, and `MINERU_PDF_RENDER_THREADS`.
- The adapter must not add MinerU backend flags, API URLs, router mode, HTTP client backend use, remote OpenAI-compatible endpoints, or `MINERU_HYBRID_BATCH_RATIO`.
Resolved profile details must be recorded in `engine_options["mineru_profile"]`, including requested profile, applied profile, environment values, and selected GPU details when known.
Allowed MinerU execution in v1:
- Direct local `mineru` CLI execution.
- The temporary local `mineru-api` process that MinerU 3.1.0 starts internally when the CLI runs without `--api-url`.
Prohibited MinerU execution in v1:
- Passing `--api-url`.
- Remote APIs.
- Router mode.
- HTTP client backends.
- Remote OpenAI-compatible backends or inference endpoints.
## 5. Intermediate Representation
The project uses a small internal representation for normalization and metadata.
Required concepts:
- Document
- Page
- Block
- Asset
- Warning
Required block types:
- `heading`
- `paragraph`
- `inline_formula`
- `display_formula`
- `table`
- `figure`
- `caption`
- `footnote`
- `reference`
- `unknown`
Record these page/block fields when MinerU returns them. Do not invent missing values.
- Page index.
- Page dimensions.
- Bounding boxes.
- Confidence.
- Source engine, fixed to MinerU 3.1.0 in v1.
- Markdown character span.
## 6. Markdown Normalization
Final Markdown must prioritize Obsidian.
- Use `$...$` for inline math.
- Use display math blocks with `$$` on their own lines.
- Keep blank lines around display math.
- Do not escape underscores or carets inside math unnecessarily.
- Prefer Markdown tables for simple tables.
- Use HTML tables for complex tables when Markdown would lose structure.
- Store figures/images in the stable `images/` directory under the PDF output folder.
- Do not add visible page separators in v1; grouped page conversion may add invisible HTML comments such as `<!-- source-page: 7 -->` for provenance.
- Preserve captions and references when MinerU provides them.
## 7. Internal Provenance Schema
New conversions do not write a public metadata JSON sidecar. The same schema shape remains useful internally for report generation, warning aggregation, and tests.
Required top-level fields:
- `source_pdf`
- `source_sha256`
- `created_at`
- `engine`
- `engine_version`
- `engine_options`
- `pages`
- `assets`
- `warnings`
- `summary`
Optional top-level fields:
- `text_fidelity`: page-level local pypdf-vs-Markdown text diagnostics when source text can be extracted or page mapping uncertainty needs to be recorded.
Required summary fields:
- `pages_processed`
- `warning_count`
- `asset_count`
- `display_formula_count`
- `inline_formula_count`
- `math_render_error_count`
Optional text fidelity summary fields:
- `text_fidelity_checked_page_count`
- `text_fidelity_low_page_count`
- `text_fidelity_unexpected_cjk_count`
- `text_fidelity_replacement_candidate_page_count`
- `text_fidelity_page_mapping_uncertain_count`
Grouped page conversion records these `engine_options` entries:
- `chunk`: original source PDF path, grouped output index, total grouped outputs, and original source page range.
- `page_conversion`: `single_page` mode, MinerU input page count of 1, grouped output page count, and failed source page numbers.
- `parts`: aggregate report records for output Markdown part paths, source page ranges, status, warning counts, and failed source pages.
- `output_folder`: the PDF-stem output folder.
Warning records include:
- `code`
- `severity`
- `page_index`
- `bbox`
- `message`
Stable warning code examples:
- `ENGINE_MISSING`
- `GPU_UNAVAILABLE`
- `LOW_CONFIDENCE_FORMULA`
- `MATH_RENDER_FAILED`
- `MATH_RENDER_REPAIRED`
- `ASSET_LINK_MISSING`
- `READING_ORDER_UNCERTAIN`
- `STRICT_LOCAL_VIOLATION`
- `MINERU_CLI_FAILED`
- `MINERU_PROFILE_ADJUSTED`
- `TEXT_LAYER_AVAILABLE`
- `TEXT_FIDELITY_LOW`
- `UNEXPECTED_CJK_IN_KOREAN_TEXT`
- `HANGUL_SPACING_SUSPECT`
- `TEXT_PAGE_MAPPING_UNCERTAIN`
## 8. Quality Report
Every conversion writes `<stem>/<stem>_report.md`.
The report is derived from internal provenance and local quality checks. It contains:
- Source and output paths.
- Markdown part paths and source page ranges.
- MinerU version and execution mode.
- Pages processed.
- Warning count.
- Asset count and missing asset link count.
- Inline and display formula counts.
- Math render error count.
- Text fidelity summary when pypdf diagnostics are available.
- Pages with warnings.
- Final status: `success`, `partial`, or `failed`.
## 9. Local-Only Enforcement
The implementation must not upload PDFs, page images, or extracted text to remote services.
Strict-local mode is on by default. The MinerU adapter must not call cloud OCR APIs, hosted document parsing APIs, hosted LLM/VLM APIs, or remote model inference endpoints.
Local-only execution means direct `mineru` CLI execution in v1. MinerU 3.1.0's CLI-internal temporary local `mineru-api` process is allowed because it is local orchestration owned by the CLI invocation. User-specified API URLs, router mode, HTTP client backends, remote APIs, and remote OpenAI-compatible backends are not allowed.
Allowed network activity is limited to documentation, package/model installation initiated by setup commands, or tests explicitly marked as network tests and disabled by default.
## 10. Failure Policy
Conversion should continue automatically when possible.
- Low-confidence formulas are included as best effort and recorded as warnings.
- Low-confidence pages are included as best effort and recorded as warnings.
- Run MinerU with its default local CLI behavior first.
- If MinerU cannot run or returns a failure, report the failure clearly and do not silently switch backend.
- Conversion fails only when the input cannot be opened, MinerU cannot run, no usable output can be produced, output cannot be written, or strict-local policy is violated.
- CLI summaries must report warning counts clearly.