288 lines
11 KiB
Markdown
288 lines
11 KiB
Markdown
# Architecture: Local PDF-to-Markdown Converter
|
|
|
|
Last updated: 2026-05-13
|
|
|
|
## 1. Overview
|
|
|
|
The system converts math-heavy digital PDFs into Obsidian-friendly Markdown using MinerU 3.1.0 as the fixed local conversion engine. Product requirements live in `PRD.md`; agent workflow rules live in `AGENTS.md`; research notes live in `docs/KNOWLEDGEBASE.md`.
|
|
|
|
The architecture separates MinerU execution from project-owned normalization and internal provenance/reporting. This boundary exists only to isolate MinerU I/O; it is not a pluggable engine system.
|
|
|
|
## 2. System Layers
|
|
|
|
1. CLI/API layer
|
|
- Parse command arguments and Python API parameters.
|
|
- Discover input PDFs.
|
|
- Plan output paths.
|
|
- Enforce overwrite behavior.
|
|
- Print conversion summaries.
|
|
|
|
Optional local UI launcher sits above this layer and invokes the project-owned `pdf2md` CLI. It can run a selected folder by discovering direct-child PDFs and sequentially invoking existing `pdf2md convert` commands. It must not call MinerU directly, add a second conversion engine, run parallel GPU conversions by default, or expose remote/API runtime paths.
|
|
|
|
2. MinerU adapter layer
|
|
- Validate MinerU 3.1.0 installation and version.
|
|
- Run MinerU through direct local CLI execution.
|
|
- Capture raw Markdown, structured output, assets, logs, and exit status.
|
|
- Enforce strict-local execution.
|
|
|
|
3. Intermediate representation layer
|
|
- Convert MinerU-specific output into project-owned document/page/block objects.
|
|
- Preserve page index, bbox, confidence, source engine, and asset references returned by MinerU.
|
|
- Prevent raw MinerU objects from becoming public API return types.
|
|
|
|
4. Normalization layer
|
|
- Convert project-owned objects and MinerU Markdown into Obsidian-friendly Markdown.
|
|
- Normalize math delimiters, display math spacing, headings, tables, and asset links.
|
|
|
|
5. Quality and reporting layer
|
|
- Run link checks and math renderability checks with local tooling.
|
|
- Aggregate structured warnings.
|
|
- Build internal metadata-like records for reports and result summaries.
|
|
- Write quality report Markdown and optional raw MinerU diagnostics.
|
|
|
|
## 3. Conversion Pipeline
|
|
|
|
1. Input discovery
|
|
- Accept a single PDF or a directory.
|
|
- Require `--recursive` for subdirectory traversal.
|
|
- Validate that each selected input is a local PDF.
|
|
- Compute source SHA-256.
|
|
|
|
2. MinerU conversion
|
|
- Create an isolated work directory per input PDF.
|
|
- Run the MinerU 3.1.0 adapter through the direct `mineru` CLI.
|
|
- Capture raw Markdown, raw JSON/structured output when available, extracted assets, warnings, and logs.
|
|
- When `--chunk-pages` is active, write one-page temporary PDFs, run MinerU once per source page, and group successful page Markdown into final outputs of the configured page count.
|
|
|
|
3. Intermediate representation
|
|
- Build document/page/block records from MinerU output.
|
|
- Preserve provenance data instead of relying only on final Markdown text.
|
|
|
|
4. Obsidian normalization
|
|
- Normalize inline math to `$...$`.
|
|
- Normalize display math to `$$...$$` blocks on separate lines.
|
|
- Normalize image links to stable relative asset paths.
|
|
- Normalize tables without destroying complex table structure.
|
|
|
|
5. Quality checks
|
|
- Verify generated asset links.
|
|
- Check math renderability when local tooling is available.
|
|
- Compare local pypdf text-layer extraction with Markdown text where page mapping is credible.
|
|
- Emit warnings without stopping conversion unless no usable output can be produced.
|
|
|
|
6. Output writing
|
|
- Write final Markdown parts under `<output>/<stem>/`.
|
|
- Write extracted assets under `<output>/<stem>/images/`.
|
|
- Write one report at `<output>/<stem>/<stem>_report.md`.
|
|
- Keep raw MinerU output when requested.
|
|
- In grouped page conversion mode, write one public Markdown part per grouped page range and delete temporary one-page PDFs plus intermediate per-page outputs.
|
|
|
|
## 4. MinerU Adapter Contract
|
|
|
|
The MinerU adapter exposes:
|
|
|
|
- `name`
|
|
- `is_available()`
|
|
- `version()`
|
|
- `doctor()`
|
|
- `convert(input_pdf, work_dir, options)`
|
|
|
|
Adapter conversion output contains:
|
|
|
|
- `raw_markdown`
|
|
- `raw_structured`
|
|
- `assets`
|
|
- `pages`
|
|
- `warnings`
|
|
- `engine`
|
|
- `engine_version`
|
|
- `engine_options`
|
|
- `exit_code`
|
|
- `stderr`
|
|
|
|
The adapter must fail fast if it cannot run in strict-local mode. Runtime engine selection is not part of v1.
|
|
|
|
The default conversion device is `cuda:0`. Because MinerU 3.1.0 selects its local device through environment/config rather than a dedicated CLI GPU flag, the adapter must set the MinerU subprocess environment to request CUDA by default while keeping the command shape direct and local.
|
|
|
|
Runtime tuning is project-owned and strict-local:
|
|
|
|
- `--gpu auto` selects the visible NVIDIA GPU with the largest VRAM from local `nvidia-smi` inventory.
|
|
- `--mineru-profile auto` is the default.
|
|
- Safe profile settings are used for GTX 1070 Ti 8GB, pre-Turing, low-VRAM GPUs, or unavailable inventory.
|
|
- Stronger settings are used only for 16GB+ Turing-or-newer GPUs.
|
|
- Tuning is applied only through allowlisted MinerU subprocess environment variables: `MINERU_PROCESSING_WINDOW_SIZE`, `MINERU_API_MAX_CONCURRENT_REQUESTS`, and `MINERU_PDF_RENDER_THREADS`.
|
|
- The adapter must not add MinerU backend flags, API URLs, router mode, HTTP client backend use, remote OpenAI-compatible endpoints, or `MINERU_HYBRID_BATCH_RATIO`.
|
|
|
|
Resolved profile details must be recorded in `engine_options["mineru_profile"]`, including requested profile, applied profile, environment values, and selected GPU details when known.
|
|
|
|
Allowed MinerU execution in v1:
|
|
|
|
- Direct local `mineru` CLI execution.
|
|
- The temporary local `mineru-api` process that MinerU 3.1.0 starts internally when the CLI runs without `--api-url`.
|
|
|
|
Prohibited MinerU execution in v1:
|
|
|
|
- Passing `--api-url`.
|
|
- Remote APIs.
|
|
- Router mode.
|
|
- HTTP client backends.
|
|
- Remote OpenAI-compatible backends or inference endpoints.
|
|
|
|
## 5. Intermediate Representation
|
|
|
|
The project uses a small internal representation for normalization and metadata.
|
|
|
|
Required concepts:
|
|
|
|
- Document
|
|
- Page
|
|
- Block
|
|
- Asset
|
|
- Warning
|
|
|
|
Required block types:
|
|
|
|
- `heading`
|
|
- `paragraph`
|
|
- `inline_formula`
|
|
- `display_formula`
|
|
- `table`
|
|
- `figure`
|
|
- `caption`
|
|
- `footnote`
|
|
- `reference`
|
|
- `unknown`
|
|
|
|
Record these page/block fields when MinerU returns them. Do not invent missing values.
|
|
|
|
- Page index.
|
|
- Page dimensions.
|
|
- Bounding boxes.
|
|
- Confidence.
|
|
- Source engine, fixed to MinerU 3.1.0 in v1.
|
|
- Markdown character span.
|
|
|
|
## 6. Markdown Normalization
|
|
|
|
Final Markdown must prioritize Obsidian.
|
|
|
|
- Use `$...$` for inline math.
|
|
- Use display math blocks with `$$` on their own lines.
|
|
- Keep blank lines around display math.
|
|
- Do not escape underscores or carets inside math unnecessarily.
|
|
- Prefer Markdown tables for simple tables.
|
|
- Use HTML tables for complex tables when Markdown would lose structure.
|
|
- Store figures/images in the stable `images/` directory under the PDF output folder.
|
|
- Do not add visible page separators in v1; grouped page conversion may add invisible HTML comments such as `<!-- source-page: 7 -->` for provenance.
|
|
- Preserve captions and references when MinerU provides them.
|
|
|
|
## 7. Internal Provenance Schema
|
|
|
|
New conversions do not write a public metadata JSON sidecar. The same schema shape remains useful internally for report generation, warning aggregation, and tests.
|
|
|
|
Required top-level fields:
|
|
|
|
- `source_pdf`
|
|
- `source_sha256`
|
|
- `created_at`
|
|
- `engine`
|
|
- `engine_version`
|
|
- `engine_options`
|
|
- `pages`
|
|
- `assets`
|
|
- `warnings`
|
|
- `summary`
|
|
|
|
Optional top-level fields:
|
|
|
|
- `text_fidelity`: page-level local pypdf-vs-Markdown text diagnostics when source text can be extracted or page mapping uncertainty needs to be recorded.
|
|
|
|
Required summary fields:
|
|
|
|
- `pages_processed`
|
|
- `warning_count`
|
|
- `asset_count`
|
|
- `display_formula_count`
|
|
- `inline_formula_count`
|
|
- `math_render_error_count`
|
|
|
|
Optional text fidelity summary fields:
|
|
|
|
- `text_fidelity_checked_page_count`
|
|
- `text_fidelity_low_page_count`
|
|
- `text_fidelity_unexpected_cjk_count`
|
|
- `text_fidelity_replacement_candidate_page_count`
|
|
- `text_fidelity_page_mapping_uncertain_count`
|
|
|
|
Grouped page conversion records these `engine_options` entries:
|
|
|
|
- `chunk`: original source PDF path, grouped output index, total grouped outputs, and original source page range.
|
|
- `page_conversion`: `single_page` mode, MinerU input page count of 1, grouped output page count, and failed source page numbers.
|
|
- `parts`: aggregate report records for output Markdown part paths, source page ranges, status, warning counts, and failed source pages.
|
|
- `output_folder`: the PDF-stem output folder.
|
|
|
|
Warning records include:
|
|
|
|
- `code`
|
|
- `severity`
|
|
- `page_index`
|
|
- `bbox`
|
|
- `message`
|
|
|
|
Stable warning code examples:
|
|
|
|
- `ENGINE_MISSING`
|
|
- `GPU_UNAVAILABLE`
|
|
- `LOW_CONFIDENCE_FORMULA`
|
|
- `MATH_RENDER_FAILED`
|
|
- `MATH_RENDER_REPAIRED`
|
|
- `ASSET_LINK_MISSING`
|
|
- `READING_ORDER_UNCERTAIN`
|
|
- `STRICT_LOCAL_VIOLATION`
|
|
- `MINERU_CLI_FAILED`
|
|
- `MINERU_PROFILE_ADJUSTED`
|
|
- `TEXT_LAYER_AVAILABLE`
|
|
- `TEXT_FIDELITY_LOW`
|
|
- `UNEXPECTED_CJK_IN_KOREAN_TEXT`
|
|
- `HANGUL_SPACING_SUSPECT`
|
|
- `TEXT_PAGE_MAPPING_UNCERTAIN`
|
|
|
|
## 8. Quality Report
|
|
|
|
Every conversion writes `<stem>/<stem>_report.md`.
|
|
|
|
The report is derived from internal provenance and local quality checks. It contains:
|
|
|
|
- Source and output paths.
|
|
- Markdown part paths and source page ranges.
|
|
- MinerU version and execution mode.
|
|
- Pages processed.
|
|
- Warning count.
|
|
- Asset count and missing asset link count.
|
|
- Inline and display formula counts.
|
|
- Math render error count.
|
|
- Text fidelity summary when pypdf diagnostics are available.
|
|
- Pages with warnings.
|
|
- Final status: `success`, `partial`, or `failed`.
|
|
|
|
## 9. Local-Only Enforcement
|
|
|
|
The implementation must not upload PDFs, page images, or extracted text to remote services.
|
|
|
|
Strict-local mode is on by default. The MinerU adapter must not call cloud OCR APIs, hosted document parsing APIs, hosted LLM/VLM APIs, or remote model inference endpoints.
|
|
|
|
Local-only execution means direct `mineru` CLI execution in v1. MinerU 3.1.0's CLI-internal temporary local `mineru-api` process is allowed because it is local orchestration owned by the CLI invocation. User-specified API URLs, router mode, HTTP client backends, remote APIs, and remote OpenAI-compatible backends are not allowed.
|
|
|
|
Allowed network activity is limited to documentation, package/model installation initiated by setup commands, or tests explicitly marked as network tests and disabled by default.
|
|
|
|
## 10. Failure Policy
|
|
|
|
Conversion should continue automatically when possible.
|
|
|
|
- Low-confidence formulas are included as best effort and recorded as warnings.
|
|
- Low-confidence pages are included as best effort and recorded as warnings.
|
|
- Run MinerU with its default local CLI behavior first.
|
|
- If MinerU cannot run or returns a failure, report the failure clearly and do not silently switch backend.
|
|
- Conversion fails only when the input cannot be opened, MinerU cannot run, no usable output can be produced, output cannot be written, or strict-local policy is violated.
|
|
- CLI summaries must report warning counts clearly.
|