modify pdftomd

This commit is contained in:
김경종
2026-05-14 10:16:59 +09:00
parent 2232b51fc9
commit dc11880140
69 changed files with 7784 additions and 1150 deletions
+57 -14
View File
@@ -1,12 +1,12 @@
# Architecture: Local PDF-to-Markdown Converter
Last updated: 2026-05-07
Last updated: 2026-05-13
## 1. Overview
The system converts math-heavy digital PDFs into Obsidian-friendly Markdown using MinerU 3.1.0 as the fixed local conversion engine. Product requirements live in `PRD.md`; agent workflow rules live in `AGENTS.md`; research notes live in `docs/KNOWLEDGEBASE.md`.
The architecture separates MinerU execution from project-owned normalization and metadata. This boundary exists only to isolate MinerU I/O; it is not a pluggable engine system.
The architecture separates MinerU execution from project-owned normalization and internal provenance/reporting. This boundary exists only to isolate MinerU I/O; it is not a pluggable engine system.
## 2. System Layers
@@ -17,6 +17,8 @@ The architecture separates MinerU execution from project-owned normalization and
- Enforce overwrite behavior.
- Print conversion summaries.
Optional local UI launcher sits above this layer and invokes the project-owned `pdf2md` CLI. It can run a selected folder by discovering direct-child PDFs and sequentially invoking existing `pdf2md convert` commands. It must not call MinerU directly, add a second conversion engine, run parallel GPU conversions by default, or expose remote/API runtime paths.
2. MinerU adapter layer
- Validate MinerU 3.1.0 installation and version.
- Run MinerU through direct local CLI execution.
@@ -32,10 +34,11 @@ The architecture separates MinerU execution from project-owned normalization and
- Convert project-owned objects and MinerU Markdown into Obsidian-friendly Markdown.
- Normalize math delimiters, display math spacing, headings, tables, and asset links.
5. Quality and metadata layer
5. Quality and reporting layer
- Run link checks and math renderability checks with local tooling.
- Aggregate structured warnings.
- Write metadata JSON, quality report Markdown, and optional raw MinerU diagnostics.
- Build internal metadata-like records for reports and result summaries.
- Write quality report Markdown and optional raw MinerU diagnostics.
## 3. Conversion Pipeline
@@ -49,6 +52,7 @@ The architecture separates MinerU execution from project-owned normalization and
- Create an isolated work directory per input PDF.
- Run the MinerU 3.1.0 adapter through the direct `mineru` CLI.
- Capture raw Markdown, raw JSON/structured output when available, extracted assets, warnings, and logs.
- When `--chunk-pages` is active, write one-page temporary PDFs, run MinerU once per source page, and group successful page Markdown into final outputs of the configured page count.
3. Intermediate representation
- Build document/page/block records from MinerU output.
@@ -63,14 +67,15 @@ The architecture separates MinerU execution from project-owned normalization and
5. Quality checks
- Verify generated asset links.
- Check math renderability when local tooling is available.
- Compare local pypdf text-layer extraction with Markdown text where page mapping is credible.
- Emit warnings without stopping conversion unless no usable output can be produced.
6. Output writing
- Write final Markdown.
- Write extracted assets.
- Write metadata JSON.
- Write `<stem>.report.md`.
- Write final Markdown parts under `<output>/<stem>/`.
- Write extracted assets under `<output>/<stem>/images/`.
- Write one report at `<output>/<stem>/<stem>_report.md`.
- Keep raw MinerU output when requested.
- In grouped page conversion mode, write one public Markdown part per grouped page range and delete temporary one-page PDFs plus intermediate per-page outputs.
## 4. MinerU Adapter Contract
@@ -99,6 +104,17 @@ The adapter must fail fast if it cannot run in strict-local mode. Runtime engine
The default conversion device is `cuda:0`. Because MinerU 3.1.0 selects its local device through environment/config rather than a dedicated CLI GPU flag, the adapter must set the MinerU subprocess environment to request CUDA by default while keeping the command shape direct and local.
Runtime tuning is project-owned and strict-local:
- `--gpu auto` selects the visible NVIDIA GPU with the largest VRAM from local `nvidia-smi` inventory.
- `--mineru-profile auto` is the default.
- Safe profile settings are used for GTX 1070 Ti 8GB, pre-Turing, low-VRAM GPUs, or unavailable inventory.
- Stronger settings are used only for 16GB+ Turing-or-newer GPUs.
- Tuning is applied only through allowlisted MinerU subprocess environment variables: `MINERU_PROCESSING_WINDOW_SIZE`, `MINERU_API_MAX_CONCURRENT_REQUESTS`, and `MINERU_PDF_RENDER_THREADS`.
- The adapter must not add MinerU backend flags, API URLs, router mode, HTTP client backend use, remote OpenAI-compatible endpoints, or `MINERU_HYBRID_BATCH_RATIO`.
Resolved profile details must be recorded in `engine_options["mineru_profile"]`, including requested profile, applied profile, environment values, and selected GPU details when known.
Allowed MinerU execution in v1:
- Direct local `mineru` CLI execution.
@@ -156,13 +172,13 @@ Final Markdown must prioritize Obsidian.
- Do not escape underscores or carets inside math unnecessarily.
- Prefer Markdown tables for simple tables.
- Use HTML tables for complex tables when Markdown would lose structure.
- Store figures/images in a stable relative assets directory.
- Do not add visible page separators in v1.
- Store figures/images in the stable `images/` directory under the PDF output folder.
- Do not add visible page separators in v1; grouped page conversion may add invisible HTML comments such as `<!-- source-page: 7 -->` for provenance.
- Preserve captions and references when MinerU provides them.
## 7. Metadata Schema
## 7. Internal Provenance Schema
When metadata is enabled, write `<stem>.metadata.json`.
New conversions do not write a public metadata JSON sidecar. The same schema shape remains useful internally for report generation, warning aggregation, and tests.
Required top-level fields:
@@ -177,6 +193,10 @@ Required top-level fields:
- `warnings`
- `summary`
Optional top-level fields:
- `text_fidelity`: page-level local pypdf-vs-Markdown text diagnostics when source text can be extracted or page mapping uncertainty needs to be recorded.
Required summary fields:
- `pages_processed`
@@ -186,6 +206,21 @@ Required summary fields:
- `inline_formula_count`
- `math_render_error_count`
Optional text fidelity summary fields:
- `text_fidelity_checked_page_count`
- `text_fidelity_low_page_count`
- `text_fidelity_unexpected_cjk_count`
- `text_fidelity_replacement_candidate_page_count`
- `text_fidelity_page_mapping_uncertain_count`
Grouped page conversion records these `engine_options` entries:
- `chunk`: original source PDF path, grouped output index, total grouped outputs, and original source page range.
- `page_conversion`: `single_page` mode, MinerU input page count of 1, grouped output page count, and failed source page numbers.
- `parts`: aggregate report records for output Markdown part paths, source page ranges, status, warning counts, and failed source pages.
- `output_folder`: the PDF-stem output folder.
Warning records include:
- `code`
@@ -205,20 +240,28 @@ Stable warning code examples:
- `READING_ORDER_UNCERTAIN`
- `STRICT_LOCAL_VIOLATION`
- `MINERU_CLI_FAILED`
- `MINERU_PROFILE_ADJUSTED`
- `TEXT_LAYER_AVAILABLE`
- `TEXT_FIDELITY_LOW`
- `UNEXPECTED_CJK_IN_KOREAN_TEXT`
- `HANGUL_SPACING_SUSPECT`
- `TEXT_PAGE_MAPPING_UNCERTAIN`
## 8. Quality Report
Every conversion writes `<stem>.report.md`.
Every conversion writes `<stem>/<stem>_report.md`.
The report is derived from metadata and local quality checks. It contains:
The report is derived from internal provenance and local quality checks. It contains:
- Source and output paths.
- Markdown part paths and source page ranges.
- MinerU version and execution mode.
- Pages processed.
- Warning count.
- Asset count and missing asset link count.
- Inline and display formula counts.
- Math render error count.
- Text fidelity summary when pypdf diagnostics are available.
- Pages with warnings.
- Final status: `success`, `partial`, or `failed`.