modify pdftomd
This commit is contained in:
+57
-14
@@ -1,12 +1,12 @@
|
||||
# Architecture: Local PDF-to-Markdown Converter
|
||||
|
||||
Last updated: 2026-05-07
|
||||
Last updated: 2026-05-13
|
||||
|
||||
## 1. Overview
|
||||
|
||||
The system converts math-heavy digital PDFs into Obsidian-friendly Markdown using MinerU 3.1.0 as the fixed local conversion engine. Product requirements live in `PRD.md`; agent workflow rules live in `AGENTS.md`; research notes live in `docs/KNOWLEDGEBASE.md`.
|
||||
|
||||
The architecture separates MinerU execution from project-owned normalization and metadata. This boundary exists only to isolate MinerU I/O; it is not a pluggable engine system.
|
||||
The architecture separates MinerU execution from project-owned normalization and internal provenance/reporting. This boundary exists only to isolate MinerU I/O; it is not a pluggable engine system.
|
||||
|
||||
## 2. System Layers
|
||||
|
||||
@@ -17,6 +17,8 @@ The architecture separates MinerU execution from project-owned normalization and
|
||||
- Enforce overwrite behavior.
|
||||
- Print conversion summaries.
|
||||
|
||||
Optional local UI launcher sits above this layer and invokes the project-owned `pdf2md` CLI. It can run a selected folder by discovering direct-child PDFs and sequentially invoking existing `pdf2md convert` commands. It must not call MinerU directly, add a second conversion engine, run parallel GPU conversions by default, or expose remote/API runtime paths.
|
||||
|
||||
2. MinerU adapter layer
|
||||
- Validate MinerU 3.1.0 installation and version.
|
||||
- Run MinerU through direct local CLI execution.
|
||||
@@ -32,10 +34,11 @@ The architecture separates MinerU execution from project-owned normalization and
|
||||
- Convert project-owned objects and MinerU Markdown into Obsidian-friendly Markdown.
|
||||
- Normalize math delimiters, display math spacing, headings, tables, and asset links.
|
||||
|
||||
5. Quality and metadata layer
|
||||
5. Quality and reporting layer
|
||||
- Run link checks and math renderability checks with local tooling.
|
||||
- Aggregate structured warnings.
|
||||
- Write metadata JSON, quality report Markdown, and optional raw MinerU diagnostics.
|
||||
- Build internal metadata-like records for reports and result summaries.
|
||||
- Write quality report Markdown and optional raw MinerU diagnostics.
|
||||
|
||||
## 3. Conversion Pipeline
|
||||
|
||||
@@ -49,6 +52,7 @@ The architecture separates MinerU execution from project-owned normalization and
|
||||
- Create an isolated work directory per input PDF.
|
||||
- Run the MinerU 3.1.0 adapter through the direct `mineru` CLI.
|
||||
- Capture raw Markdown, raw JSON/structured output when available, extracted assets, warnings, and logs.
|
||||
- When `--chunk-pages` is active, write one-page temporary PDFs, run MinerU once per source page, and group successful page Markdown into final outputs of the configured page count.
|
||||
|
||||
3. Intermediate representation
|
||||
- Build document/page/block records from MinerU output.
|
||||
@@ -63,14 +67,15 @@ The architecture separates MinerU execution from project-owned normalization and
|
||||
5. Quality checks
|
||||
- Verify generated asset links.
|
||||
- Check math renderability when local tooling is available.
|
||||
- Compare local pypdf text-layer extraction with Markdown text where page mapping is credible.
|
||||
- Emit warnings without stopping conversion unless no usable output can be produced.
|
||||
|
||||
6. Output writing
|
||||
- Write final Markdown.
|
||||
- Write extracted assets.
|
||||
- Write metadata JSON.
|
||||
- Write `<stem>.report.md`.
|
||||
- Write final Markdown parts under `<output>/<stem>/`.
|
||||
- Write extracted assets under `<output>/<stem>/images/`.
|
||||
- Write one report at `<output>/<stem>/<stem>_report.md`.
|
||||
- Keep raw MinerU output when requested.
|
||||
- In grouped page conversion mode, write one public Markdown part per grouped page range and delete temporary one-page PDFs plus intermediate per-page outputs.
|
||||
|
||||
## 4. MinerU Adapter Contract
|
||||
|
||||
@@ -99,6 +104,17 @@ The adapter must fail fast if it cannot run in strict-local mode. Runtime engine
|
||||
|
||||
The default conversion device is `cuda:0`. Because MinerU 3.1.0 selects its local device through environment/config rather than a dedicated CLI GPU flag, the adapter must set the MinerU subprocess environment to request CUDA by default while keeping the command shape direct and local.
|
||||
|
||||
Runtime tuning is project-owned and strict-local:
|
||||
|
||||
- `--gpu auto` selects the visible NVIDIA GPU with the largest VRAM from local `nvidia-smi` inventory.
|
||||
- `--mineru-profile auto` is the default.
|
||||
- Safe profile settings are used for GTX 1070 Ti 8GB, pre-Turing, low-VRAM GPUs, or unavailable inventory.
|
||||
- Stronger settings are used only for 16GB+ Turing-or-newer GPUs.
|
||||
- Tuning is applied only through allowlisted MinerU subprocess environment variables: `MINERU_PROCESSING_WINDOW_SIZE`, `MINERU_API_MAX_CONCURRENT_REQUESTS`, and `MINERU_PDF_RENDER_THREADS`.
|
||||
- The adapter must not add MinerU backend flags, API URLs, router mode, HTTP client backend use, remote OpenAI-compatible endpoints, or `MINERU_HYBRID_BATCH_RATIO`.
|
||||
|
||||
Resolved profile details must be recorded in `engine_options["mineru_profile"]`, including requested profile, applied profile, environment values, and selected GPU details when known.
|
||||
|
||||
Allowed MinerU execution in v1:
|
||||
|
||||
- Direct local `mineru` CLI execution.
|
||||
@@ -156,13 +172,13 @@ Final Markdown must prioritize Obsidian.
|
||||
- Do not escape underscores or carets inside math unnecessarily.
|
||||
- Prefer Markdown tables for simple tables.
|
||||
- Use HTML tables for complex tables when Markdown would lose structure.
|
||||
- Store figures/images in a stable relative assets directory.
|
||||
- Do not add visible page separators in v1.
|
||||
- Store figures/images in the stable `images/` directory under the PDF output folder.
|
||||
- Do not add visible page separators in v1; grouped page conversion may add invisible HTML comments such as `<!-- source-page: 7 -->` for provenance.
|
||||
- Preserve captions and references when MinerU provides them.
|
||||
|
||||
## 7. Metadata Schema
|
||||
## 7. Internal Provenance Schema
|
||||
|
||||
When metadata is enabled, write `<stem>.metadata.json`.
|
||||
New conversions do not write a public metadata JSON sidecar. The same schema shape remains useful internally for report generation, warning aggregation, and tests.
|
||||
|
||||
Required top-level fields:
|
||||
|
||||
@@ -177,6 +193,10 @@ Required top-level fields:
|
||||
- `warnings`
|
||||
- `summary`
|
||||
|
||||
Optional top-level fields:
|
||||
|
||||
- `text_fidelity`: page-level local pypdf-vs-Markdown text diagnostics when source text can be extracted or page mapping uncertainty needs to be recorded.
|
||||
|
||||
Required summary fields:
|
||||
|
||||
- `pages_processed`
|
||||
@@ -186,6 +206,21 @@ Required summary fields:
|
||||
- `inline_formula_count`
|
||||
- `math_render_error_count`
|
||||
|
||||
Optional text fidelity summary fields:
|
||||
|
||||
- `text_fidelity_checked_page_count`
|
||||
- `text_fidelity_low_page_count`
|
||||
- `text_fidelity_unexpected_cjk_count`
|
||||
- `text_fidelity_replacement_candidate_page_count`
|
||||
- `text_fidelity_page_mapping_uncertain_count`
|
||||
|
||||
Grouped page conversion records these `engine_options` entries:
|
||||
|
||||
- `chunk`: original source PDF path, grouped output index, total grouped outputs, and original source page range.
|
||||
- `page_conversion`: `single_page` mode, MinerU input page count of 1, grouped output page count, and failed source page numbers.
|
||||
- `parts`: aggregate report records for output Markdown part paths, source page ranges, status, warning counts, and failed source pages.
|
||||
- `output_folder`: the PDF-stem output folder.
|
||||
|
||||
Warning records include:
|
||||
|
||||
- `code`
|
||||
@@ -205,20 +240,28 @@ Stable warning code examples:
|
||||
- `READING_ORDER_UNCERTAIN`
|
||||
- `STRICT_LOCAL_VIOLATION`
|
||||
- `MINERU_CLI_FAILED`
|
||||
- `MINERU_PROFILE_ADJUSTED`
|
||||
- `TEXT_LAYER_AVAILABLE`
|
||||
- `TEXT_FIDELITY_LOW`
|
||||
- `UNEXPECTED_CJK_IN_KOREAN_TEXT`
|
||||
- `HANGUL_SPACING_SUSPECT`
|
||||
- `TEXT_PAGE_MAPPING_UNCERTAIN`
|
||||
|
||||
## 8. Quality Report
|
||||
|
||||
Every conversion writes `<stem>.report.md`.
|
||||
Every conversion writes `<stem>/<stem>_report.md`.
|
||||
|
||||
The report is derived from metadata and local quality checks. It contains:
|
||||
The report is derived from internal provenance and local quality checks. It contains:
|
||||
|
||||
- Source and output paths.
|
||||
- Markdown part paths and source page ranges.
|
||||
- MinerU version and execution mode.
|
||||
- Pages processed.
|
||||
- Warning count.
|
||||
- Asset count and missing asset link count.
|
||||
- Inline and display formula counts.
|
||||
- Math render error count.
|
||||
- Text fidelity summary when pypdf diagnostics are available.
|
||||
- Pages with warnings.
|
||||
- Final status: `success`, `partial`, or `failed`.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user