modify pdftomd

2026-05-14 10:16:59 +09:00
parent 2232b51fc9
commit dc11880140
69 changed files with 7784 additions and 1150 deletions
@@ -1,10 +1,10 @@
-# ConvertPDFToMD
+# ConvertPDFToMD

 Local-only PDF-to-Markdown converter for math-heavy digital documents.

 ## Status

-The project currently provides a Python package, `pdf2md convert`, Markdown recheck via `pdf2md recheck`, metadata/report output, mocked MinerU adapter tests, `pdf2md doctor` setup diagnostics, and Sprint 9 release-gate documentation. Real local MinerU sample validation remains optional and may be blocked until MinerU 3.1.0 and local model/cache setup are available.
+The project currently provides a Python package, `pdf2md convert`, legacy Markdown recheck via `pdf2md recheck`, simplified Markdown/report output, mocked MinerU adapter tests, `pdf2md doctor` setup diagnostics, NVIDIA GPU inventory/profile reporting, opt-in grouped page conversion for long PDFs, local MathJax warning mitigation, release-gate documentation, and a minimal Windows UI launcher with direct-folder PDF batch conversion. Real local MinerU sample validation is optional and should run only against local PDFs with generated outputs kept ignored.

 ## Setup

@@ -46,7 +46,7 @@ uv run pdf2md doctor

 The checker runs through local Node.js and the local `mathjax` package only. It never uses a CDN or hosted renderer, and conversion still completes if Node.js or MathJax is missing.

-For release checks, see [docs/V1RELEASECHECKLIST.md](docs/V1RELEASECHECKLIST.md). It separates the default fast gates from optional local MinerU/GPU/sample fixture evaluation. Optional fixture runs use `PDF2MD_RUN_MINERU_FIXTURES=1`, should use only local PDFs, write generated outputs to a temporary or ignored local directory, and count a sample conversion as successful only when Markdown, metadata JSON, and `.report.md` outputs all exist.
+For release checks, see [docs/V1RELEASECHECKLIST.md](docs/V1RELEASECHECKLIST.md). It separates the default fast gates from optional local MinerU/GPU/sample fixture evaluation. Optional fixture runs use `PDF2MD_RUN_MINERU_FIXTURES=1`, should use only local PDFs, write generated outputs to a temporary or ignored local directory, and count a sample conversion as successful only when Markdown part files and the single `_report.md` output exist.

 Install MinerU 3.1.0 as an explicit local setup step so the `mineru` executable is available on PATH. This project calls MinerU only through the direct local CLI shape:

@@ -56,13 +56,33 @@ mineru -p <input_path> -o <output_path>

 `pdf2md convert` requests GPU execution by default with `--gpu cuda:0`. The adapter maps that to MinerU's local `MINERU_DEVICE_MODE=cuda` and `CUDA_VISIBLE_DEVICES=0` environment for the MinerU subprocess. Actual GPU execution still requires a CUDA-capable local PyTorch/MinerU stack; `doctor` reports when PyTorch is CPU-only or CUDA is unavailable.

+MinerU runtime tuning is controlled with `--mineru-profile auto|safe|performance`; the default is `auto`. `auto` keeps GTX 1070 Ti 8GB, pre-Turing, and other low-VRAM GPUs on safe settings. Use `--gpu auto` on a stronger NVIDIA machine when you want the converter to choose the visible GPU with the most VRAM and record the selected GPU/profile in the report and internal provenance:
+
+```powershell
+uv run pdf2md convert paper.pdf --out outputs --gpu auto --mineru-profile auto
+```
+
+The default public output layout is:
+
+```text
+outputs/
+  paper/
+    paper_001.md
+    paper_report.md
+    images/
+```
+
+When `--chunk-pages` creates more than one grouped output, additional Markdown files use `paper_002.md`, `paper_003.md`, and so on. New conversions do not write public `.metadata.json` sidecars; report content is derived from internal provenance and local checks.
+
+Profile tuning uses only local environment variables for the MinerU subprocess: `MINERU_PROCESSING_WINDOW_SIZE`, `MINERU_API_MAX_CONCURRENT_REQUESTS`, and `MINERU_PDF_RENDER_THREADS`. It does not add MinerU backend selection, `--api-url`, router mode, HTTP client backends, or remote endpoints. Explicit `--mineru-profile performance` is downgraded to safe with a warning when the selected GPU is below 16GB VRAM or has pre-Turing risk.
+
 Run setup diagnostics before conversion:

 ```powershell
 uv run pdf2md doctor
 ```

-`doctor` checks Python 3.12, `uv`, the MinerU CLI and version, NVIDIA GPU visibility through `nvidia-smi`, PyTorch CUDA visibility when PyTorch is installed, local model/cache/config paths, local MathJax checker availability, and the strict-local runtime policy. It does not install packages, download models, run conversions, or inspect `samples/`.
+`doctor` checks Python 3.12, `uv`, the MinerU CLI and version, NVIDIA GPU visibility through `nvidia-smi`, PyTorch CUDA visibility when PyTorch is installed, local model/cache/config paths, local MathJax checker availability, and the strict-local runtime policy. It also reports visible GPU indexes, VRAM, driver versions, the `--gpu auto` selection, and the recommended MinerU profile. It does not install packages, download models, run conversions, or inspect `samples/`.

 The model/cache check looks for these environment variables when present:

@@ -78,14 +98,13 @@ It also checks for `%USERPROFILE%\mineru.json`, which MinerU documents as its de

 ## Rechecking Markdown

-After editing a generated Markdown file, rerun local quality checks and regenerate the adjacent metadata/report files:
+`pdf2md recheck` is currently a legacy maintenance command for Markdown files that still have adjacent metadata JSON from the older output layout:

 ```powershell
-uv run pdf2md recheck outputs/MITC공부/MITC공부.md
+uv run pdf2md recheck outputs/legacy-paper.md
 ```

-`recheck` reads the existing `<stem>.metadata.json` for source PDF, engine, page, and asset provenance. It replaces quality warnings that can be recalculated from the current Markdown, including MathJax render failures and local asset-link warnings, then rewrites `<stem>.metadata.json` and `<stem>.report.md`.
-
+`recheck` reads an existing legacy `<stem>.metadata.json` for source PDF, engine, page, and asset provenance. New simplified outputs do not persist metadata JSON, so metadata-free recheck is intentionally deferred to a later sprint.
 ## Runtime Policy

 Runtime conversion is strict-local. Allowed: direct `mineru` CLI execution and the CLI-internal temporary local `mineru-api` that MinerU starts when `--api-url` is omitted. Prohibited: `--api-url`, remote APIs, router mode, HTTP client backends, remote OpenAI-compatible backends, hosted renderers, and cloud fallbacks.
@@ -96,16 +115,36 @@ The target GPU is NVIDIA GTX 1070 Ti 8GB. `doctor` warns for GTX 1070 Ti/Pascal/

 ## Long PDFs

-Chunking is opt-in for long PDFs. Use `--chunk-pages` with no value to split into 20-page chunks, or pass an explicit positive page count:
+Grouped page conversion is opt-in for long PDFs. Use `--chunk-pages` with no value to group outputs by 20 source pages, or pass an explicit positive group size:

 ```powershell
 uv run pdf2md convert samples/long.pdf --out outputs --chunk-pages
 uv run pdf2md convert samples/long.pdf --out outputs --chunk-pages 20
 ```

-Chunk PDFs are written to a temporary local directory before each MinerU run and are deleted after conversion completes. The generated Markdown files are not merged; each chunk gets its own Markdown, metadata JSON, report Markdown, and assets directory named with the original page range.
+When `--chunk-pages` is active, the converter writes one-page temporary PDFs and sends only one source page to MinerU per run. Successful page Markdown is then grouped into final Markdown files under the PDF output folder, such as `outputs/paper/paper_001.md` and `outputs/paper/paper_002.md`. Temporary one-page PDFs and intermediate per-page outputs are deleted after conversion completes.

-The Python API keeps non-chunked behavior unchanged. `convert_pdf(..., chunk_pages=20)` returns a `BatchConversionResult` with one `ConversionResult` per chunk.
+Grouped outputs keep invisible Obsidian-friendly page comments such as `<!-- source-page: 7 -->`; failed page conversions are recorded as comments plus report warnings. Page assets are copied into the shared `outputs/paper/images/` folder with deterministic page-prefixed names to avoid filename collisions.
+
+The Python API keeps non-chunked behavior unchanged. `convert_pdf(..., chunk_pages=20)` returns a `BatchConversionResult` with one `ConversionResult` per grouped output file.
+
+## Windows UI Launcher
+
+The first UI is a minimal local Windows launcher implemented under `src/pdf2md_ui/`. It calls the existing `pdf2md` CLI or `uv run pdf2md`; it does not call MinerU directly and does not bundle MinerU, CUDA PyTorch, model weights, Node.js, or MathJax into the UI executable. The UI exposes the current conversion controls, including grouped pages, GPU device or `auto`, and MinerU profile `auto|safe|performance`. Folder conversion selects direct-child PDFs only and runs the existing CLI conversion command once per PDF sequentially.
+
+Run it from source:
+
+```powershell
+uv run python -m pdf2md_ui.app
+```
+
+Build the UI executable:
+
+```powershell
+uv run --group ui-build pyinstaller --clean --onefile --windowed --name pdf2md-ui src\pdf2md_ui\app.py
+```
+
+The expected local artifact is `dist\pdf2md-ui.exe`. The UI remains a launcher over a healthy local runtime, so run `pdf2md doctor` before relying on conversions. For the simplified output layout, select the output root; the CLI creates the final `<stem>\` folder inside it.

 ## References