Files
2026-05-14 10:16:59 +09:00

161 lines
9.7 KiB
Markdown

# ConvertPDFToMD
Local-only PDF-to-Markdown converter for math-heavy digital documents.
## Status
The project currently provides a Python package, `pdf2md convert`, legacy Markdown recheck via `pdf2md recheck`, simplified Markdown/report output, mocked MinerU adapter tests, `pdf2md doctor` setup diagnostics, NVIDIA GPU inventory/profile reporting, opt-in grouped page conversion for long PDFs, local MathJax warning mitigation, release-gate documentation, and a minimal Windows UI launcher with direct-folder PDF batch conversion. Real local MinerU sample validation is optional and should run only against local PDFs with generated outputs kept ignored.
## Setup
Use Windows PowerShell with Python 3.12. If `uv` is installed but a new shell cannot find it, add the per-user install directory to PATH for the current session:
```powershell
$env:Path = "C:\Users\user\.local\bin;$env:Path"
```
Sync the project and run the fast local test loop:
```powershell
uv sync
uv run pytest
uv run pdf2md --version
```
For the local GTX 1070 Ti runtime, install CUDA-enabled PyTorch before installing MinerU so MinerU does not resolve to a CPU-only torch wheel:
```powershell
uv sync
uv pip install --index-url https://download.pytorch.org/whl/cu126 torch==2.6.0 torchvision==0.21.0
uv pip install "mineru[core]==3.1.0"
uv run mineru-models-download -s huggingface -m all
[Environment]::SetEnvironmentVariable("MINERU_MODEL_SOURCE", "local", "User")
$env:MINERU_MODEL_SOURCE = "local"
uv run pdf2md doctor
```
Run `uv sync` before the runtime install commands. If you run `uv sync` again later, repeat the runtime install commands because MinerU and CUDA PyTorch are intentionally not part of the default fast test dependency set.
Install the optional local MathJax checker when you want formula renderability counts to reflect real MathJax parsing instead of the nonfatal "checker unavailable" warning:
```powershell
npm install
npm run mathjax-checker:health
uv run pdf2md doctor
```
The checker runs through local Node.js and the local `mathjax` package only. It never uses a CDN or hosted renderer, and conversion still completes if Node.js or MathJax is missing.
For release checks, see [docs/V1RELEASECHECKLIST.md](docs/V1RELEASECHECKLIST.md). It separates the default fast gates from optional local MinerU/GPU/sample fixture evaluation. Optional fixture runs use `PDF2MD_RUN_MINERU_FIXTURES=1`, should use only local PDFs, write generated outputs to a temporary or ignored local directory, and count a sample conversion as successful only when Markdown part files and the single `_report.md` output exist.
Install MinerU 3.1.0 as an explicit local setup step so the `mineru` executable is available on PATH. This project calls MinerU only through the direct local CLI shape:
```powershell
mineru -p <input_path> -o <output_path>
```
`pdf2md convert` requests GPU execution by default with `--gpu cuda:0`. The adapter maps that to MinerU's local `MINERU_DEVICE_MODE=cuda` and `CUDA_VISIBLE_DEVICES=0` environment for the MinerU subprocess. Actual GPU execution still requires a CUDA-capable local PyTorch/MinerU stack; `doctor` reports when PyTorch is CPU-only or CUDA is unavailable.
MinerU runtime tuning is controlled with `--mineru-profile auto|safe|performance`; the default is `auto`. `auto` keeps GTX 1070 Ti 8GB, pre-Turing, and other low-VRAM GPUs on safe settings. Use `--gpu auto` on a stronger NVIDIA machine when you want the converter to choose the visible GPU with the most VRAM and record the selected GPU/profile in the report and internal provenance:
```powershell
uv run pdf2md convert paper.pdf --out outputs --gpu auto --mineru-profile auto
```
The default public output layout is:
```text
outputs/
paper/
paper_001.md
paper_report.md
images/
```
When `--chunk-pages` creates more than one grouped output, additional Markdown files use `paper_002.md`, `paper_003.md`, and so on. New conversions do not write public `.metadata.json` sidecars; report content is derived from internal provenance and local checks.
Profile tuning uses only local environment variables for the MinerU subprocess: `MINERU_PROCESSING_WINDOW_SIZE`, `MINERU_API_MAX_CONCURRENT_REQUESTS`, and `MINERU_PDF_RENDER_THREADS`. It does not add MinerU backend selection, `--api-url`, router mode, HTTP client backends, or remote endpoints. Explicit `--mineru-profile performance` is downgraded to safe with a warning when the selected GPU is below 16GB VRAM or has pre-Turing risk.
Run setup diagnostics before conversion:
```powershell
uv run pdf2md doctor
```
`doctor` checks Python 3.12, `uv`, the MinerU CLI and version, NVIDIA GPU visibility through `nvidia-smi`, PyTorch CUDA visibility when PyTorch is installed, local model/cache/config paths, local MathJax checker availability, and the strict-local runtime policy. It also reports visible GPU indexes, VRAM, driver versions, the `--gpu auto` selection, and the recommended MinerU profile. It does not install packages, download models, run conversions, or inspect `samples/`.
The model/cache check looks for these environment variables when present:
- `MINERU_MODEL_SOURCE`
- `MINERU_MODEL_DIR`
- `MINERU_CACHE_DIR`
- `MINERU_TOOLS_CONFIG_JSON`
- `HF_HOME`
- `HUGGINGFACE_HUB_CACHE`
- `MODELSCOPE_CACHE`
It also checks for `%USERPROFILE%\mineru.json`, which MinerU documents as its default user config location. Missing model/cache paths are warnings because model download and cache population must be explicit setup actions.
## Rechecking Markdown
`pdf2md recheck` is currently a legacy maintenance command for Markdown files that still have adjacent metadata JSON from the older output layout:
```powershell
uv run pdf2md recheck outputs/legacy-paper.md
```
`recheck` reads an existing legacy `<stem>.metadata.json` for source PDF, engine, page, and asset provenance. New simplified outputs do not persist metadata JSON, so metadata-free recheck is intentionally deferred to a later sprint.
## Runtime Policy
Runtime conversion is strict-local. Allowed: direct `mineru` CLI execution and the CLI-internal temporary local `mineru-api` that MinerU starts when `--api-url` is omitted. Prohibited: `--api-url`, remote APIs, router mode, HTTP client backends, remote OpenAI-compatible backends, hosted renderers, and cloud fallbacks.
Setup may require explicit user-initiated package or model downloads. Those setup downloads are separate from runtime conversion; `pdf2md doctor`, `pdf2md convert`, imports, and default tests must not download packages or models.
The target GPU is NVIDIA GTX 1070 Ti 8GB. `doctor` warns for GTX 1070 Ti/Pascal/pre-Turing GPUs because local CUDA/PyTorch compatibility and VRAM pressure must be validated on the actual machine before relying on acceleration.
## Long PDFs
Grouped page conversion is opt-in for long PDFs. Use `--chunk-pages` with no value to group outputs by 20 source pages, or pass an explicit positive group size:
```powershell
uv run pdf2md convert samples/long.pdf --out outputs --chunk-pages
uv run pdf2md convert samples/long.pdf --out outputs --chunk-pages 20
```
When `--chunk-pages` is active, the converter writes one-page temporary PDFs and sends only one source page to MinerU per run. Successful page Markdown is then grouped into final Markdown files under the PDF output folder, such as `outputs/paper/paper_001.md` and `outputs/paper/paper_002.md`. Temporary one-page PDFs and intermediate per-page outputs are deleted after conversion completes.
Grouped outputs keep invisible Obsidian-friendly page comments such as `<!-- source-page: 7 -->`; failed page conversions are recorded as comments plus report warnings. Page assets are copied into the shared `outputs/paper/images/` folder with deterministic page-prefixed names to avoid filename collisions.
The Python API keeps non-chunked behavior unchanged. `convert_pdf(..., chunk_pages=20)` returns a `BatchConversionResult` with one `ConversionResult` per grouped output file.
## Windows UI Launcher
The first UI is a minimal local Windows launcher implemented under `src/pdf2md_ui/`. It calls the existing `pdf2md` CLI or `uv run pdf2md`; it does not call MinerU directly and does not bundle MinerU, CUDA PyTorch, model weights, Node.js, or MathJax into the UI executable. The UI exposes the current conversion controls, including grouped pages, GPU device or `auto`, and MinerU profile `auto|safe|performance`. Folder conversion selects direct-child PDFs only and runs the existing CLI conversion command once per PDF sequentially.
Run it from source:
```powershell
uv run python -m pdf2md_ui.app
```
Build the UI executable:
```powershell
uv run --group ui-build pyinstaller --clean --onefile --windowed --name pdf2md-ui src\pdf2md_ui\app.py
```
The expected local artifact is `dist\pdf2md-ui.exe`. The UI remains a launcher over a healthy local runtime, so run `pdf2md doctor` before relying on conversions. For the simplified output layout, select the output root; the CLI creates the final `<stem>\` folder inside it.
## References
Source checked on 2026-05-08:
- MinerU Quick Usage: https://opendatalab.github.io/MinerU/usage/quick_usage/
- MinerU CLI Tools: https://opendatalab.github.io/MinerU/usage/cli_tools/
- MinerU Model Source: https://opendatalab.github.io/MinerU/usage/model_source/
- MinerU GitHub README/release notes: https://github.com/opendatalab/MinerU
- uv project sync documentation: https://docs.astral.sh/uv/concepts/projects/sync/
- PyTorch previous versions: https://docs.pytorch.org/get-started/previous-versions/
- PyTorch CUDA architecture support update: https://dev-discuss.pytorch.org/t/cuda-toolkit-version-and-architecture-support-update-maxwell-and-pascal-architecture-support-removed-in-cuda-12-8-and-12-9-builds/3128
- PyTorch CUDA availability API: https://docs.pytorch.org/docs/2.11/generated/torch.cuda.is_available.html