baram2584/PDFToMD

Fork 0

Files

T

김경종 dc11880140 modify pdftomd

2026-05-14 10:16:59 +09:00

9.7 KiB

Raw Permalink Blame History

ConvertPDFToMD

Local-only PDF-to-Markdown converter for math-heavy digital documents.

Status

The project currently provides a Python package, pdf2md convert, legacy Markdown recheck via pdf2md recheck, simplified Markdown/report output, mocked MinerU adapter tests, pdf2md doctor setup diagnostics, NVIDIA GPU inventory/profile reporting, opt-in grouped page conversion for long PDFs, local MathJax warning mitigation, release-gate documentation, and a minimal Windows UI launcher with direct-folder PDF batch conversion. Real local MinerU sample validation is optional and should run only against local PDFs with generated outputs kept ignored.

Setup

Use Windows PowerShell with Python 3.12. If uv is installed but a new shell cannot find it, add the per-user install directory to PATH for the current session:

$env:Path = "C:\Users\user\.local\bin;$env:Path"

Sync the project and run the fast local test loop:

uv sync
uv run pytest
uv run pdf2md --version

For the local GTX 1070 Ti runtime, install CUDA-enabled PyTorch before installing MinerU so MinerU does not resolve to a CPU-only torch wheel:

uv sync
uv pip install --index-url https://download.pytorch.org/whl/cu126 torch==2.6.0 torchvision==0.21.0
uv pip install "mineru[core]==3.1.0"
uv run mineru-models-download -s huggingface -m all
[Environment]::SetEnvironmentVariable("MINERU_MODEL_SOURCE", "local", "User")
$env:MINERU_MODEL_SOURCE = "local"
uv run pdf2md doctor

Run uv sync before the runtime install commands. If you run uv sync again later, repeat the runtime install commands because MinerU and CUDA PyTorch are intentionally not part of the default fast test dependency set.

Install the optional local MathJax checker when you want formula renderability counts to reflect real MathJax parsing instead of the nonfatal "checker unavailable" warning:

npm install
npm run mathjax-checker:health
uv run pdf2md doctor

The checker runs through local Node.js and the local mathjax package only. It never uses a CDN or hosted renderer, and conversion still completes if Node.js or MathJax is missing.

For release checks, see docs/V1RELEASECHECKLIST.md. It separates the default fast gates from optional local MinerU/GPU/sample fixture evaluation. Optional fixture runs use PDF2MD_RUN_MINERU_FIXTURES=1, should use only local PDFs, write generated outputs to a temporary or ignored local directory, and count a sample conversion as successful only when Markdown part files and the single _report.md output exist.

Install MinerU 3.1.0 as an explicit local setup step so the mineru executable is available on PATH. This project calls MinerU only through the direct local CLI shape:

mineru -p <input_path> -o <output_path>

pdf2md convert requests GPU execution by default with --gpu cuda:0. The adapter maps that to MinerU's local MINERU_DEVICE_MODE=cuda and CUDA_VISIBLE_DEVICES=0 environment for the MinerU subprocess. Actual GPU execution still requires a CUDA-capable local PyTorch/MinerU stack; doctor reports when PyTorch is CPU-only or CUDA is unavailable.

MinerU runtime tuning is controlled with --mineru-profile auto|safe|performance; the default is auto. auto keeps GTX 1070 Ti 8GB, pre-Turing, and other low-VRAM GPUs on safe settings. Use --gpu auto on a stronger NVIDIA machine when you want the converter to choose the visible GPU with the most VRAM and record the selected GPU/profile in the report and internal provenance:

uv run pdf2md convert paper.pdf --out outputs --gpu auto --mineru-profile auto

The default public output layout is:

outputs/
  paper/
    paper_001.md
    paper_report.md
    images/

When --chunk-pages creates more than one grouped output, additional Markdown files use paper_002.md, paper_003.md, and so on. New conversions do not write public .metadata.json sidecars; report content is derived from internal provenance and local checks.

Profile tuning uses only local environment variables for the MinerU subprocess: MINERU_PROCESSING_WINDOW_SIZE, MINERU_API_MAX_CONCURRENT_REQUESTS, and MINERU_PDF_RENDER_THREADS. It does not add MinerU backend selection, --api-url, router mode, HTTP client backends, or remote endpoints. Explicit --mineru-profile performance is downgraded to safe with a warning when the selected GPU is below 16GB VRAM or has pre-Turing risk.

Run setup diagnostics before conversion:

uv run pdf2md doctor

doctor checks Python 3.12, uv, the MinerU CLI and version, NVIDIA GPU visibility through nvidia-smi, PyTorch CUDA visibility when PyTorch is installed, local model/cache/config paths, local MathJax checker availability, and the strict-local runtime policy. It also reports visible GPU indexes, VRAM, driver versions, the --gpu auto selection, and the recommended MinerU profile. It does not install packages, download models, run conversions, or inspect samples/.

The model/cache check looks for these environment variables when present:

MINERU_MODEL_SOURCE
MINERU_MODEL_DIR
MINERU_CACHE_DIR
MINERU_TOOLS_CONFIG_JSON
HF_HOME
HUGGINGFACE_HUB_CACHE
MODELSCOPE_CACHE

It also checks for %USERPROFILE%\mineru.json, which MinerU documents as its default user config location. Missing model/cache paths are warnings because model download and cache population must be explicit setup actions.

Rechecking Markdown

pdf2md recheck is currently a legacy maintenance command for Markdown files that still have adjacent metadata JSON from the older output layout:

uv run pdf2md recheck outputs/legacy-paper.md

recheck reads an existing legacy <stem>.metadata.json for source PDF, engine, page, and asset provenance. New simplified outputs do not persist metadata JSON, so metadata-free recheck is intentionally deferred to a later sprint.

Runtime Policy

Runtime conversion is strict-local. Allowed: direct mineru CLI execution and the CLI-internal temporary local mineru-api that MinerU starts when --api-url is omitted. Prohibited: --api-url, remote APIs, router mode, HTTP client backends, remote OpenAI-compatible backends, hosted renderers, and cloud fallbacks.

Setup may require explicit user-initiated package or model downloads. Those setup downloads are separate from runtime conversion; pdf2md doctor, pdf2md convert, imports, and default tests must not download packages or models.

The target GPU is NVIDIA GTX 1070 Ti 8GB. doctor warns for GTX 1070 Ti/Pascal/pre-Turing GPUs because local CUDA/PyTorch compatibility and VRAM pressure must be validated on the actual machine before relying on acceleration.

Long PDFs

Grouped page conversion is opt-in for long PDFs. Use --chunk-pages with no value to group outputs by 20 source pages, or pass an explicit positive group size:

uv run pdf2md convert samples/long.pdf --out outputs --chunk-pages
uv run pdf2md convert samples/long.pdf --out outputs --chunk-pages 20

When --chunk-pages is active, the converter writes one-page temporary PDFs and sends only one source page to MinerU per run. Successful page Markdown is then grouped into final Markdown files under the PDF output folder, such as outputs/paper/paper_001.md and outputs/paper/paper_002.md. Temporary one-page PDFs and intermediate per-page outputs are deleted after conversion completes.

Grouped outputs keep invisible Obsidian-friendly page comments such as ; failed page conversions are recorded as comments plus report warnings. Page assets are copied into the shared outputs/paper/images/ folder with deterministic page-prefixed names to avoid filename collisions.

The Python API keeps non-chunked behavior unchanged. convert_pdf(..., chunk_pages=20) returns a BatchConversionResult with one ConversionResult per grouped output file.

Windows UI Launcher

The first UI is a minimal local Windows launcher implemented under src/pdf2md_ui/. It calls the existing pdf2md CLI or uv run pdf2md; it does not call MinerU directly and does not bundle MinerU, CUDA PyTorch, model weights, Node.js, or MathJax into the UI executable. The UI exposes the current conversion controls, including grouped pages, GPU device or auto, and MinerU profile auto|safe|performance. Folder conversion selects direct-child PDFs only and runs the existing CLI conversion command once per PDF sequentially.

Run it from source:

uv run python -m pdf2md_ui.app

Build the UI executable:

uv run --group ui-build pyinstaller --clean --onefile --windowed --name pdf2md-ui src\pdf2md_ui\app.py

The expected local artifact is dist\pdf2md-ui.exe. The UI remains a launcher over a healthy local runtime, so run pdf2md doctor before relying on conversions. For the simplified output layout, select the output root; the CLI creates the final <stem>\ folder inside it.

References

Source checked on 2026-05-08:

MinerU Quick Usage: https://opendatalab.github.io/MinerU/usage/quick_usage/
MinerU CLI Tools: https://opendatalab.github.io/MinerU/usage/cli_tools/
MinerU Model Source: https://opendatalab.github.io/MinerU/usage/model_source/
MinerU GitHub README/release notes: https://github.com/opendatalab/MinerU
uv project sync documentation: https://docs.astral.sh/uv/concepts/projects/sync/
PyTorch previous versions: https://docs.pytorch.org/get-started/previous-versions/
PyTorch CUDA architecture support update: https://dev-discuss.pytorch.org/t/cuda-toolkit-version-and-architecture-support-update-maxwell-and-pascal-architecture-support-removed-in-cuda-12-8-and-12-9-builds/3128
PyTorch CUDA availability API: https://docs.pytorch.org/docs/2.11/generated/torch.cuda.is_available.html

9.7 KiB Raw Permalink Blame History