PDFToMD/README.md

# ConvertPDFToMD

Local-only PDF-to-Markdown converter for math-heavy digital documents.

## Status

The project currently provides a Python package, `pdf2md convert`, Markdown recheck via `pdf2md recheck`, metadata/report output, mocked MinerU adapter tests, `pdf2md doctor` setup diagnostics, and Sprint 9 release-gate documentation. Real local MinerU sample validation remains optional and may be blocked until MinerU 3.1.0 and local model/cache setup are available.

## Setup

Use Windows PowerShell with Python 3.12. If `uv` is installed but a new shell cannot find it, add the per-user install directory to PATH for the current session:

```powershell
$env:Path = "C:\Users\user\.local\bin;$env:Path"
```

Sync the project and run the fast local test loop:

```powershell
uv sync
uv run pytest
uv run pdf2md --version
```

For the local GTX 1070 Ti runtime, install CUDA-enabled PyTorch before installing MinerU so MinerU does not resolve to a CPU-only torch wheel:

```powershell
uv sync
uv pip install --index-url https://download.pytorch.org/whl/cu126 torch==2.6.0 torchvision==0.21.0
uv pip install "mineru[core]==3.1.0"
uv run mineru-models-download -s huggingface -m all
[Environment]::SetEnvironmentVariable("MINERU_MODEL_SOURCE", "local", "User")
$env:MINERU_MODEL_SOURCE = "local"
uv run pdf2md doctor
```

Run `uv sync` before the runtime install commands. If you run `uv sync` again later, repeat the runtime install commands because MinerU and CUDA PyTorch are intentionally not part of the default fast test dependency set.

Install the optional local MathJax checker when you want formula renderability counts to reflect real MathJax parsing instead of the nonfatal "checker unavailable" warning:

```powershell
npm install
npm run mathjax-checker:health
uv run pdf2md doctor
```

The checker runs through local Node.js and the local `mathjax` package only. It never uses a CDN or hosted renderer, and conversion still completes if Node.js or MathJax is missing.

For release checks, see [docs/V1RELEASECHECKLIST.md](docs/V1RELEASECHECKLIST.md). It separates the default fast gates from optional local MinerU/GPU/sample fixture evaluation. Optional fixture runs use `PDF2MD_RUN_MINERU_FIXTURES=1`, should use only local PDFs, write generated outputs to a temporary or ignored local directory, and count a sample conversion as successful only when Markdown, metadata JSON, and `.report.md` outputs all exist.

Install MinerU 3.1.0 as an explicit local setup step so the `mineru` executable is available on PATH. This project calls MinerU only through the direct local CLI shape:

```powershell
mineru -p <input_path> -o <output_path>
```

`pdf2md convert` requests GPU execution by default with `--gpu cuda:0`. The adapter maps that to MinerU's local `MINERU_DEVICE_MODE=cuda` and `CUDA_VISIBLE_DEVICES=0` environment for the MinerU subprocess. Actual GPU execution still requires a CUDA-capable local PyTorch/MinerU stack; `doctor` reports when PyTorch is CPU-only or CUDA is unavailable.

Run setup diagnostics before conversion:

```powershell
uv run pdf2md doctor
```

`doctor` checks Python 3.12, `uv`, the MinerU CLI and version, NVIDIA GPU visibility through `nvidia-smi`, PyTorch CUDA visibility when PyTorch is installed, local model/cache/config paths, local MathJax checker availability, and the strict-local runtime policy. It does not install packages, download models, run conversions, or inspect `samples/`.

The model/cache check looks for these environment variables when present:

- `MINERU_MODEL_SOURCE`
- `MINERU_MODEL_DIR`
- `MINERU_CACHE_DIR`
- `MINERU_TOOLS_CONFIG_JSON`
- `HF_HOME`
- `HUGGINGFACE_HUB_CACHE`
- `MODELSCOPE_CACHE`

It also checks for `%USERPROFILE%\mineru.json`, which MinerU documents as its default user config location. Missing model/cache paths are warnings because model download and cache population must be explicit setup actions.

## Rechecking Markdown

After editing a generated Markdown file, rerun local quality checks and regenerate the adjacent metadata/report files:

```powershell
uv run pdf2md recheck outputs/MITC공부/MITC공부.md
```

`recheck` reads the existing `<stem>.metadata.json` for source PDF, engine, page, and asset provenance. It replaces quality warnings that can be recalculated from the current Markdown, including MathJax render failures and local asset-link warnings, then rewrites `<stem>.metadata.json` and `<stem>.report.md`.

## Runtime Policy

Runtime conversion is strict-local. Allowed: direct `mineru` CLI execution and the CLI-internal temporary local `mineru-api` that MinerU starts when `--api-url` is omitted. Prohibited: `--api-url`, remote APIs, router mode, HTTP client backends, remote OpenAI-compatible backends, hosted renderers, and cloud fallbacks.

Setup may require explicit user-initiated package or model downloads. Those setup downloads are separate from runtime conversion; `pdf2md doctor`, `pdf2md convert`, imports, and default tests must not download packages or models.

The target GPU is NVIDIA GTX 1070 Ti 8GB. `doctor` warns for GTX 1070 Ti/Pascal/pre-Turing GPUs because local CUDA/PyTorch compatibility and VRAM pressure must be validated on the actual machine before relying on acceleration.

## Long PDFs

Chunking is opt-in for long PDFs. Use `--chunk-pages` with no value to split into 20-page chunks, or pass an explicit positive page count:

```powershell
uv run pdf2md convert samples/long.pdf --out outputs --chunk-pages
uv run pdf2md convert samples/long.pdf --out outputs --chunk-pages 20
```

Chunk PDFs are written to a temporary local directory before each MinerU run and are deleted after conversion completes. The generated Markdown files are not merged; each chunk gets its own Markdown, metadata JSON, report Markdown, and assets directory named with the original page range.

The Python API keeps non-chunked behavior unchanged. `convert_pdf(..., chunk_pages=20)` returns a `BatchConversionResult` with one `ConversionResult` per chunk.

## References

Source checked on 2026-05-08:

- MinerU Quick Usage: https://opendatalab.github.io/MinerU/usage/quick_usage/
- MinerU CLI Tools: https://opendatalab.github.io/MinerU/usage/cli_tools/
- MinerU Model Source: https://opendatalab.github.io/MinerU/usage/model_source/
- MinerU GitHub README/release notes: https://github.com/opendatalab/MinerU
- uv project sync documentation: https://docs.astral.sh/uv/concepts/projects/sync/
- PyTorch previous versions: https://docs.pytorch.org/get-started/previous-versions/
- PyTorch CUDA architecture support update: https://dev-discuss.pytorch.org/t/cuda-toolkit-version-and-architecture-support-update-maxwell-and-pascal-architecture-support-removed-in-cuda-12-8-and-12-9-builds/3128
- PyTorch CUDA availability API: https://docs.pytorch.org/docs/2.11/generated/torch.cuda.is_available.html