122 lines
6.7 KiB
Markdown
122 lines
6.7 KiB
Markdown
# ConvertPDFToMD
|
|
|
|
Local-only PDF-to-Markdown converter for math-heavy digital documents.
|
|
|
|
## Status
|
|
|
|
The project currently provides a Python package, `pdf2md convert`, Markdown recheck via `pdf2md recheck`, metadata/report output, mocked MinerU adapter tests, `pdf2md doctor` setup diagnostics, and Sprint 9 release-gate documentation. Real local MinerU sample validation remains optional and may be blocked until MinerU 3.1.0 and local model/cache setup are available.
|
|
|
|
## Setup
|
|
|
|
Use Windows PowerShell with Python 3.12. If `uv` is installed but a new shell cannot find it, add the per-user install directory to PATH for the current session:
|
|
|
|
```powershell
|
|
$env:Path = "C:\Users\user\.local\bin;$env:Path"
|
|
```
|
|
|
|
Sync the project and run the fast local test loop:
|
|
|
|
```powershell
|
|
uv sync
|
|
uv run pytest
|
|
uv run pdf2md --version
|
|
```
|
|
|
|
For the local GTX 1070 Ti runtime, install CUDA-enabled PyTorch before installing MinerU so MinerU does not resolve to a CPU-only torch wheel:
|
|
|
|
```powershell
|
|
uv sync
|
|
uv pip install --index-url https://download.pytorch.org/whl/cu126 torch==2.6.0 torchvision==0.21.0
|
|
uv pip install "mineru[core]==3.1.0"
|
|
uv run mineru-models-download -s huggingface -m all
|
|
[Environment]::SetEnvironmentVariable("MINERU_MODEL_SOURCE", "local", "User")
|
|
$env:MINERU_MODEL_SOURCE = "local"
|
|
uv run pdf2md doctor
|
|
```
|
|
|
|
Run `uv sync` before the runtime install commands. If you run `uv sync` again later, repeat the runtime install commands because MinerU and CUDA PyTorch are intentionally not part of the default fast test dependency set.
|
|
|
|
Install the optional local MathJax checker when you want formula renderability counts to reflect real MathJax parsing instead of the nonfatal "checker unavailable" warning:
|
|
|
|
```powershell
|
|
npm install
|
|
npm run mathjax-checker:health
|
|
uv run pdf2md doctor
|
|
```
|
|
|
|
The checker runs through local Node.js and the local `mathjax` package only. It never uses a CDN or hosted renderer, and conversion still completes if Node.js or MathJax is missing.
|
|
|
|
For release checks, see [docs/V1RELEASECHECKLIST.md](docs/V1RELEASECHECKLIST.md). It separates the default fast gates from optional local MinerU/GPU/sample fixture evaluation. Optional fixture runs use `PDF2MD_RUN_MINERU_FIXTURES=1`, should use only local PDFs, write generated outputs to a temporary or ignored local directory, and count a sample conversion as successful only when Markdown, metadata JSON, and `.report.md` outputs all exist.
|
|
|
|
Install MinerU 3.1.0 as an explicit local setup step so the `mineru` executable is available on PATH. This project calls MinerU only through the direct local CLI shape:
|
|
|
|
```powershell
|
|
mineru -p <input_path> -o <output_path>
|
|
```
|
|
|
|
`pdf2md convert` requests GPU execution by default with `--gpu cuda:0`. The adapter maps that to MinerU's local `MINERU_DEVICE_MODE=cuda` and `CUDA_VISIBLE_DEVICES=0` environment for the MinerU subprocess. Actual GPU execution still requires a CUDA-capable local PyTorch/MinerU stack; `doctor` reports when PyTorch is CPU-only or CUDA is unavailable.
|
|
|
|
Run setup diagnostics before conversion:
|
|
|
|
```powershell
|
|
uv run pdf2md doctor
|
|
```
|
|
|
|
`doctor` checks Python 3.12, `uv`, the MinerU CLI and version, NVIDIA GPU visibility through `nvidia-smi`, PyTorch CUDA visibility when PyTorch is installed, local model/cache/config paths, local MathJax checker availability, and the strict-local runtime policy. It does not install packages, download models, run conversions, or inspect `samples/`.
|
|
|
|
The model/cache check looks for these environment variables when present:
|
|
|
|
- `MINERU_MODEL_SOURCE`
|
|
- `MINERU_MODEL_DIR`
|
|
- `MINERU_CACHE_DIR`
|
|
- `MINERU_TOOLS_CONFIG_JSON`
|
|
- `HF_HOME`
|
|
- `HUGGINGFACE_HUB_CACHE`
|
|
- `MODELSCOPE_CACHE`
|
|
|
|
It also checks for `%USERPROFILE%\mineru.json`, which MinerU documents as its default user config location. Missing model/cache paths are warnings because model download and cache population must be explicit setup actions.
|
|
|
|
## Rechecking Markdown
|
|
|
|
After editing a generated Markdown file, rerun local quality checks and regenerate the adjacent metadata/report files:
|
|
|
|
```powershell
|
|
uv run pdf2md recheck outputs/MITC공부/MITC공부.md
|
|
```
|
|
|
|
`recheck` reads the existing `<stem>.metadata.json` for source PDF, engine, page, and asset provenance. It replaces quality warnings that can be recalculated from the current Markdown, including MathJax render failures and local asset-link warnings, then rewrites `<stem>.metadata.json` and `<stem>.report.md`.
|
|
|
|
## Runtime Policy
|
|
|
|
Runtime conversion is strict-local. Allowed: direct `mineru` CLI execution and the CLI-internal temporary local `mineru-api` that MinerU starts when `--api-url` is omitted. Prohibited: `--api-url`, remote APIs, router mode, HTTP client backends, remote OpenAI-compatible backends, hosted renderers, and cloud fallbacks.
|
|
|
|
Setup may require explicit user-initiated package or model downloads. Those setup downloads are separate from runtime conversion; `pdf2md doctor`, `pdf2md convert`, imports, and default tests must not download packages or models.
|
|
|
|
The target GPU is NVIDIA GTX 1070 Ti 8GB. `doctor` warns for GTX 1070 Ti/Pascal/pre-Turing GPUs because local CUDA/PyTorch compatibility and VRAM pressure must be validated on the actual machine before relying on acceleration.
|
|
|
|
## Long PDFs
|
|
|
|
Chunking is opt-in for long PDFs. Use `--chunk-pages` with no value to split into 20-page chunks, or pass an explicit positive page count:
|
|
|
|
```powershell
|
|
uv run pdf2md convert samples/long.pdf --out outputs --chunk-pages
|
|
uv run pdf2md convert samples/long.pdf --out outputs --chunk-pages 20
|
|
```
|
|
|
|
Chunk PDFs are written to a temporary local directory before each MinerU run and are deleted after conversion completes. The generated Markdown files are not merged; each chunk gets its own Markdown, metadata JSON, report Markdown, and assets directory named with the original page range.
|
|
|
|
The Python API keeps non-chunked behavior unchanged. `convert_pdf(..., chunk_pages=20)` returns a `BatchConversionResult` with one `ConversionResult` per chunk.
|
|
|
|
## References
|
|
|
|
Source checked on 2026-05-08:
|
|
|
|
- MinerU Quick Usage: https://opendatalab.github.io/MinerU/usage/quick_usage/
|
|
- MinerU CLI Tools: https://opendatalab.github.io/MinerU/usage/cli_tools/
|
|
- MinerU Model Source: https://opendatalab.github.io/MinerU/usage/model_source/
|
|
- MinerU GitHub README/release notes: https://github.com/opendatalab/MinerU
|
|
- uv project sync documentation: https://docs.astral.sh/uv/concepts/projects/sync/
|
|
- PyTorch previous versions: https://docs.pytorch.org/get-started/previous-versions/
|
|
- PyTorch CUDA architecture support update: https://dev-discuss.pytorch.org/t/cuda-toolkit-version-and-architecture-support-update-maxwell-and-pascal-architecture-support-removed-in-cuda-12-8-and-12-9-builds/3128
|
|
- PyTorch CUDA availability API: https://docs.pytorch.org/docs/2.11/generated/torch.cuda.is_available.html
|