baram2584/PDFToMD

Fork 0

T

김경종 73f955a8ce modify progress.md

2026-05-08 16:44:58 +09:00

.codex

add pdftomd

2026-05-08 16:42:19 +09:00

docs

add pdftomd

2026-05-08 16:42:19 +09:00

samples

add pdftomd

2026-05-08 16:42:19 +09:00

src/pdf2md

add pdftomd

2026-05-08 16:42:19 +09:00

tests

add pdftomd

2026-05-08 16:42:19 +09:00

tools/mathjax-checker

add pdftomd

2026-05-08 16:42:19 +09:00

.gitignore

add pdftomd

2026-05-08 16:42:19 +09:00

AGENTS.md

add pdftomd

2026-05-08 16:42:19 +09:00

ARCHITECTURE.md

add pdftomd

2026-05-08 16:42:19 +09:00

package-lock.json

add pdftomd

2026-05-08 16:42:19 +09:00

package.json

add pdftomd

2026-05-08 16:42:19 +09:00

PLAN.md

add pdftomd

2026-05-08 16:42:19 +09:00

PRD.md

add pdftomd

2026-05-08 16:42:19 +09:00

PROGRESS.md

modify progress.md

2026-05-08 16:44:58 +09:00

pyproject.toml

add pdftomd

2026-05-08 16:42:19 +09:00

README.md

add pdftomd

2026-05-08 16:42:19 +09:00

uv.lock

add pdftomd

2026-05-08 16:42:19 +09:00

README.md

ConvertPDFToMD

Local-only PDF-to-Markdown converter for math-heavy digital documents.

Status

The project currently provides a Python package, pdf2md convert, metadata/report output, mocked MinerU adapter tests, pdf2md doctor setup diagnostics, and Sprint 9 release-gate documentation. Real local MinerU sample validation remains optional and may be blocked until MinerU 3.1.0 and local model/cache setup are available.

Setup

Use Windows PowerShell with Python 3.12. If uv is installed but a new shell cannot find it, add the per-user install directory to PATH for the current session:

$env:Path = "C:\Users\user\.local\bin;$env:Path"

Sync the project and run the fast local test loop:

uv sync
uv run pytest
uv run pdf2md --version

For the local GTX 1070 Ti runtime, install CUDA-enabled PyTorch before installing MinerU so MinerU does not resolve to a CPU-only torch wheel:

uv sync
uv pip install --index-url https://download.pytorch.org/whl/cu126 torch==2.6.0 torchvision==0.21.0
uv pip install "mineru[core]==3.1.0"
uv run mineru-models-download -s huggingface -m all
[Environment]::SetEnvironmentVariable("MINERU_MODEL_SOURCE", "local", "User")
$env:MINERU_MODEL_SOURCE = "local"
uv run pdf2md doctor

Run uv sync before the runtime install commands. If you run uv sync again later, repeat the runtime install commands because MinerU and CUDA PyTorch are intentionally not part of the default fast test dependency set.

Install the optional local MathJax checker when you want formula renderability counts to reflect real MathJax parsing instead of the nonfatal "checker unavailable" warning:

npm install
npm run mathjax-checker:health
uv run pdf2md doctor

The checker runs through local Node.js and the local mathjax package only. It never uses a CDN or hosted renderer, and conversion still completes if Node.js or MathJax is missing.

For release checks, see docs/V1RELEASECHECKLIST.md. It separates the default fast gates from optional local MinerU/GPU/sample fixture evaluation. Optional fixture runs use PDF2MD_RUN_MINERU_FIXTURES=1, should use only local PDFs, write generated outputs to a temporary or ignored local directory, and count a sample conversion as successful only when Markdown, metadata JSON, and .report.md outputs all exist.

Install MinerU 3.1.0 as an explicit local setup step so the mineru executable is available on PATH. This project calls MinerU only through the direct local CLI shape:

mineru -p <input_path> -o <output_path>

pdf2md convert requests GPU execution by default with --gpu cuda:0. The adapter maps that to MinerU's local MINERU_DEVICE_MODE=cuda and CUDA_VISIBLE_DEVICES=0 environment for the MinerU subprocess. Actual GPU execution still requires a CUDA-capable local PyTorch/MinerU stack; doctor reports when PyTorch is CPU-only or CUDA is unavailable.

Run setup diagnostics before conversion:

uv run pdf2md doctor

doctor checks Python 3.12, uv, the MinerU CLI and version, NVIDIA GPU visibility through nvidia-smi, PyTorch CUDA visibility when PyTorch is installed, local model/cache/config paths, local MathJax checker availability, and the strict-local runtime policy. It does not install packages, download models, run conversions, or inspect samples/.

The model/cache check looks for these environment variables when present:

MINERU_MODEL_SOURCE
MINERU_MODEL_DIR
MINERU_CACHE_DIR
MINERU_TOOLS_CONFIG_JSON
HF_HOME
HUGGINGFACE_HUB_CACHE
MODELSCOPE_CACHE

It also checks for %USERPROFILE%\mineru.json, which MinerU documents as its default user config location. Missing model/cache paths are warnings because model download and cache population must be explicit setup actions.

Runtime Policy

Runtime conversion is strict-local. Allowed: direct mineru CLI execution and the CLI-internal temporary local mineru-api that MinerU starts when --api-url is omitted. Prohibited: --api-url, remote APIs, router mode, HTTP client backends, remote OpenAI-compatible backends, hosted renderers, and cloud fallbacks.

Setup may require explicit user-initiated package or model downloads. Those setup downloads are separate from runtime conversion; pdf2md doctor, pdf2md convert, imports, and default tests must not download packages or models.

The target GPU is NVIDIA GTX 1070 Ti 8GB. doctor warns for GTX 1070 Ti/Pascal/pre-Turing GPUs because local CUDA/PyTorch compatibility and VRAM pressure must be validated on the actual machine before relying on acceleration.

Long PDFs

Chunking is opt-in for long PDFs. Use --chunk-pages with no value to split into 20-page chunks, or pass an explicit positive page count:

uv run pdf2md convert samples/long.pdf --out outputs --chunk-pages
uv run pdf2md convert samples/long.pdf --out outputs --chunk-pages 20

Chunk PDFs are written to a temporary local directory before each MinerU run and are deleted after conversion completes. The generated Markdown files are not merged; each chunk gets its own Markdown, metadata JSON, report Markdown, and assets directory named with the original page range.

The Python API keeps non-chunked behavior unchanged. convert_pdf(..., chunk_pages=20) returns a BatchConversionResult with one ConversionResult per chunk.

References

Source checked on 2026-05-08:

MinerU Quick Usage: https://opendatalab.github.io/MinerU/usage/quick_usage/
MinerU CLI Tools: https://opendatalab.github.io/MinerU/usage/cli_tools/
MinerU Model Source: https://opendatalab.github.io/MinerU/usage/model_source/
MinerU GitHub README/release notes: https://github.com/opendatalab/MinerU
uv project sync documentation: https://docs.astral.sh/uv/concepts/projects/sync/
PyTorch previous versions: https://docs.pytorch.org/get-started/previous-versions/
PyTorch CUDA architecture support update: https://dev-discuss.pytorch.org/t/cuda-toolkit-version-and-architecture-support-update-maxwell-and-pascal-architecture-support-removed-in-cuda-12-8-and-12-9-builds/3128
PyTorch CUDA availability API: https://docs.pytorch.org/docs/2.11/generated/torch.cuda.is_available.html