Files
PDFToMD/docs/TOOLCHAIN.md
T
김경종 7e985ae94a add files
2026-04-30 17:05:19 +09:00

3.9 KiB

Toolchain Notes

This document summarizes the researched toolchain choices and local compatibility decisions.

Verified Environment

  • OS: Windows 10
  • GPU: NVIDIA GeForce GTX 1070 Ti
  • VRAM: 8 GB
  • NVIDIA driver: 577.00
  • nvidia-smi CUDA runtime capability: 12.9
  • User-installed CUDA toolkit: 12.4
  • Python: 3.11.15 in repo-local venv
  • Environment manager: Conda / Miniforge

Python Dependencies

Use one repo-local venv and install from requirements.txt.

Key pins:

  • torch==2.7.1+cu126
  • torchvision==0.22.1+cu126
  • marker-pdf==1.10.2
  • nougat-ocr==0.1.17
  • transformers==4.57.6
  • albumentations==1.3.1
  • pymupdf==1.27.2.3
  • pandas==3.0.2
  • pytest==9.0.3
  • pypdfium2==4.30.0
  • opencv-python-headless==4.11.0.86
  • Pillow==10.4.0
  • fsspec==2026.2.0

PyTorch / CUDA Decision

  • torch==2.11.0+cu128 imports on this machine but does not support GTX 1070 Ti sm_61 at runtime.
  • torch==2.7.1+cu126 satisfies Marker torch>=2.7.0 and successfully runs CUDA tensor operations on GTX 1070 Ti.
  • Keep this pin unless a newer official PyTorch wheel is verified to support sm_61.

Marker

  • Marker is the primary document parser.
  • It handles layout, OCR/layout, reading order, body text, headings, tables, figures, captions, and semantic block roles.
  • It should be consumed through structured output or adapter APIs where possible, not by scraping final Markdown text.

Nougat

  • Nougat is used only for formulas and mathematical expressions.
  • nougat-ocr==0.1.17 has loose dependency bounds, so the project pins compatible versions.
  • transformers 5.x breaks Nougat imports.
  • albumentations 2.x breaks Nougat transform initialization.
  • Nougat failure must fall back to Marker source text.

PyMuPDF

  • PyMuPDF is used for lightweight page analysis, page counts, text-layer quality checks, OCR intervention planning, chunk planning, and low-level PDF/page operations.
  • It is not the primary document parser.

Comparison Baselines

These tools are useful for research or quality comparison but are not the primary architecture:

  • PyMuPDF4LLM
  • Docling
  • MinerU
  • MarkItDown

Do not switch the primary parser without updating docs/ADR.md, docs/ARCHITECTURE.md, and docs/CONVERSION_POLICY.md.

Markdown And Math Rendering

  • Markdown table output should target GitHub Flavored Markdown where possible.
  • Complex tables may use limited HTML <table>.
  • Math output uses $ ... $ for inline formulas and $$ ... $$ for block formulas.
  • $...$ can conflict with ordinary dollar signs, so delimiter validation and repair are required.

Model Cache

  • Use explicit local cache paths for Marker/Nougat/Hugging Face model downloads.
  • README should include model pre-download and offline execution instructions before the engine is released.
  • Default project-local model cache path is .models/.
  • PDFTOMD_MODEL_CACHE can override the default cache root.
  • The runtime cache policy exposes Hugging Face cache environment variables from that root without downloading models during validation.
  • Runtime logs and resume state are runtime artifacts under output/.pdftomd-runtime/<document-slug>/, not generated document sidecars.

Licensing Notes

  • Current user context is personal use.
  • Before redistribution or commercial use, revisit Marker GPL and model-weight license implications.
  • Process or API isolation can reduce coupling risk, but it is not a substitute for legal review.