Files
PDFToMD/docs/TOOLCHAIN.md
T
김경종 7e985ae94a add files
2026-04-30 17:05:19 +09:00

91 lines
3.9 KiB
Markdown

# Toolchain Notes
This document summarizes the researched toolchain choices and local compatibility decisions.
## Verified Environment
- OS: Windows 10
- GPU: NVIDIA GeForce GTX 1070 Ti
- VRAM: 8 GB
- NVIDIA driver: 577.00
- `nvidia-smi` CUDA runtime capability: 12.9
- User-installed CUDA toolkit: 12.4
- Python: 3.11.15 in repo-local `venv`
- Environment manager: Conda / Miniforge
## Python Dependencies
Use one repo-local `venv` and install from `requirements.txt`.
Key pins:
- `torch==2.7.1+cu126`
- `torchvision==0.22.1+cu126`
- `marker-pdf==1.10.2`
- `nougat-ocr==0.1.17`
- `transformers==4.57.6`
- `albumentations==1.3.1`
- `pymupdf==1.27.2.3`
- `pandas==3.0.2`
- `pytest==9.0.3`
- `pypdfium2==4.30.0`
- `opencv-python-headless==4.11.0.86`
- `Pillow==10.4.0`
- `fsspec==2026.2.0`
## PyTorch / CUDA Decision
- `torch==2.11.0+cu128` imports on this machine but does not support GTX 1070 Ti `sm_61` at runtime.
- `torch==2.7.1+cu126` satisfies Marker `torch>=2.7.0` and successfully runs CUDA tensor operations on GTX 1070 Ti.
- Keep this pin unless a newer official PyTorch wheel is verified to support `sm_61`.
## Marker
- Marker is the primary document parser.
- It handles layout, OCR/layout, reading order, body text, headings, tables, figures, captions, and semantic block roles.
- It should be consumed through structured output or adapter APIs where possible, not by scraping final Markdown text.
## Nougat
- Nougat is used only for formulas and mathematical expressions.
- `nougat-ocr==0.1.17` has loose dependency bounds, so the project pins compatible versions.
- `transformers 5.x` breaks Nougat imports.
- `albumentations 2.x` breaks Nougat transform initialization.
- Nougat failure must fall back to Marker source text.
## PyMuPDF
- PyMuPDF is used for lightweight page analysis, page counts, text-layer quality checks, OCR intervention planning, chunk planning, and low-level PDF/page operations.
- It is not the primary document parser.
## Comparison Baselines
These tools are useful for research or quality comparison but are not the primary architecture:
- PyMuPDF4LLM
- Docling
- MinerU
- MarkItDown
Do not switch the primary parser without updating `docs/ADR.md`, `docs/ARCHITECTURE.md`, and `docs/CONVERSION_POLICY.md`.
## Reference Links
- Marker PyPI: https://pypi.org/project/marker-pdf/
- Nougat GitHub: https://github.com/facebookresearch/nougat
- PyMuPDF documentation: https://pymupdf.readthedocs.io/
- PyTorch previous versions: https://docs.pytorch.org/get-started/previous-versions/
- GitHub Flavored Markdown spec: https://github.github.io/gfm/
- MathJax TeX delimiters: https://docs.mathjax.org/en/latest/input/tex/delimiters.html
- Docling GitHub: https://github.com/docling-project/docling
- MinerU GitHub: https://github.com/opendatalab/MinerU
## Markdown And Math Rendering
- Markdown table output should target GitHub Flavored Markdown where possible.
- Complex tables may use limited HTML `<table>`.
- Math output uses `$ ... $` for inline formulas and `$$ ... $$` for block formulas.
- `$...$` can conflict with ordinary dollar signs, so delimiter validation and repair are required.
## Model Cache
- Use explicit local cache paths for Marker/Nougat/Hugging Face model downloads.
- README should include model pre-download and offline execution instructions before the engine is released.
- Default project-local model cache path is `.models/`.
- `PDFTOMD_MODEL_CACHE` can override the default cache root.
- The runtime cache policy exposes Hugging Face cache environment variables from that root without downloading models during validation.
- Runtime logs and resume state are runtime artifacts under `output/.pdftomd-runtime/<document-slug>/`, not generated document sidecars.
## Licensing Notes
- Current user context is personal use.
- Before redistribution or commercial use, revisit Marker GPL and model-weight license implications.
- Process or API isolation can reduce coupling risk, but it is not a substitute for legal review.