91 lines
3.9 KiB
Markdown
91 lines
3.9 KiB
Markdown
# Toolchain Notes
|
|
|
|
This document summarizes the researched toolchain choices and local compatibility decisions.
|
|
|
|
## Verified Environment
|
|
- OS: Windows 10
|
|
- GPU: NVIDIA GeForce GTX 1070 Ti
|
|
- VRAM: 8 GB
|
|
- NVIDIA driver: 577.00
|
|
- `nvidia-smi` CUDA runtime capability: 12.9
|
|
- User-installed CUDA toolkit: 12.4
|
|
- Python: 3.11.15 in repo-local `venv`
|
|
- Environment manager: Conda / Miniforge
|
|
|
|
## Python Dependencies
|
|
Use one repo-local `venv` and install from `requirements.txt`.
|
|
|
|
Key pins:
|
|
- `torch==2.7.1+cu126`
|
|
- `torchvision==0.22.1+cu126`
|
|
- `marker-pdf==1.10.2`
|
|
- `nougat-ocr==0.1.17`
|
|
- `transformers==4.57.6`
|
|
- `albumentations==1.3.1`
|
|
- `pymupdf==1.27.2.3`
|
|
- `pandas==3.0.2`
|
|
- `pytest==9.0.3`
|
|
- `pypdfium2==4.30.0`
|
|
- `opencv-python-headless==4.11.0.86`
|
|
- `Pillow==10.4.0`
|
|
- `fsspec==2026.2.0`
|
|
|
|
## PyTorch / CUDA Decision
|
|
- `torch==2.11.0+cu128` imports on this machine but does not support GTX 1070 Ti `sm_61` at runtime.
|
|
- `torch==2.7.1+cu126` satisfies Marker `torch>=2.7.0` and successfully runs CUDA tensor operations on GTX 1070 Ti.
|
|
- Keep this pin unless a newer official PyTorch wheel is verified to support `sm_61`.
|
|
|
|
## Marker
|
|
- Marker is the primary document parser.
|
|
- It handles layout, OCR/layout, reading order, body text, headings, tables, figures, captions, and semantic block roles.
|
|
- It should be consumed through structured output or adapter APIs where possible, not by scraping final Markdown text.
|
|
|
|
## Nougat
|
|
- Nougat is used only for formulas and mathematical expressions.
|
|
- `nougat-ocr==0.1.17` has loose dependency bounds, so the project pins compatible versions.
|
|
- `transformers 5.x` breaks Nougat imports.
|
|
- `albumentations 2.x` breaks Nougat transform initialization.
|
|
- Nougat failure must fall back to Marker source text.
|
|
|
|
## PyMuPDF
|
|
- PyMuPDF is used for lightweight page analysis, page counts, text-layer quality checks, OCR intervention planning, chunk planning, and low-level PDF/page operations.
|
|
- It is not the primary document parser.
|
|
|
|
## Comparison Baselines
|
|
These tools are useful for research or quality comparison but are not the primary architecture:
|
|
- PyMuPDF4LLM
|
|
- Docling
|
|
- MinerU
|
|
- MarkItDown
|
|
|
|
Do not switch the primary parser without updating `docs/ADR.md`, `docs/ARCHITECTURE.md`, and `docs/CONVERSION_POLICY.md`.
|
|
|
|
## Reference Links
|
|
- Marker PyPI: https://pypi.org/project/marker-pdf/
|
|
- Nougat GitHub: https://github.com/facebookresearch/nougat
|
|
- PyMuPDF documentation: https://pymupdf.readthedocs.io/
|
|
- PyTorch previous versions: https://docs.pytorch.org/get-started/previous-versions/
|
|
- GitHub Flavored Markdown spec: https://github.github.io/gfm/
|
|
- MathJax TeX delimiters: https://docs.mathjax.org/en/latest/input/tex/delimiters.html
|
|
- Docling GitHub: https://github.com/docling-project/docling
|
|
- MinerU GitHub: https://github.com/opendatalab/MinerU
|
|
|
|
## Markdown And Math Rendering
|
|
- Markdown table output should target GitHub Flavored Markdown where possible.
|
|
- Complex tables may use limited HTML `<table>`.
|
|
- Math output uses `$ ... $` for inline formulas and `$$ ... $$` for block formulas.
|
|
- `$...$` can conflict with ordinary dollar signs, so delimiter validation and repair are required.
|
|
|
|
## Model Cache
|
|
- Use explicit local cache paths for Marker/Nougat/Hugging Face model downloads.
|
|
- README should include model pre-download and offline execution instructions before the engine is released.
|
|
- Default project-local model cache path is `.models/`.
|
|
- `PDFTOMD_MODEL_CACHE` can override the default cache root.
|
|
- The runtime cache policy exposes Hugging Face cache environment variables from that root without downloading models during validation.
|
|
- Runtime logs and resume state are runtime artifacts under `output/.pdftomd-runtime/<document-slug>/`, not generated document sidecars.
|
|
|
|
## Licensing Notes
|
|
- Current user context is personal use.
|
|
- Before redistribution or commercial use, revisit Marker GPL and model-weight license implications.
|
|
- Process or API isolation can reduce coupling risk, but it is not a substitute for legal review.
|