add files
This commit is contained in:
@@ -0,0 +1,90 @@
|
||||
# Toolchain Notes
|
||||
|
||||
This document summarizes the researched toolchain choices and local compatibility decisions.
|
||||
|
||||
## Verified Environment
|
||||
- OS: Windows 10
|
||||
- GPU: NVIDIA GeForce GTX 1070 Ti
|
||||
- VRAM: 8 GB
|
||||
- NVIDIA driver: 577.00
|
||||
- `nvidia-smi` CUDA runtime capability: 12.9
|
||||
- User-installed CUDA toolkit: 12.4
|
||||
- Python: 3.11.15 in repo-local `venv`
|
||||
- Environment manager: Conda / Miniforge
|
||||
|
||||
## Python Dependencies
|
||||
Use one repo-local `venv` and install from `requirements.txt`.
|
||||
|
||||
Key pins:
|
||||
- `torch==2.7.1+cu126`
|
||||
- `torchvision==0.22.1+cu126`
|
||||
- `marker-pdf==1.10.2`
|
||||
- `nougat-ocr==0.1.17`
|
||||
- `transformers==4.57.6`
|
||||
- `albumentations==1.3.1`
|
||||
- `pymupdf==1.27.2.3`
|
||||
- `pandas==3.0.2`
|
||||
- `pytest==9.0.3`
|
||||
- `pypdfium2==4.30.0`
|
||||
- `opencv-python-headless==4.11.0.86`
|
||||
- `Pillow==10.4.0`
|
||||
- `fsspec==2026.2.0`
|
||||
|
||||
## PyTorch / CUDA Decision
|
||||
- `torch==2.11.0+cu128` imports on this machine but does not support GTX 1070 Ti `sm_61` at runtime.
|
||||
- `torch==2.7.1+cu126` satisfies Marker `torch>=2.7.0` and successfully runs CUDA tensor operations on GTX 1070 Ti.
|
||||
- Keep this pin unless a newer official PyTorch wheel is verified to support `sm_61`.
|
||||
|
||||
## Marker
|
||||
- Marker is the primary document parser.
|
||||
- It handles layout, OCR/layout, reading order, body text, headings, tables, figures, captions, and semantic block roles.
|
||||
- It should be consumed through structured output or adapter APIs where possible, not by scraping final Markdown text.
|
||||
|
||||
## Nougat
|
||||
- Nougat is used only for formulas and mathematical expressions.
|
||||
- `nougat-ocr==0.1.17` has loose dependency bounds, so the project pins compatible versions.
|
||||
- `transformers 5.x` breaks Nougat imports.
|
||||
- `albumentations 2.x` breaks Nougat transform initialization.
|
||||
- Nougat failure must fall back to Marker source text.
|
||||
|
||||
## PyMuPDF
|
||||
- PyMuPDF is used for lightweight page analysis, page counts, text-layer quality checks, OCR intervention planning, chunk planning, and low-level PDF/page operations.
|
||||
- It is not the primary document parser.
|
||||
|
||||
## Comparison Baselines
|
||||
These tools are useful for research or quality comparison but are not the primary architecture:
|
||||
- PyMuPDF4LLM
|
||||
- Docling
|
||||
- MinerU
|
||||
- MarkItDown
|
||||
|
||||
Do not switch the primary parser without updating `docs/ADR.md`, `docs/ARCHITECTURE.md`, and `docs/CONVERSION_POLICY.md`.
|
||||
|
||||
## Reference Links
|
||||
- Marker PyPI: https://pypi.org/project/marker-pdf/
|
||||
- Nougat GitHub: https://github.com/facebookresearch/nougat
|
||||
- PyMuPDF documentation: https://pymupdf.readthedocs.io/
|
||||
- PyTorch previous versions: https://docs.pytorch.org/get-started/previous-versions/
|
||||
- GitHub Flavored Markdown spec: https://github.github.io/gfm/
|
||||
- MathJax TeX delimiters: https://docs.mathjax.org/en/latest/input/tex/delimiters.html
|
||||
- Docling GitHub: https://github.com/docling-project/docling
|
||||
- MinerU GitHub: https://github.com/opendatalab/MinerU
|
||||
|
||||
## Markdown And Math Rendering
|
||||
- Markdown table output should target GitHub Flavored Markdown where possible.
|
||||
- Complex tables may use limited HTML `<table>`.
|
||||
- Math output uses `$ ... $` for inline formulas and `$$ ... $$` for block formulas.
|
||||
- `$...$` can conflict with ordinary dollar signs, so delimiter validation and repair are required.
|
||||
|
||||
## Model Cache
|
||||
- Use explicit local cache paths for Marker/Nougat/Hugging Face model downloads.
|
||||
- README should include model pre-download and offline execution instructions before the engine is released.
|
||||
- Default project-local model cache path is `.models/`.
|
||||
- `PDFTOMD_MODEL_CACHE` can override the default cache root.
|
||||
- The runtime cache policy exposes Hugging Face cache environment variables from that root without downloading models during validation.
|
||||
- Runtime logs and resume state are runtime artifacts under `output/.pdftomd-runtime/<document-slug>/`, not generated document sidecars.
|
||||
|
||||
## Licensing Notes
|
||||
- Current user context is personal use.
|
||||
- Before redistribution or commercial use, revisit Marker GPL and model-weight license implications.
|
||||
- Process or API isolation can reduce coupling risk, but it is not a substitute for legal review.
|
||||
Reference in New Issue
Block a user