Files
PDFToMD/README.md
T
김경종 7e985ae94a add files
2026-04-30 17:05:19 +09:00

2.9 KiB

PDFtoMD

PDFtoMD는 수학, 공학, 역학 중심 PDF를 AI Agent가 읽기 쉬운 Markdown 문서 묶음으로 변환하는 로컬 우선 변환 엔진입니다.

목표는 단순 텍스트 추출이 아니라 원문 문서의 읽기 순서, 문단 흐름, 수식, 표, 그림, 캡션, 본문 참조를 보존한 구조화 변환입니다.

Status

  • Current phase: Harness foundation planning.
  • Implementation: not started.
  • Primary target: Windows 10 native CLI/library engine.
  • UI: future PyQt thin client.

Core Direction

  • Marker handles document structure, reading order, OCR/layout, body text, tables, figures, headings, and captions.
  • Nougat handles only mathematical expressions and formula blocks.
  • PyMuPDF handles lightweight page analysis, text-layer quality checks, page counts, chunk planning, and low-level PDF operations.
  • Mixed text/scanned PDFs are in scope.
  • Output is chunked Markdown plus image/table assets under a document slug directory.

Environment

Use one repo-local Python 3.11 environment.

conda create -p .\venv python=3.11 -y
.\venv\python.exe -m pip install -r requirements.txt

Verified local baseline:

  • Windows 10
  • NVIDIA GeForce GTX 1070 Ti, 8 GB VRAM
  • NVIDIA driver 577.00
  • PyTorch 2.7.1+cu126
  • Marker 1.10.2
  • Nougat OCR 0.1.17

Verification

python scripts\validate_workspace.py
.\venv\python.exe -m pip check
.\venv\python.exe -c "import torch; x=torch.ones((1,), device='cuda'); print(torch.__version__, torch.version.cuda, x.item())"
.\venv\Scripts\nougat.exe --help

scripts/validate_workspace.py now discovers repo-local Python validation by default. It prefers .\venv\python.exe, compiles Harness scripts, and runs scripts/test_*.py with pytest unless HARNESS_VALIDATION_COMMANDS or npm scripts override discovery.

Important Documents

  • AGENTS.md: persistent repository instructions.
  • PLAN.md: multi-agent planning state.
  • PROGRESS.md: multi-agent progress state.
  • phases/: executable Harness phase tickets.
  • docs/PRD.md: product requirements.
  • docs/ARCHITECTURE.md: engine architecture.
  • docs/CONVERSION_POLICY.md: detailed conversion decisions.
  • docs/HARNESS.md: planner/generator/evaluator Harness workflow.
  • docs/IMPLEMENTATION_PLAN.md: full phase-by-phase implementation roadmap.
  • docs/ADR.md: architecture decision records.
  • docs/TOOLCHAIN.md: toolchain and dependency notes.
  • docs/UI_GUIDE.md: future PyQt UI guidance.

Sample Corpus

The samples/ directory is used for quality evaluation and regression tests. Current sample PDFs include Korean filenames, engineering/mechanics documents, formulas, figures, and a long 76-page document.

Before implementation, create a sample metadata mapping file that tags each PDF by text-layer quality, scanned pages, multi-column layout, formula density, table density, figure density, and Korean filename coverage.