Architecture

Scope

현재 구현 목표는 1차 목표인 Windows native, local-first CLI/library 변환 엔진입니다.

기본 parser: Marker
기본 수식 parser: Nougat
PDF 분석과 chunk 계획: PyMuPDF
출력: Markdown chunk files plus assets
기본 chunk 목표: 20페이지
기본 runtime: CUDA
UI, hosted API, 기본 LLM 보정 경로는 1차 목표 범위 밖입니다.

Architecture Principles

Marker-first architecture를 유지합니다.
Nougat은 전체 문서 parser가 아니라 수식 parser입니다.
PyMuPDF는 무거운 변환 전에 빠른 page-level 분석과 chunk 계획을 담당합니다.
출력은 AI Agent가 탐색하기 쉬운 deterministic Markdown bundle이어야 합니다.
복잡한 table/figure/formula 손실 가능성은 fallback과 품질 검증으로 다룹니다.
생성 Markdown은 원문 문서 내용 중심이어야 하며 경고/오류 로그로 오염시키지 않습니다.

Pipeline

Input normalization
- PDF path를 pathlib 기반으로 정규화합니다.
- 한글, 공백, 긴 Windows 경로를 지원합니다.
- document slug를 결정적으로 생성합니다.
Page pre-analysis
- PyMuPDF로 page count, text length, image count, text-layer quality를 확인합니다.
- 페이지별 OCR 필요 여부를 추정합니다.
- 긴 문서는 20페이지 목표 chunk 계획을 세우되 logical block boundary 보존을 고려합니다.
Marker parse
- Marker가 layout, OCR, reading order, body text, headings, tables, figures, captions, semantic blocks를 담당합니다.
- Marker Document Model 또는 이에 준하는 구조화 출력을 내부 block model로 매핑합니다.
Formula handoff
- Marker equation block 또는 수식 패턴이 감지된 block만 Nougat에 전달합니다.
- Nougat 결과는 LaTeX 문자열 후보로 취급하며 validation과 fallback 정책을 통과해야 합니다.
- Nougat 실패 시 Marker 원문 수식 문자열을 사용합니다.
Semantic enrichment
- 수식 번호, figure 번호, table 번호, caption, 본문 참조를 식별합니다.
- 식별 confidence가 충분하면 내부 Markdown link로 연결합니다.
- header/footer/page-number 반복 패턴은 본문 흐름에서 제거하거나 분리합니다.
Markdown rendering
- heading, paragraph, list, blockquote, table, figure, equation block을 Markdown으로 렌더링합니다.
- Markdown table을 우선하되 복잡한 표는 제한적 HTML table 또는 이미지 fallback을 사용합니다.
- 각 chunk에는 문서 제목, page range, chunk 번호 등 최소 frontmatter를 넣을 수 있습니다.
Asset writing
- 이미지는 images/ 아래 결정적 파일명으로 저장합니다.
- figure 번호가 있으면 {document-slug}_fig-{figure-number}.png를 우선합니다.
- 충돌 또는 번호 부재 시 chunk/page/block identifier를 사용합니다.
- hash 기반 deduplication으로 중복 asset 저장을 줄입니다.
Validation and reporting
- math delimiter balance, LaTeX environment pairs, table parseability, image link existence, caption matching, chunk boundary integrity를 검증합니다.
- CLI는 progress bar와 chunk별 성공/실패를 표시합니다.
- 오류와 경고는 stderr와 local log에 기록합니다.

Planned Layout

samples/             # regression and quality corpus
tests/               # focused pytest coverage
scripts/             # validation / harness helpers
phases/              # executable Harness phase tickets
src/                 # source package, planned
venv/                # repo-local Windows virtual environment, ignored by git
output/              # conversion output, ignored by git

Harness Boundary

docs/HARNESS.md defines the planner/generator/evaluator workflow for long-running work.
phases/ files are execution tickets, not architecture policy. Architecture policy remains in docs/ARCHITECTURE.md, docs/CONVERSION_POLICY.md, and docs/ADR.md.
Each implementation phase must keep parser, formula, pre-analysis, renderer, runtime, and UI responsibilities separated according to this document.
Evaluator checks should use hard thresholds from each step's Sprint Contract and the focused quality strategy below.

Output Contract

출력은 문서 slug 디렉터리 아래에 묶입니다.

output/
└── document-slug/
    ├── document-slug_001.md
    ├── document-slug_002.md
    └── images/
        ├── document-slug_fig-001.png
        └── document-slug_fig-003.png

세부 규칙:

chunk Markdown 파일명은 <slug>_<chunk-index:03d>.md
image asset은 images/
같은 입력과 같은 옵션은 같은 output path를 생성해야 합니다.
별도 문서 sidecar metadata/log 산출물은 기본 output contract에 포함하지 않습니다.
local log와 resume state/cache는 runtime artifact이며 문서 출력 contract와 구분합니다.

Runtime Policy

기본 runtime은 cuda
explicit --runtime cuda 또는 --device cuda에서 CUDA가 준비되지 않았으면 빠르게 실패
--runtime auto는 필요 시 CPU fallback 경고를 출력
GTX 1070 Ti 8GB 기준 batch size는 1~2 수준에서 시작
GPU OOM 시 가능한 경우 batch/page 단위를 줄여 재시도
수식 parser 기본값은 nougat
verified PyTorch baseline은 torch==2.7.1+cu126

Environment

단일 repo-local Python 3.11 venv를 사용합니다.

conda create -p .\venv python=3.11 -y
.\venv\python.exe -m pip install -r requirements.txt

주요 pin:

torch==2.7.1+cu126
torchvision==0.22.1+cu126
marker-pdf==1.10.2
nougat-ocr==0.1.17
transformers==4.57.6
albumentations==1.3.1
pypdfium2==4.30.0
opencv-python-headless==4.11.0.86
Pillow==10.4.0
fsspec==2026.2.0

Model Cache And Offline Mode

모델 cache 위치는 명시적으로 관리해야 합니다.
최초 다운로드 이후 offline 실행 시 이미 받은 weight를 우선 사용해야 합니다.
README에는 model download와 offline 실행 절차를 별도로 추가해야 합니다.

Quality Strategy

전체 Markdown snapshot 비교는 주요 검증 방식으로 사용하지 않습니다.
focused assertions를 우선합니다.
검증 대상:
- heading hierarchy
- math delimiter balance
- LaTeX \begin / \end pairs
- image link existence
- figure/table/formula caption matching
- table parseability
- chunk boundary integrity
- Windows path and Korean filename handling
- no-exception conversion

Out of Scope for the First Goal

PyQt UI 구현
hosted conversion API 기본 경로화
LLM 보정 모드 기본 경로화
생성 문서와 함께 배포되는 별도 sidecar metadata/log 산출물

6.7 KiB Raw Blame History

Architecture

Scope

Architecture Principles

Pipeline

Planned Layout

Harness Boundary

Output Contract

Runtime Policy

Environment

Model Cache And Offline Mode

Quality Strategy

Out of Scope for the First Goal

6.7 KiB

Raw Blame History