add files

2026-04-30 17:05:19 +09:00
parent f3e01b5a8c
commit 7e985ae94a
135 changed files with 41205 additions and 0 deletions
@@ -0,0 +1,142 @@
+# Architecture Decision Records
+
+## 철학
+프로젝트의 핵심 가치관:
+- 정확한 수식 변환
+- 로컬 작동
+- 메모리 최적 사용
+- AI Agent가 탐색하기 쉬운 deterministic Markdown bundle
+- 원문 구조와 참조 관계 보존
+
+---
+
+## ADR-001: Marker-first document parsing
+**결정**: Marker를 기본 PDF parser로 사용한다.
+
+**이유**:
+- Marker는 layout, OCR, reading order, table, figure, caption, heading을 포함한 문서 구조 추적에 적합하다.
+- 프로젝트 목표는 단순 텍스트 추출이 아니라 원문 논리 구조를 Markdown으로 재구성하는 것이다.
+
+**트레이드오프**:
+- Marker 의존성 및 model weight 관리가 필요하다.
+- 배포 가능성이 생기면 GPL 및 model license 검토가 필요하다.
+
+---
+
+## ADR-002: Nougat as formula-only parser
+**결정**: Nougat은 전체 PDF parser가 아니라 수식 및 수학적 표현 parser로만 사용한다.
+
+**이유**:
+- Nougat은 학술 문서의 수식/LaTeX 변환에 강점이 있다.
+- 전체 문서 구조는 Marker가 담당해야 reading order, 표, 그림, caption 경로가 일관된다.
+
+**트레이드오프**:
+- Marker block과 Nougat 결과를 연결하는 handoff/fallback 계층이 필요하다.
+- Nougat 실패 시 Marker 원문 문자열을 fallback으로 사용해야 한다.
+
+---
+
+## ADR-003: PyMuPDF page pre-analysis and chunk planning
+**결정**: PyMuPDF를 페이지 수, 텍스트 레이어 품질, OCR 필요 여부, chunk 계획, 저수준 PDF 작업에 사용한다.
+
+**이유**:
+- 무거운 parser 실행 전에 빠른 page-level 분석이 필요하다.
+- 혼합 PDF는 페이지별 OCR 개입 여부를 판단해야 한다.
+- 긴 PDF는 20페이지 목표 chunk로 나누되 논리 block 경계를 고려해야 한다.
+
+**트레이드오프**:
+- PyMuPDF 분석 결과와 Marker layout 결과를 조정하는 adapter가 필요하다.
+
+---
+
+## ADR-004: Single Python 3.11 environment
+**결정**: repo-local 단일 Python 3.11 `venv`를 사용한다.
+
+**이유**:
+- 개발과 실행 경로를 단순화한다.
+- Marker와 Nougat은 명시적 dependency pin을 두면 하나의 환경에서 함께 동작한다.
+
+**검증된 주요 pin**:
+- `torch==2.7.1+cu126`
+- `torchvision==0.22.1+cu126`
+- `marker-pdf==1.10.2`
+- `nougat-ocr==0.1.17`
+- `transformers==4.57.6`
+- `albumentations==1.3.1`
+- `pypdfium2==4.30.0`
+- `opencv-python-headless==4.11.0.86`
+- `Pillow==10.4.0`
+- `fsspec==2026.2.0`
+
+**트레이드오프**:
+- Nougat의 느슨한 dependency bounds 때문에 requirements pin을 엄격히 유지해야 한다.
+- 최신 PyTorch를 무조건 사용할 수 없다. GTX 1070 Ti `sm_61` 지원 때문에 `torch==2.7.1+cu126`을 사용한다.
+
+---
+
+## ADR-005: Markdown bundle output without document sidecars by default
+**결정**: 기본 출력은 chunk Markdown 파일과 asset directory로 제한한다.
+
+**이유**:
+- AI Agent가 읽고 탐색하기 쉬운 산출물을 우선한다.
+- 별도 sidecar 산출물은 사용자가 명시적으로 요청하기 전까지 범위를 넓히지 않는다.
+
+**트레이드오프**:
+- 변환 diagnostics를 문서 출력과 분리해야 한다.
+- runtime log/state/cache는 허용하되 문서 output contract와 구분해야 한다.
+
+---
+
+## ADR-006: Focused quality assertions over full snapshots
+**결정**: 전체 Markdown snapshot 비교보다 focused assertions를 우선한다.
+
+**이유**:
+- PDF 변환 결과는 줄바꿈, spacing, parser version에 민감하다.
+- 품질 핵심은 heading, 수식, 표, 이미지, caption, 링크, chunk integrity, 예외 여부다.
+
+**트레이드오프**:
+- 테스트 설계가 더 세분화된다.
+- sample metadata mapping이 필요하다.
+
+---
+
+## ADR-007: Runtime fallback policy
+**결정**:
+- explicit `--runtime cuda` 또는 `--device cuda`는 CUDA 실패 시 fail-fast.
+- `--runtime auto`는 경고 후 CPU fallback 허용.
+- GPU OOM은 가능한 경우 batch/page 단위를 줄여 재시도.
+
+**이유**:
+- 사용자가 CUDA를 명시한 경우 조용한 CPU 전환은 예측 불가능한 지연을 만든다.
+- auto mode는 유연한 실행을 제공해야 한다.
+
+**트레이드오프**:
+- runtime state와 오류 reporting이 필요하다.
+
+---
+
+## ADR-008: Future PyQt UI as thin client
+**결정**: PyQt UI는 변환 엔진을 직접 구현하지 않고 CLI/라이브러리 API를 호출하는 thin client로 둔다.
+
+**이유**:
+- 1차 목표는 CLI/library 엔진 안정화다.
+- UI와 core engine의 책임을 분리해야 테스트와 유지보수가 쉽다.
+
+**트레이드오프**:
+- UI 설계 전에 core API contract를 안정화해야 한다.
+
+---
+
+## ADR-009: File-based planner/generator/evaluator Harness
+**결정**: 장기 작업은 `planner -> generator -> evaluator` 역할 분리와 파일 기반 handoff를 사용하는 Harness workflow로 관리한다.
+
+**이유**:
+- PDF 변환 엔진은 parser, OCR, 수식, 표, 그림, runtime, 테스트가 얽힌 장기 작업이므로 단일 대화에서 일관성을 유지하기 어렵다.
+- 작은 self-contained phase step은 새 agent가 fresh context로 작업을 이어받기 쉽게 한다.
+- 구현 agent와 평가 agent를 분리하면 자기 평가 편향을 줄이고, hard threshold 기반 검증을 강제할 수 있다.
+- `PLAN.md`, `PROGRESS.md`, `phases/` 파일을 통한 handoff는 대화 밖에서도 현재 상태를 재구성할 수 있게 한다.
+
+**트레이드오프**:
+- 각 step마다 Sprint Contract와 검증 기준을 작성하는 비용이 생긴다.
+- 너무 많은 agent, hook, command를 추가하면 Harness 자체가 유지보수 대상이 될 수 있으므로 `docs/HARNESS.md`의 단순화 규칙을 따른다.
+- Hook은 보조 장치일 뿐이며, evaluator 검토와 acceptance criteria를 대체하지 않는다.
@@ -0,0 +1,152 @@
+# Architecture
+
+## Scope
+현재 구현 목표는 1차 목표인 Windows native, local-first CLI/library 변환 엔진입니다.
+
+- 기본 parser: `Marker`
+- 기본 수식 parser: `Nougat`
+- PDF 분석과 chunk 계획: `PyMuPDF`
+- 출력: Markdown chunk files plus assets
+- 기본 chunk 목표: 20페이지
+- 기본 runtime: CUDA
+- UI, hosted API, 기본 LLM 보정 경로는 1차 목표 범위 밖입니다.
+
+## Architecture Principles
+- Marker-first architecture를 유지합니다.
+- Nougat은 전체 문서 parser가 아니라 수식 parser입니다.
+- PyMuPDF는 무거운 변환 전에 빠른 page-level 분석과 chunk 계획을 담당합니다.
+- 출력은 AI Agent가 탐색하기 쉬운 deterministic Markdown bundle이어야 합니다.
+- 복잡한 table/figure/formula 손실 가능성은 fallback과 품질 검증으로 다룹니다.
+- 생성 Markdown은 원문 문서 내용 중심이어야 하며 경고/오류 로그로 오염시키지 않습니다.
+
+## Pipeline
+1. Input normalization
+   - PDF path를 `pathlib` 기반으로 정규화합니다.
+   - 한글, 공백, 긴 Windows 경로를 지원합니다.
+   - document slug를 결정적으로 생성합니다.
+
+2. Page pre-analysis
+   - PyMuPDF로 page count, text length, image count, text-layer quality를 확인합니다.
+   - 페이지별 OCR 필요 여부를 추정합니다.
+   - 긴 문서는 20페이지 목표 chunk 계획을 세우되 logical block boundary 보존을 고려합니다.
+
+3. Marker parse
+   - Marker가 layout, OCR, reading order, body text, headings, tables, figures, captions, semantic blocks를 담당합니다.
+   - Marker Document Model 또는 이에 준하는 구조화 출력을 내부 block model로 매핑합니다.
+
+4. Formula handoff
+   - Marker equation block 또는 수식 패턴이 감지된 block만 Nougat에 전달합니다.
+   - Nougat 결과는 LaTeX 문자열 후보로 취급하며 validation과 fallback 정책을 통과해야 합니다.
+   - Nougat 실패 시 Marker 원문 수식 문자열을 사용합니다.
+
+5. Semantic enrichment
+   - 수식 번호, figure 번호, table 번호, caption, 본문 참조를 식별합니다.
+   - 식별 confidence가 충분하면 내부 Markdown link로 연결합니다.
+   - header/footer/page-number 반복 패턴은 본문 흐름에서 제거하거나 분리합니다.
+
+6. Markdown rendering
+   - heading, paragraph, list, blockquote, table, figure, equation block을 Markdown으로 렌더링합니다.
+   - Markdown table을 우선하되 복잡한 표는 제한적 HTML table 또는 이미지 fallback을 사용합니다.
+   - 각 chunk에는 문서 제목, page range, chunk 번호 등 최소 frontmatter를 넣을 수 있습니다.
+
+7. Asset writing
+   - 이미지는 `images/` 아래 결정적 파일명으로 저장합니다.
+   - figure 번호가 있으면 `{document-slug}_fig-{figure-number}.png`를 우선합니다.
+   - 충돌 또는 번호 부재 시 chunk/page/block identifier를 사용합니다.
+   - hash 기반 deduplication으로 중복 asset 저장을 줄입니다.
+
+8. Validation and reporting
+   - math delimiter balance, LaTeX environment pairs, table parseability, image link existence, caption matching, chunk boundary integrity를 검증합니다.
+   - CLI는 progress bar와 chunk별 성공/실패를 표시합니다.
+   - 오류와 경고는 stderr와 local log에 기록합니다.
+
+## Planned Layout
+```text
+samples/             # regression and quality corpus
+tests/               # focused pytest coverage
+scripts/             # validation / harness helpers
+phases/              # executable Harness phase tickets
+src/                 # source package, planned
+venv/                # repo-local Windows virtual environment, ignored by git
+output/              # conversion output, ignored by git
+```
+
+## Harness Boundary
+- `docs/HARNESS.md` defines the planner/generator/evaluator workflow for long-running work.
+- `phases/` files are execution tickets, not architecture policy. Architecture policy remains in `docs/ARCHITECTURE.md`, `docs/CONVERSION_POLICY.md`, and `docs/ADR.md`.
+- Each implementation phase must keep parser, formula, pre-analysis, renderer, runtime, and UI responsibilities separated according to this document.
+- Evaluator checks should use hard thresholds from each step's Sprint Contract and the focused quality strategy below.
+
+## Output Contract
+출력은 문서 slug 디렉터리 아래에 묶입니다.
+
+```text
+output/
+└── document-slug/
+    ├── document-slug_001.md
+    ├── document-slug_002.md
+    └── images/
+        ├── document-slug_fig-001.png
+        └── document-slug_fig-003.png
+```
+
+세부 규칙:
+- chunk Markdown 파일명은 `<slug>_<chunk-index:03d>.md`
+- image asset은 `images/`
+- 같은 입력과 같은 옵션은 같은 output path를 생성해야 합니다.
+- 별도 문서 sidecar metadata/log 산출물은 기본 output contract에 포함하지 않습니다.
+- local log와 resume state/cache는 runtime artifact이며 문서 출력 contract와 구분합니다.
+
+## Runtime Policy
+- 기본 runtime은 `cuda`
+- explicit `--runtime cuda` 또는 `--device cuda`에서 CUDA가 준비되지 않았으면 빠르게 실패
+- `--runtime auto`는 필요 시 CPU fallback 경고를 출력
+- GTX 1070 Ti 8GB 기준 batch size는 1~2 수준에서 시작
+- GPU OOM 시 가능한 경우 batch/page 단위를 줄여 재시도
+- 수식 parser 기본값은 `nougat`
+- verified PyTorch baseline은 `torch==2.7.1+cu126`
+
+## Environment
+단일 repo-local Python 3.11 `venv`를 사용합니다.
+
+```powershell
+conda create -p .\venv python=3.11 -y
+.\venv\python.exe -m pip install -r requirements.txt
+```
+
+주요 pin:
+- `torch==2.7.1+cu126`
+- `torchvision==0.22.1+cu126`
+- `marker-pdf==1.10.2`
+- `nougat-ocr==0.1.17`
+- `transformers==4.57.6`
+- `albumentations==1.3.1`
+- `pypdfium2==4.30.0`
+- `opencv-python-headless==4.11.0.86`
+- `Pillow==10.4.0`
+- `fsspec==2026.2.0`
+
+## Model Cache And Offline Mode
+- 모델 cache 위치는 명시적으로 관리해야 합니다.
+- 최초 다운로드 이후 offline 실행 시 이미 받은 weight를 우선 사용해야 합니다.
+- README에는 model download와 offline 실행 절차를 별도로 추가해야 합니다.
+
+## Quality Strategy
+- 전체 Markdown snapshot 비교는 주요 검증 방식으로 사용하지 않습니다.
+- focused assertions를 우선합니다.
+- 검증 대상:
+  - heading hierarchy
+  - math delimiter balance
+  - LaTeX `\begin` / `\end` pairs
+  - image link existence
+  - figure/table/formula caption matching
+  - table parseability
+  - chunk boundary integrity
+  - Windows path and Korean filename handling
+  - no-exception conversion
+
+## Out of Scope for the First Goal
+- PyQt UI 구현
+- hosted conversion API 기본 경로화
+- LLM 보정 모드 기본 경로화
+- 생성 문서와 함께 배포되는 별도 sidecar metadata/log 산출물
@@ -0,0 +1,91 @@
+# Conversion Policy
+
+This document records implementation decisions for the PDF-to-Markdown conversion engine. It is planning guidance, not implementation code.
+
+## Input Classification
+- Support mixed PDFs by default: text-layer pages, scanned pages, and mixed pages can appear in the same document.
+- Use PyMuPDF or equivalent lightweight page analysis before heavy parsing to estimate text-layer quality per page.
+- Decide OCR intervention per page instead of treating the entire PDF as text-only or scan-only.
+- Prefer Marker's OCR/layout functionality for scanned or weak text-layer pages.
+
+## Parser Responsibilities
+- Marker owns overall layout tracking, reading order, body extraction, table structure, image extraction, headings, captions, and semantic block roles.
+- Nougat owns only mathematical expressions and formula block parsing.
+- Do not use Nougat as the main document parser.
+- Send a block to Nougat when Marker identifies it as an equation area or when text-pattern detection marks it as mathematical content.
+- If Nougat conversion fails, preserve information by falling back to Marker's extracted source text.
+
+## Formula Handling
+- Treat formulas embedded inside a sentence without independent line spacing as inline formulas.
+- Treat formulas occupying independent line space or vertical whitespace as block formulas.
+- Preserve formula numbers detected near the right or bottom side of a formula region.
+- Attach anchors to extracted formula numbers and rewrite body references such as `Eq. (3)` or `식 (5)` as internal Markdown links when confidence is sufficient.
+- Validate Markdown math delimiters by counting opening and closing `$ ... $` and `$$ ... $$` pairs across each chunk.
+- Validate common LaTeX environments by checking matching `\begin{...}` and `\end{...}` names and counts.
+- If delimiter or environment validation fails, repair the closest logical location in a way that keeps Markdown rendering intact.
+
+## Tables
+- Prefer Markdown tables when structure can be represented without major loss.
+- Use limited HTML `<table>` output for tables with merged cells, multi-row headers, or structures that exceed GitHub Flavored Markdown table expressiveness.
+- Preserve table footnotes as regular text immediately below the table.
+- Preserve top or bottom captions as text and create internal links from body references such as `Table 1`.
+- If structured table extraction loses too much information, also save a screenshot of the table region as a fallback asset and link it near the structured output.
+
+## Figures And Images
+- Use deterministic image asset naming such as `{document-slug}_fig-{figure-number}.png` when a figure number is available.
+- Include chunk/page/block identifiers in names or anchors when needed to avoid collisions.
+- Place extracted image assets in the document `images/` directory.
+- Add figure captions below Markdown image links.
+- Rewrite body references such as `Fig. 2` to internal Markdown links when the figure target can be identified.
+- Deduplicate extracted images by hash and let repeated references share one asset and anchor.
+
+## Reading Order And Paragraph Flow
+- Stitch lines into paragraphs when a line does not end with terminal punctuation and the next line begins like a continuation, or when bounding-box line spacing matches intra-paragraph spacing.
+- Join hyphenated line breaks when a line-ending hyphen is followed by a lowercase continuation without whitespace.
+- Preserve hyphens for known compounds, identifiers, or proper nouns when confidence is low.
+- Use Marker bounding boxes to validate that the linearized text flow matches expected reading order in sample PDFs.
+- Detect repeated header/footer/page-number patterns in stable top/bottom page regions and exclude them from body Markdown, or separate them from the main body flow.
+
+## Chunking
+- Use 20 pages as the default chunk target.
+- Prefer logical block boundaries over strict page boundaries when a paragraph, formula, table, or figure would be cut in the middle.
+- If a block crosses a chunk boundary, keep the block intact by moving it to the previous or next chunk according to the least damaging boundary.
+- Add minimal context at the top of each chunk, including document title, page range, and chunk number.
+- Avoid sidecar metadata by default; put only core metadata in concise Markdown frontmatter.
+
+## Determinism And Paths
+- Ensure the same PDF and same options produce stable output structure and filenames.
+- Use deterministic slug, anchor, asset, and chunk naming rules.
+- Prefer `pathlib` for filesystem paths.
+- Test Korean filenames, paths with spaces, and long Windows paths.
+
+## Runtime And Recovery
+- Use conservative batch sizes, usually 1 or 2, for GTX 1070 Ti 8 GB VRAM.
+- If a GPU out-of-memory error occurs, retry with a smaller batch or smaller page unit where possible.
+- If the user explicitly requests `--device cuda` or `--runtime cuda`, fail fast instead of silently switching to CPU.
+- If the user requests `--runtime auto`, warn and fall back to CPU when CUDA initialization fails.
+- Keep model cache locations explicit, preferably under a local project or user-configured model cache directory, so offline operation can reuse already-downloaded weights.
+
+## Logging And Resume
+- Show chunk-level progress and success/failure status in the CLI.
+- Print warnings and errors to stderr and a local log file.
+- Do not inject warnings or error logs into generated Markdown because they reduce document readability and integrity.
+- Support resuming failed conversions by skipping already successful chunks when a local state/cache file is available.
+- Sidecar outputs are still out of scope unless explicitly requested; a resume state file is a runtime cache, not part of the document output contract.
+
+## Quality Tests
+- Prefer focused assertions over full Markdown snapshots.
+- Validate heading structure, formula delimiter balance, LaTeX environment pairs, image links, caption matching, table parseability, and no-exception conversion.
+- Use regex and Markdown/HTML parsers where practical instead of ad hoc string checks.
+- Maintain a sample metadata mapping file for `samples/` that tags each PDF by traits such as text-layer quality, scanned pages, multi-column layout, formula density, table density, figure density, and Korean filename coverage.
+- Use engineering/mechanics PDFs with multi-column layout, formulas, graphs, and tables as the MVP acceptance corpus.
+
+## Licensing
+- Current use is personal, which lowers immediate distribution risk.
+- If redistribution or commercial use becomes relevant, revisit Marker GPL and model-weight license implications before packaging.
+- Process or service isolation can be considered as a licensing risk-mitigation strategy, but it is not a legal conclusion and should be reviewed before distribution.
+
+## UI Boundary
+- Keep the core conversion engine as a Python API/CLI package.
+- Future PyQt UI should remain a thin client over the same API and must not duplicate conversion logic.
+
@@ -0,0 +1,114 @@
+# Harness Engineering Guide
+
+이 문서는 PDFtoMD 프로젝트에서 장기 agent 작업을 관리하는 Harness 운영 규칙입니다. 기준은 Anthropic의 "Harness design for long-running application development" 글에서 강조한 planner, generator, evaluator 분리, 파일 기반 handoff, sprint contract, 독립 평가 루프입니다.
+
+## Purpose
+- 긴 변환 엔진 개발을 작은 self-contained step으로 나눕니다.
+- 새 agent가 이전 대화 맥락 없이도 `AGENTS.md`, `PLAN.md`, `PROGRESS.md`, `phases/` 파일만 읽고 일을 이어받게 합니다.
+- 구현 agent와 평가 agent를 분리해 자기 평가 편향을 줄입니다.
+- 각 step의 성공 조건을 코드 작성 전에 파일로 고정합니다.
+- Harness 자체는 단순하게 유지하고, 복잡성은 필요한 검증 기준과 step 경계에만 둡니다.
+
+## Roles
+
+### Planner
+- 제품 목표와 아키텍처 문서를 읽고 phase와 step을 작성합니다.
+- 구현 세부를 과도하게 지정하지 않고 산출물, 책임 범위, 수락 기준, 금지 범위를 명확히 합니다.
+- 산출물:
+  - `PLAN.md` 업데이트
+  - `phases/index.json`
+  - `phases/{phase}/index.json`
+  - `phases/{phase}/stepN.md`
+
+### Generator
+- 한 번에 하나의 `stepN.md`만 수행합니다.
+- 작업 전 step의 "Sprint Contract"를 읽고, 애매하면 구현 전에 `PROGRESS.md`에 blocker로 남깁니다.
+- TDD가 필요한 구현 step에서는 테스트를 먼저 작성합니다.
+- 산출물:
+  - step 범위 내 코드, 테스트, 문서 변경
+  - `phases/{phase}/index.json` step status 업데이트
+  - `PROGRESS.md` handoff 업데이트
+
+### Evaluator
+- generator가 만든 결과를 독립적으로 검토합니다.
+- 합의된 기준 중 하나라도 hard threshold를 넘지 못하면 step을 통과시키지 않습니다.
+- 통과 여부만 보지 않고, 재작업 가능한 구체적 실패 원인을 남깁니다.
+- 산출물:
+  - review finding 또는 pass 기록
+  - 필요한 경우 `phases/{phase}/index.json`의 `error_message` 또는 `blocked_reason`
+  - `PROGRESS.md` 검증 결과
+
+## File Protocol
+- `AGENTS.md`: 변하지 않는 저장소 규칙.
+- `PLAN.md`: 전체 작업 계획의 단일 출처.
+- `PROGRESS.md`: 현재 진행 상태와 handoff의 단일 출처.
+- `docs/*.md`: 제품, 아키텍처, 결정, 도구 체인, Harness 운영 지식.
+- `phases/index.json`: 실행 가능한 phase registry.
+- `phases/{phase}/index.json`: 해당 phase step 상태의 단일 출처.
+- `phases/{phase}/stepN.md`: 새 agent가 독립 실행할 수 있는 ticket.
+
+## Step Contract Template
+각 `stepN.md`는 다음 정보를 포함해야 합니다.
+
+````markdown
+# Step N: step-name
+
+## Read First
+- /AGENTS.md
+- /PLAN.md
+- /PROGRESS.md
+- /docs/HARNESS.md
+- /docs/ARCHITECTURE.md
+- /docs/ADR.md
+- /docs/CONVERSION_POLICY.md
+
+## Task
+이 step에서 만들어야 하는 산출물과 수정 가능한 파일을 구체적으로 적습니다.
+
+## Sprint Contract
+- Done means: 사용자가 관찰할 수 있거나 테스트로 확인 가능한 완료 조건.
+- Hard thresholds: 하나라도 실패하면 step 실패로 보는 기준.
+- Files owned: 이 step에서 수정할 수 있는 파일 또는 디렉터리.
+- Dependencies: 이전 step 산출물 또는 필요한 문서.
+
+## Acceptance Criteria
+```powershell
+python scripts\validate_workspace.py
+```
+
+## Verification
+1. 테스트와 검증 명령을 실행합니다.
+2. `PROGRESS.md`에 결과와 다음 handoff를 기록합니다.
+3. `phases/{phase}/index.json`의 해당 step을 `completed`, `blocked`, `error` 중 하나로 갱신합니다.
+
+## Do Not
+- step 범위 밖 기능을 구현하지 않습니다.
+- 새 parser나 외부 API를 도입하지 않습니다.
+- 생성 Markdown 출력 contract를 임의로 넓히지 않습니다.
+````
+
+## Evaluation Criteria
+PDFtoMD의 evaluator는 다음 hard threshold를 우선 적용합니다.
+
+| Area | Hard Threshold |
+| --- | --- |
+| Architecture | Marker, Nougat, PyMuPDF 책임 경계를 깨지 않는다. |
+| TDD | 구현 step은 실패하는 테스트가 먼저 추가되거나, 테스트가 필요 없는 이유가 step에 명시된다. |
+| Determinism | 같은 입력과 옵션은 같은 slug, asset path, anchor, Markdown 구조를 만든다. |
+| Markdown quality | heading, math delimiter, table, image link, caption, chunk frontmatter 검증이 가능하다. |
+| Runtime | Windows path, Korean filename, CUDA/CPU runtime 정책을 훼손하지 않는다. |
+| Scope | PyQt UI, hosted API, LLM correction, sidecar output을 1차 구현에 끌어오지 않는다. |
+| Handoff | `PROGRESS.md`와 phase index가 다음 agent에게 충분한 상태를 제공한다. |
+
+## When To Use The Full Loop
+- Full planner/generator/evaluator loop를 사용합니다:
+  - 새 phase를 시작할 때
+  - parser adapter, chunk planner, renderer, quality validator처럼 실패 비용이 큰 작업
+  - sample corpus나 runtime 정책처럼 여러 파일과 문서가 동시에 바뀌는 작업
+- 단순한 문서 오타, 작은 command 설명, 명확한 단일 테스트 수정은 일반 Codex 작업으로 처리해도 됩니다. 그래도 `PROGRESS.md`는 갱신합니다.
+
+## Simplification Rule
+Harness 구성 요소는 실제로 품질을 높일 때만 유지합니다.
+- 같은 검증을 두 곳에서 반복하면 하나로 줄입니다.
+- hook은 보조 장치로 취급하고, step의 acceptance criteria와 evaluator 판단을 대체하지 않습니다.
+- agent에게 너무 많은 컨텍스트를 주지 말고, step에 필요한 문서와 파일만 지정합니다.
@@ -0,0 +1,121 @@
+# Implementation Phase Plan
+
+이 문서는 PDFtoMD 구현 전체를 phase 단위로 나눈 실행 계획입니다. 각 phase의 상세 실행 티켓은 `phases/{phase}/stepN.md`에 둡니다.
+
+## Planning Principles
+- 1차 목표는 Windows native, local-first CLI/library 변환 엔진입니다.
+- PyQt UI는 core API와 CLI가 안정화된 뒤 thin client로 구현합니다.
+- 각 phase는 이전 phase의 산출물을 전제로 하며, phase 안의 step은 하나의 agent가 독립 실행할 수 있어야 합니다.
+- 구현 phase는 TDD를 기본으로 합니다.
+- Parser 책임 경계는 유지합니다: Marker는 문서 구조, Nougat은 수식, PyMuPDF는 사전 분석과 저수준 PDF 작업입니다.
+
+## Phase Overview
+
+| Phase | Goal | Primary Output | Depends On |
+| --- | --- | --- | --- |
+| 0. Harness foundation | 실행 가능한 Harness 기반과 최소 품질 토대 | sample metadata, core models, preanalysis contract, quality gates | current docs |
+| 1. Core runtime contracts | 변환 옵션, 입력 정규화, 출력 bundle 계약, path/cache 정책 | stable API contracts and tests | Phase 0 |
+| 2. Marker adapter | Marker 실행과 block normalization 경계 구현 | Marker adapter, OCR handoff, block mapping tests | Phase 1 |
+| 3. Formula pipeline | Nougat formula-only handoff와 LaTeX 검증/fallback | formula detector, Nougat adapter, repair/fallback tests | Phase 2 |
+| 4. Semantic enrichment | 문단, reading order, header/footer, 참조 관계 보강 | enrichment pipeline and reference index | Phase 2, 3 |
+| 5. Markdown rendering and assets | Markdown chunk, table, figure, asset writer 구현 | deterministic Markdown bundle writer | Phase 4 |
+| 6. CLI runtime and resume | CLI, progress/logging, runtime, OOM, resume 구현 | user-facing local CLI | Phase 5 |
+| 7. MVP quality hardening | samples 기반 end-to-end 품질 검증과 회귀 안정화 | MVP acceptance suite | Phase 6 |
+| 8. Release docs and packaging | 설치, 모델 cache, offline, release 문서 정리 | local release-ready docs/scripts | Phase 7 |
+| 9. PyQt thin client | CLI/library를 호출하는 Windows UI | optional PyQt UI | Phase 8 |
+
+## Phase 0: Harness Foundation
+- Directory: `phases/0-harness-foundation`
+- Purpose: 구현 전 공통 모델, sample metadata, PyMuPDF pre-analysis contract, Markdown quality gates를 만든다.
+- Steps:
+  1. `sample-metadata-contract`
+  2. `core-package-skeleton`
+  3. `page-preanalysis-contract`
+  4. `markdown-quality-gates`
+
+## Phase 1: Core Runtime Contracts
+- Directory: `phases/1-core-runtime-contracts`
+- Purpose: parser 실행 전에 모든 phase가 공유할 입력, 옵션, path, output contract를 안정화한다.
+- Steps:
+  1. `input-normalization-slug`
+  2. `conversion-options-config`
+  3. `output-bundle-contract`
+  4. `runtime-cache-policy`
+
+## Phase 2: Marker Adapter
+- Directory: `phases/2-marker-adapter`
+- Purpose: Marker를 primary parser로 연결하고, OCR/page plan과 Marker 구조화 출력을 내부 block model로 매핑한다.
+- Steps:
+  1. `marker-invocation-adapter`
+  2. `ocr-plan-handoff`
+  3. `marker-block-normalization`
+  4. `marker-failure-reporting`
+
+## Phase 3: Formula Pipeline
+- Directory: `phases/3-formula-pipeline`
+- Purpose: Nougat을 formula-only parser로 연결하고, 수식 delimiter, numbering, fallback을 안정화한다.
+- Steps:
+  1. `formula-block-detection`
+  2. `nougat-command-adapter`
+  3. `latex-validation-repair`
+  4. `formula-reference-links`
+
+## Phase 4: Semantic Enrichment
+- Directory: `phases/4-semantic-enrichment`
+- Purpose: Marker block을 Markdown에 적합한 논리 구조로 보강한다.
+- Steps:
+  1. `reading-order-checks`
+  2. `paragraph-stitching`
+  3. `header-footer-filtering`
+  4. `reference-indexing`
+
+## Phase 5: Markdown Rendering And Assets
+- Directory: `phases/5-markdown-rendering-assets`
+- Purpose: chunked Markdown bundle과 image/table asset 출력을 결정적으로 생성한다.
+- Steps:
+  1. `markdown-block-renderer`
+  2. `table-renderer-fallbacks`
+  3. `figure-asset-writer`
+  4. `chunk-renderer`
+
+## Phase 6: CLI Runtime And Resume
+- Directory: `phases/6-cli-runtime-resume`
+- Purpose: 변환 엔진을 사용자가 실행할 수 있는 CLI로 묶고 runtime/recovery 정책을 구현한다.
+- Steps:
+  1. `cli-entrypoint-options`
+  2. `progress-logging`
+  3. `resume-state`
+  4. `device-oom-policy`
+  5. `model-cache-offline`
+
+## Phase 7: MVP Quality Hardening
+- Directory: `phases/7-mvp-quality-hardening`
+- Purpose: sample corpus 기준으로 end-to-end 품질을 고정하고 MVP 수락 기준을 통과시킨다.
+- Steps:
+  1. `sample-smoke-conversions`
+  2. `quality-metrics-report`
+  3. `regression-thresholds`
+  4. `mvp-fix-sweep`
+
+## Phase 8: Release Docs And Packaging
+- Directory: `phases/8-release-docs-packaging`
+- Purpose: 개인용 로컬 실행 기준으로 설치, 모델 다운로드, offline 실행, release checklist를 정리한다.
+- Steps:
+  1. `readme-usage-flow`
+  2. `environment-bootstrap-docs`
+  3. `license-checkpoint`
+  4. `release-checklist`
+
+## Phase 9: PyQt Thin Client
+- Directory: `phases/9-pyqt-thin-client`
+- Purpose: core engine을 중복 구현하지 않는 Windows UI를 만든다.
+- Steps:
+  1. `ui-api-contract`
+  2. `pyqt-shell`
+  3. `ui-progress-resume`
+  4. `ui-packaging-notes`
+
+## Deferred Backlog
+- Hosted conversion API는 현재 phase plan에 포함하지 않습니다.
+- LLM correction mode는 기본 경로가 아니며, MVP 이후 별도 ADR과 phase 계획이 필요합니다.
+- 배포/상업적 사용이 현실화되면 Marker GPL과 model weight license를 별도 법적 검토 대상으로 둡니다.
@@ -0,0 +1,88 @@
+# PRD: PDFtoMD
+
+## 목표
+PDFtoMD는 수학, 공학, 역학 중심의 PDF 문서를 AI Agent가 쉽게 접근하고 읽을 수 있는 Markdown 문서 묶음으로 변환하는 프로그램입니다.
+
+이 프로젝트의 목표는 PDF의 텍스트를 단순 추출하는 것이 아니라, 원문 문서의 논리 구조를 보존하면서 AI가 읽기 쉬운 지식 자료로 재구성하는 것입니다.
+
+## 문제 정의
+- PDF는 텍스트, 이미지, 수식, 표, 캡션을 좌표 기반으로 저장하므로 원문 읽기 순서가 쉽게 깨집니다.
+- 논문과 공학 문서에는 다단 레이아웃, 수식 번호, 그림/표 참조, 복잡한 표가 자주 등장합니다.
+- 스캔 PDF와 텍스트 레이어 PDF가 섞인 문서는 OCR 여부를 문서 전체 단위가 아니라 페이지 단위로 판단해야 합니다.
+- AI Agent와 RAG 도구는 긴 PDF 하나보다 논리적으로 나뉜 Markdown chunk와 연결된 asset을 더 안정적으로 탐색합니다.
+
+## 사용자
+- PDF 문서를 Markdown으로 변환해 AI Agent, RAG, 개인 지식 관리 도구에 활용하고 싶은 사용자
+- 수식, 표, 이미지가 많은 논문/공학 문서를 Markdown으로 읽고 관리하고 싶은 사용자
+- 긴 PDF를 여러 Markdown 파일로 나누어 부분 탐색하고 싶은 사용자
+- Windows native 환경에서 외부 서비스 없이 로컬로 변환하고 싶은 사용자
+
+## 1차 MVP 범위
+- Windows native 환경에서 완전 로컬 실행
+- GPU 기본 사용, VRAM 8GB 환경을 기준으로 안정적인 chunk 처리
+- repo-local Python 3.11 단일 `venv` 환경 사용
+- PDF parser는 `Marker`를 기본 엔진으로 사용
+- 본문 구조, OCR/layout, reading order, 표, 그림, heading, caption은 Marker 경로를 유지
+- 수학적 표현이나 수식은 `Nougat` parser를 사용
+- PyMuPDF로 페이지 수, 텍스트 레이어 품질, OCR 필요 여부, chunk 계획을 사전 분석
+- PDF 텍스트를 Markdown 문단과 heading 구조로 변환
+- PDF 내 수식을 Markdown math delimiter를 사용하는 LaTeX로 변환
+- Nougat 실패 시 Marker 원문 수식 문자열을 fallback으로 보존
+- PDF 내 이미지를 추출하고 Markdown에서 연결
+- 이미지의 figure 번호와 캡션을 가능한 한 보존
+- PDF 내 표를 구조화하고 Markdown table로 출력
+- Markdown table 손실이 큰 표는 제한적 HTML table 또는 표 영역 이미지 fallback으로 보존
+- 페이지 수가 많은 문서를 20페이지 목표 chunk로 분할하되 논리 block 경계 보존
+- CLI 진행률, chunk 단위 성공/실패 요약, stderr/local log 기록
+- 실패 chunk 재개를 위한 runtime cache/state 기반 resume 옵션
+- `samples/` PDF 기반 품질 검증과 회귀 테스트 지원
+
+## 2차 범위
+- PyQt 기반 Windows UI
+- UI는 CLI/라이브러리 계층을 호출하는 thin client로 구현
+- 선택적 외부 API 연동은 변환 엔진 안정화 이후 검토
+
+## 제외 범위
+- hosted conversion API 기본 경로화
+- LLM 보정 모드 기본 경로화
+- 생성 문서와 함께 배포되는 별도 sidecar metadata/log 산출물
+- 변환 엔진 로직을 PyQt UI 안에 중복 구현하는 방식
+
+## 핵심 기능
+1. PDF 문서를 Markdown 문서 묶음으로 변환
+2. 텍스트 PDF, 스캔 PDF, 혼합 PDF를 페이지별 OCR 판단으로 처리
+3. 수식을 `$ ... $` 또는 `$$ ... $$` 형식의 LaTeX로 보존
+4. 수식 번호와 본문 내 수식 참조를 가능한 한 내부 링크로 연결
+5. 논문에서 자주 쓰이는 다중 컬럼 문서를 Markdown의 선형 구조로 재배치
+6. 이미지 추출 및 Markdown 연결
+7. figure 번호, caption, 본문 내 figure 참조 연결
+8. 표 구조화 및 표 유형별 Markdown/HTML/fallback 이미지 출력
+9. 긴 PDF를 여러 chunk Markdown 파일로 분할 변환
+10. 한글 파일명, 긴 Windows 경로, 공백 포함 경로 지원
+11. GTX 1070 Ti 8GB VRAM 기준 batch 크기 제어와 OOM 재시도
+12. offline 실행을 위한 명시적 model cache 정책
+
+## 품질 기준
+- 원문 읽기 순서가 Markdown에서 자연스럽게 유지되어야 합니다.
+- heading, 본문, 리스트, 인용, 표, 그림, 캡션, 수식의 의미 역할이 구분되어야 합니다.
+- 수식 delimiter와 기본 LaTeX 구조가 깨지지 않아야 합니다.
+- 수식 번호와 본문 참조가 가능한 한 연결되어야 합니다.
+- 이미지와 캡션, figure 번호, 본문 참조가 가능한 한 연결되어야 합니다.
+- 표는 구조 손실을 최소화하는 형식으로 저장되어야 합니다.
+- chunk 경계가 문단, 표, 그림, 수식을 중간에서 깨뜨리지 않아야 합니다.
+- 같은 입력 PDF와 같은 옵션은 같은 파일명, anchor, asset 구조를 생성해야 합니다.
+- Windows 경로, 한글 파일명, 긴 문서, GPU 메모리 부족 상황을 고려해야 합니다.
+- 오류와 경고는 Markdown 본문을 오염시키지 않고 stderr/local log에 남겨야 합니다.
+
+## Acceptance Criteria
+- `python scripts/validate_workspace.py`가 성공해야 합니다.
+- `.\venv\python.exe -m pip check`가 성공해야 합니다.
+- CUDA smoke test가 GTX 1070 Ti에서 성공해야 합니다.
+- `.\venv\Scripts\nougat.exe --help`가 성공해야 합니다.
+- sample metadata mapping 파일이 각 sample PDF의 특성을 설명해야 합니다.
+- focused pytest가 heading, 수식 delimiter, LaTeX environment pair, image link, caption matching, table parseability, chunk boundary, no-exception conversion을 검증해야 합니다.
+
+## UI
+- UI는 2차 목표로 PyQt를 사용합니다.
+- UI는 변환 엔진을 직접 구현하지 않고 CLI/라이브러리 계층을 호출하는 thin client로 둡니다.
+- 미니멀하고 깔끔한 Windows 표준 디자인을 따릅니다.
@@ -0,0 +1,90 @@
+# Toolchain Notes
+
+This document summarizes the researched toolchain choices and local compatibility decisions.
+
+## Verified Environment
+- OS: Windows 10
+- GPU: NVIDIA GeForce GTX 1070 Ti
+- VRAM: 8 GB
+- NVIDIA driver: 577.00
+- `nvidia-smi` CUDA runtime capability: 12.9
+- User-installed CUDA toolkit: 12.4
+- Python: 3.11.15 in repo-local `venv`
+- Environment manager: Conda / Miniforge
+
+## Python Dependencies
+Use one repo-local `venv` and install from `requirements.txt`.
+
+Key pins:
+- `torch==2.7.1+cu126`
+- `torchvision==0.22.1+cu126`
+- `marker-pdf==1.10.2`
+- `nougat-ocr==0.1.17`
+- `transformers==4.57.6`
+- `albumentations==1.3.1`
+- `pymupdf==1.27.2.3`
+- `pandas==3.0.2`
+- `pytest==9.0.3`
+- `pypdfium2==4.30.0`
+- `opencv-python-headless==4.11.0.86`
+- `Pillow==10.4.0`
+- `fsspec==2026.2.0`
+
+## PyTorch / CUDA Decision
+- `torch==2.11.0+cu128` imports on this machine but does not support GTX 1070 Ti `sm_61` at runtime.
+- `torch==2.7.1+cu126` satisfies Marker `torch>=2.7.0` and successfully runs CUDA tensor operations on GTX 1070 Ti.
+- Keep this pin unless a newer official PyTorch wheel is verified to support `sm_61`.
+
+## Marker
+- Marker is the primary document parser.
+- It handles layout, OCR/layout, reading order, body text, headings, tables, figures, captions, and semantic block roles.
+- It should be consumed through structured output or adapter APIs where possible, not by scraping final Markdown text.
+
+## Nougat
+- Nougat is used only for formulas and mathematical expressions.
+- `nougat-ocr==0.1.17` has loose dependency bounds, so the project pins compatible versions.
+- `transformers 5.x` breaks Nougat imports.
+- `albumentations 2.x` breaks Nougat transform initialization.
+- Nougat failure must fall back to Marker source text.
+
+## PyMuPDF
+- PyMuPDF is used for lightweight page analysis, page counts, text-layer quality checks, OCR intervention planning, chunk planning, and low-level PDF/page operations.
+- It is not the primary document parser.
+
+## Comparison Baselines
+These tools are useful for research or quality comparison but are not the primary architecture:
+- PyMuPDF4LLM
+- Docling
+- MinerU
+- MarkItDown
+
+Do not switch the primary parser without updating `docs/ADR.md`, `docs/ARCHITECTURE.md`, and `docs/CONVERSION_POLICY.md`.
+
+## Reference Links
+- Marker PyPI: https://pypi.org/project/marker-pdf/
+- Nougat GitHub: https://github.com/facebookresearch/nougat
+- PyMuPDF documentation: https://pymupdf.readthedocs.io/
+- PyTorch previous versions: https://docs.pytorch.org/get-started/previous-versions/
+- GitHub Flavored Markdown spec: https://github.github.io/gfm/
+- MathJax TeX delimiters: https://docs.mathjax.org/en/latest/input/tex/delimiters.html
+- Docling GitHub: https://github.com/docling-project/docling
+- MinerU GitHub: https://github.com/opendatalab/MinerU
+
+## Markdown And Math Rendering
+- Markdown table output should target GitHub Flavored Markdown where possible.
+- Complex tables may use limited HTML `<table>`.
+- Math output uses `$ ... $` for inline formulas and `$$ ... $$` for block formulas.
+- `$...$` can conflict with ordinary dollar signs, so delimiter validation and repair are required.
+
+## Model Cache
+- Use explicit local cache paths for Marker/Nougat/Hugging Face model downloads.
+- README should include model pre-download and offline execution instructions before the engine is released.
+- Default project-local model cache path is `.models/`.
+- `PDFTOMD_MODEL_CACHE` can override the default cache root.
+- The runtime cache policy exposes Hugging Face cache environment variables from that root without downloading models during validation.
+- Runtime logs and resume state are runtime artifacts under `output/.pdftomd-runtime/<document-slug>/`, not generated document sidecars.
+
+## Licensing Notes
+- Current user context is personal use.
+- Before redistribution or commercial use, revisit Marker GPL and model-weight license implications.
+- Process or API isolation can reduce coupling risk, but it is not a substitute for legal review.
@@ -0,0 +1,39 @@
+# UI 디자인 가이드
+
+UI는 2차 목표입니다. 1차 MVP에서는 CLI/라이브러리 변환 엔진을 먼저 안정화합니다.
+
+## 디자인 원칙
+1. 표준 Windows 환경에 맞는 미니멀한 UI를 따른다.
+2. 변환 엔진 로직을 UI에 중복 구현하지 않는다.
+3. PyQt UI는 core Python API 또는 CLI를 호출하는 thin client로 둔다.
+4. 긴 문서 변환 중 사용자가 현재 상태를 파악할 수 있어야 한다.
+5. 오류와 경고는 읽기 쉬운 방식으로 보여주되, 생성 Markdown을 오염시키지 않는다.
+
+## 주요 화면
+- PDF 선택
+- 출력 폴더 선택
+- runtime 선택: `cuda`, `auto`, `cpu`
+- formula parser 선택: `nougat`, `marker`
+- chunk size 표시 및 기본값 유지
+- 진행률과 chunk별 상태 표시
+- 실패 chunk 요약
+- resume 실행 버튼
+- local log 열기
+
+## Interaction Rules
+- 기본값은 CLI 기본값과 동일해야 한다.
+- `cuda` 명시 실행에서 CUDA 초기화 실패 시 CPU fallback을 자동으로 하지 않고 명확히 실패를 표시한다.
+- `auto` 실행에서 CUDA 실패 시 경고 후 CPU fallback 상태를 표시한다.
+- 변환 중 취소가 가능해야 한다.
+- 성공한 chunk와 실패한 chunk가 구분되어야 한다.
+
+## Visual Style
+- Windows native에 어울리는 절제된 색상과 간격을 사용한다.
+- 작업 도구 UI이므로 marketing hero나 장식적 layout은 사용하지 않는다.
+- 긴 파일명과 한글 경로가 잘리지 않도록 middle ellipsis 또는 tooltip을 제공한다.
+- 로그와 결과 경로는 복사 가능한 텍스트로 제공한다.
+
+## Boundary
+- UI는 `src/` core package의 public API 또는 CLI만 호출한다.
+- UI에서 Marker/Nougat/PyMuPDF를 직접 조합하지 않는다.
+- UI 테스트는 core conversion quality test와 분리한다.