remove files

2026-05-08 16:31:17 +09:00
parent 7e985ae94a
commit 551ab50735
135 changed files with 0 additions and 41205 deletions
@@ -1,142 +0,0 @@
-# Architecture Decision Records
-
-## 철학
-프로젝트의 핵심 가치관:
- 정확한 수식 변환
- 로컬 작동
- 메모리 최적 사용
- AI Agent가 탐색하기 쉬운 deterministic Markdown bundle
- 원문 구조와 참조 관계 보존
-
---
-
-## ADR-001: Marker-first document parsing
-**결정**: Marker를 기본 PDF parser로 사용한다.
-
-**이유**:
- Marker는 layout, OCR, reading order, table, figure, caption, heading을 포함한 문서 구조 추적에 적합하다.
- 프로젝트 목표는 단순 텍스트 추출이 아니라 원문 논리 구조를 Markdown으로 재구성하는 것이다.
-
-**트레이드오프**:
- Marker 의존성 및 model weight 관리가 필요하다.
- 배포 가능성이 생기면 GPL 및 model license 검토가 필요하다.
-
---
-
-## ADR-002: Nougat as formula-only parser
-**결정**: Nougat은 전체 PDF parser가 아니라 수식 및 수학적 표현 parser로만 사용한다.
-
-**이유**:
- Nougat은 학술 문서의 수식/LaTeX 변환에 강점이 있다.
- 전체 문서 구조는 Marker가 담당해야 reading order, 표, 그림, caption 경로가 일관된다.
-
-**트레이드오프**:
- Marker block과 Nougat 결과를 연결하는 handoff/fallback 계층이 필요하다.
- Nougat 실패 시 Marker 원문 문자열을 fallback으로 사용해야 한다.
-
---
-
-## ADR-003: PyMuPDF page pre-analysis and chunk planning
-**결정**: PyMuPDF를 페이지 수, 텍스트 레이어 품질, OCR 필요 여부, chunk 계획, 저수준 PDF 작업에 사용한다.
-
-**이유**:
- 무거운 parser 실행 전에 빠른 page-level 분석이 필요하다.
- 혼합 PDF는 페이지별 OCR 개입 여부를 판단해야 한다.
- 긴 PDF는 20페이지 목표 chunk로 나누되 논리 block 경계를 고려해야 한다.
-
-**트레이드오프**:
- PyMuPDF 분석 결과와 Marker layout 결과를 조정하는 adapter가 필요하다.
-
---
-
-## ADR-004: Single Python 3.11 environment
-**결정**: repo-local 단일 Python 3.11 `venv`를 사용한다.
-
-**이유**:
- 개발과 실행 경로를 단순화한다.
- Marker와 Nougat은 명시적 dependency pin을 두면 하나의 환경에서 함께 동작한다.
-
-**검증된 주요 pin**:
- `torch==2.7.1+cu126`
- `torchvision==0.22.1+cu126`
- `marker-pdf==1.10.2`
- `nougat-ocr==0.1.17`
- `transformers==4.57.6`
- `albumentations==1.3.1`
- `pypdfium2==4.30.0`
- `opencv-python-headless==4.11.0.86`
- `Pillow==10.4.0`
- `fsspec==2026.2.0`
-
-**트레이드오프**:
- Nougat의 느슨한 dependency bounds 때문에 requirements pin을 엄격히 유지해야 한다.
- 최신 PyTorch를 무조건 사용할 수 없다. GTX 1070 Ti `sm_61` 지원 때문에 `torch==2.7.1+cu126`을 사용한다.
-
---
-
-## ADR-005: Markdown bundle output without document sidecars by default
-**결정**: 기본 출력은 chunk Markdown 파일과 asset directory로 제한한다.
-
-**이유**:
- AI Agent가 읽고 탐색하기 쉬운 산출물을 우선한다.
- 별도 sidecar 산출물은 사용자가 명시적으로 요청하기 전까지 범위를 넓히지 않는다.
-
-**트레이드오프**:
- 변환 diagnostics를 문서 출력과 분리해야 한다.
- runtime log/state/cache는 허용하되 문서 output contract와 구분해야 한다.
-
---
-
-## ADR-006: Focused quality assertions over full snapshots
-**결정**: 전체 Markdown snapshot 비교보다 focused assertions를 우선한다.
-
-**이유**:
- PDF 변환 결과는 줄바꿈, spacing, parser version에 민감하다.
- 품질 핵심은 heading, 수식, 표, 이미지, caption, 링크, chunk integrity, 예외 여부다.
-
-**트레이드오프**:
- 테스트 설계가 더 세분화된다.
- sample metadata mapping이 필요하다.
-
---
-
-## ADR-007: Runtime fallback policy
-**결정**:
- explicit `--runtime cuda` 또는 `--device cuda`는 CUDA 실패 시 fail-fast.
- `--runtime auto`는 경고 후 CPU fallback 허용.
- GPU OOM은 가능한 경우 batch/page 단위를 줄여 재시도.
-
-**이유**:
- 사용자가 CUDA를 명시한 경우 조용한 CPU 전환은 예측 불가능한 지연을 만든다.
- auto mode는 유연한 실행을 제공해야 한다.
-
-**트레이드오프**:
- runtime state와 오류 reporting이 필요하다.
-
---
-
-## ADR-008: Future PyQt UI as thin client
-**결정**: PyQt UI는 변환 엔진을 직접 구현하지 않고 CLI/라이브러리 API를 호출하는 thin client로 둔다.
-
-**이유**:
- 1차 목표는 CLI/library 엔진 안정화다.
- UI와 core engine의 책임을 분리해야 테스트와 유지보수가 쉽다.
-
-**트레이드오프**:
- UI 설계 전에 core API contract를 안정화해야 한다.
-
---
-
-## ADR-009: File-based planner/generator/evaluator Harness
-**결정**: 장기 작업은 `planner -> generator -> evaluator` 역할 분리와 파일 기반 handoff를 사용하는 Harness workflow로 관리한다.
-
-**이유**:
- PDF 변환 엔진은 parser, OCR, 수식, 표, 그림, runtime, 테스트가 얽힌 장기 작업이므로 단일 대화에서 일관성을 유지하기 어렵다.
- 작은 self-contained phase step은 새 agent가 fresh context로 작업을 이어받기 쉽게 한다.
- 구현 agent와 평가 agent를 분리하면 자기 평가 편향을 줄이고, hard threshold 기반 검증을 강제할 수 있다.
- `PLAN.md`, `PROGRESS.md`, `phases/` 파일을 통한 handoff는 대화 밖에서도 현재 상태를 재구성할 수 있게 한다.
-
-**트레이드오프**:
- 각 step마다 Sprint Contract와 검증 기준을 작성하는 비용이 생긴다.
- 너무 많은 agent, hook, command를 추가하면 Harness 자체가 유지보수 대상이 될 수 있으므로 `docs/HARNESS.md`의 단순화 규칙을 따른다.
- Hook은 보조 장치일 뿐이며, evaluator 검토와 acceptance criteria를 대체하지 않는다.
@@ -1,152 +0,0 @@
-# Architecture
-
-## Scope
-현재 구현 목표는 1차 목표인 Windows native, local-first CLI/library 변환 엔진입니다.
-
- 기본 parser: `Marker`
- 기본 수식 parser: `Nougat`
- PDF 분석과 chunk 계획: `PyMuPDF`
- 출력: Markdown chunk files plus assets
- 기본 chunk 목표: 20페이지
- 기본 runtime: CUDA
- UI, hosted API, 기본 LLM 보정 경로는 1차 목표 범위 밖입니다.
-
-## Architecture Principles
- Marker-first architecture를 유지합니다.
- Nougat은 전체 문서 parser가 아니라 수식 parser입니다.
- PyMuPDF는 무거운 변환 전에 빠른 page-level 분석과 chunk 계획을 담당합니다.
- 출력은 AI Agent가 탐색하기 쉬운 deterministic Markdown bundle이어야 합니다.
- 복잡한 table/figure/formula 손실 가능성은 fallback과 품질 검증으로 다룹니다.
- 생성 Markdown은 원문 문서 내용 중심이어야 하며 경고/오류 로그로 오염시키지 않습니다.
-
-## Pipeline
-1. Input normalization
-   - PDF path를 `pathlib` 기반으로 정규화합니다.
-   - 한글, 공백, 긴 Windows 경로를 지원합니다.
-   - document slug를 결정적으로 생성합니다.
-
-2. Page pre-analysis
-   - PyMuPDF로 page count, text length, image count, text-layer quality를 확인합니다.
-   - 페이지별 OCR 필요 여부를 추정합니다.
-   - 긴 문서는 20페이지 목표 chunk 계획을 세우되 logical block boundary 보존을 고려합니다.
-
-3. Marker parse
-   - Marker가 layout, OCR, reading order, body text, headings, tables, figures, captions, semantic blocks를 담당합니다.
-   - Marker Document Model 또는 이에 준하는 구조화 출력을 내부 block model로 매핑합니다.
-
-4. Formula handoff
-   - Marker equation block 또는 수식 패턴이 감지된 block만 Nougat에 전달합니다.
-   - Nougat 결과는 LaTeX 문자열 후보로 취급하며 validation과 fallback 정책을 통과해야 합니다.
-   - Nougat 실패 시 Marker 원문 수식 문자열을 사용합니다.
-
-5. Semantic enrichment
-   - 수식 번호, figure 번호, table 번호, caption, 본문 참조를 식별합니다.
-   - 식별 confidence가 충분하면 내부 Markdown link로 연결합니다.
-   - header/footer/page-number 반복 패턴은 본문 흐름에서 제거하거나 분리합니다.
-
-6. Markdown rendering
-   - heading, paragraph, list, blockquote, table, figure, equation block을 Markdown으로 렌더링합니다.
-   - Markdown table을 우선하되 복잡한 표는 제한적 HTML table 또는 이미지 fallback을 사용합니다.
-   - 각 chunk에는 문서 제목, page range, chunk 번호 등 최소 frontmatter를 넣을 수 있습니다.
-
-7. Asset writing
-   - 이미지는 `images/` 아래 결정적 파일명으로 저장합니다.
-   - figure 번호가 있으면 `{document-slug}_fig-{figure-number}.png`를 우선합니다.
-   - 충돌 또는 번호 부재 시 chunk/page/block identifier를 사용합니다.
-   - hash 기반 deduplication으로 중복 asset 저장을 줄입니다.
-
-8. Validation and reporting
-   - math delimiter balance, LaTeX environment pairs, table parseability, image link existence, caption matching, chunk boundary integrity를 검증합니다.
-   - CLI는 progress bar와 chunk별 성공/실패를 표시합니다.
-   - 오류와 경고는 stderr와 local log에 기록합니다.
-
-## Planned Layout
-```text
-samples/             # regression and quality corpus
-tests/               # focused pytest coverage
-scripts/             # validation / harness helpers
-phases/              # executable Harness phase tickets
-src/                 # source package, planned
-venv/                # repo-local Windows virtual environment, ignored by git
-output/              # conversion output, ignored by git
-```
-
-## Harness Boundary
- `docs/HARNESS.md` defines the planner/generator/evaluator workflow for long-running work.
- `phases/` files are execution tickets, not architecture policy. Architecture policy remains in `docs/ARCHITECTURE.md`, `docs/CONVERSION_POLICY.md`, and `docs/ADR.md`.
- Each implementation phase must keep parser, formula, pre-analysis, renderer, runtime, and UI responsibilities separated according to this document.
- Evaluator checks should use hard thresholds from each step's Sprint Contract and the focused quality strategy below.
-
-## Output Contract
-출력은 문서 slug 디렉터리 아래에 묶입니다.
-
-```text
-output/
-└── document-slug/
-    ├── document-slug_001.md
-    ├── document-slug_002.md
-    └── images/
-        ├── document-slug_fig-001.png
-        └── document-slug_fig-003.png
-```
-
-세부 규칙:
- chunk Markdown 파일명은 `<slug>_<chunk-index:03d>.md`
- image asset은 `images/`
- 같은 입력과 같은 옵션은 같은 output path를 생성해야 합니다.
- 별도 문서 sidecar metadata/log 산출물은 기본 output contract에 포함하지 않습니다.
- local log와 resume state/cache는 runtime artifact이며 문서 출력 contract와 구분합니다.
-
-## Runtime Policy
- 기본 runtime은 `cuda`
- explicit `--runtime cuda` 또는 `--device cuda`에서 CUDA가 준비되지 않았으면 빠르게 실패
- `--runtime auto`는 필요 시 CPU fallback 경고를 출력
- GTX 1070 Ti 8GB 기준 batch size는 1~2 수준에서 시작
- GPU OOM 시 가능한 경우 batch/page 단위를 줄여 재시도
- 수식 parser 기본값은 `nougat`
- verified PyTorch baseline은 `torch==2.7.1+cu126`
-
-## Environment
-단일 repo-local Python 3.11 `venv`를 사용합니다.
-
-```powershell
-conda create -p .\venv python=3.11 -y
-.\venv\python.exe -m pip install -r requirements.txt
-```
-
-주요 pin:
- `torch==2.7.1+cu126`
- `torchvision==0.22.1+cu126`
- `marker-pdf==1.10.2`
- `nougat-ocr==0.1.17`
- `transformers==4.57.6`
- `albumentations==1.3.1`
- `pypdfium2==4.30.0`
- `opencv-python-headless==4.11.0.86`
- `Pillow==10.4.0`
- `fsspec==2026.2.0`
-
-## Model Cache And Offline Mode
- 모델 cache 위치는 명시적으로 관리해야 합니다.
- 최초 다운로드 이후 offline 실행 시 이미 받은 weight를 우선 사용해야 합니다.
- README에는 model download와 offline 실행 절차를 별도로 추가해야 합니다.
-
-## Quality Strategy
- 전체 Markdown snapshot 비교는 주요 검증 방식으로 사용하지 않습니다.
- focused assertions를 우선합니다.
- 검증 대상:
-  - heading hierarchy
-  - math delimiter balance
-  - LaTeX `\begin` / `\end` pairs
-  - image link existence
-  - figure/table/formula caption matching
-  - table parseability
-  - chunk boundary integrity
-  - Windows path and Korean filename handling
-  - no-exception conversion
-
-## Out of Scope for the First Goal
- PyQt UI 구현
- hosted conversion API 기본 경로화
- LLM 보정 모드 기본 경로화
- 생성 문서와 함께 배포되는 별도 sidecar metadata/log 산출물
@@ -1,91 +0,0 @@
-# Conversion Policy
-
-This document records implementation decisions for the PDF-to-Markdown conversion engine. It is planning guidance, not implementation code.
-
-## Input Classification
- Support mixed PDFs by default: text-layer pages, scanned pages, and mixed pages can appear in the same document.
- Use PyMuPDF or equivalent lightweight page analysis before heavy parsing to estimate text-layer quality per page.
- Decide OCR intervention per page instead of treating the entire PDF as text-only or scan-only.
- Prefer Marker's OCR/layout functionality for scanned or weak text-layer pages.
-
-## Parser Responsibilities
- Marker owns overall layout tracking, reading order, body extraction, table structure, image extraction, headings, captions, and semantic block roles.
- Nougat owns only mathematical expressions and formula block parsing.
- Do not use Nougat as the main document parser.
- Send a block to Nougat when Marker identifies it as an equation area or when text-pattern detection marks it as mathematical content.
- If Nougat conversion fails, preserve information by falling back to Marker's extracted source text.
-
-## Formula Handling
- Treat formulas embedded inside a sentence without independent line spacing as inline formulas.
- Treat formulas occupying independent line space or vertical whitespace as block formulas.
- Preserve formula numbers detected near the right or bottom side of a formula region.
- Attach anchors to extracted formula numbers and rewrite body references such as `Eq. (3)` or `식 (5)` as internal Markdown links when confidence is sufficient.
- Validate Markdown math delimiters by counting opening and closing `$ ... $` and `$$ ... $$` pairs across each chunk.
- Validate common LaTeX environments by checking matching `\begin{...}` and `\end{...}` names and counts.
- If delimiter or environment validation fails, repair the closest logical location in a way that keeps Markdown rendering intact.
-
-## Tables
- Prefer Markdown tables when structure can be represented without major loss.
- Use limited HTML `<table>` output for tables with merged cells, multi-row headers, or structures that exceed GitHub Flavored Markdown table expressiveness.
- Preserve table footnotes as regular text immediately below the table.
- Preserve top or bottom captions as text and create internal links from body references such as `Table 1`.
- If structured table extraction loses too much information, also save a screenshot of the table region as a fallback asset and link it near the structured output.
-
-## Figures And Images
- Use deterministic image asset naming such as `{document-slug}_fig-{figure-number}.png` when a figure number is available.
- Include chunk/page/block identifiers in names or anchors when needed to avoid collisions.
- Place extracted image assets in the document `images/` directory.
- Add figure captions below Markdown image links.
- Rewrite body references such as `Fig. 2` to internal Markdown links when the figure target can be identified.
- Deduplicate extracted images by hash and let repeated references share one asset and anchor.
-
-## Reading Order And Paragraph Flow
- Stitch lines into paragraphs when a line does not end with terminal punctuation and the next line begins like a continuation, or when bounding-box line spacing matches intra-paragraph spacing.
- Join hyphenated line breaks when a line-ending hyphen is followed by a lowercase continuation without whitespace.
- Preserve hyphens for known compounds, identifiers, or proper nouns when confidence is low.
- Use Marker bounding boxes to validate that the linearized text flow matches expected reading order in sample PDFs.
- Detect repeated header/footer/page-number patterns in stable top/bottom page regions and exclude them from body Markdown, or separate them from the main body flow.
-
-## Chunking
- Use 20 pages as the default chunk target.
- Prefer logical block boundaries over strict page boundaries when a paragraph, formula, table, or figure would be cut in the middle.
- If a block crosses a chunk boundary, keep the block intact by moving it to the previous or next chunk according to the least damaging boundary.
- Add minimal context at the top of each chunk, including document title, page range, and chunk number.
- Avoid sidecar metadata by default; put only core metadata in concise Markdown frontmatter.
-
-## Determinism And Paths
- Ensure the same PDF and same options produce stable output structure and filenames.
- Use deterministic slug, anchor, asset, and chunk naming rules.
- Prefer `pathlib` for filesystem paths.
- Test Korean filenames, paths with spaces, and long Windows paths.
-
-## Runtime And Recovery
- Use conservative batch sizes, usually 1 or 2, for GTX 1070 Ti 8 GB VRAM.
- If a GPU out-of-memory error occurs, retry with a smaller batch or smaller page unit where possible.
- If the user explicitly requests `--device cuda` or `--runtime cuda`, fail fast instead of silently switching to CPU.
- If the user requests `--runtime auto`, warn and fall back to CPU when CUDA initialization fails.
- Keep model cache locations explicit, preferably under a local project or user-configured model cache directory, so offline operation can reuse already-downloaded weights.
-
-## Logging And Resume
- Show chunk-level progress and success/failure status in the CLI.
- Print warnings and errors to stderr and a local log file.
- Do not inject warnings or error logs into generated Markdown because they reduce document readability and integrity.
- Support resuming failed conversions by skipping already successful chunks when a local state/cache file is available.
- Sidecar outputs are still out of scope unless explicitly requested; a resume state file is a runtime cache, not part of the document output contract.
-
-## Quality Tests
- Prefer focused assertions over full Markdown snapshots.
- Validate heading structure, formula delimiter balance, LaTeX environment pairs, image links, caption matching, table parseability, and no-exception conversion.
- Use regex and Markdown/HTML parsers where practical instead of ad hoc string checks.
- Maintain a sample metadata mapping file for `samples/` that tags each PDF by traits such as text-layer quality, scanned pages, multi-column layout, formula density, table density, figure density, and Korean filename coverage.
- Use engineering/mechanics PDFs with multi-column layout, formulas, graphs, and tables as the MVP acceptance corpus.
-
-## Licensing
- Current use is personal, which lowers immediate distribution risk.
- If redistribution or commercial use becomes relevant, revisit Marker GPL and model-weight license implications before packaging.
- Process or service isolation can be considered as a licensing risk-mitigation strategy, but it is not a legal conclusion and should be reviewed before distribution.
-
-## UI Boundary
- Keep the core conversion engine as a Python API/CLI package.
- Future PyQt UI should remain a thin client over the same API and must not duplicate conversion logic.
-
@@ -1,114 +0,0 @@
-# Harness Engineering Guide
-
-이 문서는 PDFtoMD 프로젝트에서 장기 agent 작업을 관리하는 Harness 운영 규칙입니다. 기준은 Anthropic의 "Harness design for long-running application development" 글에서 강조한 planner, generator, evaluator 분리, 파일 기반 handoff, sprint contract, 독립 평가 루프입니다.
-
-## Purpose
- 긴 변환 엔진 개발을 작은 self-contained step으로 나눕니다.
- 새 agent가 이전 대화 맥락 없이도 `AGENTS.md`, `PLAN.md`, `PROGRESS.md`, `phases/` 파일만 읽고 일을 이어받게 합니다.
- 구현 agent와 평가 agent를 분리해 자기 평가 편향을 줄입니다.
- 각 step의 성공 조건을 코드 작성 전에 파일로 고정합니다.
- Harness 자체는 단순하게 유지하고, 복잡성은 필요한 검증 기준과 step 경계에만 둡니다.
-
-## Roles
-
-### Planner
- 제품 목표와 아키텍처 문서를 읽고 phase와 step을 작성합니다.
- 구현 세부를 과도하게 지정하지 않고 산출물, 책임 범위, 수락 기준, 금지 범위를 명확히 합니다.
- 산출물:
-  - `PLAN.md` 업데이트
-  - `phases/index.json`
-  - `phases/{phase}/index.json`
-  - `phases/{phase}/stepN.md`
-
-### Generator
- 한 번에 하나의 `stepN.md`만 수행합니다.
- 작업 전 step의 "Sprint Contract"를 읽고, 애매하면 구현 전에 `PROGRESS.md`에 blocker로 남깁니다.
- TDD가 필요한 구현 step에서는 테스트를 먼저 작성합니다.
- 산출물:
-  - step 범위 내 코드, 테스트, 문서 변경
-  - `phases/{phase}/index.json` step status 업데이트
-  - `PROGRESS.md` handoff 업데이트
-
-### Evaluator
- generator가 만든 결과를 독립적으로 검토합니다.
- 합의된 기준 중 하나라도 hard threshold를 넘지 못하면 step을 통과시키지 않습니다.
- 통과 여부만 보지 않고, 재작업 가능한 구체적 실패 원인을 남깁니다.
- 산출물:
-  - review finding 또는 pass 기록
-  - 필요한 경우 `phases/{phase}/index.json`의 `error_message` 또는 `blocked_reason`
-  - `PROGRESS.md` 검증 결과
-
-## File Protocol
- `AGENTS.md`: 변하지 않는 저장소 규칙.
- `PLAN.md`: 전체 작업 계획의 단일 출처.
- `PROGRESS.md`: 현재 진행 상태와 handoff의 단일 출처.
- `docs/*.md`: 제품, 아키텍처, 결정, 도구 체인, Harness 운영 지식.
- `phases/index.json`: 실행 가능한 phase registry.
- `phases/{phase}/index.json`: 해당 phase step 상태의 단일 출처.
- `phases/{phase}/stepN.md`: 새 agent가 독립 실행할 수 있는 ticket.
-
-## Step Contract Template
-각 `stepN.md`는 다음 정보를 포함해야 합니다.
-
-````markdown
-# Step N: step-name
-
-## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/ARCHITECTURE.md
- /docs/ADR.md
- /docs/CONVERSION_POLICY.md
-
-## Task
-이 step에서 만들어야 하는 산출물과 수정 가능한 파일을 구체적으로 적습니다.
-
-## Sprint Contract
- Done means: 사용자가 관찰할 수 있거나 테스트로 확인 가능한 완료 조건.
- Hard thresholds: 하나라도 실패하면 step 실패로 보는 기준.
- Files owned: 이 step에서 수정할 수 있는 파일 또는 디렉터리.
- Dependencies: 이전 step 산출물 또는 필요한 문서.
-
-## Acceptance Criteria
-```powershell
-python scripts\validate_workspace.py
-```
-
-## Verification
-1. 테스트와 검증 명령을 실행합니다.
-2. `PROGRESS.md`에 결과와 다음 handoff를 기록합니다.
-3. `phases/{phase}/index.json`의 해당 step을 `completed`, `blocked`, `error` 중 하나로 갱신합니다.
-
-## Do Not
- step 범위 밖 기능을 구현하지 않습니다.
- 새 parser나 외부 API를 도입하지 않습니다.
- 생성 Markdown 출력 contract를 임의로 넓히지 않습니다.
-````
-
-## Evaluation Criteria
-PDFtoMD의 evaluator는 다음 hard threshold를 우선 적용합니다.
-
-| Area | Hard Threshold |
-| --- | --- |
-| Architecture | Marker, Nougat, PyMuPDF 책임 경계를 깨지 않는다. |
-| TDD | 구현 step은 실패하는 테스트가 먼저 추가되거나, 테스트가 필요 없는 이유가 step에 명시된다. |
-| Determinism | 같은 입력과 옵션은 같은 slug, asset path, anchor, Markdown 구조를 만든다. |
-| Markdown quality | heading, math delimiter, table, image link, caption, chunk frontmatter 검증이 가능하다. |
-| Runtime | Windows path, Korean filename, CUDA/CPU runtime 정책을 훼손하지 않는다. |
-| Scope | PyQt UI, hosted API, LLM correction, sidecar output을 1차 구현에 끌어오지 않는다. |
-| Handoff | `PROGRESS.md`와 phase index가 다음 agent에게 충분한 상태를 제공한다. |
-
-## When To Use The Full Loop
- Full planner/generator/evaluator loop를 사용합니다:
-  - 새 phase를 시작할 때
-  - parser adapter, chunk planner, renderer, quality validator처럼 실패 비용이 큰 작업
-  - sample corpus나 runtime 정책처럼 여러 파일과 문서가 동시에 바뀌는 작업
- 단순한 문서 오타, 작은 command 설명, 명확한 단일 테스트 수정은 일반 Codex 작업으로 처리해도 됩니다. 그래도 `PROGRESS.md`는 갱신합니다.
-
-## Simplification Rule
-Harness 구성 요소는 실제로 품질을 높일 때만 유지합니다.
- 같은 검증을 두 곳에서 반복하면 하나로 줄입니다.
- hook은 보조 장치로 취급하고, step의 acceptance criteria와 evaluator 판단을 대체하지 않습니다.
- agent에게 너무 많은 컨텍스트를 주지 말고, step에 필요한 문서와 파일만 지정합니다.
@@ -1,121 +0,0 @@
-# Implementation Phase Plan
-
-이 문서는 PDFtoMD 구현 전체를 phase 단위로 나눈 실행 계획입니다. 각 phase의 상세 실행 티켓은 `phases/{phase}/stepN.md`에 둡니다.
-
-## Planning Principles
- 1차 목표는 Windows native, local-first CLI/library 변환 엔진입니다.
- PyQt UI는 core API와 CLI가 안정화된 뒤 thin client로 구현합니다.
- 각 phase는 이전 phase의 산출물을 전제로 하며, phase 안의 step은 하나의 agent가 독립 실행할 수 있어야 합니다.
- 구현 phase는 TDD를 기본으로 합니다.
- Parser 책임 경계는 유지합니다: Marker는 문서 구조, Nougat은 수식, PyMuPDF는 사전 분석과 저수준 PDF 작업입니다.
-
-## Phase Overview
-
-| Phase | Goal | Primary Output | Depends On |
-| --- | --- | --- | --- |
-| 0. Harness foundation | 실행 가능한 Harness 기반과 최소 품질 토대 | sample metadata, core models, preanalysis contract, quality gates | current docs |
-| 1. Core runtime contracts | 변환 옵션, 입력 정규화, 출력 bundle 계약, path/cache 정책 | stable API contracts and tests | Phase 0 |
-| 2. Marker adapter | Marker 실행과 block normalization 경계 구현 | Marker adapter, OCR handoff, block mapping tests | Phase 1 |
-| 3. Formula pipeline | Nougat formula-only handoff와 LaTeX 검증/fallback | formula detector, Nougat adapter, repair/fallback tests | Phase 2 |
-| 4. Semantic enrichment | 문단, reading order, header/footer, 참조 관계 보강 | enrichment pipeline and reference index | Phase 2, 3 |
-| 5. Markdown rendering and assets | Markdown chunk, table, figure, asset writer 구현 | deterministic Markdown bundle writer | Phase 4 |
-| 6. CLI runtime and resume | CLI, progress/logging, runtime, OOM, resume 구현 | user-facing local CLI | Phase 5 |
-| 7. MVP quality hardening | samples 기반 end-to-end 품질 검증과 회귀 안정화 | MVP acceptance suite | Phase 6 |
-| 8. Release docs and packaging | 설치, 모델 cache, offline, release 문서 정리 | local release-ready docs/scripts | Phase 7 |
-| 9. PyQt thin client | CLI/library를 호출하는 Windows UI | optional PyQt UI | Phase 8 |
-
-## Phase 0: Harness Foundation
- Directory: `phases/0-harness-foundation`
- Purpose: 구현 전 공통 모델, sample metadata, PyMuPDF pre-analysis contract, Markdown quality gates를 만든다.
- Steps:
-  1. `sample-metadata-contract`
-  2. `core-package-skeleton`
-  3. `page-preanalysis-contract`
-  4. `markdown-quality-gates`
-
-## Phase 1: Core Runtime Contracts
- Directory: `phases/1-core-runtime-contracts`
- Purpose: parser 실행 전에 모든 phase가 공유할 입력, 옵션, path, output contract를 안정화한다.
- Steps:
-  1. `input-normalization-slug`
-  2. `conversion-options-config`
-  3. `output-bundle-contract`
-  4. `runtime-cache-policy`
-
-## Phase 2: Marker Adapter
- Directory: `phases/2-marker-adapter`
- Purpose: Marker를 primary parser로 연결하고, OCR/page plan과 Marker 구조화 출력을 내부 block model로 매핑한다.
- Steps:
-  1. `marker-invocation-adapter`
-  2. `ocr-plan-handoff`
-  3. `marker-block-normalization`
-  4. `marker-failure-reporting`
-
-## Phase 3: Formula Pipeline
- Directory: `phases/3-formula-pipeline`
- Purpose: Nougat을 formula-only parser로 연결하고, 수식 delimiter, numbering, fallback을 안정화한다.
- Steps:
-  1. `formula-block-detection`
-  2. `nougat-command-adapter`
-  3. `latex-validation-repair`
-  4. `formula-reference-links`
-
-## Phase 4: Semantic Enrichment
- Directory: `phases/4-semantic-enrichment`
- Purpose: Marker block을 Markdown에 적합한 논리 구조로 보강한다.
- Steps:
-  1. `reading-order-checks`
-  2. `paragraph-stitching`
-  3. `header-footer-filtering`
-  4. `reference-indexing`
-
-## Phase 5: Markdown Rendering And Assets
- Directory: `phases/5-markdown-rendering-assets`
- Purpose: chunked Markdown bundle과 image/table asset 출력을 결정적으로 생성한다.
- Steps:
-  1. `markdown-block-renderer`
-  2. `table-renderer-fallbacks`
-  3. `figure-asset-writer`
-  4. `chunk-renderer`
-
-## Phase 6: CLI Runtime And Resume
- Directory: `phases/6-cli-runtime-resume`
- Purpose: 변환 엔진을 사용자가 실행할 수 있는 CLI로 묶고 runtime/recovery 정책을 구현한다.
- Steps:
-  1. `cli-entrypoint-options`
-  2. `progress-logging`
-  3. `resume-state`
-  4. `device-oom-policy`
-  5. `model-cache-offline`
-
-## Phase 7: MVP Quality Hardening
- Directory: `phases/7-mvp-quality-hardening`
- Purpose: sample corpus 기준으로 end-to-end 품질을 고정하고 MVP 수락 기준을 통과시킨다.
- Steps:
-  1. `sample-smoke-conversions`
-  2. `quality-metrics-report`
-  3. `regression-thresholds`
-  4. `mvp-fix-sweep`
-
-## Phase 8: Release Docs And Packaging
- Directory: `phases/8-release-docs-packaging`
- Purpose: 개인용 로컬 실행 기준으로 설치, 모델 다운로드, offline 실행, release checklist를 정리한다.
- Steps:
-  1. `readme-usage-flow`
-  2. `environment-bootstrap-docs`
-  3. `license-checkpoint`
-  4. `release-checklist`
-
-## Phase 9: PyQt Thin Client
- Directory: `phases/9-pyqt-thin-client`
- Purpose: core engine을 중복 구현하지 않는 Windows UI를 만든다.
- Steps:
-  1. `ui-api-contract`
-  2. `pyqt-shell`
-  3. `ui-progress-resume`
-  4. `ui-packaging-notes`
-
-## Deferred Backlog
- Hosted conversion API는 현재 phase plan에 포함하지 않습니다.
- LLM correction mode는 기본 경로가 아니며, MVP 이후 별도 ADR과 phase 계획이 필요합니다.
- 배포/상업적 사용이 현실화되면 Marker GPL과 model weight license를 별도 법적 검토 대상으로 둡니다.
@@ -1,88 +0,0 @@
-# PRD: PDFtoMD
-
-## 목표
-PDFtoMD는 수학, 공학, 역학 중심의 PDF 문서를 AI Agent가 쉽게 접근하고 읽을 수 있는 Markdown 문서 묶음으로 변환하는 프로그램입니다.
-
-이 프로젝트의 목표는 PDF의 텍스트를 단순 추출하는 것이 아니라, 원문 문서의 논리 구조를 보존하면서 AI가 읽기 쉬운 지식 자료로 재구성하는 것입니다.
-
-## 문제 정의
- PDF는 텍스트, 이미지, 수식, 표, 캡션을 좌표 기반으로 저장하므로 원문 읽기 순서가 쉽게 깨집니다.
- 논문과 공학 문서에는 다단 레이아웃, 수식 번호, 그림/표 참조, 복잡한 표가 자주 등장합니다.
- 스캔 PDF와 텍스트 레이어 PDF가 섞인 문서는 OCR 여부를 문서 전체 단위가 아니라 페이지 단위로 판단해야 합니다.
- AI Agent와 RAG 도구는 긴 PDF 하나보다 논리적으로 나뉜 Markdown chunk와 연결된 asset을 더 안정적으로 탐색합니다.
-
-## 사용자
- PDF 문서를 Markdown으로 변환해 AI Agent, RAG, 개인 지식 관리 도구에 활용하고 싶은 사용자
- 수식, 표, 이미지가 많은 논문/공학 문서를 Markdown으로 읽고 관리하고 싶은 사용자
- 긴 PDF를 여러 Markdown 파일로 나누어 부분 탐색하고 싶은 사용자
- Windows native 환경에서 외부 서비스 없이 로컬로 변환하고 싶은 사용자
-
-## 1차 MVP 범위
- Windows native 환경에서 완전 로컬 실행
- GPU 기본 사용, VRAM 8GB 환경을 기준으로 안정적인 chunk 처리
- repo-local Python 3.11 단일 `venv` 환경 사용
- PDF parser는 `Marker`를 기본 엔진으로 사용
- 본문 구조, OCR/layout, reading order, 표, 그림, heading, caption은 Marker 경로를 유지
- 수학적 표현이나 수식은 `Nougat` parser를 사용
- PyMuPDF로 페이지 수, 텍스트 레이어 품질, OCR 필요 여부, chunk 계획을 사전 분석
- PDF 텍스트를 Markdown 문단과 heading 구조로 변환
- PDF 내 수식을 Markdown math delimiter를 사용하는 LaTeX로 변환
- Nougat 실패 시 Marker 원문 수식 문자열을 fallback으로 보존
- PDF 내 이미지를 추출하고 Markdown에서 연결
- 이미지의 figure 번호와 캡션을 가능한 한 보존
- PDF 내 표를 구조화하고 Markdown table로 출력
- Markdown table 손실이 큰 표는 제한적 HTML table 또는 표 영역 이미지 fallback으로 보존
- 페이지 수가 많은 문서를 20페이지 목표 chunk로 분할하되 논리 block 경계 보존
- CLI 진행률, chunk 단위 성공/실패 요약, stderr/local log 기록
- 실패 chunk 재개를 위한 runtime cache/state 기반 resume 옵션
- `samples/` PDF 기반 품질 검증과 회귀 테스트 지원
-
-## 2차 범위
- PyQt 기반 Windows UI
- UI는 CLI/라이브러리 계층을 호출하는 thin client로 구현
- 선택적 외부 API 연동은 변환 엔진 안정화 이후 검토
-
-## 제외 범위
- hosted conversion API 기본 경로화
- LLM 보정 모드 기본 경로화
- 생성 문서와 함께 배포되는 별도 sidecar metadata/log 산출물
- 변환 엔진 로직을 PyQt UI 안에 중복 구현하는 방식
-
-## 핵심 기능
-1. PDF 문서를 Markdown 문서 묶음으로 변환
-2. 텍스트 PDF, 스캔 PDF, 혼합 PDF를 페이지별 OCR 판단으로 처리
-3. 수식을 `$ ... $` 또는 `$$ ... $$` 형식의 LaTeX로 보존
-4. 수식 번호와 본문 내 수식 참조를 가능한 한 내부 링크로 연결
-5. 논문에서 자주 쓰이는 다중 컬럼 문서를 Markdown의 선형 구조로 재배치
-6. 이미지 추출 및 Markdown 연결
-7. figure 번호, caption, 본문 내 figure 참조 연결
-8. 표 구조화 및 표 유형별 Markdown/HTML/fallback 이미지 출력
-9. 긴 PDF를 여러 chunk Markdown 파일로 분할 변환
-10. 한글 파일명, 긴 Windows 경로, 공백 포함 경로 지원
-11. GTX 1070 Ti 8GB VRAM 기준 batch 크기 제어와 OOM 재시도
-12. offline 실행을 위한 명시적 model cache 정책
-
-## 품질 기준
- 원문 읽기 순서가 Markdown에서 자연스럽게 유지되어야 합니다.
- heading, 본문, 리스트, 인용, 표, 그림, 캡션, 수식의 의미 역할이 구분되어야 합니다.
- 수식 delimiter와 기본 LaTeX 구조가 깨지지 않아야 합니다.
- 수식 번호와 본문 참조가 가능한 한 연결되어야 합니다.
- 이미지와 캡션, figure 번호, 본문 참조가 가능한 한 연결되어야 합니다.
- 표는 구조 손실을 최소화하는 형식으로 저장되어야 합니다.
- chunk 경계가 문단, 표, 그림, 수식을 중간에서 깨뜨리지 않아야 합니다.
- 같은 입력 PDF와 같은 옵션은 같은 파일명, anchor, asset 구조를 생성해야 합니다.
- Windows 경로, 한글 파일명, 긴 문서, GPU 메모리 부족 상황을 고려해야 합니다.
- 오류와 경고는 Markdown 본문을 오염시키지 않고 stderr/local log에 남겨야 합니다.
-
-## Acceptance Criteria
- `python scripts/validate_workspace.py`가 성공해야 합니다.
- `.\venv\python.exe -m pip check`가 성공해야 합니다.
- CUDA smoke test가 GTX 1070 Ti에서 성공해야 합니다.
- `.\venv\Scripts\nougat.exe --help`가 성공해야 합니다.
- sample metadata mapping 파일이 각 sample PDF의 특성을 설명해야 합니다.
- focused pytest가 heading, 수식 delimiter, LaTeX environment pair, image link, caption matching, table parseability, chunk boundary, no-exception conversion을 검증해야 합니다.
-
-## UI
- UI는 2차 목표로 PyQt를 사용합니다.
- UI는 변환 엔진을 직접 구현하지 않고 CLI/라이브러리 계층을 호출하는 thin client로 둡니다.
- 미니멀하고 깔끔한 Windows 표준 디자인을 따릅니다.
@@ -1,90 +0,0 @@
-# Toolchain Notes
-
-This document summarizes the researched toolchain choices and local compatibility decisions.
-
-## Verified Environment
- OS: Windows 10
- GPU: NVIDIA GeForce GTX 1070 Ti
- VRAM: 8 GB
- NVIDIA driver: 577.00
- `nvidia-smi` CUDA runtime capability: 12.9
- User-installed CUDA toolkit: 12.4
- Python: 3.11.15 in repo-local `venv`
- Environment manager: Conda / Miniforge
-
-## Python Dependencies
-Use one repo-local `venv` and install from `requirements.txt`.
-
-Key pins:
- `torch==2.7.1+cu126`
- `torchvision==0.22.1+cu126`
- `marker-pdf==1.10.2`
- `nougat-ocr==0.1.17`
- `transformers==4.57.6`
- `albumentations==1.3.1`
- `pymupdf==1.27.2.3`
- `pandas==3.0.2`
- `pytest==9.0.3`
- `pypdfium2==4.30.0`
- `opencv-python-headless==4.11.0.86`
- `Pillow==10.4.0`
- `fsspec==2026.2.0`
-
-## PyTorch / CUDA Decision
- `torch==2.11.0+cu128` imports on this machine but does not support GTX 1070 Ti `sm_61` at runtime.
- `torch==2.7.1+cu126` satisfies Marker `torch>=2.7.0` and successfully runs CUDA tensor operations on GTX 1070 Ti.
- Keep this pin unless a newer official PyTorch wheel is verified to support `sm_61`.
-
-## Marker
- Marker is the primary document parser.
- It handles layout, OCR/layout, reading order, body text, headings, tables, figures, captions, and semantic block roles.
- It should be consumed through structured output or adapter APIs where possible, not by scraping final Markdown text.
-
-## Nougat
- Nougat is used only for formulas and mathematical expressions.
- `nougat-ocr==0.1.17` has loose dependency bounds, so the project pins compatible versions.
- `transformers 5.x` breaks Nougat imports.
- `albumentations 2.x` breaks Nougat transform initialization.
- Nougat failure must fall back to Marker source text.
-
-## PyMuPDF
- PyMuPDF is used for lightweight page analysis, page counts, text-layer quality checks, OCR intervention planning, chunk planning, and low-level PDF/page operations.
- It is not the primary document parser.
-
-## Comparison Baselines
-These tools are useful for research or quality comparison but are not the primary architecture:
- PyMuPDF4LLM
- Docling
- MinerU
- MarkItDown
-
-Do not switch the primary parser without updating `docs/ADR.md`, `docs/ARCHITECTURE.md`, and `docs/CONVERSION_POLICY.md`.
-
-## Reference Links
- Marker PyPI: https://pypi.org/project/marker-pdf/
- Nougat GitHub: https://github.com/facebookresearch/nougat
- PyMuPDF documentation: https://pymupdf.readthedocs.io/
- PyTorch previous versions: https://docs.pytorch.org/get-started/previous-versions/
- GitHub Flavored Markdown spec: https://github.github.io/gfm/
- MathJax TeX delimiters: https://docs.mathjax.org/en/latest/input/tex/delimiters.html
- Docling GitHub: https://github.com/docling-project/docling
- MinerU GitHub: https://github.com/opendatalab/MinerU
-
-## Markdown And Math Rendering
- Markdown table output should target GitHub Flavored Markdown where possible.
- Complex tables may use limited HTML `<table>`.
- Math output uses `$ ... $` for inline formulas and `$$ ... $$` for block formulas.
- `$...$` can conflict with ordinary dollar signs, so delimiter validation and repair are required.
-
-## Model Cache
- Use explicit local cache paths for Marker/Nougat/Hugging Face model downloads.
- README should include model pre-download and offline execution instructions before the engine is released.
- Default project-local model cache path is `.models/`.
- `PDFTOMD_MODEL_CACHE` can override the default cache root.
- The runtime cache policy exposes Hugging Face cache environment variables from that root without downloading models during validation.
- Runtime logs and resume state are runtime artifacts under `output/.pdftomd-runtime/<document-slug>/`, not generated document sidecars.
-
-## Licensing Notes
- Current user context is personal use.
- Before redistribution or commercial use, revisit Marker GPL and model-weight license implications.
- Process or API isolation can reduce coupling risk, but it is not a substitute for legal review.
@@ -1,39 +0,0 @@
-# UI 디자인 가이드
-
-UI는 2차 목표입니다. 1차 MVP에서는 CLI/라이브러리 변환 엔진을 먼저 안정화합니다.
-
-## 디자인 원칙
-1. 표준 Windows 환경에 맞는 미니멀한 UI를 따른다.
-2. 변환 엔진 로직을 UI에 중복 구현하지 않는다.
-3. PyQt UI는 core Python API 또는 CLI를 호출하는 thin client로 둔다.
-4. 긴 문서 변환 중 사용자가 현재 상태를 파악할 수 있어야 한다.
-5. 오류와 경고는 읽기 쉬운 방식으로 보여주되, 생성 Markdown을 오염시키지 않는다.
-
-## 주요 화면
- PDF 선택
- 출력 폴더 선택
- runtime 선택: `cuda`, `auto`, `cpu`
- formula parser 선택: `nougat`, `marker`
- chunk size 표시 및 기본값 유지
- 진행률과 chunk별 상태 표시
- 실패 chunk 요약
- resume 실행 버튼
- local log 열기
-
-## Interaction Rules
- 기본값은 CLI 기본값과 동일해야 한다.
- `cuda` 명시 실행에서 CUDA 초기화 실패 시 CPU fallback을 자동으로 하지 않고 명확히 실패를 표시한다.
- `auto` 실행에서 CUDA 실패 시 경고 후 CPU fallback 상태를 표시한다.
- 변환 중 취소가 가능해야 한다.
- 성공한 chunk와 실패한 chunk가 구분되어야 한다.
-
-## Visual Style
- Windows native에 어울리는 절제된 색상과 간격을 사용한다.
- 작업 도구 UI이므로 marketing hero나 장식적 layout은 사용하지 않는다.
- 긴 파일명과 한글 경로가 잘리지 않도록 middle ellipsis 또는 tooltip을 제공한다.
- 로그와 결과 경로는 복사 가능한 텍스트로 제공한다.
-
-## Boundary
- UI는 `src/` core package의 public API 또는 CLI만 호출한다.
- UI에서 Marker/Nougat/PyMuPDF를 직접 조합하지 않는다.
- UI 테스트는 core conversion quality test와 분리한다.