add files

2026-04-30 17:05:19 +09:00
parent f3e01b5a8c
commit 7e985ae94a
135 changed files with 41205 additions and 0 deletions
@@ -0,0 +1,151 @@
+# PDFtoMD Multi-Agent Plan
+
+## Goal
+Build a Windows-native, local-first PDF-to-Markdown conversion engine that preserves logical reading order, paragraph flow, formulas, tables, figures, captions, and chunked output for AI-agent consumption.
+
+## Current Scope
+- Primary deliverable: CLI/library conversion engine.
+- Primary PDF parser: Marker.
+- Formula parser: Nougat, isolated from the main parser path and used only for mathematical expressions/formulas.
+- PDF analysis and chunk planning: PyMuPDF.
+- Output: chunked Markdown files plus image/table assets under a document slug directory.
+- Default chunk size: 20 pages.
+- Runtime target: Windows 10, local GPU first, GTX 1070 Ti with 8 GB VRAM.
+- User context: personal use.
+- Python environment target: one repo-local Python 3.11 environment.
+
+## Out of Scope For Now
+- PyQt UI implementation.
+- Hosted conversion API.
+- Default LLM correction path.
+- Sidecar metadata/log output unless explicitly requested.
+- Custom agent file creation until the user approves one agent at a time.
+- Engine implementation outside an approved Harness phase.
+
+## Current Inputs
+- Repository instructions: `AGENTS.md`.
+- Product/design documents: `docs/PRD.md`, `docs/ARCHITECTURE.md`, `docs/ADR.md`, `docs/TOOLCHAIN.md`, `docs/UI_GUIDE.md`.
+- Conversion policy decisions: `docs/CONVERSION_POLICY.md`.
+- Harness operating guide: `docs/HARNESS.md`.
+- Full implementation roadmap: `docs/IMPLEMENTATION_PLAN.md`.
+- Executable phase registry: `phases/index.json`.
+- Sample corpus:
+  - `samples/2007쉘구조물의유한요소해석에대하여.pdf`
+  - `samples/FourNodeQuadrilateralShellElementMITC4.pdf`
+  - `samples/MITC공부.pdf`
+  - `samples/유한요소해석법을이용한쉘구조물의동적좌굴해석.pdf`
+
+## Research Tracks
+1. Toolchain research
+   - Verify Marker, Nougat, PyMuPDF, PyTorch/CUDA, Pandas, and Markdown/math-rendering constraints.
+   - Track licensing risks. Current use is personal, but revisit if distribution or commercial use becomes relevant.
+   - Compare Marker-first architecture against PyMuPDF4LLM, Docling, and MinerU as quality baselines only.
+   - Keep `docs/TOOLCHAIN.md` updated when dependency pins or compatibility findings change.
+
+2. Conversion architecture
+   - Define stable internal document/block types.
+   - Keep Marker document/block structure as the main source for headings, body text, reading order, figures, tables, and captions.
+   - Treat Nougat output as formula text input subject to validation and fallback policy.
+   - Keep PyMuPDF responsible for page counts, chunk planning, and low-level PDF/page operations.
+   - Follow `docs/CONVERSION_POLICY.md` for OCR decisions, parser handoff rules, fallback behavior, chunk boundary handling, logging, and resume policy.
+
+3. Quality and regression strategy
+   - Prefer focused assertions over full Markdown snapshots.
+   - Validate headings, formula delimiters, begin/end pairs, table shape, image links, captions, and no-exception conversion.
+   - Include Korean filenames and Windows paths in regression coverage.
+   - Include VRAM pressure and long-document chunking scenarios.
+
+4. Runtime strategy
+   - Use repo-local Python environments.
+   - Use a single `venv` for Marker/PyMuPDF/Pandas/tests and Nougat.
+   - Use CUDA-enabled PyTorch compatible with the installed NVIDIA driver and GTX 1070 Ti.
+   - Current verified PyTorch choice is `torch==2.7.1+cu126`, because newer `torch==2.11.0+cu128` does not support GTX 1070 Ti `sm_61`.
+   - Keep Nougat dependency pins explicit inside the unified environment, especially `transformers==4.57.6`, `albumentations==1.3.1`, `pypdfium2==4.30.0`, `opencv-python-headless==4.11.0.86`, `Pillow==10.4.0`, and `fsspec==2026.2.0`.
+
+## Created Project Agent Roles
+The user approved creating the project-scoped Codex extensions on 2026-04-30. These read-only agents now live under `.codex/agents/`.
+
+1. `pdf_toolchain_researcher`
+   - Read-only.
+   - Owns official-doc research for Marker, Nougat, PyMuPDF, PyTorch/CUDA, Markdown math, and comparison tools.
+   - Outputs compatibility notes, licensing notes, and recommended dependency constraints.
+
+2. `conversion_architect`
+   - Read-only at first.
+   - Owns engine boundaries, internal data contracts, chunk policy, adapter interfaces, and output contract.
+   - Outputs phase-ready architecture notes and acceptance criteria.
+
+3. `quality_evaluator`
+   - Read-only at first.
+   - Owns sample-corpus classification, focused quality checks, regression fixtures, and failure taxonomy.
+   - Outputs test strategy before implementation begins.
+
+4. `formula_pipeline_specialist`
+   - Read-only at first.
+   - Owns Nougat integration assumptions, formula extraction boundaries, LaTeX delimiter validation, and fallback policy.
+
+5. `layout_table_figure_specialist`
+   - Read-only at first.
+   - Owns reading order, paragraph stitching, table rendering, figure extraction, caption linking, and cross-reference preservation.
+
+6. `sample_corpus_analyst`
+   - Read-only at first.
+   - Owns sample PDF corpus analysis, OCR-candidate identification, metadata schema suggestions, and regression implications.
+
+## Future Agent Roles
+- `marker_adapter_worker`: implementation worker for Marker adapter code, after TDD phase approval.
+- `markdown_renderer_worker`: implementation worker for Markdown renderer and output contract, after TDD phase approval.
+- `runtime_cli_worker`: implementation worker for CLI/runtime/device behavior, after TDD phase approval.
+- `test_fixture_worker`: implementation worker for sample metadata and focused pytest fixtures, after TDD phase approval.
+
+## Harness Execution Model
+This project now follows a file-based planner/generator/evaluator workflow for long-running work.
+
+1. Planner creates or updates `phases/` steps from `AGENTS.md`, `PLAN.md`, `PROGRESS.md`, and `docs/*.md`.
+2. Generator executes one `stepN.md` at a time and stays inside that step's owned files and Do Not list.
+3. Evaluator reviews the result against the step's Sprint Contract and hard thresholds before the work is considered complete.
+4. Communication and handoff happen through files, not only chat:
+   - `PLAN.md` for overall work plan.
+   - `PROGRESS.md` for current state and next handoff.
+   - `phases/{phase}/index.json` for step execution status.
+   - `docs/HARNESS.md` for role and contract rules.
+
+## Active Phase Plan
+Phase registry:
+- `phases/index.json`
+
+Full phase roadmap:
+1. `0-harness-foundation`: sample metadata, core models, PyMuPDF pre-analysis contract, Markdown quality gates.
+2. `1-core-runtime-contracts`: input normalization, conversion options, output bundle contract, runtime cache policy.
+3. `2-marker-adapter`: Marker invocation, OCR plan handoff, block normalization, parser failure reporting.
+4. `3-formula-pipeline`: formula candidate detection, Nougat command adapter, LaTeX validation/repair, formula reference links.
+5. `4-semantic-enrichment`: reading-order checks, paragraph stitching, header/footer filtering, figure/table/formula reference indexing.
+6. `5-markdown-rendering-assets`: block renderer, table renderer/fallbacks, figure asset writer, chunk renderer.
+7. `6-cli-runtime-resume`: CLI options, progress/logging, resume state, CUDA/OOM policy, model cache/offline support.
+8. `7-mvp-quality-hardening`: sample smoke conversions, quality metrics, regression thresholds, MVP fix sweep.
+9. `8-release-docs-packaging`: usage docs, environment bootstrap docs, license checkpoint, local release checklist.
+10. `9-pyqt-thin-client`: UI API contract, PyQt shell, UI progress/resume, UI packaging notes.
+
+Detailed phase goals and dependencies are recorded in `docs/IMPLEMENTATION_PLAN.md`. Executable step contracts live under each `phases/{phase}/stepN.md`.
+
+## Priority Order
+1. Execute `phases/0-harness-foundation/step0.md` only after the user wants implementation to begin.
+2. Keep each implementation step inside its Sprint Contract and TDD requirements.
+3. Review each completed phase before starting the next phase.
+4. Treat PyQt and external API work as post-MVP unless the user explicitly changes scope.
+
+## Acceptance Criteria For Planning Stage
+- `PLAN.md` and `PROGRESS.md` exist and reflect the current goal.
+- `docs/CONVERSION_POLICY.md` records parser, OCR, formula, table, figure, chunk, runtime, logging, resume, and quality-test policy decisions.
+- `docs/TOOLCHAIN.md` records verified local dependency and compatibility decisions.
+- Environment decision is recorded.
+- `requirements.txt` records the verified single-environment dependency pins.
+- `docs/HARNESS.md` records the planner/generator/evaluator workflow.
+- `docs/IMPLEMENTATION_PLAN.md` records the full phase roadmap.
+- `phases/index.json` and `phases/*/stepN.md` provide executable self-contained tickets.
+- No custom agent file is created without explicit user approval.
+- Repository validation command remains runnable:
+
+```bash
+python scripts/validate_workspace.py
+```