Files
PDFToMD/PLAN.md
T
김경종 7e985ae94a add files
2026-04-30 17:05:19 +09:00

152 lines
8.7 KiB
Markdown

# PDFtoMD Multi-Agent Plan
## Goal
Build a Windows-native, local-first PDF-to-Markdown conversion engine that preserves logical reading order, paragraph flow, formulas, tables, figures, captions, and chunked output for AI-agent consumption.
## Current Scope
- Primary deliverable: CLI/library conversion engine.
- Primary PDF parser: Marker.
- Formula parser: Nougat, isolated from the main parser path and used only for mathematical expressions/formulas.
- PDF analysis and chunk planning: PyMuPDF.
- Output: chunked Markdown files plus image/table assets under a document slug directory.
- Default chunk size: 20 pages.
- Runtime target: Windows 10, local GPU first, GTX 1070 Ti with 8 GB VRAM.
- User context: personal use.
- Python environment target: one repo-local Python 3.11 environment.
## Out of Scope For Now
- PyQt UI implementation.
- Hosted conversion API.
- Default LLM correction path.
- Sidecar metadata/log output unless explicitly requested.
- Custom agent file creation until the user approves one agent at a time.
- Engine implementation outside an approved Harness phase.
## Current Inputs
- Repository instructions: `AGENTS.md`.
- Product/design documents: `docs/PRD.md`, `docs/ARCHITECTURE.md`, `docs/ADR.md`, `docs/TOOLCHAIN.md`, `docs/UI_GUIDE.md`.
- Conversion policy decisions: `docs/CONVERSION_POLICY.md`.
- Harness operating guide: `docs/HARNESS.md`.
- Full implementation roadmap: `docs/IMPLEMENTATION_PLAN.md`.
- Executable phase registry: `phases/index.json`.
- Sample corpus:
- `samples/2007쉘구조물의유한요소해석에대하여.pdf`
- `samples/FourNodeQuadrilateralShellElementMITC4.pdf`
- `samples/MITC공부.pdf`
- `samples/유한요소해석법을이용한쉘구조물의동적좌굴해석.pdf`
## Research Tracks
1. Toolchain research
- Verify Marker, Nougat, PyMuPDF, PyTorch/CUDA, Pandas, and Markdown/math-rendering constraints.
- Track licensing risks. Current use is personal, but revisit if distribution or commercial use becomes relevant.
- Compare Marker-first architecture against PyMuPDF4LLM, Docling, and MinerU as quality baselines only.
- Keep `docs/TOOLCHAIN.md` updated when dependency pins or compatibility findings change.
2. Conversion architecture
- Define stable internal document/block types.
- Keep Marker document/block structure as the main source for headings, body text, reading order, figures, tables, and captions.
- Treat Nougat output as formula text input subject to validation and fallback policy.
- Keep PyMuPDF responsible for page counts, chunk planning, and low-level PDF/page operations.
- Follow `docs/CONVERSION_POLICY.md` for OCR decisions, parser handoff rules, fallback behavior, chunk boundary handling, logging, and resume policy.
3. Quality and regression strategy
- Prefer focused assertions over full Markdown snapshots.
- Validate headings, formula delimiters, begin/end pairs, table shape, image links, captions, and no-exception conversion.
- Include Korean filenames and Windows paths in regression coverage.
- Include VRAM pressure and long-document chunking scenarios.
4. Runtime strategy
- Use repo-local Python environments.
- Use a single `venv` for Marker/PyMuPDF/Pandas/tests and Nougat.
- Use CUDA-enabled PyTorch compatible with the installed NVIDIA driver and GTX 1070 Ti.
- Current verified PyTorch choice is `torch==2.7.1+cu126`, because newer `torch==2.11.0+cu128` does not support GTX 1070 Ti `sm_61`.
- Keep Nougat dependency pins explicit inside the unified environment, especially `transformers==4.57.6`, `albumentations==1.3.1`, `pypdfium2==4.30.0`, `opencv-python-headless==4.11.0.86`, `Pillow==10.4.0`, and `fsspec==2026.2.0`.
## Created Project Agent Roles
The user approved creating the project-scoped Codex extensions on 2026-04-30. These read-only agents now live under `.codex/agents/`.
1. `pdf_toolchain_researcher`
- Read-only.
- Owns official-doc research for Marker, Nougat, PyMuPDF, PyTorch/CUDA, Markdown math, and comparison tools.
- Outputs compatibility notes, licensing notes, and recommended dependency constraints.
2. `conversion_architect`
- Read-only at first.
- Owns engine boundaries, internal data contracts, chunk policy, adapter interfaces, and output contract.
- Outputs phase-ready architecture notes and acceptance criteria.
3. `quality_evaluator`
- Read-only at first.
- Owns sample-corpus classification, focused quality checks, regression fixtures, and failure taxonomy.
- Outputs test strategy before implementation begins.
4. `formula_pipeline_specialist`
- Read-only at first.
- Owns Nougat integration assumptions, formula extraction boundaries, LaTeX delimiter validation, and fallback policy.
5. `layout_table_figure_specialist`
- Read-only at first.
- Owns reading order, paragraph stitching, table rendering, figure extraction, caption linking, and cross-reference preservation.
6. `sample_corpus_analyst`
- Read-only at first.
- Owns sample PDF corpus analysis, OCR-candidate identification, metadata schema suggestions, and regression implications.
## Future Agent Roles
- `marker_adapter_worker`: implementation worker for Marker adapter code, after TDD phase approval.
- `markdown_renderer_worker`: implementation worker for Markdown renderer and output contract, after TDD phase approval.
- `runtime_cli_worker`: implementation worker for CLI/runtime/device behavior, after TDD phase approval.
- `test_fixture_worker`: implementation worker for sample metadata and focused pytest fixtures, after TDD phase approval.
## Harness Execution Model
This project now follows a file-based planner/generator/evaluator workflow for long-running work.
1. Planner creates or updates `phases/` steps from `AGENTS.md`, `PLAN.md`, `PROGRESS.md`, and `docs/*.md`.
2. Generator executes one `stepN.md` at a time and stays inside that step's owned files and Do Not list.
3. Evaluator reviews the result against the step's Sprint Contract and hard thresholds before the work is considered complete.
4. Communication and handoff happen through files, not only chat:
- `PLAN.md` for overall work plan.
- `PROGRESS.md` for current state and next handoff.
- `phases/{phase}/index.json` for step execution status.
- `docs/HARNESS.md` for role and contract rules.
## Active Phase Plan
Phase registry:
- `phases/index.json`
Full phase roadmap:
1. `0-harness-foundation`: sample metadata, core models, PyMuPDF pre-analysis contract, Markdown quality gates.
2. `1-core-runtime-contracts`: input normalization, conversion options, output bundle contract, runtime cache policy.
3. `2-marker-adapter`: Marker invocation, OCR plan handoff, block normalization, parser failure reporting.
4. `3-formula-pipeline`: formula candidate detection, Nougat command adapter, LaTeX validation/repair, formula reference links.
5. `4-semantic-enrichment`: reading-order checks, paragraph stitching, header/footer filtering, figure/table/formula reference indexing.
6. `5-markdown-rendering-assets`: block renderer, table renderer/fallbacks, figure asset writer, chunk renderer.
7. `6-cli-runtime-resume`: CLI options, progress/logging, resume state, CUDA/OOM policy, model cache/offline support.
8. `7-mvp-quality-hardening`: sample smoke conversions, quality metrics, regression thresholds, MVP fix sweep.
9. `8-release-docs-packaging`: usage docs, environment bootstrap docs, license checkpoint, local release checklist.
10. `9-pyqt-thin-client`: UI API contract, PyQt shell, UI progress/resume, UI packaging notes.
Detailed phase goals and dependencies are recorded in `docs/IMPLEMENTATION_PLAN.md`. Executable step contracts live under each `phases/{phase}/stepN.md`.
## Priority Order
1. Execute `phases/0-harness-foundation/step0.md` only after the user wants implementation to begin.
2. Keep each implementation step inside its Sprint Contract and TDD requirements.
3. Review each completed phase before starting the next phase.
4. Treat PyQt and external API work as post-MVP unless the user explicitly changes scope.
## Acceptance Criteria For Planning Stage
- `PLAN.md` and `PROGRESS.md` exist and reflect the current goal.
- `docs/CONVERSION_POLICY.md` records parser, OCR, formula, table, figure, chunk, runtime, logging, resume, and quality-test policy decisions.
- `docs/TOOLCHAIN.md` records verified local dependency and compatibility decisions.
- Environment decision is recorded.
- `requirements.txt` records the verified single-environment dependency pins.
- `docs/HARNESS.md` records the planner/generator/evaluator workflow.
- `docs/IMPLEMENTATION_PLAN.md` records the full phase roadmap.
- `phases/index.json` and `phases/*/stepN.md` provide executable self-contained tickets.
- No custom agent file is created without explicit user approval.
- Repository validation command remains runnable:
```bash
python scripts/validate_workspace.py
```