add files
This commit is contained in:
@@ -0,0 +1,151 @@
|
||||
# PDFtoMD Multi-Agent Plan
|
||||
|
||||
## Goal
|
||||
Build a Windows-native, local-first PDF-to-Markdown conversion engine that preserves logical reading order, paragraph flow, formulas, tables, figures, captions, and chunked output for AI-agent consumption.
|
||||
|
||||
## Current Scope
|
||||
- Primary deliverable: CLI/library conversion engine.
|
||||
- Primary PDF parser: Marker.
|
||||
- Formula parser: Nougat, isolated from the main parser path and used only for mathematical expressions/formulas.
|
||||
- PDF analysis and chunk planning: PyMuPDF.
|
||||
- Output: chunked Markdown files plus image/table assets under a document slug directory.
|
||||
- Default chunk size: 20 pages.
|
||||
- Runtime target: Windows 10, local GPU first, GTX 1070 Ti with 8 GB VRAM.
|
||||
- User context: personal use.
|
||||
- Python environment target: one repo-local Python 3.11 environment.
|
||||
|
||||
## Out of Scope For Now
|
||||
- PyQt UI implementation.
|
||||
- Hosted conversion API.
|
||||
- Default LLM correction path.
|
||||
- Sidecar metadata/log output unless explicitly requested.
|
||||
- Custom agent file creation until the user approves one agent at a time.
|
||||
- Engine implementation outside an approved Harness phase.
|
||||
|
||||
## Current Inputs
|
||||
- Repository instructions: `AGENTS.md`.
|
||||
- Product/design documents: `docs/PRD.md`, `docs/ARCHITECTURE.md`, `docs/ADR.md`, `docs/TOOLCHAIN.md`, `docs/UI_GUIDE.md`.
|
||||
- Conversion policy decisions: `docs/CONVERSION_POLICY.md`.
|
||||
- Harness operating guide: `docs/HARNESS.md`.
|
||||
- Full implementation roadmap: `docs/IMPLEMENTATION_PLAN.md`.
|
||||
- Executable phase registry: `phases/index.json`.
|
||||
- Sample corpus:
|
||||
- `samples/2007쉘구조물의유한요소해석에대하여.pdf`
|
||||
- `samples/FourNodeQuadrilateralShellElementMITC4.pdf`
|
||||
- `samples/MITC공부.pdf`
|
||||
- `samples/유한요소해석법을이용한쉘구조물의동적좌굴해석.pdf`
|
||||
|
||||
## Research Tracks
|
||||
1. Toolchain research
|
||||
- Verify Marker, Nougat, PyMuPDF, PyTorch/CUDA, Pandas, and Markdown/math-rendering constraints.
|
||||
- Track licensing risks. Current use is personal, but revisit if distribution or commercial use becomes relevant.
|
||||
- Compare Marker-first architecture against PyMuPDF4LLM, Docling, and MinerU as quality baselines only.
|
||||
- Keep `docs/TOOLCHAIN.md` updated when dependency pins or compatibility findings change.
|
||||
|
||||
2. Conversion architecture
|
||||
- Define stable internal document/block types.
|
||||
- Keep Marker document/block structure as the main source for headings, body text, reading order, figures, tables, and captions.
|
||||
- Treat Nougat output as formula text input subject to validation and fallback policy.
|
||||
- Keep PyMuPDF responsible for page counts, chunk planning, and low-level PDF/page operations.
|
||||
- Follow `docs/CONVERSION_POLICY.md` for OCR decisions, parser handoff rules, fallback behavior, chunk boundary handling, logging, and resume policy.
|
||||
|
||||
3. Quality and regression strategy
|
||||
- Prefer focused assertions over full Markdown snapshots.
|
||||
- Validate headings, formula delimiters, begin/end pairs, table shape, image links, captions, and no-exception conversion.
|
||||
- Include Korean filenames and Windows paths in regression coverage.
|
||||
- Include VRAM pressure and long-document chunking scenarios.
|
||||
|
||||
4. Runtime strategy
|
||||
- Use repo-local Python environments.
|
||||
- Use a single `venv` for Marker/PyMuPDF/Pandas/tests and Nougat.
|
||||
- Use CUDA-enabled PyTorch compatible with the installed NVIDIA driver and GTX 1070 Ti.
|
||||
- Current verified PyTorch choice is `torch==2.7.1+cu126`, because newer `torch==2.11.0+cu128` does not support GTX 1070 Ti `sm_61`.
|
||||
- Keep Nougat dependency pins explicit inside the unified environment, especially `transformers==4.57.6`, `albumentations==1.3.1`, `pypdfium2==4.30.0`, `opencv-python-headless==4.11.0.86`, `Pillow==10.4.0`, and `fsspec==2026.2.0`.
|
||||
|
||||
## Created Project Agent Roles
|
||||
The user approved creating the project-scoped Codex extensions on 2026-04-30. These read-only agents now live under `.codex/agents/`.
|
||||
|
||||
1. `pdf_toolchain_researcher`
|
||||
- Read-only.
|
||||
- Owns official-doc research for Marker, Nougat, PyMuPDF, PyTorch/CUDA, Markdown math, and comparison tools.
|
||||
- Outputs compatibility notes, licensing notes, and recommended dependency constraints.
|
||||
|
||||
2. `conversion_architect`
|
||||
- Read-only at first.
|
||||
- Owns engine boundaries, internal data contracts, chunk policy, adapter interfaces, and output contract.
|
||||
- Outputs phase-ready architecture notes and acceptance criteria.
|
||||
|
||||
3. `quality_evaluator`
|
||||
- Read-only at first.
|
||||
- Owns sample-corpus classification, focused quality checks, regression fixtures, and failure taxonomy.
|
||||
- Outputs test strategy before implementation begins.
|
||||
|
||||
4. `formula_pipeline_specialist`
|
||||
- Read-only at first.
|
||||
- Owns Nougat integration assumptions, formula extraction boundaries, LaTeX delimiter validation, and fallback policy.
|
||||
|
||||
5. `layout_table_figure_specialist`
|
||||
- Read-only at first.
|
||||
- Owns reading order, paragraph stitching, table rendering, figure extraction, caption linking, and cross-reference preservation.
|
||||
|
||||
6. `sample_corpus_analyst`
|
||||
- Read-only at first.
|
||||
- Owns sample PDF corpus analysis, OCR-candidate identification, metadata schema suggestions, and regression implications.
|
||||
|
||||
## Future Agent Roles
|
||||
- `marker_adapter_worker`: implementation worker for Marker adapter code, after TDD phase approval.
|
||||
- `markdown_renderer_worker`: implementation worker for Markdown renderer and output contract, after TDD phase approval.
|
||||
- `runtime_cli_worker`: implementation worker for CLI/runtime/device behavior, after TDD phase approval.
|
||||
- `test_fixture_worker`: implementation worker for sample metadata and focused pytest fixtures, after TDD phase approval.
|
||||
|
||||
## Harness Execution Model
|
||||
This project now follows a file-based planner/generator/evaluator workflow for long-running work.
|
||||
|
||||
1. Planner creates or updates `phases/` steps from `AGENTS.md`, `PLAN.md`, `PROGRESS.md`, and `docs/*.md`.
|
||||
2. Generator executes one `stepN.md` at a time and stays inside that step's owned files and Do Not list.
|
||||
3. Evaluator reviews the result against the step's Sprint Contract and hard thresholds before the work is considered complete.
|
||||
4. Communication and handoff happen through files, not only chat:
|
||||
- `PLAN.md` for overall work plan.
|
||||
- `PROGRESS.md` for current state and next handoff.
|
||||
- `phases/{phase}/index.json` for step execution status.
|
||||
- `docs/HARNESS.md` for role and contract rules.
|
||||
|
||||
## Active Phase Plan
|
||||
Phase registry:
|
||||
- `phases/index.json`
|
||||
|
||||
Full phase roadmap:
|
||||
1. `0-harness-foundation`: sample metadata, core models, PyMuPDF pre-analysis contract, Markdown quality gates.
|
||||
2. `1-core-runtime-contracts`: input normalization, conversion options, output bundle contract, runtime cache policy.
|
||||
3. `2-marker-adapter`: Marker invocation, OCR plan handoff, block normalization, parser failure reporting.
|
||||
4. `3-formula-pipeline`: formula candidate detection, Nougat command adapter, LaTeX validation/repair, formula reference links.
|
||||
5. `4-semantic-enrichment`: reading-order checks, paragraph stitching, header/footer filtering, figure/table/formula reference indexing.
|
||||
6. `5-markdown-rendering-assets`: block renderer, table renderer/fallbacks, figure asset writer, chunk renderer.
|
||||
7. `6-cli-runtime-resume`: CLI options, progress/logging, resume state, CUDA/OOM policy, model cache/offline support.
|
||||
8. `7-mvp-quality-hardening`: sample smoke conversions, quality metrics, regression thresholds, MVP fix sweep.
|
||||
9. `8-release-docs-packaging`: usage docs, environment bootstrap docs, license checkpoint, local release checklist.
|
||||
10. `9-pyqt-thin-client`: UI API contract, PyQt shell, UI progress/resume, UI packaging notes.
|
||||
|
||||
Detailed phase goals and dependencies are recorded in `docs/IMPLEMENTATION_PLAN.md`. Executable step contracts live under each `phases/{phase}/stepN.md`.
|
||||
|
||||
## Priority Order
|
||||
1. Execute `phases/0-harness-foundation/step0.md` only after the user wants implementation to begin.
|
||||
2. Keep each implementation step inside its Sprint Contract and TDD requirements.
|
||||
3. Review each completed phase before starting the next phase.
|
||||
4. Treat PyQt and external API work as post-MVP unless the user explicitly changes scope.
|
||||
|
||||
## Acceptance Criteria For Planning Stage
|
||||
- `PLAN.md` and `PROGRESS.md` exist and reflect the current goal.
|
||||
- `docs/CONVERSION_POLICY.md` records parser, OCR, formula, table, figure, chunk, runtime, logging, resume, and quality-test policy decisions.
|
||||
- `docs/TOOLCHAIN.md` records verified local dependency and compatibility decisions.
|
||||
- Environment decision is recorded.
|
||||
- `requirements.txt` records the verified single-environment dependency pins.
|
||||
- `docs/HARNESS.md` records the planner/generator/evaluator workflow.
|
||||
- `docs/IMPLEMENTATION_PLAN.md` records the full phase roadmap.
|
||||
- `phases/index.json` and `phases/*/stepN.md` provide executable self-contained tickets.
|
||||
- No custom agent file is created without explicit user approval.
|
||||
- Repository validation command remains runnable:
|
||||
|
||||
```bash
|
||||
python scripts/validate_workspace.py
|
||||
```
|
||||
Reference in New Issue
Block a user