8.7 KiB
PDFtoMD Multi-Agent Plan
Goal
Build a Windows-native, local-first PDF-to-Markdown conversion engine that preserves logical reading order, paragraph flow, formulas, tables, figures, captions, and chunked output for AI-agent consumption.
Current Scope
- Primary deliverable: CLI/library conversion engine.
- Primary PDF parser: Marker.
- Formula parser: Nougat, isolated from the main parser path and used only for mathematical expressions/formulas.
- PDF analysis and chunk planning: PyMuPDF.
- Output: chunked Markdown files plus image/table assets under a document slug directory.
- Default chunk size: 20 pages.
- Runtime target: Windows 10, local GPU first, GTX 1070 Ti with 8 GB VRAM.
- User context: personal use.
- Python environment target: one repo-local Python 3.11 environment.
Out of Scope For Now
- PyQt UI implementation.
- Hosted conversion API.
- Default LLM correction path.
- Sidecar metadata/log output unless explicitly requested.
- Custom agent file creation until the user approves one agent at a time.
- Engine implementation outside an approved Harness phase.
Current Inputs
- Repository instructions:
AGENTS.md. - Product/design documents:
docs/PRD.md,docs/ARCHITECTURE.md,docs/ADR.md,docs/TOOLCHAIN.md,docs/UI_GUIDE.md. - Conversion policy decisions:
docs/CONVERSION_POLICY.md. - Harness operating guide:
docs/HARNESS.md. - Full implementation roadmap:
docs/IMPLEMENTATION_PLAN.md. - Executable phase registry:
phases/index.json. - Sample corpus:
samples/2007쉘구조물의유한요소해석에대하여.pdfsamples/FourNodeQuadrilateralShellElementMITC4.pdfsamples/MITC공부.pdfsamples/유한요소해석법을이용한쉘구조물의동적좌굴해석.pdf
Research Tracks
-
Toolchain research
- Verify Marker, Nougat, PyMuPDF, PyTorch/CUDA, Pandas, and Markdown/math-rendering constraints.
- Track licensing risks. Current use is personal, but revisit if distribution or commercial use becomes relevant.
- Compare Marker-first architecture against PyMuPDF4LLM, Docling, and MinerU as quality baselines only.
- Keep
docs/TOOLCHAIN.mdupdated when dependency pins or compatibility findings change.
-
Conversion architecture
- Define stable internal document/block types.
- Keep Marker document/block structure as the main source for headings, body text, reading order, figures, tables, and captions.
- Treat Nougat output as formula text input subject to validation and fallback policy.
- Keep PyMuPDF responsible for page counts, chunk planning, and low-level PDF/page operations.
- Follow
docs/CONVERSION_POLICY.mdfor OCR decisions, parser handoff rules, fallback behavior, chunk boundary handling, logging, and resume policy.
-
Quality and regression strategy
- Prefer focused assertions over full Markdown snapshots.
- Validate headings, formula delimiters, begin/end pairs, table shape, image links, captions, and no-exception conversion.
- Include Korean filenames and Windows paths in regression coverage.
- Include VRAM pressure and long-document chunking scenarios.
-
Runtime strategy
- Use repo-local Python environments.
- Use a single
venvfor Marker/PyMuPDF/Pandas/tests and Nougat. - Use CUDA-enabled PyTorch compatible with the installed NVIDIA driver and GTX 1070 Ti.
- Current verified PyTorch choice is
torch==2.7.1+cu126, because newertorch==2.11.0+cu128does not support GTX 1070 Tism_61. - Keep Nougat dependency pins explicit inside the unified environment, especially
transformers==4.57.6,albumentations==1.3.1,pypdfium2==4.30.0,opencv-python-headless==4.11.0.86,Pillow==10.4.0, andfsspec==2026.2.0.
Created Project Agent Roles
The user approved creating the project-scoped Codex extensions on 2026-04-30. These read-only agents now live under .codex/agents/.
-
pdf_toolchain_researcher- Read-only.
- Owns official-doc research for Marker, Nougat, PyMuPDF, PyTorch/CUDA, Markdown math, and comparison tools.
- Outputs compatibility notes, licensing notes, and recommended dependency constraints.
-
conversion_architect- Read-only at first.
- Owns engine boundaries, internal data contracts, chunk policy, adapter interfaces, and output contract.
- Outputs phase-ready architecture notes and acceptance criteria.
-
quality_evaluator- Read-only at first.
- Owns sample-corpus classification, focused quality checks, regression fixtures, and failure taxonomy.
- Outputs test strategy before implementation begins.
-
formula_pipeline_specialist- Read-only at first.
- Owns Nougat integration assumptions, formula extraction boundaries, LaTeX delimiter validation, and fallback policy.
-
layout_table_figure_specialist- Read-only at first.
- Owns reading order, paragraph stitching, table rendering, figure extraction, caption linking, and cross-reference preservation.
-
sample_corpus_analyst- Read-only at first.
- Owns sample PDF corpus analysis, OCR-candidate identification, metadata schema suggestions, and regression implications.
Future Agent Roles
marker_adapter_worker: implementation worker for Marker adapter code, after TDD phase approval.markdown_renderer_worker: implementation worker for Markdown renderer and output contract, after TDD phase approval.runtime_cli_worker: implementation worker for CLI/runtime/device behavior, after TDD phase approval.test_fixture_worker: implementation worker for sample metadata and focused pytest fixtures, after TDD phase approval.
Harness Execution Model
This project now follows a file-based planner/generator/evaluator workflow for long-running work.
- Planner creates or updates
phases/steps fromAGENTS.md,PLAN.md,PROGRESS.md, anddocs/*.md. - Generator executes one
stepN.mdat a time and stays inside that step's owned files and Do Not list. - Evaluator reviews the result against the step's Sprint Contract and hard thresholds before the work is considered complete.
- Communication and handoff happen through files, not only chat:
PLAN.mdfor overall work plan.PROGRESS.mdfor current state and next handoff.phases/{phase}/index.jsonfor step execution status.docs/HARNESS.mdfor role and contract rules.
Active Phase Plan
Phase registry:
phases/index.json
Full phase roadmap:
0-harness-foundation: sample metadata, core models, PyMuPDF pre-analysis contract, Markdown quality gates.1-core-runtime-contracts: input normalization, conversion options, output bundle contract, runtime cache policy.2-marker-adapter: Marker invocation, OCR plan handoff, block normalization, parser failure reporting.3-formula-pipeline: formula candidate detection, Nougat command adapter, LaTeX validation/repair, formula reference links.4-semantic-enrichment: reading-order checks, paragraph stitching, header/footer filtering, figure/table/formula reference indexing.5-markdown-rendering-assets: block renderer, table renderer/fallbacks, figure asset writer, chunk renderer.6-cli-runtime-resume: CLI options, progress/logging, resume state, CUDA/OOM policy, model cache/offline support.7-mvp-quality-hardening: sample smoke conversions, quality metrics, regression thresholds, MVP fix sweep.8-release-docs-packaging: usage docs, environment bootstrap docs, license checkpoint, local release checklist.9-pyqt-thin-client: UI API contract, PyQt shell, UI progress/resume, UI packaging notes.
Detailed phase goals and dependencies are recorded in docs/IMPLEMENTATION_PLAN.md. Executable step contracts live under each phases/{phase}/stepN.md.
Priority Order
- Execute
phases/0-harness-foundation/step0.mdonly after the user wants implementation to begin. - Keep each implementation step inside its Sprint Contract and TDD requirements.
- Review each completed phase before starting the next phase.
- Treat PyQt and external API work as post-MVP unless the user explicitly changes scope.
Acceptance Criteria For Planning Stage
PLAN.mdandPROGRESS.mdexist and reflect the current goal.docs/CONVERSION_POLICY.mdrecords parser, OCR, formula, table, figure, chunk, runtime, logging, resume, and quality-test policy decisions.docs/TOOLCHAIN.mdrecords verified local dependency and compatibility decisions.- Environment decision is recorded.
requirements.txtrecords the verified single-environment dependency pins.docs/HARNESS.mdrecords the planner/generator/evaluator workflow.docs/IMPLEMENTATION_PLAN.mdrecords the full phase roadmap.phases/index.jsonandphases/*/stepN.mdprovide executable self-contained tickets.- No custom agent file is created without explicit user approval.
- Repository validation command remains runnable:
python scripts/validate_workspace.py