PDFtoMD Multi-Agent Plan

Goal

Build a Windows-native, local-first PDF-to-Markdown conversion engine that preserves logical reading order, paragraph flow, formulas, tables, figures, captions, and chunked output for AI-agent consumption.

Current Scope

Primary deliverable: CLI/library conversion engine.
Primary PDF parser: Marker.
Formula parser: Nougat, isolated from the main parser path and used only for mathematical expressions/formulas.
PDF analysis and chunk planning: PyMuPDF.
Output: chunked Markdown files plus image/table assets under a document slug directory.
Default chunk size: 20 pages.
Runtime target: Windows 10, local GPU first, GTX 1070 Ti with 8 GB VRAM.
User context: personal use.
Python environment target: one repo-local Python 3.11 environment.

Out of Scope For Now

PyQt UI implementation.
Hosted conversion API.
Default LLM correction path.
Sidecar metadata/log output unless explicitly requested.
Custom agent file creation until the user approves one agent at a time.
Engine implementation outside an approved Harness phase.

Current Inputs

Repository instructions: AGENTS.md.
Product/design documents: docs/PRD.md, docs/ARCHITECTURE.md, docs/ADR.md, docs/TOOLCHAIN.md, docs/UI_GUIDE.md.
Conversion policy decisions: docs/CONVERSION_POLICY.md.
Harness operating guide: docs/HARNESS.md.
Full implementation roadmap: docs/IMPLEMENTATION_PLAN.md.
Executable phase registry: phases/index.json.
Sample corpus:
- samples/2007쉘구조물의유한요소해석에대하여.pdf
- samples/FourNodeQuadrilateralShellElementMITC4.pdf
- samples/MITC공부.pdf
- samples/유한요소해석법을이용한쉘구조물의동적좌굴해석.pdf

Research Tracks

Toolchain research
- Verify Marker, Nougat, PyMuPDF, PyTorch/CUDA, Pandas, and Markdown/math-rendering constraints.
- Track licensing risks. Current use is personal, but revisit if distribution or commercial use becomes relevant.
- Compare Marker-first architecture against PyMuPDF4LLM, Docling, and MinerU as quality baselines only.
- Keep docs/TOOLCHAIN.md updated when dependency pins or compatibility findings change.
Conversion architecture
- Define stable internal document/block types.
- Keep Marker document/block structure as the main source for headings, body text, reading order, figures, tables, and captions.
- Treat Nougat output as formula text input subject to validation and fallback policy.
- Keep PyMuPDF responsible for page counts, chunk planning, and low-level PDF/page operations.
- Follow docs/CONVERSION_POLICY.md for OCR decisions, parser handoff rules, fallback behavior, chunk boundary handling, logging, and resume policy.
Quality and regression strategy
- Prefer focused assertions over full Markdown snapshots.
- Validate headings, formula delimiters, begin/end pairs, table shape, image links, captions, and no-exception conversion.
- Include Korean filenames and Windows paths in regression coverage.
- Include VRAM pressure and long-document chunking scenarios.
Runtime strategy
- Use repo-local Python environments.
- Use a single venv for Marker/PyMuPDF/Pandas/tests and Nougat.
- Use CUDA-enabled PyTorch compatible with the installed NVIDIA driver and GTX 1070 Ti.
- Current verified PyTorch choice is torch==2.7.1+cu126, because newer torch==2.11.0+cu128 does not support GTX 1070 Ti sm_61.
- Keep Nougat dependency pins explicit inside the unified environment, especially transformers==4.57.6, albumentations==1.3.1, pypdfium2==4.30.0, opencv-python-headless==4.11.0.86, Pillow==10.4.0, and fsspec==2026.2.0.

Created Project Agent Roles

The user approved creating the project-scoped Codex extensions on 2026-04-30. These read-only agents now live under .codex/agents/.

pdf_toolchain_researcher
- Read-only.
- Owns official-doc research for Marker, Nougat, PyMuPDF, PyTorch/CUDA, Markdown math, and comparison tools.
- Outputs compatibility notes, licensing notes, and recommended dependency constraints.
conversion_architect
- Read-only at first.
- Owns engine boundaries, internal data contracts, chunk policy, adapter interfaces, and output contract.
- Outputs phase-ready architecture notes and acceptance criteria.
quality_evaluator
- Read-only at first.
- Owns sample-corpus classification, focused quality checks, regression fixtures, and failure taxonomy.
- Outputs test strategy before implementation begins.
formula_pipeline_specialist
- Read-only at first.
- Owns Nougat integration assumptions, formula extraction boundaries, LaTeX delimiter validation, and fallback policy.
layout_table_figure_specialist
- Read-only at first.
- Owns reading order, paragraph stitching, table rendering, figure extraction, caption linking, and cross-reference preservation.
sample_corpus_analyst
- Read-only at first.
- Owns sample PDF corpus analysis, OCR-candidate identification, metadata schema suggestions, and regression implications.

Future Agent Roles

marker_adapter_worker: implementation worker for Marker adapter code, after TDD phase approval.
markdown_renderer_worker: implementation worker for Markdown renderer and output contract, after TDD phase approval.
runtime_cli_worker: implementation worker for CLI/runtime/device behavior, after TDD phase approval.
test_fixture_worker: implementation worker for sample metadata and focused pytest fixtures, after TDD phase approval.

Harness Execution Model

This project now follows a file-based planner/generator/evaluator workflow for long-running work.

Planner creates or updates phases/ steps from AGENTS.md, PLAN.md, PROGRESS.md, and docs/*.md.
Generator executes one stepN.md at a time and stays inside that step's owned files and Do Not list.
Evaluator reviews the result against the step's Sprint Contract and hard thresholds before the work is considered complete.
Communication and handoff happen through files, not only chat:
- PLAN.md for overall work plan.
- PROGRESS.md for current state and next handoff.
- phases/{phase}/index.json for step execution status.
- docs/HARNESS.md for role and contract rules.

Active Phase Plan

Phase registry:

phases/index.json

Full phase roadmap:

0-harness-foundation: sample metadata, core models, PyMuPDF pre-analysis contract, Markdown quality gates.
1-core-runtime-contracts: input normalization, conversion options, output bundle contract, runtime cache policy.
2-marker-adapter: Marker invocation, OCR plan handoff, block normalization, parser failure reporting.
3-formula-pipeline: formula candidate detection, Nougat command adapter, LaTeX validation/repair, formula reference links.
4-semantic-enrichment: reading-order checks, paragraph stitching, header/footer filtering, figure/table/formula reference indexing.
5-markdown-rendering-assets: block renderer, table renderer/fallbacks, figure asset writer, chunk renderer.
6-cli-runtime-resume: CLI options, progress/logging, resume state, CUDA/OOM policy, model cache/offline support.
7-mvp-quality-hardening: sample smoke conversions, quality metrics, regression thresholds, MVP fix sweep.
8-release-docs-packaging: usage docs, environment bootstrap docs, license checkpoint, local release checklist.
9-pyqt-thin-client: UI API contract, PyQt shell, UI progress/resume, UI packaging notes.

Detailed phase goals and dependencies are recorded in docs/IMPLEMENTATION_PLAN.md. Executable step contracts live under each phases/{phase}/stepN.md.

Priority Order

Execute phases/0-harness-foundation/step0.md only after the user wants implementation to begin.
Keep each implementation step inside its Sprint Contract and TDD requirements.
Review each completed phase before starting the next phase.
Treat PyQt and external API work as post-MVP unless the user explicitly changes scope.

Acceptance Criteria For Planning Stage

PLAN.md and PROGRESS.md exist and reflect the current goal.
docs/CONVERSION_POLICY.md records parser, OCR, formula, table, figure, chunk, runtime, logging, resume, and quality-test policy decisions.
docs/TOOLCHAIN.md records verified local dependency and compatibility decisions.
Environment decision is recorded.
requirements.txt records the verified single-environment dependency pins.
docs/HARNESS.md records the planner/generator/evaluator workflow.
docs/IMPLEMENTATION_PLAN.md records the full phase roadmap.
phases/index.json and phases/*/stepN.md provide executable self-contained tickets.
No custom agent file is created without explicit user approval.
Repository validation command remains runnable:

python scripts/validate_workspace.py

8.7 KiB Raw Blame History