PDFtoMD Progress

Current Status

Date: 2026-04-30.
Mode: Phase 1 implementation complete; ready for Phase 1 review or Phase 2 handoff.
Implementation status: Phase 0 and Phase 1 complete.
Custom agent creation: project-scoped read-only agents created after user approval.
Persistent multi-agent coordination files were approved by the user.

Completed

Read repository instructions and project documents:
- AGENTS.md
- docs/PRD.md
- docs/ARCHITECTURE.md
- docs/ADR.md
- docs/UI_GUIDE.md
Confirmed project direction:
- Windows-native local CLI/library engine first.
- Marker for document structure, reading order, tables, figures, headings, and captions.
- Nougat only for mathematical expressions/formulas.
- PyMuPDF for PDF analysis and chunk planning.
- Markdown chunk output plus image/table assets.
Recorded detailed conversion policy decisions in docs/CONVERSION_POLICY.md.
Strengthened project documentation with current research and decisions:
- AGENTS.md
- README.md
- docs/PRD.md
- docs/ARCHITECTURE.md
- docs/ADR.md
- docs/TOOLCHAIN.md
- docs/UI_GUIDE.md
Strengthened AGENTS.md multi-agent coordination rules so every new agent reads PLAN.md and PROGRESS.md first and can identify current goals, assigned scope, completed work, blockers, next work, and conflict risks.
Created project-scoped Codex extensions under .codex/:
- Agents: pdf-toolchain-researcher, sample-corpus-analyst, conversion-architect, quality-evaluator, formula-pipeline-specialist, layout-table-figure-specialist.
- Commands: status, env-check, sample-audit, quality-plan, conversion-policy-review, model-cache-check, phase-draft.
- Skills: pdf-toolchain, sample-corpus, conversion-architecture, formula-quality, markdown-quality, windows-runtime.
- Hooks: strengthened risky command guard, added handoff policy and drift policy hooks.
Validated .codex extension formats:
- Agent TOML files parsed successfully.
- .codex/hooks.json parsed successfully.
- Hook Python scripts compiled successfully.
- All .codex/skills/*/SKILL.md files passed skill-creator quick validation.
Confirmed user environment:
- Windows 10.
- NVIDIA GeForce GTX 1070 Ti.
- 8 GB VRAM.
- NVIDIA driver 577.00.
- nvidia-smi reports CUDA runtime capability 12.9.
- User reports CUDA 12.4 installed.
- Current detected Python: Miniforge Python 3.12.7.
- Conda is available.
- uv is not available.
Created repo-local environment:
- venv: Python 3.11.15, unified Marker/PyMuPDF/Pandas/test/Nougat environment.
Removed previous experimental venv-nougat directory after unified venv validation passed.
Verified unified environment:
- torch==2.7.1+cu126
- torchvision==0.22.1+cu126
- marker-pdf==1.10.2
- nougat-ocr==0.1.17
- transformers==4.57.6
- albumentations==1.3.1
- fsspec==2026.2.0
- pymupdf==1.27.2.3
- pandas==3.0.2
- pytest==9.0.3
- Pillow==10.4.0
- pypdfium2==4.30.0
- opencv-python-headless==4.11.0.86
- pip check: passed.
- CUDA tensor operation on GTX 1070 Ti: passed.
- venv\Scripts\nougat.exe --help: passed.
Ran earlier repository validation before default Python test discovery was added:
- python scripts/validate_workspace.py: passed at that time with no configured validation commands.
Confirmed sample PDFs:
- samples/2007쉘구조물의유한요소해석에대하여.pdf: 13 pages, first page text length 3523, first page images 0.
- samples/FourNodeQuadrilateralShellElementMITC4.pdf: 7 pages, first page text length 3269, first page images 0.
- samples/MITC공부.pdf: 13 pages, first page text length 226, first page images 2.
- samples/유한요소해석법을이용한쉘구조물의동적좌굴해석.pdf: 76 pages, first page text length 446, first page images 10.
Strengthened the project for Anthropic-style Harness Engineering:
- Added docs/HARNESS.md with planner/generator/evaluator roles, file protocol, Sprint Contract template, evaluator hard thresholds, and simplification rules.
- Added executable phase registry phases/index.json.
- Added first self-contained phase phases/0-harness-foundation/ with four pending steps:
  - sample-metadata-contract
  - core-package-skeleton
  - page-preanalysis-contract
  - markdown-quality-gates
- Updated AGENTS.md, PLAN.md, README.md, docs/ARCHITECTURE.md, and docs/ADR.md to reference the Harness workflow.
- Added .codex/commands/sprint-contract.md.
- Strengthened Harness workflow/review skill guidance to require Sprint Contracts.
- Updated hooks for simpler Windows-friendly command paths and expanded handoff checks to include phases/, scripts/, .agents/, and plugins/.
- Made scripts/validate_workspace.py discover repo-local Python validation by default.
- Added scripts/test_validate_workspace.py and fixed scripts/test_execute.py UTF-8 fixture handling on Windows.
Established the full phase-by-phase implementation roadmap before starting engine implementation:
- Added docs/IMPLEMENTATION_PLAN.md.
- Expanded phases/index.json from Phase 0 only to Phases 0 through 9.
- Added executable pending step contracts for:
  - 1-core-runtime-contracts
  - 2-marker-adapter
  - 3-formula-pipeline
  - 4-semantic-enrichment
  - 5-markdown-rendering-assets
  - 6-cli-runtime-resume
  - 7-mvp-quality-hardening
  - 8-release-docs-packaging
  - 9-pyqt-thin-client
- Updated PLAN.md, AGENTS.md, and README.md to point new agents to the full implementation roadmap.
Implemented Phase 0 Harness foundation:
- Step 0 sample-metadata-contract: added deterministic samples/metadata.json and metadata contract tests.
- Step 1 core-package-skeleton: added pyproject.toml, importable src/pdftomd package, typed model contracts, and model tests.
- Step 2 page-preanalysis-contract: added PyMuPDF-only analyze_pdf() preanalysis, deterministic OCR candidate logic, and chunk candidate tests.
- Step 3 markdown-quality-gates: added focused Markdown quality gates and tests for math delimiters, LaTeX environments, image links, tables, chunk frontmatter, and anchors.
- Parallel work was split by disjoint write scopes: sample metadata/model contracts first, then preanalysis/quality gates.
Reviewed Phase 0 with harness-review criteria:
- No blocking findings.
- Architecture boundary remained intact: Marker and Nougat are not invoked in foundation contracts, and PyMuPDF is limited to page pre-analysis.
- python scripts\validate_workspace.py passed before Phase 1 work started.
Implemented Phase 1 core runtime contracts:
- Step 0 input-normalization-slug: added deterministic PDF path normalization, document identity creation, anchors, and output bundle path contracts.
- Step 1 conversion-options-config: added typed conversion options, runtime modes, and formula parser options without CLI parsing.
- Step 2 output-bundle-contract: added deterministic document bundle paths while keeping runtime artifacts separate from document output.
- Step 3 runtime-cache-policy: added explicit .models/ default cache policy, PDFTOMD_MODEL_CACHE override, Hugging Face offline environment mappings, and runtime artifact paths.
- Updated docs/TOOLCHAIN.md and .gitignore for model cache policy.

Web Research Notes

Marker currently supports Markdown/JSON/chunks/HTML output and includes tables, equations, inline math, image extraction, layout, and reading-order functionality.
Nougat is the intended isolated formula parser candidate; Windows GPU use depends on a correct PyTorch install.
PyMuPDF remains appropriate for page counting, PDF splitting/chunk planning, and low-level image/page operations.
PyMuPDF4LLM, Docling, and MinerU are useful comparison baselines but are not the primary parser under the current architecture.
MathJax notes that $...$ inline math can conflict with ordinary dollar signs, so delimiter validation is required.

In Progress

None.

Blockers

None yet.

Decisions

Personal-use context lowers immediate licensing risk, but Marker GPL/model license implications must be revisited before redistribution or commercial use.
Mixed text/scanned PDFs are in scope, with page-level OCR intervention decisions based on lightweight text-layer quality analysis.
Marker owns layout, reading order, body text, headings, tables, figures, captions, and OCR/layout handling.
Nougat owns only mathematical expressions and formula blocks, with Marker text fallback on failure.
Markdown tables are preferred, but limited HTML tables and table-region screenshot fallbacks are allowed for complex tables.
Figure/table/formula numbers and body references should become internal Markdown links when confidence is sufficient.
Chunking should prefer logical block boundaries over strict 20-page boundaries when a block would be split.
Chunk Markdown may include concise frontmatter with core context, but document-output sidecars remain out of scope by default.
CLI should write warnings/errors to stderr and local logs, not into generated Markdown.
Resume support may use local runtime state/cache files to skip successful chunks.
Custom agents will be created later, only one at a time after explicit user approval.
Planning files are the source of truth for multi-agent coordination.
Harness phase files now exist. PLAN.md remains the overall plan, PROGRESS.md remains the handoff state, and phases/{phase}/index.json is the phase execution status.
Each future implementation step should use the docs/HARNESS.md planner/generator/evaluator workflow and include a Sprint Contract before code changes.
Full implementation sequencing is recorded in docs/IMPLEMENTATION_PLAN.md; phase files are pending tickets and should not be executed out of dependency order.
Phase 0 and Phase 1 are complete. phases/index.json marks both 0-harness-foundation and 1-core-runtime-contracts as completed.
Main and Nougat dependencies can share one environment when Nougat's loose dependencies are pinned explicitly.
torch==2.11.0+cu128 was rejected for this machine because it does not support GTX 1070 Ti sm_61.
torch==2.7.1+cu126 was selected because it satisfies Marker torch>=2.7.0 and successfully runs CUDA tensor operations on GTX 1070 Ti.
nougat-ocr==0.1.17 requires dependency pins:
- transformers==4.57.6, because transformers 5.7.0 breaks Nougat imports.
- albumentations==1.3.1, because albumentations 2.x breaks Nougat transform initialization.
- fsspec==2026.2.0, because newer fsspec conflicts with datasets.
- pypdfium2==4.30.0, opencv-python-headless==4.11.0.86, and Pillow==10.4.0, because Marker/Surya depend on these versions and Nougat can operate with them.

Next Work

Review Phase 1 output with harness-review before moving to Phase 2.
If review passes, start phases/2-marker-adapter/step0.md.
Execute phases in order unless PLAN.md and docs/IMPLEMENTATION_PLAN.md are updated with a clear dependency rationale.
Do not create new custom agents unless the user explicitly approves another agent.

Latest Validation

.\venv\python.exe -m pytest scripts\test_validate_workspace.py: passed, 7 tests.
.\venv\python.exe -m py_compile scripts\execute.py scripts\validate_workspace.py .codex\hooks\*.py: passed.
JSON parse check for phases/index.json, phases/0-harness-foundation/index.json, and .codex/hooks.json: passed.
Phase structure check for all stepN.md files: passed.
.codex/commands/*.md frontmatter check: passed.
python scripts\validate_workspace.py: passed, 103 tests after Phase 1 implementation.
.\venv\python.exe -c "import pdftomd; print(pdftomd.__name__)": passed after adding editable package metadata.

12 KiB Raw Blame History

PDFtoMD Progress

Current Status

Completed

Web Research Notes

In Progress

Blockers

Decisions

Next Work

Latest Validation

12 KiB

Raw Blame History