12 KiB
12 KiB
PDFtoMD Progress
Current Status
- Date: 2026-04-30.
- Mode: Phase 1 implementation complete; ready for Phase 1 review or Phase 2 handoff.
- Implementation status: Phase 0 and Phase 1 complete.
- Custom agent creation: project-scoped read-only agents created after user approval.
- Persistent multi-agent coordination files were approved by the user.
Completed
- Read repository instructions and project documents:
AGENTS.mddocs/PRD.mddocs/ARCHITECTURE.mddocs/ADR.mddocs/UI_GUIDE.md
- Confirmed project direction:
- Windows-native local CLI/library engine first.
- Marker for document structure, reading order, tables, figures, headings, and captions.
- Nougat only for mathematical expressions/formulas.
- PyMuPDF for PDF analysis and chunk planning.
- Markdown chunk output plus image/table assets.
- Recorded detailed conversion policy decisions in
docs/CONVERSION_POLICY.md. - Strengthened project documentation with current research and decisions:
AGENTS.mdREADME.mddocs/PRD.mddocs/ARCHITECTURE.mddocs/ADR.mddocs/TOOLCHAIN.mddocs/UI_GUIDE.md
- Strengthened
AGENTS.mdmulti-agent coordination rules so every new agent readsPLAN.mdandPROGRESS.mdfirst and can identify current goals, assigned scope, completed work, blockers, next work, and conflict risks. - Created project-scoped Codex extensions under
.codex/:- Agents:
pdf-toolchain-researcher,sample-corpus-analyst,conversion-architect,quality-evaluator,formula-pipeline-specialist,layout-table-figure-specialist. - Commands:
status,env-check,sample-audit,quality-plan,conversion-policy-review,model-cache-check,phase-draft. - Skills:
pdf-toolchain,sample-corpus,conversion-architecture,formula-quality,markdown-quality,windows-runtime. - Hooks: strengthened risky command guard, added handoff policy and drift policy hooks.
- Agents:
- Validated
.codexextension formats:- Agent TOML files parsed successfully.
.codex/hooks.jsonparsed successfully.- Hook Python scripts compiled successfully.
- All
.codex/skills/*/SKILL.mdfiles passedskill-creatorquick validation.
- Confirmed user environment:
- Windows 10.
- NVIDIA GeForce GTX 1070 Ti.
- 8 GB VRAM.
- NVIDIA driver 577.00.
nvidia-smireports CUDA runtime capability 12.9.- User reports CUDA 12.4 installed.
- Current detected Python: Miniforge Python 3.12.7.
- Conda is available.
uvis not available.
- Created repo-local environment:
venv: Python 3.11.15, unified Marker/PyMuPDF/Pandas/test/Nougat environment.
- Removed previous experimental
venv-nougatdirectory after unifiedvenvvalidation passed. - Verified unified environment:
torch==2.7.1+cu126torchvision==0.22.1+cu126marker-pdf==1.10.2nougat-ocr==0.1.17transformers==4.57.6albumentations==1.3.1fsspec==2026.2.0pymupdf==1.27.2.3pandas==3.0.2pytest==9.0.3Pillow==10.4.0pypdfium2==4.30.0opencv-python-headless==4.11.0.86pip check: passed.- CUDA tensor operation on GTX 1070 Ti: passed.
venv\Scripts\nougat.exe --help: passed.
- Ran earlier repository validation before default Python test discovery was added:
python scripts/validate_workspace.py: passed at that time with no configured validation commands.
- Confirmed sample PDFs:
samples/2007쉘구조물의유한요소해석에대하여.pdf: 13 pages, first page text length 3523, first page images 0.samples/FourNodeQuadrilateralShellElementMITC4.pdf: 7 pages, first page text length 3269, first page images 0.samples/MITC공부.pdf: 13 pages, first page text length 226, first page images 2.samples/유한요소해석법을이용한쉘구조물의동적좌굴해석.pdf: 76 pages, first page text length 446, first page images 10.
- Strengthened the project for Anthropic-style Harness Engineering:
- Added
docs/HARNESS.mdwith planner/generator/evaluator roles, file protocol, Sprint Contract template, evaluator hard thresholds, and simplification rules. - Added executable phase registry
phases/index.json. - Added first self-contained phase
phases/0-harness-foundation/with four pending steps:sample-metadata-contractcore-package-skeletonpage-preanalysis-contractmarkdown-quality-gates
- Updated
AGENTS.md,PLAN.md,README.md,docs/ARCHITECTURE.md, anddocs/ADR.mdto reference the Harness workflow. - Added
.codex/commands/sprint-contract.md. - Strengthened Harness workflow/review skill guidance to require Sprint Contracts.
- Updated hooks for simpler Windows-friendly command paths and expanded handoff checks to include
phases/,scripts/,.agents/, andplugins/. - Made
scripts/validate_workspace.pydiscover repo-local Python validation by default. - Added
scripts/test_validate_workspace.pyand fixedscripts/test_execute.pyUTF-8 fixture handling on Windows.
- Added
- Established the full phase-by-phase implementation roadmap before starting engine implementation:
- Added
docs/IMPLEMENTATION_PLAN.md. - Expanded
phases/index.jsonfrom Phase 0 only to Phases 0 through 9. - Added executable pending step contracts for:
1-core-runtime-contracts2-marker-adapter3-formula-pipeline4-semantic-enrichment5-markdown-rendering-assets6-cli-runtime-resume7-mvp-quality-hardening8-release-docs-packaging9-pyqt-thin-client
- Updated
PLAN.md,AGENTS.md, andREADME.mdto point new agents to the full implementation roadmap.
- Added
- Implemented Phase 0 Harness foundation:
- Step 0
sample-metadata-contract: added deterministicsamples/metadata.jsonand metadata contract tests. - Step 1
core-package-skeleton: addedpyproject.toml, importablesrc/pdftomdpackage, typed model contracts, and model tests. - Step 2
page-preanalysis-contract: added PyMuPDF-onlyanalyze_pdf()preanalysis, deterministic OCR candidate logic, and chunk candidate tests. - Step 3
markdown-quality-gates: added focused Markdown quality gates and tests for math delimiters, LaTeX environments, image links, tables, chunk frontmatter, and anchors. - Parallel work was split by disjoint write scopes: sample metadata/model contracts first, then preanalysis/quality gates.
- Step 0
- Reviewed Phase 0 with
harness-reviewcriteria:- No blocking findings.
- Architecture boundary remained intact: Marker and Nougat are not invoked in foundation contracts, and PyMuPDF is limited to page pre-analysis.
python scripts\validate_workspace.pypassed before Phase 1 work started.
- Implemented Phase 1 core runtime contracts:
- Step 0
input-normalization-slug: added deterministic PDF path normalization, document identity creation, anchors, and output bundle path contracts. - Step 1
conversion-options-config: added typed conversion options, runtime modes, and formula parser options without CLI parsing. - Step 2
output-bundle-contract: added deterministic document bundle paths while keeping runtime artifacts separate from document output. - Step 3
runtime-cache-policy: added explicit.models/default cache policy,PDFTOMD_MODEL_CACHEoverride, Hugging Face offline environment mappings, and runtime artifact paths. - Updated
docs/TOOLCHAIN.mdand.gitignorefor model cache policy.
- Step 0
Web Research Notes
- Marker currently supports Markdown/JSON/chunks/HTML output and includes tables, equations, inline math, image extraction, layout, and reading-order functionality.
- Nougat is the intended isolated formula parser candidate; Windows GPU use depends on a correct PyTorch install.
- PyMuPDF remains appropriate for page counting, PDF splitting/chunk planning, and low-level image/page operations.
- PyMuPDF4LLM, Docling, and MinerU are useful comparison baselines but are not the primary parser under the current architecture.
- MathJax notes that
$...$inline math can conflict with ordinary dollar signs, so delimiter validation is required.
In Progress
- None.
Blockers
- None yet.
Decisions
- Personal-use context lowers immediate licensing risk, but Marker GPL/model license implications must be revisited before redistribution or commercial use.
- Mixed text/scanned PDFs are in scope, with page-level OCR intervention decisions based on lightweight text-layer quality analysis.
- Marker owns layout, reading order, body text, headings, tables, figures, captions, and OCR/layout handling.
- Nougat owns only mathematical expressions and formula blocks, with Marker text fallback on failure.
- Markdown tables are preferred, but limited HTML tables and table-region screenshot fallbacks are allowed for complex tables.
- Figure/table/formula numbers and body references should become internal Markdown links when confidence is sufficient.
- Chunking should prefer logical block boundaries over strict 20-page boundaries when a block would be split.
- Chunk Markdown may include concise frontmatter with core context, but document-output sidecars remain out of scope by default.
- CLI should write warnings/errors to stderr and local logs, not into generated Markdown.
- Resume support may use local runtime state/cache files to skip successful chunks.
- Custom agents will be created later, only one at a time after explicit user approval.
- Planning files are the source of truth for multi-agent coordination.
- Harness phase files now exist.
PLAN.mdremains the overall plan,PROGRESS.mdremains the handoff state, andphases/{phase}/index.jsonis the phase execution status. - Each future implementation step should use the
docs/HARNESS.mdplanner/generator/evaluator workflow and include a Sprint Contract before code changes. - Full implementation sequencing is recorded in
docs/IMPLEMENTATION_PLAN.md; phase files are pending tickets and should not be executed out of dependency order. - Phase 0 and Phase 1 are complete.
phases/index.jsonmarks both0-harness-foundationand1-core-runtime-contractsas completed. - Main and Nougat dependencies can share one environment when Nougat's loose dependencies are pinned explicitly.
torch==2.11.0+cu128was rejected for this machine because it does not support GTX 1070 Tism_61.torch==2.7.1+cu126was selected because it satisfies Markertorch>=2.7.0and successfully runs CUDA tensor operations on GTX 1070 Ti.nougat-ocr==0.1.17requires dependency pins:transformers==4.57.6, becausetransformers 5.7.0breaks Nougat imports.albumentations==1.3.1, becausealbumentations 2.xbreaks Nougat transform initialization.fsspec==2026.2.0, because newerfsspecconflicts withdatasets.pypdfium2==4.30.0,opencv-python-headless==4.11.0.86, andPillow==10.4.0, because Marker/Surya depend on these versions and Nougat can operate with them.
Next Work
- Review Phase 1 output with
harness-reviewbefore moving to Phase 2. - If review passes, start
phases/2-marker-adapter/step0.md. - Execute phases in order unless
PLAN.mdanddocs/IMPLEMENTATION_PLAN.mdare updated with a clear dependency rationale. - Do not create new custom agents unless the user explicitly approves another agent.
Latest Validation
.\venv\python.exe -m pytest scripts\test_validate_workspace.py: passed, 7 tests..\venv\python.exe -m py_compile scripts\execute.py scripts\validate_workspace.py .codex\hooks\*.py: passed.- JSON parse check for
phases/index.json,phases/0-harness-foundation/index.json, and.codex/hooks.json: passed. - Phase structure check for all
stepN.mdfiles: passed. .codex/commands/*.mdfrontmatter check: passed.python scripts\validate_workspace.py: passed, 103 tests after Phase 1 implementation..\venv\python.exe -c "import pdftomd; print(pdftomd.__name__)": passed after adding editable package metadata.