Files
PDFToMD/PROGRESS.md
T
김경종 7e985ae94a add files
2026-04-30 17:05:19 +09:00

12 KiB

PDFtoMD Progress

Current Status

  • Date: 2026-04-30.
  • Mode: Phase 1 implementation complete; ready for Phase 1 review or Phase 2 handoff.
  • Implementation status: Phase 0 and Phase 1 complete.
  • Custom agent creation: project-scoped read-only agents created after user approval.
  • Persistent multi-agent coordination files were approved by the user.

Completed

  • Read repository instructions and project documents:
    • AGENTS.md
    • docs/PRD.md
    • docs/ARCHITECTURE.md
    • docs/ADR.md
    • docs/UI_GUIDE.md
  • Confirmed project direction:
    • Windows-native local CLI/library engine first.
    • Marker for document structure, reading order, tables, figures, headings, and captions.
    • Nougat only for mathematical expressions/formulas.
    • PyMuPDF for PDF analysis and chunk planning.
    • Markdown chunk output plus image/table assets.
  • Recorded detailed conversion policy decisions in docs/CONVERSION_POLICY.md.
  • Strengthened project documentation with current research and decisions:
    • AGENTS.md
    • README.md
    • docs/PRD.md
    • docs/ARCHITECTURE.md
    • docs/ADR.md
    • docs/TOOLCHAIN.md
    • docs/UI_GUIDE.md
  • Strengthened AGENTS.md multi-agent coordination rules so every new agent reads PLAN.md and PROGRESS.md first and can identify current goals, assigned scope, completed work, blockers, next work, and conflict risks.
  • Created project-scoped Codex extensions under .codex/:
    • Agents: pdf-toolchain-researcher, sample-corpus-analyst, conversion-architect, quality-evaluator, formula-pipeline-specialist, layout-table-figure-specialist.
    • Commands: status, env-check, sample-audit, quality-plan, conversion-policy-review, model-cache-check, phase-draft.
    • Skills: pdf-toolchain, sample-corpus, conversion-architecture, formula-quality, markdown-quality, windows-runtime.
    • Hooks: strengthened risky command guard, added handoff policy and drift policy hooks.
  • Validated .codex extension formats:
    • Agent TOML files parsed successfully.
    • .codex/hooks.json parsed successfully.
    • Hook Python scripts compiled successfully.
    • All .codex/skills/*/SKILL.md files passed skill-creator quick validation.
  • Confirmed user environment:
    • Windows 10.
    • NVIDIA GeForce GTX 1070 Ti.
    • 8 GB VRAM.
    • NVIDIA driver 577.00.
    • nvidia-smi reports CUDA runtime capability 12.9.
    • User reports CUDA 12.4 installed.
    • Current detected Python: Miniforge Python 3.12.7.
    • Conda is available.
    • uv is not available.
  • Created repo-local environment:
    • venv: Python 3.11.15, unified Marker/PyMuPDF/Pandas/test/Nougat environment.
  • Removed previous experimental venv-nougat directory after unified venv validation passed.
  • Verified unified environment:
    • torch==2.7.1+cu126
    • torchvision==0.22.1+cu126
    • marker-pdf==1.10.2
    • nougat-ocr==0.1.17
    • transformers==4.57.6
    • albumentations==1.3.1
    • fsspec==2026.2.0
    • pymupdf==1.27.2.3
    • pandas==3.0.2
    • pytest==9.0.3
    • Pillow==10.4.0
    • pypdfium2==4.30.0
    • opencv-python-headless==4.11.0.86
    • pip check: passed.
    • CUDA tensor operation on GTX 1070 Ti: passed.
    • venv\Scripts\nougat.exe --help: passed.
  • Ran earlier repository validation before default Python test discovery was added:
    • python scripts/validate_workspace.py: passed at that time with no configured validation commands.
  • Confirmed sample PDFs:
    • samples/2007쉘구조물의유한요소해석에대하여.pdf: 13 pages, first page text length 3523, first page images 0.
    • samples/FourNodeQuadrilateralShellElementMITC4.pdf: 7 pages, first page text length 3269, first page images 0.
    • samples/MITC공부.pdf: 13 pages, first page text length 226, first page images 2.
    • samples/유한요소해석법을이용한쉘구조물의동적좌굴해석.pdf: 76 pages, first page text length 446, first page images 10.
  • Strengthened the project for Anthropic-style Harness Engineering:
    • Added docs/HARNESS.md with planner/generator/evaluator roles, file protocol, Sprint Contract template, evaluator hard thresholds, and simplification rules.
    • Added executable phase registry phases/index.json.
    • Added first self-contained phase phases/0-harness-foundation/ with four pending steps:
      • sample-metadata-contract
      • core-package-skeleton
      • page-preanalysis-contract
      • markdown-quality-gates
    • Updated AGENTS.md, PLAN.md, README.md, docs/ARCHITECTURE.md, and docs/ADR.md to reference the Harness workflow.
    • Added .codex/commands/sprint-contract.md.
    • Strengthened Harness workflow/review skill guidance to require Sprint Contracts.
    • Updated hooks for simpler Windows-friendly command paths and expanded handoff checks to include phases/, scripts/, .agents/, and plugins/.
    • Made scripts/validate_workspace.py discover repo-local Python validation by default.
    • Added scripts/test_validate_workspace.py and fixed scripts/test_execute.py UTF-8 fixture handling on Windows.
  • Established the full phase-by-phase implementation roadmap before starting engine implementation:
    • Added docs/IMPLEMENTATION_PLAN.md.
    • Expanded phases/index.json from Phase 0 only to Phases 0 through 9.
    • Added executable pending step contracts for:
      • 1-core-runtime-contracts
      • 2-marker-adapter
      • 3-formula-pipeline
      • 4-semantic-enrichment
      • 5-markdown-rendering-assets
      • 6-cli-runtime-resume
      • 7-mvp-quality-hardening
      • 8-release-docs-packaging
      • 9-pyqt-thin-client
    • Updated PLAN.md, AGENTS.md, and README.md to point new agents to the full implementation roadmap.
  • Implemented Phase 0 Harness foundation:
    • Step 0 sample-metadata-contract: added deterministic samples/metadata.json and metadata contract tests.
    • Step 1 core-package-skeleton: added pyproject.toml, importable src/pdftomd package, typed model contracts, and model tests.
    • Step 2 page-preanalysis-contract: added PyMuPDF-only analyze_pdf() preanalysis, deterministic OCR candidate logic, and chunk candidate tests.
    • Step 3 markdown-quality-gates: added focused Markdown quality gates and tests for math delimiters, LaTeX environments, image links, tables, chunk frontmatter, and anchors.
    • Parallel work was split by disjoint write scopes: sample metadata/model contracts first, then preanalysis/quality gates.
  • Reviewed Phase 0 with harness-review criteria:
    • No blocking findings.
    • Architecture boundary remained intact: Marker and Nougat are not invoked in foundation contracts, and PyMuPDF is limited to page pre-analysis.
    • python scripts\validate_workspace.py passed before Phase 1 work started.
  • Implemented Phase 1 core runtime contracts:
    • Step 0 input-normalization-slug: added deterministic PDF path normalization, document identity creation, anchors, and output bundle path contracts.
    • Step 1 conversion-options-config: added typed conversion options, runtime modes, and formula parser options without CLI parsing.
    • Step 2 output-bundle-contract: added deterministic document bundle paths while keeping runtime artifacts separate from document output.
    • Step 3 runtime-cache-policy: added explicit .models/ default cache policy, PDFTOMD_MODEL_CACHE override, Hugging Face offline environment mappings, and runtime artifact paths.
    • Updated docs/TOOLCHAIN.md and .gitignore for model cache policy.

Web Research Notes

  • Marker currently supports Markdown/JSON/chunks/HTML output and includes tables, equations, inline math, image extraction, layout, and reading-order functionality.
  • Nougat is the intended isolated formula parser candidate; Windows GPU use depends on a correct PyTorch install.
  • PyMuPDF remains appropriate for page counting, PDF splitting/chunk planning, and low-level image/page operations.
  • PyMuPDF4LLM, Docling, and MinerU are useful comparison baselines but are not the primary parser under the current architecture.
  • MathJax notes that $...$ inline math can conflict with ordinary dollar signs, so delimiter validation is required.

In Progress

  • None.

Blockers

  • None yet.

Decisions

  • Personal-use context lowers immediate licensing risk, but Marker GPL/model license implications must be revisited before redistribution or commercial use.
  • Mixed text/scanned PDFs are in scope, with page-level OCR intervention decisions based on lightweight text-layer quality analysis.
  • Marker owns layout, reading order, body text, headings, tables, figures, captions, and OCR/layout handling.
  • Nougat owns only mathematical expressions and formula blocks, with Marker text fallback on failure.
  • Markdown tables are preferred, but limited HTML tables and table-region screenshot fallbacks are allowed for complex tables.
  • Figure/table/formula numbers and body references should become internal Markdown links when confidence is sufficient.
  • Chunking should prefer logical block boundaries over strict 20-page boundaries when a block would be split.
  • Chunk Markdown may include concise frontmatter with core context, but document-output sidecars remain out of scope by default.
  • CLI should write warnings/errors to stderr and local logs, not into generated Markdown.
  • Resume support may use local runtime state/cache files to skip successful chunks.
  • Custom agents will be created later, only one at a time after explicit user approval.
  • Planning files are the source of truth for multi-agent coordination.
  • Harness phase files now exist. PLAN.md remains the overall plan, PROGRESS.md remains the handoff state, and phases/{phase}/index.json is the phase execution status.
  • Each future implementation step should use the docs/HARNESS.md planner/generator/evaluator workflow and include a Sprint Contract before code changes.
  • Full implementation sequencing is recorded in docs/IMPLEMENTATION_PLAN.md; phase files are pending tickets and should not be executed out of dependency order.
  • Phase 0 and Phase 1 are complete. phases/index.json marks both 0-harness-foundation and 1-core-runtime-contracts as completed.
  • Main and Nougat dependencies can share one environment when Nougat's loose dependencies are pinned explicitly.
  • torch==2.11.0+cu128 was rejected for this machine because it does not support GTX 1070 Ti sm_61.
  • torch==2.7.1+cu126 was selected because it satisfies Marker torch>=2.7.0 and successfully runs CUDA tensor operations on GTX 1070 Ti.
  • nougat-ocr==0.1.17 requires dependency pins:
    • transformers==4.57.6, because transformers 5.7.0 breaks Nougat imports.
    • albumentations==1.3.1, because albumentations 2.x breaks Nougat transform initialization.
    • fsspec==2026.2.0, because newer fsspec conflicts with datasets.
    • pypdfium2==4.30.0, opencv-python-headless==4.11.0.86, and Pillow==10.4.0, because Marker/Surya depend on these versions and Nougat can operate with them.

Next Work

  1. Review Phase 1 output with harness-review before moving to Phase 2.
  2. If review passes, start phases/2-marker-adapter/step0.md.
  3. Execute phases in order unless PLAN.md and docs/IMPLEMENTATION_PLAN.md are updated with a clear dependency rationale.
  4. Do not create new custom agents unless the user explicitly approves another agent.

Latest Validation

  • .\venv\python.exe -m pytest scripts\test_validate_workspace.py: passed, 7 tests.
  • .\venv\python.exe -m py_compile scripts\execute.py scripts\validate_workspace.py .codex\hooks\*.py: passed.
  • JSON parse check for phases/index.json, phases/0-harness-foundation/index.json, and .codex/hooks.json: passed.
  • Phase structure check for all stepN.md files: passed.
  • .codex/commands/*.md frontmatter check: passed.
  • python scripts\validate_workspace.py: passed, 103 tests after Phase 1 implementation.
  • .\venv\python.exe -c "import pdftomd; print(pdftomd.__name__)": passed after adding editable package metadata.