Files
PDFToMD/docs/WORKARCHIVE.md
T
2026-05-08 17:03:40 +09:00

8.7 KiB

Work Archive

Last updated: 2026-05-08

This document stores completed project work, historical sprint outcomes, environment setup results, and sample conversion evidence. PROGRESS.md should stay focused on current status, blockers, and next actions. Read this archive when a task needs past implementation context, previous verification commands, or historical handoff details.

Archive Policy

  • Keep completed sprint outcomes here after they no longer need to stay in PROGRESS.md.
  • Keep PROGRESS.md short and current.
  • Keep source-of-truth product requirements in PRD.md and architecture decisions in ARCHITECTURE.md.
  • Keep sprint contract details under docs/Sprints/.
  • Do not archive or commit sample PDFs or generated conversion outputs.

Completed Milestones

Area Completed Outcome Primary References
Core project documents Created and iterated PRD.md, AGENTS.md, ARCHITECTURE.md, and docs/KNOWLEDGEBASE.md. PRD.md, AGENTS.md, ARCHITECTURE.md, docs/KNOWLEDGEBASE.md
Engine decision Moved from the initial MinerU 2.5-era planning to fixed MinerU 3.1.0. PLAN.md, ARCHITECTURE.md, docs/KNOWLEDGEBASE.md
Strict-local policy Redefined v1 runtime policy: allow direct local mineru CLI and its CLI-internal temporary local mineru-api; prohibit --api-url, remote APIs, router mode, HTTP client backends, and remote OpenAI-compatible backends. PRD.md, ARCHITECTURE.md, AGENTS.md
Agent workflow Added shared PLAN.md and PROGRESS.md workflow, project-scoped agents, commands, skills, hooks, and the planner/generator/evaluator harness guidance. AGENTS.md, .codex/agents/, .codex/commands/, .codex/skills/, .codex/hooks.json
V1 implementation planning Created docs/V1IMPLEMENTATIONPLAN.md and sprint contracts under docs/Sprints/. docs/V1IMPLEMENTATIONPLAN.md, docs/Sprints/*.md
Sprint 0 Completed source, environment, license, privacy, and contract verification. Recommendation was go-with-risks. docs/Sprints/SPRINT0CONTRACT.md
Sprint 1 Created minimal Python package scaffold, CLI placeholder, pyproject.toml, uv.lock, and fast pytest loop. docs/Sprints/SPRINT1CONTRACT.md
Sprint 2 Implemented deterministic local PDF discovery, path planning, overwrite conflict checks, duplicate output checks, and output-root escape prevention. docs/Sprints/SPRINT2CONTRACT.md, src/pdf2md/paths.py
Sprint 3 Implemented project-owned IR records, warning codes/severities, metadata construction, and metadata tests. docs/Sprints/SPRINT3CONTRACT.md, src/pdf2md/ir.py, src/pdf2md/metadata.py
Sprint 4 Implemented mocked direct local MinerU CLI adapter boundary and strict-local validation. docs/Sprints/SPRINT4CONTRACT.md, src/pdf2md/mineru_adapter.py
Sprint 5 Implemented Obsidian Markdown normalization, math delimiter handling, asset link normalization, and table fallback warnings. docs/Sprints/SPRINT5CONTRACT.md, src/pdf2md/markdown.py
Sprint 6 Implemented local quality checks and human-readable report rendering from metadata and quality results. docs/Sprints/SPRINT6CONTRACT.md, src/pdf2md/quality.py, src/pdf2md/report.py
Sprint 7 Implemented conversion orchestration, convert_pdf, convert_input, pdf2md convert, final Markdown writing, metadata/report writing, asset copying, and fake-adapter CLI/API tests. docs/Sprints/SPRINT7CONTRACT.md, src/pdf2md/conversion.py, src/pdf2md/cli.py
Sprint 8 Implemented pdf2md doctor, local setup diagnostics, GPU/CUDA/PyTorch/model/cache checks, and setup documentation. docs/Sprints/SPRINT8CONTRACT.md, src/pdf2md/doctor.py, README.md
Sprint 9 Implemented fast mocked v1 release-gate tests, optional local MinerU fixture evaluation, and docs/V1RELEASECHECKLIST.md. docs/Sprints/SPRINT9CONTRACT.md, docs/V1RELEASECHECKLIST.md, tests/integration/
GPU default/runtime setup Made conversion default to cuda:0, mapped CUDA requests to MinerU subprocess environment variables, rebuilt .venv, installed CUDA-enabled PyTorch and MinerU 3.1.0, downloaded MinerU models, and set MINERU_MODEL_SOURCE=local. README.md, src/pdf2md/mineru_adapter.py, src/pdf2md/conversion.py
MathJax checker Planned and implemented local MathJax render checker with Node.js helper, Python wrapper, conversion integration, and doctor diagnostics. docs/MATHJAXCHECKERPLAN.md, tools/mathjax-checker/check.mjs, src/pdf2md/math_render.py
Sprint 10 Implemented opt-in pre-conversion PDF chunking with pypdf, temporary chunk PDF cleanup, --chunk-pages [PAGES], chunk metadata/report context, and mocked tests. docs/Sprints/SPRINT10CONTRACT.md, src/pdf2md/pdf_splitter.py

Runtime Setup Archive

  • OS/workspace: Windows PowerShell in D:\Work\Repos\AICoding\ConvertPDFToMD.
  • Python target: 3.12.
  • Local Python observed during Sprint 0: 3.12.7.
  • uv installed per-user at C:\Users\user\.local\bin.
  • GPU target: NVIDIA GTX 1070 Ti 8GB.
  • Local GPU observed: NVIDIA GeForce GTX 1070 Ti, driver 577.00, 8192 MiB VRAM, WDDM.
  • MinerU execution mode: direct local mineru CLI only.
  • MinerU 3.1.0 CLI-internal temporary local mineru-api is allowed when the CLI runs without --api-url.
  • GTX 1070 Ti runtime setup used torch==2.6.0+cu126, torchvision==0.21.0+cu126, and mineru[core]==3.1.0.
  • MinerU models were downloaded with uv run mineru-models-download -s huggingface -m all.
  • Runtime model loading uses MINERU_MODEL_SOURCE=local.
  • Current doctor status after setup is WARN because GTX 1070 Ti is Pascal/pre-Turing; MinerU, CUDA PyTorch, local model config, MathJax checker, and strict-local checks pass.

Sample Conversion Archive

Generated outputs are ignored under outputs/ and are not committed.

Sample Result Output Location Summary
samples/MITC공부.pdf Completed after CUDA-enabled runtime setup. outputs/MITC공부/ 13 pages, 107 assets, 23 inline formulas, 103 display formulas, 1 info warning at the time of that run because the local MathJax checker was unavailable.
samples/FourNodeQuadrilateralShellElementMITC4.pdf Completed with default GPU request and MINERU_MODEL_SOURCE=local. outputs/FourNodeQuadrilateralShellElementMITC4/ Report status success: 7 pages, 22 assets, 38 inline formulas, 16 display formulas, 0 math render errors, 0 warnings.

Historical Verification Highlights

  • Sprint 1 scaffold: uv run pytest passed 4 tests; uv run pdf2md --version printed pdf2md 0.1.0.
  • Sprint 2 path planning: uv run pytest tests/test_paths.py passed 17 tests; full suite passed 21 tests.
  • Sprint 3 metadata/IR: targeted IR/metadata tests passed 25 tests; full suite passed 46 tests.
  • Sprint 4 MinerU adapter: adapter tests passed 26 tests; full suite passed 72 tests.
  • Sprint 5 Markdown normalization: targeted Markdown/IR tests passed 30 tests; full suite passed 89 tests.
  • Sprint 6 quality/report: targeted quality/report/metadata tests passed 26 tests; full suite passed 103 tests.
  • Sprint 7 conversion orchestration: full suite passed 119 tests after metadata math count fixes.
  • Sprint 8 doctor: full suite passed 133 tests.
  • Sprint 9 release gate: full suite passed 136 tests with 1 optional skip.
  • CUDA runtime rebuild: verified CUDA with an actual tensor operation on NVIDIA GeForce GTX 1070 Ti, compute capability 6.1; mineru --version reported 3.1.0.
  • MathJax checker: npm run mathjax-checker:health returned {"ok":true} after local npm install; full suite passed 150 tests with 1 optional skip after integration.
  • Sprint 10 chunking: targeted chunking tests passed 42 tests; full default suite passed 163 tests with 1 optional skip; git diff --check passed with line-ending warnings only.

Historical Blockers And Resolutions

  • Early uv availability was missing; resolved by installing uv per-user.
  • Initial real sample conversion failed when PyTorch was CPU-only; resolved by rebuilding .venv and installing CUDA-enabled PyTorch before MinerU.
  • Optional local MinerU fixture checks were originally blocked by missing MinerU CLI; resolved for the local environment after MinerU 3.1.0 and model setup.
  • Local MathJax render checking was originally unavailable; resolved after adding the local Node.js MathJax checker and installing local npm dependencies.
  • GTX 1070 Ti remains a Pascal/pre-Turing GPU risk. This is a warning, not a current blocker.

Archive Notes For Agents

  • Read this file when a task asks for historical implementation state, prior verification, completed sprint context, or sample conversion evidence.
  • Prefer PROGRESS.md for current state and next actions.
  • Prefer sprint contract files for detailed acceptance criteria and handoff structure.
  • Prefer docs/V1RELEASECHECKLIST.md for release readiness gates.