Files
PDFToMD/docs/WORKARCHIVE.md
T
2026-05-08 17:03:40 +09:00

91 lines
8.7 KiB
Markdown

# Work Archive
Last updated: 2026-05-08
This document stores completed project work, historical sprint outcomes, environment setup results, and sample conversion evidence. `PROGRESS.md` should stay focused on current status, blockers, and next actions. Read this archive when a task needs past implementation context, previous verification commands, or historical handoff details.
## Archive Policy
- Keep completed sprint outcomes here after they no longer need to stay in `PROGRESS.md`.
- Keep `PROGRESS.md` short and current.
- Keep source-of-truth product requirements in `PRD.md` and architecture decisions in `ARCHITECTURE.md`.
- Keep sprint contract details under `docs/Sprints/`.
- Do not archive or commit sample PDFs or generated conversion outputs.
## Completed Milestones
| Area | Completed Outcome | Primary References |
| --- | --- | --- |
| Core project documents | Created and iterated `PRD.md`, `AGENTS.md`, `ARCHITECTURE.md`, and `docs/KNOWLEDGEBASE.md`. | `PRD.md`, `AGENTS.md`, `ARCHITECTURE.md`, `docs/KNOWLEDGEBASE.md` |
| Engine decision | Moved from the initial MinerU 2.5-era planning to fixed MinerU 3.1.0. | `PLAN.md`, `ARCHITECTURE.md`, `docs/KNOWLEDGEBASE.md` |
| Strict-local policy | Redefined v1 runtime policy: allow direct local `mineru` CLI and its CLI-internal temporary local `mineru-api`; prohibit `--api-url`, remote APIs, router mode, HTTP client backends, and remote OpenAI-compatible backends. | `PRD.md`, `ARCHITECTURE.md`, `AGENTS.md` |
| Agent workflow | Added shared `PLAN.md` and `PROGRESS.md` workflow, project-scoped agents, commands, skills, hooks, and the planner/generator/evaluator harness guidance. | `AGENTS.md`, `.codex/agents/`, `.codex/commands/`, `.codex/skills/`, `.codex/hooks.json` |
| V1 implementation planning | Created `docs/V1IMPLEMENTATIONPLAN.md` and sprint contracts under `docs/Sprints/`. | `docs/V1IMPLEMENTATIONPLAN.md`, `docs/Sprints/*.md` |
| Sprint 0 | Completed source, environment, license, privacy, and contract verification. Recommendation was `go-with-risks`. | `docs/Sprints/SPRINT0CONTRACT.md` |
| Sprint 1 | Created minimal Python package scaffold, CLI placeholder, `pyproject.toml`, `uv.lock`, and fast pytest loop. | `docs/Sprints/SPRINT1CONTRACT.md` |
| Sprint 2 | Implemented deterministic local PDF discovery, path planning, overwrite conflict checks, duplicate output checks, and output-root escape prevention. | `docs/Sprints/SPRINT2CONTRACT.md`, `src/pdf2md/paths.py` |
| Sprint 3 | Implemented project-owned IR records, warning codes/severities, metadata construction, and metadata tests. | `docs/Sprints/SPRINT3CONTRACT.md`, `src/pdf2md/ir.py`, `src/pdf2md/metadata.py` |
| Sprint 4 | Implemented mocked direct local MinerU CLI adapter boundary and strict-local validation. | `docs/Sprints/SPRINT4CONTRACT.md`, `src/pdf2md/mineru_adapter.py` |
| Sprint 5 | Implemented Obsidian Markdown normalization, math delimiter handling, asset link normalization, and table fallback warnings. | `docs/Sprints/SPRINT5CONTRACT.md`, `src/pdf2md/markdown.py` |
| Sprint 6 | Implemented local quality checks and human-readable report rendering from metadata and quality results. | `docs/Sprints/SPRINT6CONTRACT.md`, `src/pdf2md/quality.py`, `src/pdf2md/report.py` |
| Sprint 7 | Implemented conversion orchestration, `convert_pdf`, `convert_input`, `pdf2md convert`, final Markdown writing, metadata/report writing, asset copying, and fake-adapter CLI/API tests. | `docs/Sprints/SPRINT7CONTRACT.md`, `src/pdf2md/conversion.py`, `src/pdf2md/cli.py` |
| Sprint 8 | Implemented `pdf2md doctor`, local setup diagnostics, GPU/CUDA/PyTorch/model/cache checks, and setup documentation. | `docs/Sprints/SPRINT8CONTRACT.md`, `src/pdf2md/doctor.py`, `README.md` |
| Sprint 9 | Implemented fast mocked v1 release-gate tests, optional local MinerU fixture evaluation, and `docs/V1RELEASECHECKLIST.md`. | `docs/Sprints/SPRINT9CONTRACT.md`, `docs/V1RELEASECHECKLIST.md`, `tests/integration/` |
| GPU default/runtime setup | Made conversion default to `cuda:0`, mapped CUDA requests to MinerU subprocess environment variables, rebuilt `.venv`, installed CUDA-enabled PyTorch and MinerU 3.1.0, downloaded MinerU models, and set `MINERU_MODEL_SOURCE=local`. | `README.md`, `src/pdf2md/mineru_adapter.py`, `src/pdf2md/conversion.py` |
| MathJax checker | Planned and implemented local MathJax render checker with Node.js helper, Python wrapper, conversion integration, and doctor diagnostics. | `docs/MATHJAXCHECKERPLAN.md`, `tools/mathjax-checker/check.mjs`, `src/pdf2md/math_render.py` |
| Sprint 10 | Implemented opt-in pre-conversion PDF chunking with `pypdf`, temporary chunk PDF cleanup, `--chunk-pages [PAGES]`, chunk metadata/report context, and mocked tests. | `docs/Sprints/SPRINT10CONTRACT.md`, `src/pdf2md/pdf_splitter.py` |
## Runtime Setup Archive
- OS/workspace: Windows PowerShell in `D:\Work\Repos\AICoding\ConvertPDFToMD`.
- Python target: 3.12.
- Local Python observed during Sprint 0: 3.12.7.
- `uv` installed per-user at `C:\Users\user\.local\bin`.
- GPU target: NVIDIA GTX 1070 Ti 8GB.
- Local GPU observed: NVIDIA GeForce GTX 1070 Ti, driver 577.00, 8192 MiB VRAM, WDDM.
- MinerU execution mode: direct local `mineru` CLI only.
- MinerU 3.1.0 CLI-internal temporary local `mineru-api` is allowed when the CLI runs without `--api-url`.
- GTX 1070 Ti runtime setup used `torch==2.6.0+cu126`, `torchvision==0.21.0+cu126`, and `mineru[core]==3.1.0`.
- MinerU models were downloaded with `uv run mineru-models-download -s huggingface -m all`.
- Runtime model loading uses `MINERU_MODEL_SOURCE=local`.
- Current doctor status after setup is WARN because GTX 1070 Ti is Pascal/pre-Turing; MinerU, CUDA PyTorch, local model config, MathJax checker, and strict-local checks pass.
## Sample Conversion Archive
Generated outputs are ignored under `outputs/` and are not committed.
| Sample | Result | Output Location | Summary |
| --- | --- | --- | --- |
| `samples/MITC공부.pdf` | Completed after CUDA-enabled runtime setup. | `outputs/MITC공부/` | 13 pages, 107 assets, 23 inline formulas, 103 display formulas, 1 info warning at the time of that run because the local MathJax checker was unavailable. |
| `samples/FourNodeQuadrilateralShellElementMITC4.pdf` | Completed with default GPU request and `MINERU_MODEL_SOURCE=local`. | `outputs/FourNodeQuadrilateralShellElementMITC4/` | Report status `success`: 7 pages, 22 assets, 38 inline formulas, 16 display formulas, 0 math render errors, 0 warnings. |
## Historical Verification Highlights
- Sprint 1 scaffold: `uv run pytest` passed 4 tests; `uv run pdf2md --version` printed `pdf2md 0.1.0`.
- Sprint 2 path planning: `uv run pytest tests/test_paths.py` passed 17 tests; full suite passed 21 tests.
- Sprint 3 metadata/IR: targeted IR/metadata tests passed 25 tests; full suite passed 46 tests.
- Sprint 4 MinerU adapter: adapter tests passed 26 tests; full suite passed 72 tests.
- Sprint 5 Markdown normalization: targeted Markdown/IR tests passed 30 tests; full suite passed 89 tests.
- Sprint 6 quality/report: targeted quality/report/metadata tests passed 26 tests; full suite passed 103 tests.
- Sprint 7 conversion orchestration: full suite passed 119 tests after metadata math count fixes.
- Sprint 8 doctor: full suite passed 133 tests.
- Sprint 9 release gate: full suite passed 136 tests with 1 optional skip.
- CUDA runtime rebuild: verified CUDA with an actual tensor operation on `NVIDIA GeForce GTX 1070 Ti`, compute capability 6.1; `mineru --version` reported 3.1.0.
- MathJax checker: `npm run mathjax-checker:health` returned `{"ok":true}` after local `npm install`; full suite passed 150 tests with 1 optional skip after integration.
- Sprint 10 chunking: targeted chunking tests passed 42 tests; full default suite passed 163 tests with 1 optional skip; `git diff --check` passed with line-ending warnings only.
## Historical Blockers And Resolutions
- Early `uv` availability was missing; resolved by installing `uv` per-user.
- Initial real sample conversion failed when PyTorch was CPU-only; resolved by rebuilding `.venv` and installing CUDA-enabled PyTorch before MinerU.
- Optional local MinerU fixture checks were originally blocked by missing MinerU CLI; resolved for the local environment after MinerU 3.1.0 and model setup.
- Local MathJax render checking was originally unavailable; resolved after adding the local Node.js MathJax checker and installing local npm dependencies.
- GTX 1070 Ti remains a Pascal/pre-Turing GPU risk. This is a warning, not a current blocker.
## Archive Notes For Agents
- Read this file when a task asks for historical implementation state, prior verification, completed sprint context, or sample conversion evidence.
- Prefer `PROGRESS.md` for current state and next actions.
- Prefer sprint contract files for detailed acceptance criteria and handoff structure.
- Prefer `docs/V1RELEASECHECKLIST.md` for release readiness gates.