modify pdftomd

This commit is contained in:
김경종
2026-05-14 10:16:59 +09:00
parent 2232b51fc9
commit dc11880140
69 changed files with 7784 additions and 1150 deletions
+42 -120
View File
@@ -4,7 +4,7 @@ This file is the shared work plan for agents. Read it before starting work, then
## Current Goal
Completed work history is archived in `docs/WORKARCHIVE.md`. Sprint 11 MathJax warning mitigation is implemented. On this PC, full local runtime setup is complete in `.venv`; Markdown quality recheck for existing outputs is implemented and now shares the same conservative MathJax repair path as fresh conversion. Next work is optional manual Obsidian quality review, additional sample validation, or broader repair rules if future samples expose new deterministic MathJax failure patterns.
Completed work through Sprint 16, the Sprint 16 SolidElement validation, and the UI direct-folder batch conversion is archived in `docs/WORKARCHIVE.md`. Sprint 17 offline installer planning has been abandoned and is retained only as historical context.
## Active Constraints
@@ -14,142 +14,64 @@ Completed work history is archived in `docs/WORKARCHIVE.md`. Sprint 11 MathJax w
- Target Python 3.12.
- Target GPU: GTX 1070 Ti 8GB.
- Default conversion device: `cuda:0`.
- Default MinerU profile: `auto`.
- Run MinerU through direct local CLI execution only.
- UI code must invoke the existing project-owned `pdf2md` CLI; it must not call MinerU directly.
- The current UI executable is a thin launcher for the installed local runtime, not a self-contained bundle of MinerU, PyTorch, CUDA, local models, Node.js, or MathJax.
- UI subprocess calls must use fixed argument lists with `shell=False` and must not expose arbitrary command execution.
- UI folder batch conversion must run direct-child PDFs sequentially through existing `pdf2md convert` commands.
- On MinerU failure, report a clear error/warning and do not silently fallback.
- Write both metadata JSON and a human-readable `.report.md` quality report for conversions.
- Current conversions write simplified Markdown/report outputs with no persisted metadata JSON; internal provenance still feeds warnings and reports.
- `pdf2md recheck` remains legacy-only for outputs that still have adjacent metadata JSON.
- Do not commit generated installer payloads, wheelhouses, Python installers, model files, Node binaries, generated installer executables, `build/`, `dist/`, `outputs/`, or `samples/`.
- Use `samples/` only as local fixture context; do not commit sample files unless explicitly requested.
## Active References
- Product requirements: `PRD.md`.
- System design: `ARCHITECTURE.md`.
- Agent workflow: `AGENTS.md`.
- Current implementation sequence: `docs/V1IMPLEMENTATIONPLAN.md`.
- Completed work archive: `docs/WORKARCHIVE.md`.
- Release gates: `docs/V1RELEASECHECKLIST.md`.
- Completed UI folder batch design and plan: `docs/superpowers/specs/2026-05-13-ui-folder-batch-conversion-design.md` and `docs/superpowers/plans/2026-05-13-ui-folder-batch-conversion.md`.
- Abandoned Sprint 17 historical plan: `docs/Sprints/SPRINT17CONTRACT.md` and `docs/superpowers/plans/2026-05-12-offline-installer.md`.
## Planned Work
1. Use `research-agent` for MinerU 3.1.0 source tracking and official-doc verification.
2. Use `requirements-guard-agent` for cross-document consistency reviews.
3. Use `mineru-integration-agent` for direct local MinerU CLI adapter planning.
4. Use `obsidian-markdown-agent` for math-heavy Obsidian Markdown output planning.
5. Use `metadata-agent` for provenance, warning, JSON metadata, and `.report.md` planning.
6. Use `evaluation-agent` for local fixture coverage and regression criteria.
7. Use `local-setup-agent` for Python 3.12, uv, CUDA, GTX 1070 Ti 8GB, and doctor-check planning.
8. Use `license-privacy-agent` for license and strict-local privacy review.
9. Use `harness-planner-agent` to turn substantial implementation requests into scoped contracts before code work starts.
10. Use `feature-generator-agent` to implement one approved contract at a time after the user explicitly requests implementation.
11. Use `evaluation-agent` as the independent contract reviewer and QA evaluator before and after each implementation chunk.
12. Follow `docs/V1IMPLEMENTATIONPLAN.md` for the v1 implementation sprint sequence.
13. Use `docs/Sprints/SPRINT10CONTRACT.md` for the implemented long-PDF pre-conversion chunking sprint.
14. Use `docs/WORKARCHIVE.md` for completed sprint history, prior verification, runtime setup evidence, and sample conversion evidence.
15. Use `docs/Sprints/SPRINT11CONTRACT.md` for the implemented MathJax warning mitigation sprint.
16. Keep the mitigation path shared by `pdf2md convert` and `pdf2md recheck` so existing Markdown outputs can be cleaned without rerunning MinerU.
1. Keep completed sprint details out of `PROGRESS.md`; use `docs/WORKARCHIVE.md` and `docs/Sprints/*.md` for history.
2. Preserve strict-local runtime behavior: use local model paths, direct CLI execution, and no user-specified API or remote backend.
3. When practical, run hands-on UI smoke from `dist\pdf2md-ui.exe`: Doctor, then one small local conversion to ignored `outputs/`.
4. On a stronger NVIDIA GPU PC, run `uv run pdf2md doctor` and one optional local conversion with `--gpu auto --mineru-profile auto` to validate the auto profile.
5. Decide in a future sprint whether simplified outputs need metadata-free `pdf2md recheck`; current behavior intentionally remains legacy-only.
## Sprint 11: MathJax Warning Mitigation
## Completed Work References
Objective:
- Implemented a conservative local post-validation cleanup pass that attempts to remove only the specific math-span artifacts responsible for MathJax warnings, then reruns MathJax validation before writing final Markdown, metadata JSON, and report Markdown.
Assumptions:
- MathJax warning mitigation is best-effort and nonfatal.
- The cleanup pass must stay deterministic and local-only.
- Warning reduction must not silently erase meaningful formula content.
- The same behavior should apply to fresh conversions and `pdf2md recheck`.
Planned workflow:
1. Run the existing MathJax renderability check against normalized Markdown and keep failed `MathExpression` records, including index, display mode, Markdown span, and MathJax message.
2. Generate cleanup candidates only for failed spans. Candidate rules should start with narrow, non-semantic fixes such as trimming invisible/control artifacts, removing obvious OCR/extractor debris, normalizing accidental delimiter leftovers, and fixing whitespace/newline forms known to break MathJax.
3. Validate each candidate with the same local MathJax checker. Replace a math span only when the candidate passes and preserves the original inline/display delimiter shape.
4. Rebuild Markdown from approved span replacements and rerun the full quality check on the repaired Markdown.
5. Write metadata/report data from the final Markdown and final quality result. Record unresolved failures as `MATH_RENDER_FAILED`; record applied mitigations in a traceable form so warning counts are not reduced by hiding changes.
Touched surfaces to plan in the sprint contract:
- `src/pdf2md/quality.py`: expose failed math expression details without losing the existing warning behavior.
- `src/pdf2md/math_render.py`: keep MathJax checking local and batch-oriented; do not expose raw MathJax objects as public API.
- New focused module, likely `src/pdf2md/math_repair.py`: own candidate generation, span replacement, and repair result records.
- `src/pdf2md/conversion.py`: run mitigation between normalization and final metadata/report construction for `convert` and `recheck`.
- `src/pdf2md/ir.py`, `src/pdf2md/metadata.py`, and `src/pdf2md/report.py`: update only if the contract decides a new repair warning/info code or summary field is needed.
- Tests in `tests/test_quality.py`, a new `tests/test_math_repair.py`, and targeted conversion/recheck CLI tests.
Non-goals:
- Do not add cloud OCR, remote LLMs, remote render APIs, or external document upload paths.
- Do not add a second conversion engine or runtime engine selection.
- Do not implement a full LaTeX parser, symbolic math simplifier, or Obsidian automation.
- Do not remove whole formulas or meaningful LaTeX tokens solely to silence warnings.
- Do not add new CLI flags unless a later contract explicitly justifies them.
Verification:
- Unit tests for failed-expression capture, candidate generation, safe span replacement, and no-op behavior when no candidate passes.
- Conversion tests proving repaired Markdown is written only after candidate revalidation.
- Recheck tests proving existing output Markdown can be repaired and metadata/report regenerated without rerunning MinerU.
- Report/metadata tests proving remaining warnings and applied mitigations are visible and derived from final state.
- Run `uv run pytest tests/test_quality.py tests/test_math_repair.py tests/test_conversion.py tests/test_cli.py tests/test_report.py`.
- Run `uv run pytest` before marking the sprint complete.
- Optionally run `uv run pdf2md recheck outputs\MITC공부\MITC공부.md` against ignored local sample output when the user requests real-output validation.
Hard failure criteria:
- The cleanup changes math spans that did not fail MathJax validation.
- The cleanup removes an entire formula or a semantically meaningful token without an explicit trace.
- The cleanup reduces warning counts by dropping warnings instead of producing MathJax-valid Markdown.
- The cleanup makes `pdf2md convert` or `pdf2md recheck` require Node.js/MathJax when they were previously optional.
- Default tests require real MinerU, GPU, Node.js, MathJax, network, Obsidian, or `samples/`.
- Completed sprint outcomes through Sprint 16 are summarized in `docs/WORKARCHIVE.md`.
- Detailed historical contracts remain under `docs/Sprints/SPRINT0CONTRACT.md` through `docs/Sprints/SPRINT16CONTRACT.md`.
- UI direct-folder batch conversion is archived in `docs/WORKARCHIVE.md`; its design and execution plan live under `docs/superpowers/`.
- Abandoned Sprint 17 offline installer planning is archived in `docs/WORKARCHIVE.md` and must not be treated as active planned work.
- Historical verification results and sample conversion evidence live in `docs/WORKARCHIVE.md`.
## Open Questions
- None.
- Whether metadata-free `pdf2md recheck` should be designed for simplified outputs.
- Whether a stronger NVIDIA GPU PC changes the default practical MinerU profile recommendation after real conversion validation.
## Decisions
- Use `PLAN.md` for intended work and ownership.
- Use `PROGRESS.md` for completed work, current status, blockers, and next actions.
- Use `docs/WORKARCHIVE.md` for archived completed work and historical handoff details.
- Use `PROGRESS.md` for current status, blockers, and next actions.
- Use `docs/WORKARCHIVE.md` for archived completed work, historical verification, runtime setup evidence, and sample conversion evidence.
- MinerU default local CLI execution is the only v1 execution mode.
- MinerU 3.1.0 may launch a temporary local `mineru-api` internally when `mineru` CLI runs without `--api-url`.
- Strict-local mode forbids `--api-url`, remote APIs, router mode, HTTP client backends, and remote OpenAI-compatible backends.
- No silent fallback after MinerU failure.
- Conversion output includes both metadata JSON and `<stem>.report.md`.
- Current conversion output uses `<stem>/<stem>_001.md`, shared `<stem>/images/`, and one `<stem>/<stem>_report.md`; new conversions do not persist metadata JSON.
- Local MathJax render checking is optional and nonfatal; missing Node.js or MathJax must produce a clear warning instead of blocking conversion.
- MathJax warning mitigation must run only after initial local MathJax validation identifies failed math spans.
- MathJax warning mitigation must be deterministic, local-only, and limited to failed math spans.
- Candidate math cleanup must be revalidated with the local MathJax checker before replacing Markdown.
- If no candidate passes validation, keep the original formula and retain the `MATH_RENDER_FAILED` warning.
- Successfully mitigated formulas must remain traceable in metadata/report output; warning reduction must not hide that a formula was changed.
- Sprint 11 uses `MATH_RENDER_REPAIRED` info warnings for applied repair provenance.
- Sprint 11 initial repair rules cover repeated same-direction scripts and truncated array `\end{a}` endings only.
- Project-scoped custom agents live in `.codex/agents/*.toml`.
- Project prompt commands live in `.codex/commands/*.md`.
- Project-specific skills live in `.codex/skills/*/SKILL.md`.
- Project hooks live in `.codex/hooks.json` and `.codex/hooks/*.py`.
- Agent, command, skill, and hook assets are written in English for Codex compatibility.
- Long-running implementation should use a planner/generator/evaluator harness only when the task complexity justifies the overhead.
- Each substantial implementation chunk should have a sprint contract with objective, scope, verification, failure thresholds, and handoff fields.
- Generator agents may self-check, but independent evaluation is required before marking a chunk complete.
- V1 implementation sequencing and sprint contracts live in `docs/V1IMPLEMENTATIONPLAN.md`.
- Concrete sprint contract documents live under `docs/Sprints/`.
- Sprint 2 path planning contract lives at `docs/Sprints/SPRINT2CONTRACT.md`.
- Sprint 3 domain records and metadata contract lives at `docs/Sprints/SPRINT3CONTRACT.md`.
- Sprint 4 MinerU adapter contract lives at `docs/Sprints/SPRINT4CONTRACT.md`.
- Sprint 4 fixes the v1 adapter executable to the direct `mineru` CLI; user-specified alternate executables, including `mineru-api`, are prohibited.
- Sprint 5 Obsidian Markdown normalization and asset link contract lives at `docs/Sprints/SPRINT5CONTRACT.md`.
- Sprint 5 owns Markdown normalization only; it does not write final Markdown files, copy assets, run MinerU, or connect to conversion orchestration.
- Sprint 6 quality checks and report generation contract lives at `docs/Sprints/SPRINT6CONTRACT.md`.
- Sprint 6 owns quality/report boundaries only; it does not write final files, run MinerU, or connect to conversion orchestration.
- Sprint 7 conversion orchestration, CLI, and Python API contract lives at `docs/Sprints/SPRINT7CONTRACT.md`.
- Sprint 7 will be the first implementation sprint allowed to write final Markdown, metadata JSON, report Markdown, and local copied assets as product behavior.
- Sprint 7 implemented conversion orchestration, `convert_pdf`, batch conversion, `pdf2md convert`, output writing, metadata/report writing, and fake-adapter CLI/API tests.
- Sprint 8 should cover `pdf2md doctor` and setup documentation; Sprint 7 intentionally did not add doctor behavior.
- Sprint 8 doctor and setup documentation contract lives at `docs/Sprints/SPRINT8CONTRACT.md`.
- Sprint 8 owns doctor diagnostics and setup docs only; it must not run real MinerU, download models, run sample PDFs, or add runtime remote/API paths in default tests.
- Sprint 8 implements `pdf2md doctor`, local setup diagnostics, and setup documentation without running real MinerU, downloading models, or touching `samples/` in default tests.
- Sprint 9 local fixture evaluation and v1 release gate contract lives at `docs/Sprints/SPRINT9CONTRACT.md`.
- Sprint 9 must keep default tests independent of real MinerU, GPU, models, network, Obsidian, LaTeX tooling, and `samples/`; real MinerU fixture checks must be explicit opt-in only.
- Sprint 9 implements fast mocked integration tests, explicit opt-in local MinerU fixture evaluation, and `docs/V1RELEASECHECKLIST.md`.
- `pdf2md convert` defaults to `--gpu cuda:0`.
- The MinerU adapter maps CUDA device requests to local subprocess environment variables instead of adding speculative MinerU CLI flags.
- GTX 1070 Ti local runtime uses PyTorch `2.6.0+cu126` and `torchvision 0.21.0+cu126` installed after `uv sync`, followed by `mineru[core]==3.1.0`.
- MinerU models are downloaded with `mineru-models-download -s huggingface -m all`, and runtime model loading uses `MINERU_MODEL_SOURCE=local`.
- Sprint 10 uses `pypdf` for local PDF page chunk planning and temporary chunk PDF writing.
- Sprint 10 converts chunk PDFs independently and does not merge generated Markdown outputs.
- Chunking is opt-in through `--chunk-pages`; if the option is present without a value, the CLI uses 20 pages per chunk.
- `convert_pdf()` keeps returning `ConversionResult` without chunking and returns `BatchConversionResult` when `chunk_pages` is set.
- Chunk PDFs are temporary local files and are deleted after conversion completes, including when raw MinerU output is retained.
- Chunking remains opt-in through `--chunk-pages`; if the option is present without a value, final grouped outputs use 20 source pages.
- In chunk mode, MinerU receives one source page per run and final Markdown parts are grouped by `chunk_pages`.
- `--gpu auto` selects the visible NVIDIA GPU with the largest local `nvidia-smi` VRAM report.
- `--mineru-profile auto` is the default and stays conservative on GTX 1070 Ti 8GB, low-VRAM, and pre-Turing GPUs.
- The UI launcher can convert a direct folder by running one existing `pdf2md convert` command per direct-child PDF sequentially.
- Sprint 17 offline installer planning is abandoned. Do not implement or extend offline installer work unless the user explicitly reopens that direction.