From 005f17bac198a525968266559cbab4815ec956bb Mon Sep 17 00:00:00 2001 From: NINI Date: Mon, 11 May 2026 01:40:51 +0900 Subject: [PATCH] docs: plan MathJax warning mitigation --- PLAN.md | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++-- PROGRESS.md | 11 +++++---- 2 files changed, 74 insertions(+), 6 deletions(-) diff --git a/PLAN.md b/PLAN.md index 5e9a64e..c43caa5 100644 --- a/PLAN.md +++ b/PLAN.md @@ -4,7 +4,7 @@ This file is the shared work plan for agents. Read it before starting work, then ## Current Goal -Completed work history is archived in `docs/WORKARCHIVE.md`. Sprint 10 pre-conversion PDF chunking is implemented. On this PC, full local runtime setup is complete in `.venv`; Markdown quality recheck for existing outputs is implemented. Next work is optional manual Obsidian quality review, Markdown cleanup for sample warnings, or additional sample validation if requested. +Completed work history is archived in `docs/WORKARCHIVE.md`. Sprint 10 pre-conversion PDF chunking is implemented. On this PC, full local runtime setup is complete in `.venv`; Markdown quality recheck for existing outputs is implemented. Next planned work is MathJax warning mitigation: after local MathJax validation, conservatively clean only warning-causing math spans, rerun validation, and preserve provenance for changed or still-failing formulas. Manual Obsidian quality review and sample validation remain optional fallback tasks. ## Active Constraints @@ -35,10 +35,70 @@ Completed work history is archived in `docs/WORKARCHIVE.md`. Sprint 10 pre-conve 12. Follow `docs/V1IMPLEMENTATIONPLAN.md` for the v1 implementation sprint sequence. 13. Use `docs/Sprints/SPRINT10CONTRACT.md` for the implemented long-PDF pre-conversion chunking sprint. 14. Use `docs/WORKARCHIVE.md` for completed sprint history, prior verification, runtime setup evidence, and sample conversion evidence. +15. Plan Sprint 11 for MathJax warning mitigation before code changes start. +16. Create `docs/Sprints/SPRINT11CONTRACT.md` for the mitigation sprint if implementation is requested. +17. Keep the mitigation path shared by `pdf2md convert` and `pdf2md recheck` so existing Markdown outputs can be cleaned without rerunning MinerU. + +## Proposed Sprint 11: MathJax Warning Mitigation + +Objective: + +- Add a conservative local post-validation cleanup pass that attempts to remove only the specific math-span artifacts responsible for MathJax warnings, then reruns MathJax validation before writing final Markdown, metadata JSON, and report Markdown. + +Assumptions: + +- MathJax warning mitigation is best-effort and nonfatal. +- The cleanup pass must stay deterministic and local-only. +- Warning reduction must not silently erase meaningful formula content. +- The same behavior should apply to fresh conversions and `pdf2md recheck`. + +Planned workflow: + +1. Run the existing MathJax renderability check against normalized Markdown and keep failed `MathExpression` records, including index, display mode, Markdown span, and MathJax message. +2. Generate cleanup candidates only for failed spans. Candidate rules should start with narrow, non-semantic fixes such as trimming invisible/control artifacts, removing obvious OCR/extractor debris, normalizing accidental delimiter leftovers, and fixing whitespace/newline forms known to break MathJax. +3. Validate each candidate with the same local MathJax checker. Replace a math span only when the candidate passes and preserves the original inline/display delimiter shape. +4. Rebuild Markdown from approved span replacements and rerun the full quality check on the repaired Markdown. +5. Write metadata/report data from the final Markdown and final quality result. Record unresolved failures as `MATH_RENDER_FAILED`; record applied mitigations in a traceable form so warning counts are not reduced by hiding changes. + +Touched surfaces to plan in the sprint contract: + +- `src/pdf2md/quality.py`: expose failed math expression details without losing the existing warning behavior. +- `src/pdf2md/math_render.py`: keep MathJax checking local and batch-oriented; do not expose raw MathJax objects as public API. +- New focused module, likely `src/pdf2md/math_repair.py`: own candidate generation, span replacement, and repair result records. +- `src/pdf2md/conversion.py`: run mitigation between normalization and final metadata/report construction for `convert` and `recheck`. +- `src/pdf2md/ir.py`, `src/pdf2md/metadata.py`, and `src/pdf2md/report.py`: update only if the contract decides a new repair warning/info code or summary field is needed. +- Tests in `tests/test_quality.py`, a new `tests/test_math_repair.py`, and targeted conversion/recheck CLI tests. + +Non-goals: + +- Do not add cloud OCR, remote LLMs, remote render APIs, or external document upload paths. +- Do not add a second conversion engine or runtime engine selection. +- Do not implement a full LaTeX parser, symbolic math simplifier, or Obsidian automation. +- Do not remove whole formulas or meaningful LaTeX tokens solely to silence warnings. +- Do not add new CLI flags unless a later contract explicitly justifies them. + +Verification: + +- Unit tests for failed-expression capture, candidate generation, safe span replacement, and no-op behavior when no candidate passes. +- Conversion tests proving repaired Markdown is written only after candidate revalidation. +- Recheck tests proving existing output Markdown can be repaired and metadata/report regenerated without rerunning MinerU. +- Report/metadata tests proving remaining warnings and applied mitigations are visible and derived from final state. +- Run `uv run pytest tests/test_quality.py tests/test_math_repair.py tests/test_conversion.py tests/test_cli.py tests/test_report.py`. +- Run `uv run pytest` before marking the sprint complete. +- Optionally run `uv run pdf2md recheck outputs\MITC공부\MITC공부.md` against ignored local sample output when the user requests real-output validation. + +Hard failure criteria: + +- The cleanup changes math spans that did not fail MathJax validation. +- The cleanup removes an entire formula or a semantically meaningful token without an explicit trace. +- The cleanup reduces warning counts by dropping warnings instead of producing MathJax-valid Markdown. +- The cleanup makes `pdf2md convert` or `pdf2md recheck` require Node.js/MathJax when they were previously optional. +- Default tests require real MinerU, GPU, Node.js, MathJax, network, Obsidian, or `samples/`. ## Open Questions -- None. +- Which exact cleanup rules should Sprint 11 allow after inspecting current MathJax failure messages? Recommendation: start with deterministic non-semantic artifacts only. +- Should applied mitigations use a new stable warning/info code or be represented through existing metadata/report fields? Recommendation: make repair provenance visible without counting a successfully repaired expression as a render failure. ## Decisions @@ -51,6 +111,11 @@ Completed work history is archived in `docs/WORKARCHIVE.md`. Sprint 10 pre-conve - No silent fallback after MinerU failure. - Conversion output includes both metadata JSON and `.report.md`. - Local MathJax render checking is optional and nonfatal; missing Node.js or MathJax must produce a clear warning instead of blocking conversion. +- MathJax warning mitigation must run only after initial local MathJax validation identifies failed math spans. +- MathJax warning mitigation must be deterministic, local-only, and limited to failed math spans. +- Candidate math cleanup must be revalidated with the local MathJax checker before replacing Markdown. +- If no candidate passes validation, keep the original formula and retain the `MATH_RENDER_FAILED` warning. +- Successfully mitigated formulas must remain traceable in metadata/report output; warning reduction must not hide that a formula was changed. - Project-scoped custom agents live in `.codex/agents/*.toml`. - Project prompt commands live in `.codex/commands/*.md`. - Project-specific skills live in `.codex/skills/*/SKILL.md`. diff --git a/PROGRESS.md b/PROGRESS.md index 3b19f52..ae8e0a8 100644 --- a/PROGRESS.md +++ b/PROGRESS.md @@ -48,6 +48,7 @@ This file records current progress for agents. Read it before starting work, the - Added `recheck_markdown()` and `pdf2md recheck ` to rerun local quality checks for an existing generated Markdown file and rewrite the adjacent metadata JSON and `.report.md` without rerunning MinerU. - Verified `uv run pdf2md recheck outputs\MITC공부\MITC공부.md`; the command regenerated metadata/report and still reported 2 warnings because the current Markdown still contains the two MathJax-invalid expressions. - Reconverted `samples/MITC공부.pdf` with `--overwrite` to ignored `outputs/MITC공부/`; report status remains `partial`: 13 pages, 107 assets, 23 inline formulas, 103 display formulas, 2 MathJax render warnings, and 0 missing or invalid asset links. +- Added a `PLAN.md` Sprint 11 proposal for conservative MathJax warning mitigation after validation; no implementation code has been started. ## In Progress @@ -59,7 +60,9 @@ This file records current progress for agents. Read it before starting work, the ## Next Actions -1. Manually fix the two MathJax-invalid expressions in `outputs/MITC공부/MITC공부.md` if a warning-free local report is desired, then run `uv run pdf2md recheck outputs\MITC공부\MITC공부.md`. -2. Review generated sample Markdown outputs in Obsidian if visual quality needs manual assessment. -3. Run optional real local chunked conversion on a long sample only if requested. -4. Preserve strict-local runtime behavior: use local model paths, direct CLI execution, and no user-specified API or remote backend. +1. If implementation is requested, write `docs/Sprints/SPRINT11CONTRACT.md` for MathJax warning mitigation before code changes start. +2. Inspect the current MathJax failure messages from `outputs/MITC공부/MITC공부.md` to choose the narrow initial cleanup rule set. +3. Manually fix the two MathJax-invalid expressions in `outputs/MITC공부/MITC공부.md` only if a warning-free local report is desired before Sprint 11 exists, then run `uv run pdf2md recheck outputs\MITC공부\MITC공부.md`. +4. Review generated sample Markdown outputs in Obsidian if visual quality needs manual assessment. +5. Run optional real local chunked conversion on a long sample only if requested. +6. Preserve strict-local runtime behavior: use local model paths, direct CLI execution, and no user-specified API or remote backend.