docs: plan MathJax warning mitigation

This commit is contained in:
NINI
2026-05-11 01:40:51 +09:00
parent c77db658e7
commit 005f17bac1
2 changed files with 74 additions and 6 deletions
+67 -2
View File
@@ -4,7 +4,7 @@ This file is the shared work plan for agents. Read it before starting work, then
## Current Goal ## Current Goal
Completed work history is archived in `docs/WORKARCHIVE.md`. Sprint 10 pre-conversion PDF chunking is implemented. On this PC, full local runtime setup is complete in `.venv`; Markdown quality recheck for existing outputs is implemented. Next work is optional manual Obsidian quality review, Markdown cleanup for sample warnings, or additional sample validation if requested. Completed work history is archived in `docs/WORKARCHIVE.md`. Sprint 10 pre-conversion PDF chunking is implemented. On this PC, full local runtime setup is complete in `.venv`; Markdown quality recheck for existing outputs is implemented. Next planned work is MathJax warning mitigation: after local MathJax validation, conservatively clean only warning-causing math spans, rerun validation, and preserve provenance for changed or still-failing formulas. Manual Obsidian quality review and sample validation remain optional fallback tasks.
## Active Constraints ## Active Constraints
@@ -35,10 +35,70 @@ Completed work history is archived in `docs/WORKARCHIVE.md`. Sprint 10 pre-conve
12. Follow `docs/V1IMPLEMENTATIONPLAN.md` for the v1 implementation sprint sequence. 12. Follow `docs/V1IMPLEMENTATIONPLAN.md` for the v1 implementation sprint sequence.
13. Use `docs/Sprints/SPRINT10CONTRACT.md` for the implemented long-PDF pre-conversion chunking sprint. 13. Use `docs/Sprints/SPRINT10CONTRACT.md` for the implemented long-PDF pre-conversion chunking sprint.
14. Use `docs/WORKARCHIVE.md` for completed sprint history, prior verification, runtime setup evidence, and sample conversion evidence. 14. Use `docs/WORKARCHIVE.md` for completed sprint history, prior verification, runtime setup evidence, and sample conversion evidence.
15. Plan Sprint 11 for MathJax warning mitigation before code changes start.
16. Create `docs/Sprints/SPRINT11CONTRACT.md` for the mitigation sprint if implementation is requested.
17. Keep the mitigation path shared by `pdf2md convert` and `pdf2md recheck` so existing Markdown outputs can be cleaned without rerunning MinerU.
## Proposed Sprint 11: MathJax Warning Mitigation
Objective:
- Add a conservative local post-validation cleanup pass that attempts to remove only the specific math-span artifacts responsible for MathJax warnings, then reruns MathJax validation before writing final Markdown, metadata JSON, and report Markdown.
Assumptions:
- MathJax warning mitigation is best-effort and nonfatal.
- The cleanup pass must stay deterministic and local-only.
- Warning reduction must not silently erase meaningful formula content.
- The same behavior should apply to fresh conversions and `pdf2md recheck`.
Planned workflow:
1. Run the existing MathJax renderability check against normalized Markdown and keep failed `MathExpression` records, including index, display mode, Markdown span, and MathJax message.
2. Generate cleanup candidates only for failed spans. Candidate rules should start with narrow, non-semantic fixes such as trimming invisible/control artifacts, removing obvious OCR/extractor debris, normalizing accidental delimiter leftovers, and fixing whitespace/newline forms known to break MathJax.
3. Validate each candidate with the same local MathJax checker. Replace a math span only when the candidate passes and preserves the original inline/display delimiter shape.
4. Rebuild Markdown from approved span replacements and rerun the full quality check on the repaired Markdown.
5. Write metadata/report data from the final Markdown and final quality result. Record unresolved failures as `MATH_RENDER_FAILED`; record applied mitigations in a traceable form so warning counts are not reduced by hiding changes.
Touched surfaces to plan in the sprint contract:
- `src/pdf2md/quality.py`: expose failed math expression details without losing the existing warning behavior.
- `src/pdf2md/math_render.py`: keep MathJax checking local and batch-oriented; do not expose raw MathJax objects as public API.
- New focused module, likely `src/pdf2md/math_repair.py`: own candidate generation, span replacement, and repair result records.
- `src/pdf2md/conversion.py`: run mitigation between normalization and final metadata/report construction for `convert` and `recheck`.
- `src/pdf2md/ir.py`, `src/pdf2md/metadata.py`, and `src/pdf2md/report.py`: update only if the contract decides a new repair warning/info code or summary field is needed.
- Tests in `tests/test_quality.py`, a new `tests/test_math_repair.py`, and targeted conversion/recheck CLI tests.
Non-goals:
- Do not add cloud OCR, remote LLMs, remote render APIs, or external document upload paths.
- Do not add a second conversion engine or runtime engine selection.
- Do not implement a full LaTeX parser, symbolic math simplifier, or Obsidian automation.
- Do not remove whole formulas or meaningful LaTeX tokens solely to silence warnings.
- Do not add new CLI flags unless a later contract explicitly justifies them.
Verification:
- Unit tests for failed-expression capture, candidate generation, safe span replacement, and no-op behavior when no candidate passes.
- Conversion tests proving repaired Markdown is written only after candidate revalidation.
- Recheck tests proving existing output Markdown can be repaired and metadata/report regenerated without rerunning MinerU.
- Report/metadata tests proving remaining warnings and applied mitigations are visible and derived from final state.
- Run `uv run pytest tests/test_quality.py tests/test_math_repair.py tests/test_conversion.py tests/test_cli.py tests/test_report.py`.
- Run `uv run pytest` before marking the sprint complete.
- Optionally run `uv run pdf2md recheck outputs\MITC공부\MITC공부.md` against ignored local sample output when the user requests real-output validation.
Hard failure criteria:
- The cleanup changes math spans that did not fail MathJax validation.
- The cleanup removes an entire formula or a semantically meaningful token without an explicit trace.
- The cleanup reduces warning counts by dropping warnings instead of producing MathJax-valid Markdown.
- The cleanup makes `pdf2md convert` or `pdf2md recheck` require Node.js/MathJax when they were previously optional.
- Default tests require real MinerU, GPU, Node.js, MathJax, network, Obsidian, or `samples/`.
## Open Questions ## Open Questions
- None. - Which exact cleanup rules should Sprint 11 allow after inspecting current MathJax failure messages? Recommendation: start with deterministic non-semantic artifacts only.
- Should applied mitigations use a new stable warning/info code or be represented through existing metadata/report fields? Recommendation: make repair provenance visible without counting a successfully repaired expression as a render failure.
## Decisions ## Decisions
@@ -51,6 +111,11 @@ Completed work history is archived in `docs/WORKARCHIVE.md`. Sprint 10 pre-conve
- No silent fallback after MinerU failure. - No silent fallback after MinerU failure.
- Conversion output includes both metadata JSON and `<stem>.report.md`. - Conversion output includes both metadata JSON and `<stem>.report.md`.
- Local MathJax render checking is optional and nonfatal; missing Node.js or MathJax must produce a clear warning instead of blocking conversion. - Local MathJax render checking is optional and nonfatal; missing Node.js or MathJax must produce a clear warning instead of blocking conversion.
- MathJax warning mitigation must run only after initial local MathJax validation identifies failed math spans.
- MathJax warning mitigation must be deterministic, local-only, and limited to failed math spans.
- Candidate math cleanup must be revalidated with the local MathJax checker before replacing Markdown.
- If no candidate passes validation, keep the original formula and retain the `MATH_RENDER_FAILED` warning.
- Successfully mitigated formulas must remain traceable in metadata/report output; warning reduction must not hide that a formula was changed.
- Project-scoped custom agents live in `.codex/agents/*.toml`. - Project-scoped custom agents live in `.codex/agents/*.toml`.
- Project prompt commands live in `.codex/commands/*.md`. - Project prompt commands live in `.codex/commands/*.md`.
- Project-specific skills live in `.codex/skills/*/SKILL.md`. - Project-specific skills live in `.codex/skills/*/SKILL.md`.
+7 -4
View File
@@ -48,6 +48,7 @@ This file records current progress for agents. Read it before starting work, the
- Added `recheck_markdown()` and `pdf2md recheck <markdown.md>` to rerun local quality checks for an existing generated Markdown file and rewrite the adjacent metadata JSON and `.report.md` without rerunning MinerU. - Added `recheck_markdown()` and `pdf2md recheck <markdown.md>` to rerun local quality checks for an existing generated Markdown file and rewrite the adjacent metadata JSON and `.report.md` without rerunning MinerU.
- Verified `uv run pdf2md recheck outputs\MITC공부\MITC공부.md`; the command regenerated metadata/report and still reported 2 warnings because the current Markdown still contains the two MathJax-invalid expressions. - Verified `uv run pdf2md recheck outputs\MITC공부\MITC공부.md`; the command regenerated metadata/report and still reported 2 warnings because the current Markdown still contains the two MathJax-invalid expressions.
- Reconverted `samples/MITC공부.pdf` with `--overwrite` to ignored `outputs/MITC공부/`; report status remains `partial`: 13 pages, 107 assets, 23 inline formulas, 103 display formulas, 2 MathJax render warnings, and 0 missing or invalid asset links. - Reconverted `samples/MITC공부.pdf` with `--overwrite` to ignored `outputs/MITC공부/`; report status remains `partial`: 13 pages, 107 assets, 23 inline formulas, 103 display formulas, 2 MathJax render warnings, and 0 missing or invalid asset links.
- Added a `PLAN.md` Sprint 11 proposal for conservative MathJax warning mitigation after validation; no implementation code has been started.
## In Progress ## In Progress
@@ -59,7 +60,9 @@ This file records current progress for agents. Read it before starting work, the
## Next Actions ## Next Actions
1. Manually fix the two MathJax-invalid expressions in `outputs/MITC공부/MITC공부.md` if a warning-free local report is desired, then run `uv run pdf2md recheck outputs\MITC공부\MITC공부.md`. 1. If implementation is requested, write `docs/Sprints/SPRINT11CONTRACT.md` for MathJax warning mitigation before code changes start.
2. Review generated sample Markdown outputs in Obsidian if visual quality needs manual assessment. 2. Inspect the current MathJax failure messages from `outputs/MITC공부/MITC공부.md` to choose the narrow initial cleanup rule set.
3. Run optional real local chunked conversion on a long sample only if requested. 3. Manually fix the two MathJax-invalid expressions in `outputs/MITC공부/MITC공부.md` only if a warning-free local report is desired before Sprint 11 exists, then run `uv run pdf2md recheck outputs\MITC공부\MITC공부.md`.
4. Preserve strict-local runtime behavior: use local model paths, direct CLI execution, and no user-specified API or remote backend. 4. Review generated sample Markdown outputs in Obsidian if visual quality needs manual assessment.
5. Run optional real local chunked conversion on a long sample only if requested.
6. Preserve strict-local runtime behavior: use local model paths, direct CLI execution, and no user-specified API or remote backend.