docs: plan MathJax warning mitigation
This commit is contained in:
@@ -4,7 +4,7 @@ This file is the shared work plan for agents. Read it before starting work, then
|
||||
|
||||
## Current Goal
|
||||
|
||||
Completed work history is archived in `docs/WORKARCHIVE.md`. Sprint 10 pre-conversion PDF chunking is implemented. On this PC, full local runtime setup is complete in `.venv`; Markdown quality recheck for existing outputs is implemented. Next work is optional manual Obsidian quality review, Markdown cleanup for sample warnings, or additional sample validation if requested.
|
||||
Completed work history is archived in `docs/WORKARCHIVE.md`. Sprint 10 pre-conversion PDF chunking is implemented. On this PC, full local runtime setup is complete in `.venv`; Markdown quality recheck for existing outputs is implemented. Next planned work is MathJax warning mitigation: after local MathJax validation, conservatively clean only warning-causing math spans, rerun validation, and preserve provenance for changed or still-failing formulas. Manual Obsidian quality review and sample validation remain optional fallback tasks.
|
||||
|
||||
## Active Constraints
|
||||
|
||||
@@ -35,10 +35,70 @@ Completed work history is archived in `docs/WORKARCHIVE.md`. Sprint 10 pre-conve
|
||||
12. Follow `docs/V1IMPLEMENTATIONPLAN.md` for the v1 implementation sprint sequence.
|
||||
13. Use `docs/Sprints/SPRINT10CONTRACT.md` for the implemented long-PDF pre-conversion chunking sprint.
|
||||
14. Use `docs/WORKARCHIVE.md` for completed sprint history, prior verification, runtime setup evidence, and sample conversion evidence.
|
||||
15. Plan Sprint 11 for MathJax warning mitigation before code changes start.
|
||||
16. Create `docs/Sprints/SPRINT11CONTRACT.md` for the mitigation sprint if implementation is requested.
|
||||
17. Keep the mitigation path shared by `pdf2md convert` and `pdf2md recheck` so existing Markdown outputs can be cleaned without rerunning MinerU.
|
||||
|
||||
## Proposed Sprint 11: MathJax Warning Mitigation
|
||||
|
||||
Objective:
|
||||
|
||||
- Add a conservative local post-validation cleanup pass that attempts to remove only the specific math-span artifacts responsible for MathJax warnings, then reruns MathJax validation before writing final Markdown, metadata JSON, and report Markdown.
|
||||
|
||||
Assumptions:
|
||||
|
||||
- MathJax warning mitigation is best-effort and nonfatal.
|
||||
- The cleanup pass must stay deterministic and local-only.
|
||||
- Warning reduction must not silently erase meaningful formula content.
|
||||
- The same behavior should apply to fresh conversions and `pdf2md recheck`.
|
||||
|
||||
Planned workflow:
|
||||
|
||||
1. Run the existing MathJax renderability check against normalized Markdown and keep failed `MathExpression` records, including index, display mode, Markdown span, and MathJax message.
|
||||
2. Generate cleanup candidates only for failed spans. Candidate rules should start with narrow, non-semantic fixes such as trimming invisible/control artifacts, removing obvious OCR/extractor debris, normalizing accidental delimiter leftovers, and fixing whitespace/newline forms known to break MathJax.
|
||||
3. Validate each candidate with the same local MathJax checker. Replace a math span only when the candidate passes and preserves the original inline/display delimiter shape.
|
||||
4. Rebuild Markdown from approved span replacements and rerun the full quality check on the repaired Markdown.
|
||||
5. Write metadata/report data from the final Markdown and final quality result. Record unresolved failures as `MATH_RENDER_FAILED`; record applied mitigations in a traceable form so warning counts are not reduced by hiding changes.
|
||||
|
||||
Touched surfaces to plan in the sprint contract:
|
||||
|
||||
- `src/pdf2md/quality.py`: expose failed math expression details without losing the existing warning behavior.
|
||||
- `src/pdf2md/math_render.py`: keep MathJax checking local and batch-oriented; do not expose raw MathJax objects as public API.
|
||||
- New focused module, likely `src/pdf2md/math_repair.py`: own candidate generation, span replacement, and repair result records.
|
||||
- `src/pdf2md/conversion.py`: run mitigation between normalization and final metadata/report construction for `convert` and `recheck`.
|
||||
- `src/pdf2md/ir.py`, `src/pdf2md/metadata.py`, and `src/pdf2md/report.py`: update only if the contract decides a new repair warning/info code or summary field is needed.
|
||||
- Tests in `tests/test_quality.py`, a new `tests/test_math_repair.py`, and targeted conversion/recheck CLI tests.
|
||||
|
||||
Non-goals:
|
||||
|
||||
- Do not add cloud OCR, remote LLMs, remote render APIs, or external document upload paths.
|
||||
- Do not add a second conversion engine or runtime engine selection.
|
||||
- Do not implement a full LaTeX parser, symbolic math simplifier, or Obsidian automation.
|
||||
- Do not remove whole formulas or meaningful LaTeX tokens solely to silence warnings.
|
||||
- Do not add new CLI flags unless a later contract explicitly justifies them.
|
||||
|
||||
Verification:
|
||||
|
||||
- Unit tests for failed-expression capture, candidate generation, safe span replacement, and no-op behavior when no candidate passes.
|
||||
- Conversion tests proving repaired Markdown is written only after candidate revalidation.
|
||||
- Recheck tests proving existing output Markdown can be repaired and metadata/report regenerated without rerunning MinerU.
|
||||
- Report/metadata tests proving remaining warnings and applied mitigations are visible and derived from final state.
|
||||
- Run `uv run pytest tests/test_quality.py tests/test_math_repair.py tests/test_conversion.py tests/test_cli.py tests/test_report.py`.
|
||||
- Run `uv run pytest` before marking the sprint complete.
|
||||
- Optionally run `uv run pdf2md recheck outputs\MITC공부\MITC공부.md` against ignored local sample output when the user requests real-output validation.
|
||||
|
||||
Hard failure criteria:
|
||||
|
||||
- The cleanup changes math spans that did not fail MathJax validation.
|
||||
- The cleanup removes an entire formula or a semantically meaningful token without an explicit trace.
|
||||
- The cleanup reduces warning counts by dropping warnings instead of producing MathJax-valid Markdown.
|
||||
- The cleanup makes `pdf2md convert` or `pdf2md recheck` require Node.js/MathJax when they were previously optional.
|
||||
- Default tests require real MinerU, GPU, Node.js, MathJax, network, Obsidian, or `samples/`.
|
||||
|
||||
## Open Questions
|
||||
|
||||
- None.
|
||||
- Which exact cleanup rules should Sprint 11 allow after inspecting current MathJax failure messages? Recommendation: start with deterministic non-semantic artifacts only.
|
||||
- Should applied mitigations use a new stable warning/info code or be represented through existing metadata/report fields? Recommendation: make repair provenance visible without counting a successfully repaired expression as a render failure.
|
||||
|
||||
## Decisions
|
||||
|
||||
@@ -51,6 +111,11 @@ Completed work history is archived in `docs/WORKARCHIVE.md`. Sprint 10 pre-conve
|
||||
- No silent fallback after MinerU failure.
|
||||
- Conversion output includes both metadata JSON and `<stem>.report.md`.
|
||||
- Local MathJax render checking is optional and nonfatal; missing Node.js or MathJax must produce a clear warning instead of blocking conversion.
|
||||
- MathJax warning mitigation must run only after initial local MathJax validation identifies failed math spans.
|
||||
- MathJax warning mitigation must be deterministic, local-only, and limited to failed math spans.
|
||||
- Candidate math cleanup must be revalidated with the local MathJax checker before replacing Markdown.
|
||||
- If no candidate passes validation, keep the original formula and retain the `MATH_RENDER_FAILED` warning.
|
||||
- Successfully mitigated formulas must remain traceable in metadata/report output; warning reduction must not hide that a formula was changed.
|
||||
- Project-scoped custom agents live in `.codex/agents/*.toml`.
|
||||
- Project prompt commands live in `.codex/commands/*.md`.
|
||||
- Project-specific skills live in `.codex/skills/*/SKILL.md`.
|
||||
|
||||
+7
-4
@@ -48,6 +48,7 @@ This file records current progress for agents. Read it before starting work, the
|
||||
- Added `recheck_markdown()` and `pdf2md recheck <markdown.md>` to rerun local quality checks for an existing generated Markdown file and rewrite the adjacent metadata JSON and `.report.md` without rerunning MinerU.
|
||||
- Verified `uv run pdf2md recheck outputs\MITC공부\MITC공부.md`; the command regenerated metadata/report and still reported 2 warnings because the current Markdown still contains the two MathJax-invalid expressions.
|
||||
- Reconverted `samples/MITC공부.pdf` with `--overwrite` to ignored `outputs/MITC공부/`; report status remains `partial`: 13 pages, 107 assets, 23 inline formulas, 103 display formulas, 2 MathJax render warnings, and 0 missing or invalid asset links.
|
||||
- Added a `PLAN.md` Sprint 11 proposal for conservative MathJax warning mitigation after validation; no implementation code has been started.
|
||||
|
||||
## In Progress
|
||||
|
||||
@@ -59,7 +60,9 @@ This file records current progress for agents. Read it before starting work, the
|
||||
|
||||
## Next Actions
|
||||
|
||||
1. Manually fix the two MathJax-invalid expressions in `outputs/MITC공부/MITC공부.md` if a warning-free local report is desired, then run `uv run pdf2md recheck outputs\MITC공부\MITC공부.md`.
|
||||
2. Review generated sample Markdown outputs in Obsidian if visual quality needs manual assessment.
|
||||
3. Run optional real local chunked conversion on a long sample only if requested.
|
||||
4. Preserve strict-local runtime behavior: use local model paths, direct CLI execution, and no user-specified API or remote backend.
|
||||
1. If implementation is requested, write `docs/Sprints/SPRINT11CONTRACT.md` for MathJax warning mitigation before code changes start.
|
||||
2. Inspect the current MathJax failure messages from `outputs/MITC공부/MITC공부.md` to choose the narrow initial cleanup rule set.
|
||||
3. Manually fix the two MathJax-invalid expressions in `outputs/MITC공부/MITC공부.md` only if a warning-free local report is desired before Sprint 11 exists, then run `uv run pdf2md recheck outputs\MITC공부\MITC공부.md`.
|
||||
4. Review generated sample Markdown outputs in Obsidian if visual quality needs manual assessment.
|
||||
5. Run optional real local chunked conversion on a long sample only if requested.
|
||||
6. Preserve strict-local runtime behavior: use local model paths, direct CLI execution, and no user-specified API or remote backend.
|
||||
|
||||
Reference in New Issue
Block a user