feat: mitigate MathJax formula warnings

This commit is contained in:
NINI
2026-05-11 02:08:46 +09:00
parent 005f17bac1
commit 71e6fbcc51
12 changed files with 625 additions and 41 deletions
+181
View File
@@ -0,0 +1,181 @@
# Sprint 11 Contract: MathJax Warning Mitigation
Status: Implemented
Last updated: 2026-05-11
## Objective
Add a conservative local cleanup pass for MathJax-invalid formulas:
1. Run the existing MathJax renderability check on normalized Markdown.
2. Build repair candidates only for expressions that failed MathJax validation.
3. Re-check each candidate with the same local checker.
4. Replace only candidates that pass.
5. Re-run final quality checks before writing Markdown, metadata JSON, and report Markdown.
The feature should reduce `MATH_RENDER_FAILED` warnings without hiding that a formula was changed.
## Current Precondition
- `pdf2md convert` writes normalized Markdown, metadata JSON, and `<stem>.report.md`.
- `pdf2md recheck` can rerun quality checks for an existing generated Markdown file without rerunning MinerU.
- Local MathJax checking is already optional and nonfatal.
- `outputs/MITC공부/MITC공부.md` currently has two MathJax render failures:
- expression 8: `Double exponent: use braces to clarify`
- expression 83: `Unknown environment 'a'`
- `samples/MITC공부.pdf` is the requested real local validation sample.
## Touched Surfaces
Allowed during implementation:
- `src/pdf2md/quality.py`
- `src/pdf2md/math_repair.py`
- `src/pdf2md/conversion.py`
- `src/pdf2md/ir.py`
- `tests/test_quality.py`
- `tests/test_math_repair.py`
- `tests/test_conversion.py`
- `tests/test_cli.py`
- `docs/Sprints/SPRINT11CONTRACT.md`
- `PLAN.md`
- `PROGRESS.md`
Not allowed:
- Remote OCR, remote LLMs, remote render APIs, or external document upload paths.
- Alternate PDF conversion engines.
- Switchable conversion-engine behavior.
- A full LaTeX parser or symbolic math rewrite engine.
- New CLI flags unless a later user request explicitly asks for them.
- Mandatory default tests that require real MinerU, GPU, Node.js, MathJax, network, Obsidian, or `samples/`.
- Committed files under `samples/`.
- Committed generated conversion outputs under `outputs/`.
## Product Behavior
Repair activation:
- Repair runs automatically when a local math checker is available and at least one math expression fails validation.
- If the checker is unavailable, behavior remains unchanged: conversion/recheck continues with an info-level unavailable-checker warning.
- The same repair path applies to fresh `convert` output and existing Markdown processed through `recheck`.
Initial deterministic repair rules:
- Repeated same-direction script repair:
- Convert consecutive superscripts/subscripts such as `^ {i} ^ {t}` to `^ {i} {} ^ {t}`.
- This resolves MathJax double-super/subscript syntax while preserving both script tokens.
- Truncated array environment repair:
- Convert `\end{a}` to `\end{array}` only when the expression has unmatched `\begin{array}` / `\end{array}` counts.
- This targets obvious extraction truncation, not arbitrary environment renaming.
Provenance:
- Applied repairs produce `MATH_RENDER_REPAIRED` info warnings.
- Successfully repaired expressions must not count as `math_render_error_count`.
- Unrepaired expressions keep the original `MATH_RENDER_FAILED` warning behavior.
- The report remains derived from metadata and local quality checks.
## Architecture Plan
### WP11.1: Failed Math Detail Capture
Actions:
- Add a project-owned result type that can include failed `MathExpression` records and checker messages.
- Preserve the current `check_math_renderability()` return behavior for existing callers.
- Keep expression extraction outside fenced code and inline code.
Expected output:
- Conversion can access failed expression spans without parsing warning message text.
### WP11.2: Repair Module
Actions:
- Add `src/pdf2md/math_repair.py`.
- Define repair result records.
- Generate candidates only for failed expressions.
- Revalidate candidates through the injected checker.
- Apply replacements from right to left so Markdown spans remain stable.
Expected output:
- Pure string-level repair behavior that is deterministic, local-only, and independently testable.
### WP11.3: Conversion And Recheck Integration
Actions:
- Route `convert` normalized Markdown through repair before final metadata/report construction.
- Route `recheck` Markdown through the same repair path before rewriting metadata/report.
- Re-run final quality checks after any repair.
- Preserve asset checking and strict-local behavior unchanged.
Expected output:
- Fresh conversions and rechecks both benefit from MathJax warning mitigation.
### WP11.4: Tests
Default tests:
- Repeated superscripts are repaired only when the original expression failed.
- `\end{a}` repairs to `\end{array}` only when array environments are unbalanced.
- A candidate that still fails is not written back.
- Passing expressions are not changed.
- Conversion writes repaired Markdown only after candidate revalidation.
- Recheck can repair an existing Markdown output and regenerate metadata/report.
- Existing unavailable-checker behavior remains nonfatal.
Optional local validation:
- Run `uv run pdf2md convert samples\MITC공부.pdf --out outputs\sprint11-MITC공부 --overwrite`.
- Confirm the generated report has `Math render error count: 0` for the requested sample, or record any remaining failures exactly.
## Acceptance Criteria
- Default fast tests pass without real MinerU, GPU, Node.js, MathJax, network, Obsidian, or `samples/`.
- `pdf2md convert` and `pdf2md recheck` share the same repair behavior.
- MathJax failed spans are repaired only after candidate revalidation succeeds.
- Successfully repaired formulas remain visible through `MATH_RENDER_REPAIRED` info warnings.
- Existing strict-local and MinerU-only constraints are unchanged.
- `samples/MITC공부.pdf` is validated locally as requested, with generated outputs kept ignored under `outputs/`.
## Hard Failure Criteria
- Repair changes a math span that did not fail initial MathJax validation.
- Repair drops an entire formula or removes meaningful LaTeX tokens solely to silence warnings.
- Repair claims success without re-running the local checker on the candidate.
- `convert` or `recheck` starts requiring MathJax when it was previously optional.
- Default tests require real MinerU, GPU, Node.js, MathJax, network, Obsidian, or `samples/`.
- `samples/` or generated `outputs/` files are committed.
## Verification Commands
```powershell
uv run pytest tests/test_quality.py tests/test_math_repair.py tests/test_conversion.py tests/test_cli.py tests/test_report.py
uv run pytest
uv run pdf2md convert samples\MITC공부.pdf --out outputs\sprint11-MITC공부 --overwrite
git diff --check
git status --short --untracked-files=all
```
## Handoff Requirements
After implementation:
- Update `PROGRESS.md` with files changed, commands run, test outcomes, sample validation outcome, known failures, and next action.
- Keep sample PDFs and generated outputs out of the commit.
- Commit the completed sprint if verification passes.
## Implementation Handoff
- Files changed: `src/pdf2md/quality.py`, `src/pdf2md/math_repair.py`, `src/pdf2md/conversion.py`, `src/pdf2md/ir.py`, tests, `ARCHITECTURE.md`, `docs/V1IMPLEMENTATIONPLAN.md`, `PLAN.md`, and `PROGRESS.md`.
- Default verification: `uv run pytest` passed 172 tests with 1 skipped.
- Targeted verification: `uv run pytest tests/test_quality.py tests/test_math_repair.py tests/test_conversion.py tests/test_cli.py tests/test_report.py` passed 56 tests.
- Requested sample verification: `uv run pdf2md convert samples\MITC공부.pdf --out outputs\sprint11-MITC공부 --overwrite` succeeded; final report shows `Math render error count: 0` and two `MATH_RENDER_REPAIRED` info warnings.
- Known failures: none.
- Residual risk: repair rules are deliberately narrow; future PDFs may expose MathJax failures that should remain warnings until a deterministic rule is added and tested.
- Next action: optional Obsidian visual review or additional sample validation.
+44 -2
View File
@@ -4,7 +4,7 @@ Last updated: 2026-05-08
This document is the implementation plan for v1. It does not replace `PRD.md` or `ARCHITECTURE.md`; use those files as the source of product requirements and system design. This plan explains the order of work, sprint contracts, verification gates, and agent ownership for implementing the converter.
Sprint 1 created the Python package scaffold and CLI placeholder. Sprint 2 created path planning. Sprint 3 created project-owned records and metadata construction. Sprint 4 created the mocked direct local MinerU adapter boundary. Sprint 5 created the Obsidian Markdown normalization boundary. Sprint 6 created local quality-check and report-rendering boundaries. Sprint 7 implemented conversion orchestration, the public conversion API, and the `pdf2md convert` CLI path with fake-adapter tests. Sprint 8 implemented mockable doctor diagnostics, the `pdf2md doctor` CLI path, and setup documentation. Sprint 9 implemented fast mocked integration tests, explicit opt-in local MinerU fixture evaluation, and the v1 release checklist. Sprint 10 implemented opt-in pre-conversion PDF chunking for long documents.
Sprint 1 created the Python package scaffold and CLI placeholder. Sprint 2 created path planning. Sprint 3 created project-owned records and metadata construction. Sprint 4 created the mocked direct local MinerU adapter boundary. Sprint 5 created the Obsidian Markdown normalization boundary. Sprint 6 created local quality-check and report-rendering boundaries. Sprint 7 implemented conversion orchestration, the public conversion API, and the `pdf2md convert` CLI path with fake-adapter tests. Sprint 8 implemented mockable doctor diagnostics, the `pdf2md doctor` CLI path, and setup documentation. Sprint 9 implemented fast mocked integration tests, explicit opt-in local MinerU fixture evaluation, and the v1 release checklist. Sprint 10 implemented opt-in pre-conversion PDF chunking for long documents. Sprint 11 implemented conservative MathJax warning mitigation for failed math spans.
## 1. V1 Outcome
@@ -599,6 +599,48 @@ Hard failure criteria:
- Chunk outputs are merged.
- Default tests require real MinerU, GPU, model files, network, Obsidian, LaTeX tooling, or `samples/`.
### Sprint 11: MathJax Warning Mitigation
Active contract:
- `docs/Sprints/SPRINT11CONTRACT.md`
Status:
- Implemented.
Objective:
- Repair narrow MathJax-invalid formula artifacts after initial local validation and before final output writing.
Touched surfaces:
- `quality.py`
- `math_repair.py`
- `conversion.py`
- `ir.py`
- Unit tests for quality details, repair rules, conversion, and recheck behavior
Expected outputs:
- Failed math expression records expose body, display mode, span, and checker message.
- Repair candidates are generated only for failed math spans.
- Repeated same-direction scripts are disambiguated with an empty group.
- Truncated `\end{a}` array endings are repaired when array environments are unbalanced.
- `convert` and `recheck` share the same repair behavior.
- Applied repairs are recorded as `MATH_RENDER_REPAIRED` info warnings and do not count as math render errors.
Verification checks:
- Default fast tests pass without real MinerU, GPU, Node.js, MathJax, network, Obsidian, or `samples/`.
- `samples/MITC공부.pdf` validates locally with `Math render error count: 0`.
Hard failure criteria:
- Repair changes math spans that did not fail local MathJax validation.
- Repair claims success without candidate revalidation.
- Repair introduces remote services, alternate engines, or mandatory sample-dependent default tests.
## 6. Cross-Cutting Acceptance Criteria
Every implementation sprint must preserve these acceptance criteria:
@@ -645,7 +687,7 @@ Handoff fields:
- MinerU 3.1.0 install and CLI behavior are source-verified, but real local output still needs a later local probe before release.
- GTX 1070 Ti 8GB is visible locally, but it is Pascal compute capability 6.1; `doctor` and setup docs must make CUDA/PyTorch limits clear.
- `uv` is installed per-user at `C:\Users\user\.local\bin`, but a new shell may need PATH refresh before `uv` is visible.
- Formula renderability checks need a local tool choice; the implementation should start with an interface and graceful unavailable-tool warning if needed.
- Formula renderability checks and conservative warning mitigation are implemented, but formula reconstruction remains best effort and should keep warnings/provenance visible.
- Some PDFs will have tables or formulas that cannot be faithfully represented in Markdown; metadata and `.report.md` must surface this instead of hiding it.
- Redistribution license obligations must be reviewed before packaging, redistribution, or bundling model weights.