feat: mitigate MathJax formula warnings
This commit is contained in:
@@ -0,0 +1,181 @@
|
||||
# Sprint 11 Contract: MathJax Warning Mitigation
|
||||
|
||||
Status: Implemented
|
||||
Last updated: 2026-05-11
|
||||
|
||||
## Objective
|
||||
|
||||
Add a conservative local cleanup pass for MathJax-invalid formulas:
|
||||
|
||||
1. Run the existing MathJax renderability check on normalized Markdown.
|
||||
2. Build repair candidates only for expressions that failed MathJax validation.
|
||||
3. Re-check each candidate with the same local checker.
|
||||
4. Replace only candidates that pass.
|
||||
5. Re-run final quality checks before writing Markdown, metadata JSON, and report Markdown.
|
||||
|
||||
The feature should reduce `MATH_RENDER_FAILED` warnings without hiding that a formula was changed.
|
||||
|
||||
## Current Precondition
|
||||
|
||||
- `pdf2md convert` writes normalized Markdown, metadata JSON, and `<stem>.report.md`.
|
||||
- `pdf2md recheck` can rerun quality checks for an existing generated Markdown file without rerunning MinerU.
|
||||
- Local MathJax checking is already optional and nonfatal.
|
||||
- `outputs/MITC공부/MITC공부.md` currently has two MathJax render failures:
|
||||
- expression 8: `Double exponent: use braces to clarify`
|
||||
- expression 83: `Unknown environment 'a'`
|
||||
- `samples/MITC공부.pdf` is the requested real local validation sample.
|
||||
|
||||
## Touched Surfaces
|
||||
|
||||
Allowed during implementation:
|
||||
|
||||
- `src/pdf2md/quality.py`
|
||||
- `src/pdf2md/math_repair.py`
|
||||
- `src/pdf2md/conversion.py`
|
||||
- `src/pdf2md/ir.py`
|
||||
- `tests/test_quality.py`
|
||||
- `tests/test_math_repair.py`
|
||||
- `tests/test_conversion.py`
|
||||
- `tests/test_cli.py`
|
||||
- `docs/Sprints/SPRINT11CONTRACT.md`
|
||||
- `PLAN.md`
|
||||
- `PROGRESS.md`
|
||||
|
||||
Not allowed:
|
||||
|
||||
- Remote OCR, remote LLMs, remote render APIs, or external document upload paths.
|
||||
- Alternate PDF conversion engines.
|
||||
- Switchable conversion-engine behavior.
|
||||
- A full LaTeX parser or symbolic math rewrite engine.
|
||||
- New CLI flags unless a later user request explicitly asks for them.
|
||||
- Mandatory default tests that require real MinerU, GPU, Node.js, MathJax, network, Obsidian, or `samples/`.
|
||||
- Committed files under `samples/`.
|
||||
- Committed generated conversion outputs under `outputs/`.
|
||||
|
||||
## Product Behavior
|
||||
|
||||
Repair activation:
|
||||
|
||||
- Repair runs automatically when a local math checker is available and at least one math expression fails validation.
|
||||
- If the checker is unavailable, behavior remains unchanged: conversion/recheck continues with an info-level unavailable-checker warning.
|
||||
- The same repair path applies to fresh `convert` output and existing Markdown processed through `recheck`.
|
||||
|
||||
Initial deterministic repair rules:
|
||||
|
||||
- Repeated same-direction script repair:
|
||||
- Convert consecutive superscripts/subscripts such as `^ {i} ^ {t}` to `^ {i} {} ^ {t}`.
|
||||
- This resolves MathJax double-super/subscript syntax while preserving both script tokens.
|
||||
- Truncated array environment repair:
|
||||
- Convert `\end{a}` to `\end{array}` only when the expression has unmatched `\begin{array}` / `\end{array}` counts.
|
||||
- This targets obvious extraction truncation, not arbitrary environment renaming.
|
||||
|
||||
Provenance:
|
||||
|
||||
- Applied repairs produce `MATH_RENDER_REPAIRED` info warnings.
|
||||
- Successfully repaired expressions must not count as `math_render_error_count`.
|
||||
- Unrepaired expressions keep the original `MATH_RENDER_FAILED` warning behavior.
|
||||
- The report remains derived from metadata and local quality checks.
|
||||
|
||||
## Architecture Plan
|
||||
|
||||
### WP11.1: Failed Math Detail Capture
|
||||
|
||||
Actions:
|
||||
|
||||
- Add a project-owned result type that can include failed `MathExpression` records and checker messages.
|
||||
- Preserve the current `check_math_renderability()` return behavior for existing callers.
|
||||
- Keep expression extraction outside fenced code and inline code.
|
||||
|
||||
Expected output:
|
||||
|
||||
- Conversion can access failed expression spans without parsing warning message text.
|
||||
|
||||
### WP11.2: Repair Module
|
||||
|
||||
Actions:
|
||||
|
||||
- Add `src/pdf2md/math_repair.py`.
|
||||
- Define repair result records.
|
||||
- Generate candidates only for failed expressions.
|
||||
- Revalidate candidates through the injected checker.
|
||||
- Apply replacements from right to left so Markdown spans remain stable.
|
||||
|
||||
Expected output:
|
||||
|
||||
- Pure string-level repair behavior that is deterministic, local-only, and independently testable.
|
||||
|
||||
### WP11.3: Conversion And Recheck Integration
|
||||
|
||||
Actions:
|
||||
|
||||
- Route `convert` normalized Markdown through repair before final metadata/report construction.
|
||||
- Route `recheck` Markdown through the same repair path before rewriting metadata/report.
|
||||
- Re-run final quality checks after any repair.
|
||||
- Preserve asset checking and strict-local behavior unchanged.
|
||||
|
||||
Expected output:
|
||||
|
||||
- Fresh conversions and rechecks both benefit from MathJax warning mitigation.
|
||||
|
||||
### WP11.4: Tests
|
||||
|
||||
Default tests:
|
||||
|
||||
- Repeated superscripts are repaired only when the original expression failed.
|
||||
- `\end{a}` repairs to `\end{array}` only when array environments are unbalanced.
|
||||
- A candidate that still fails is not written back.
|
||||
- Passing expressions are not changed.
|
||||
- Conversion writes repaired Markdown only after candidate revalidation.
|
||||
- Recheck can repair an existing Markdown output and regenerate metadata/report.
|
||||
- Existing unavailable-checker behavior remains nonfatal.
|
||||
|
||||
Optional local validation:
|
||||
|
||||
- Run `uv run pdf2md convert samples\MITC공부.pdf --out outputs\sprint11-MITC공부 --overwrite`.
|
||||
- Confirm the generated report has `Math render error count: 0` for the requested sample, or record any remaining failures exactly.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
- Default fast tests pass without real MinerU, GPU, Node.js, MathJax, network, Obsidian, or `samples/`.
|
||||
- `pdf2md convert` and `pdf2md recheck` share the same repair behavior.
|
||||
- MathJax failed spans are repaired only after candidate revalidation succeeds.
|
||||
- Successfully repaired formulas remain visible through `MATH_RENDER_REPAIRED` info warnings.
|
||||
- Existing strict-local and MinerU-only constraints are unchanged.
|
||||
- `samples/MITC공부.pdf` is validated locally as requested, with generated outputs kept ignored under `outputs/`.
|
||||
|
||||
## Hard Failure Criteria
|
||||
|
||||
- Repair changes a math span that did not fail initial MathJax validation.
|
||||
- Repair drops an entire formula or removes meaningful LaTeX tokens solely to silence warnings.
|
||||
- Repair claims success without re-running the local checker on the candidate.
|
||||
- `convert` or `recheck` starts requiring MathJax when it was previously optional.
|
||||
- Default tests require real MinerU, GPU, Node.js, MathJax, network, Obsidian, or `samples/`.
|
||||
- `samples/` or generated `outputs/` files are committed.
|
||||
|
||||
## Verification Commands
|
||||
|
||||
```powershell
|
||||
uv run pytest tests/test_quality.py tests/test_math_repair.py tests/test_conversion.py tests/test_cli.py tests/test_report.py
|
||||
uv run pytest
|
||||
uv run pdf2md convert samples\MITC공부.pdf --out outputs\sprint11-MITC공부 --overwrite
|
||||
git diff --check
|
||||
git status --short --untracked-files=all
|
||||
```
|
||||
|
||||
## Handoff Requirements
|
||||
|
||||
After implementation:
|
||||
|
||||
- Update `PROGRESS.md` with files changed, commands run, test outcomes, sample validation outcome, known failures, and next action.
|
||||
- Keep sample PDFs and generated outputs out of the commit.
|
||||
- Commit the completed sprint if verification passes.
|
||||
|
||||
## Implementation Handoff
|
||||
|
||||
- Files changed: `src/pdf2md/quality.py`, `src/pdf2md/math_repair.py`, `src/pdf2md/conversion.py`, `src/pdf2md/ir.py`, tests, `ARCHITECTURE.md`, `docs/V1IMPLEMENTATIONPLAN.md`, `PLAN.md`, and `PROGRESS.md`.
|
||||
- Default verification: `uv run pytest` passed 172 tests with 1 skipped.
|
||||
- Targeted verification: `uv run pytest tests/test_quality.py tests/test_math_repair.py tests/test_conversion.py tests/test_cli.py tests/test_report.py` passed 56 tests.
|
||||
- Requested sample verification: `uv run pdf2md convert samples\MITC공부.pdf --out outputs\sprint11-MITC공부 --overwrite` succeeded; final report shows `Math render error count: 0` and two `MATH_RENDER_REPAIRED` info warnings.
|
||||
- Known failures: none.
|
||||
- Residual risk: repair rules are deliberately narrow; future PDFs may expose MathJax failures that should remain warnings until a deterministic rule is added and tested.
|
||||
- Next action: optional Obsidian visual review or additional sample validation.
|
||||
@@ -4,7 +4,7 @@ Last updated: 2026-05-08
|
||||
|
||||
This document is the implementation plan for v1. It does not replace `PRD.md` or `ARCHITECTURE.md`; use those files as the source of product requirements and system design. This plan explains the order of work, sprint contracts, verification gates, and agent ownership for implementing the converter.
|
||||
|
||||
Sprint 1 created the Python package scaffold and CLI placeholder. Sprint 2 created path planning. Sprint 3 created project-owned records and metadata construction. Sprint 4 created the mocked direct local MinerU adapter boundary. Sprint 5 created the Obsidian Markdown normalization boundary. Sprint 6 created local quality-check and report-rendering boundaries. Sprint 7 implemented conversion orchestration, the public conversion API, and the `pdf2md convert` CLI path with fake-adapter tests. Sprint 8 implemented mockable doctor diagnostics, the `pdf2md doctor` CLI path, and setup documentation. Sprint 9 implemented fast mocked integration tests, explicit opt-in local MinerU fixture evaluation, and the v1 release checklist. Sprint 10 implemented opt-in pre-conversion PDF chunking for long documents.
|
||||
Sprint 1 created the Python package scaffold and CLI placeholder. Sprint 2 created path planning. Sprint 3 created project-owned records and metadata construction. Sprint 4 created the mocked direct local MinerU adapter boundary. Sprint 5 created the Obsidian Markdown normalization boundary. Sprint 6 created local quality-check and report-rendering boundaries. Sprint 7 implemented conversion orchestration, the public conversion API, and the `pdf2md convert` CLI path with fake-adapter tests. Sprint 8 implemented mockable doctor diagnostics, the `pdf2md doctor` CLI path, and setup documentation. Sprint 9 implemented fast mocked integration tests, explicit opt-in local MinerU fixture evaluation, and the v1 release checklist. Sprint 10 implemented opt-in pre-conversion PDF chunking for long documents. Sprint 11 implemented conservative MathJax warning mitigation for failed math spans.
|
||||
|
||||
## 1. V1 Outcome
|
||||
|
||||
@@ -599,6 +599,48 @@ Hard failure criteria:
|
||||
- Chunk outputs are merged.
|
||||
- Default tests require real MinerU, GPU, model files, network, Obsidian, LaTeX tooling, or `samples/`.
|
||||
|
||||
### Sprint 11: MathJax Warning Mitigation
|
||||
|
||||
Active contract:
|
||||
|
||||
- `docs/Sprints/SPRINT11CONTRACT.md`
|
||||
|
||||
Status:
|
||||
|
||||
- Implemented.
|
||||
|
||||
Objective:
|
||||
|
||||
- Repair narrow MathJax-invalid formula artifacts after initial local validation and before final output writing.
|
||||
|
||||
Touched surfaces:
|
||||
|
||||
- `quality.py`
|
||||
- `math_repair.py`
|
||||
- `conversion.py`
|
||||
- `ir.py`
|
||||
- Unit tests for quality details, repair rules, conversion, and recheck behavior
|
||||
|
||||
Expected outputs:
|
||||
|
||||
- Failed math expression records expose body, display mode, span, and checker message.
|
||||
- Repair candidates are generated only for failed math spans.
|
||||
- Repeated same-direction scripts are disambiguated with an empty group.
|
||||
- Truncated `\end{a}` array endings are repaired when array environments are unbalanced.
|
||||
- `convert` and `recheck` share the same repair behavior.
|
||||
- Applied repairs are recorded as `MATH_RENDER_REPAIRED` info warnings and do not count as math render errors.
|
||||
|
||||
Verification checks:
|
||||
|
||||
- Default fast tests pass without real MinerU, GPU, Node.js, MathJax, network, Obsidian, or `samples/`.
|
||||
- `samples/MITC공부.pdf` validates locally with `Math render error count: 0`.
|
||||
|
||||
Hard failure criteria:
|
||||
|
||||
- Repair changes math spans that did not fail local MathJax validation.
|
||||
- Repair claims success without candidate revalidation.
|
||||
- Repair introduces remote services, alternate engines, or mandatory sample-dependent default tests.
|
||||
|
||||
## 6. Cross-Cutting Acceptance Criteria
|
||||
|
||||
Every implementation sprint must preserve these acceptance criteria:
|
||||
@@ -645,7 +687,7 @@ Handoff fields:
|
||||
- MinerU 3.1.0 install and CLI behavior are source-verified, but real local output still needs a later local probe before release.
|
||||
- GTX 1070 Ti 8GB is visible locally, but it is Pascal compute capability 6.1; `doctor` and setup docs must make CUDA/PyTorch limits clear.
|
||||
- `uv` is installed per-user at `C:\Users\user\.local\bin`, but a new shell may need PATH refresh before `uv` is visible.
|
||||
- Formula renderability checks need a local tool choice; the implementation should start with an interface and graceful unavailable-tool warning if needed.
|
||||
- Formula renderability checks and conservative warning mitigation are implemented, but formula reconstruction remains best effort and should keep warnings/provenance visible.
|
||||
- Some PDFs will have tables or formulas that cannot be faithfully represented in Markdown; metadata and `.report.md` must surface this instead of hiding it.
|
||||
- Redistribution license obligations must be reviewed before packaging, redistribution, or bundling model weights.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user