# Sprint 11 Contract: MathJax Warning Mitigation Status: Implemented Last updated: 2026-05-11 ## Objective Add a conservative local cleanup pass for MathJax-invalid formulas: 1. Run the existing MathJax renderability check on normalized Markdown. 2. Build repair candidates only for expressions that failed MathJax validation. 3. Re-check each candidate with the same local checker. 4. Replace only candidates that pass. 5. Re-run final quality checks before writing Markdown, metadata JSON, and report Markdown. The feature should reduce `MATH_RENDER_FAILED` warnings without hiding that a formula was changed. ## Current Precondition - `pdf2md convert` writes normalized Markdown, metadata JSON, and `.report.md`. - `pdf2md recheck` can rerun quality checks for an existing generated Markdown file without rerunning MinerU. - Local MathJax checking is already optional and nonfatal. - `outputs/MITC공부/MITC공부.md` currently has two MathJax render failures: - expression 8: `Double exponent: use braces to clarify` - expression 83: `Unknown environment 'a'` - `samples/MITC공부.pdf` is the requested real local validation sample. ## Touched Surfaces Allowed during implementation: - `src/pdf2md/quality.py` - `src/pdf2md/math_repair.py` - `src/pdf2md/conversion.py` - `src/pdf2md/ir.py` - `tests/test_quality.py` - `tests/test_math_repair.py` - `tests/test_conversion.py` - `tests/test_cli.py` - `docs/Sprints/SPRINT11CONTRACT.md` - `PLAN.md` - `PROGRESS.md` Not allowed: - Remote OCR, remote LLMs, remote render APIs, or external document upload paths. - Alternate PDF conversion engines. - Switchable conversion-engine behavior. - A full LaTeX parser or symbolic math rewrite engine. - New CLI flags unless a later user request explicitly asks for them. - Mandatory default tests that require real MinerU, GPU, Node.js, MathJax, network, Obsidian, or `samples/`. - Committed files under `samples/`. - Committed generated conversion outputs under `outputs/`. ## Product Behavior Repair activation: - Repair runs automatically when a local math checker is available and at least one math expression fails validation. - If the checker is unavailable, behavior remains unchanged: conversion/recheck continues with an info-level unavailable-checker warning. - The same repair path applies to fresh `convert` output and existing Markdown processed through `recheck`. Initial deterministic repair rules: - Repeated same-direction script repair: - Convert consecutive superscripts/subscripts such as `^ {i} ^ {t}` to `^ {i} {} ^ {t}`. - This resolves MathJax double-super/subscript syntax while preserving both script tokens. - Truncated array environment repair: - Convert `\end{a}` to `\end{array}` only when the expression has unmatched `\begin{array}` / `\end{array}` counts. - This targets obvious extraction truncation, not arbitrary environment renaming. Provenance: - Applied repairs produce `MATH_RENDER_REPAIRED` info warnings. - Successfully repaired expressions must not count as `math_render_error_count`. - Unrepaired expressions keep the original `MATH_RENDER_FAILED` warning behavior. - The report remains derived from metadata and local quality checks. ## Architecture Plan ### WP11.1: Failed Math Detail Capture Actions: - Add a project-owned result type that can include failed `MathExpression` records and checker messages. - Preserve the current `check_math_renderability()` return behavior for existing callers. - Keep expression extraction outside fenced code and inline code. Expected output: - Conversion can access failed expression spans without parsing warning message text. ### WP11.2: Repair Module Actions: - Add `src/pdf2md/math_repair.py`. - Define repair result records. - Generate candidates only for failed expressions. - Revalidate candidates through the injected checker. - Apply replacements from right to left so Markdown spans remain stable. Expected output: - Pure string-level repair behavior that is deterministic, local-only, and independently testable. ### WP11.3: Conversion And Recheck Integration Actions: - Route `convert` normalized Markdown through repair before final metadata/report construction. - Route `recheck` Markdown through the same repair path before rewriting metadata/report. - Re-run final quality checks after any repair. - Preserve asset checking and strict-local behavior unchanged. Expected output: - Fresh conversions and rechecks both benefit from MathJax warning mitigation. ### WP11.4: Tests Default tests: - Repeated superscripts are repaired only when the original expression failed. - `\end{a}` repairs to `\end{array}` only when array environments are unbalanced. - A candidate that still fails is not written back. - Passing expressions are not changed. - Conversion writes repaired Markdown only after candidate revalidation. - Recheck can repair an existing Markdown output and regenerate metadata/report. - Existing unavailable-checker behavior remains nonfatal. Optional local validation: - Run `uv run pdf2md convert samples\MITC공부.pdf --out outputs\sprint11-MITC공부 --overwrite`. - Confirm the generated report has `Math render error count: 0` for the requested sample, or record any remaining failures exactly. ## Acceptance Criteria - Default fast tests pass without real MinerU, GPU, Node.js, MathJax, network, Obsidian, or `samples/`. - `pdf2md convert` and `pdf2md recheck` share the same repair behavior. - MathJax failed spans are repaired only after candidate revalidation succeeds. - Successfully repaired formulas remain visible through `MATH_RENDER_REPAIRED` info warnings. - Existing strict-local and MinerU-only constraints are unchanged. - `samples/MITC공부.pdf` is validated locally as requested, with generated outputs kept ignored under `outputs/`. ## Hard Failure Criteria - Repair changes a math span that did not fail initial MathJax validation. - Repair drops an entire formula or removes meaningful LaTeX tokens solely to silence warnings. - Repair claims success without re-running the local checker on the candidate. - `convert` or `recheck` starts requiring MathJax when it was previously optional. - Default tests require real MinerU, GPU, Node.js, MathJax, network, Obsidian, or `samples/`. - `samples/` or generated `outputs/` files are committed. ## Verification Commands ```powershell uv run pytest tests/test_quality.py tests/test_math_repair.py tests/test_conversion.py tests/test_cli.py tests/test_report.py uv run pytest uv run pdf2md convert samples\MITC공부.pdf --out outputs\sprint11-MITC공부 --overwrite git diff --check git status --short --untracked-files=all ``` ## Handoff Requirements After implementation: - Update `PROGRESS.md` with files changed, commands run, test outcomes, sample validation outcome, known failures, and next action. - Keep sample PDFs and generated outputs out of the commit. - Commit the completed sprint if verification passes. ## Implementation Handoff - Files changed: `src/pdf2md/quality.py`, `src/pdf2md/math_repair.py`, `src/pdf2md/conversion.py`, `src/pdf2md/ir.py`, tests, `ARCHITECTURE.md`, `docs/V1IMPLEMENTATIONPLAN.md`, `PLAN.md`, and `PROGRESS.md`. - Default verification: `uv run pytest` passed 172 tests with 1 skipped. - Targeted verification: `uv run pytest tests/test_quality.py tests/test_math_repair.py tests/test_conversion.py tests/test_cli.py tests/test_report.py` passed 56 tests. - Requested sample verification: `uv run pdf2md convert samples\MITC공부.pdf --out outputs\sprint11-MITC공부 --overwrite` succeeded; final report shows `Math render error count: 0` and two `MATH_RENDER_REPAIRED` info warnings. - Known failures: none. - Residual risk: repair rules are deliberately narrow; future PDFs may expose MathJax failures that should remain warnings until a deterministic rule is added and tested. - Next action: optional Obsidian visual review or additional sample validation.