7.8 KiB
7.8 KiB
Sprint 11 Contract: MathJax Warning Mitigation
Status: Implemented Last updated: 2026-05-11
Objective
Add a conservative local cleanup pass for MathJax-invalid formulas:
- Run the existing MathJax renderability check on normalized Markdown.
- Build repair candidates only for expressions that failed MathJax validation.
- Re-check each candidate with the same local checker.
- Replace only candidates that pass.
- Re-run final quality checks before writing Markdown, metadata JSON, and report Markdown.
The feature should reduce MATH_RENDER_FAILED warnings without hiding that a formula was changed.
Current Precondition
pdf2md convertwrites normalized Markdown, metadata JSON, and<stem>.report.md.pdf2md recheckcan rerun quality checks for an existing generated Markdown file without rerunning MinerU.- Local MathJax checking is already optional and nonfatal.
outputs/MITC공부/MITC공부.mdcurrently has two MathJax render failures:- expression 8:
Double exponent: use braces to clarify - expression 83:
Unknown environment 'a'
- expression 8:
samples/MITC공부.pdfis the requested real local validation sample.
Touched Surfaces
Allowed during implementation:
src/pdf2md/quality.pysrc/pdf2md/math_repair.pysrc/pdf2md/conversion.pysrc/pdf2md/ir.pytests/test_quality.pytests/test_math_repair.pytests/test_conversion.pytests/test_cli.pydocs/Sprints/SPRINT11CONTRACT.mdPLAN.mdPROGRESS.md
Not allowed:
- Remote OCR, remote LLMs, remote render APIs, or external document upload paths.
- Alternate PDF conversion engines.
- Switchable conversion-engine behavior.
- A full LaTeX parser or symbolic math rewrite engine.
- New CLI flags unless a later user request explicitly asks for them.
- Mandatory default tests that require real MinerU, GPU, Node.js, MathJax, network, Obsidian, or
samples/. - Committed files under
samples/. - Committed generated conversion outputs under
outputs/.
Product Behavior
Repair activation:
- Repair runs automatically when a local math checker is available and at least one math expression fails validation.
- If the checker is unavailable, behavior remains unchanged: conversion/recheck continues with an info-level unavailable-checker warning.
- The same repair path applies to fresh
convertoutput and existing Markdown processed throughrecheck.
Initial deterministic repair rules:
- Repeated same-direction script repair:
- Convert consecutive superscripts/subscripts such as
^ {i} ^ {t}to^ {i} {} ^ {t}. - This resolves MathJax double-super/subscript syntax while preserving both script tokens.
- Convert consecutive superscripts/subscripts such as
- Truncated array environment repair:
- Convert
\end{a}to\end{array}only when the expression has unmatched\begin{array}/\end{array}counts. - This targets obvious extraction truncation, not arbitrary environment renaming.
- Convert
Provenance:
- Applied repairs produce
MATH_RENDER_REPAIREDinfo warnings. - Successfully repaired expressions must not count as
math_render_error_count. - Unrepaired expressions keep the original
MATH_RENDER_FAILEDwarning behavior. - The report remains derived from metadata and local quality checks.
Architecture Plan
WP11.1: Failed Math Detail Capture
Actions:
- Add a project-owned result type that can include failed
MathExpressionrecords and checker messages. - Preserve the current
check_math_renderability()return behavior for existing callers. - Keep expression extraction outside fenced code and inline code.
Expected output:
- Conversion can access failed expression spans without parsing warning message text.
WP11.2: Repair Module
Actions:
- Add
src/pdf2md/math_repair.py. - Define repair result records.
- Generate candidates only for failed expressions.
- Revalidate candidates through the injected checker.
- Apply replacements from right to left so Markdown spans remain stable.
Expected output:
- Pure string-level repair behavior that is deterministic, local-only, and independently testable.
WP11.3: Conversion And Recheck Integration
Actions:
- Route
convertnormalized Markdown through repair before final metadata/report construction. - Route
recheckMarkdown through the same repair path before rewriting metadata/report. - Re-run final quality checks after any repair.
- Preserve asset checking and strict-local behavior unchanged.
Expected output:
- Fresh conversions and rechecks both benefit from MathJax warning mitigation.
WP11.4: Tests
Default tests:
- Repeated superscripts are repaired only when the original expression failed.
\end{a}repairs to\end{array}only when array environments are unbalanced.- A candidate that still fails is not written back.
- Passing expressions are not changed.
- Conversion writes repaired Markdown only after candidate revalidation.
- Recheck can repair an existing Markdown output and regenerate metadata/report.
- Existing unavailable-checker behavior remains nonfatal.
Optional local validation:
- Run
uv run pdf2md convert samples\MITC공부.pdf --out outputs\sprint11-MITC공부 --overwrite. - Confirm the generated report has
Math render error count: 0for the requested sample, or record any remaining failures exactly.
Acceptance Criteria
- Default fast tests pass without real MinerU, GPU, Node.js, MathJax, network, Obsidian, or
samples/. pdf2md convertandpdf2md recheckshare the same repair behavior.- MathJax failed spans are repaired only after candidate revalidation succeeds.
- Successfully repaired formulas remain visible through
MATH_RENDER_REPAIREDinfo warnings. - Existing strict-local and MinerU-only constraints are unchanged.
samples/MITC공부.pdfis validated locally as requested, with generated outputs kept ignored underoutputs/.
Hard Failure Criteria
- Repair changes a math span that did not fail initial MathJax validation.
- Repair drops an entire formula or removes meaningful LaTeX tokens solely to silence warnings.
- Repair claims success without re-running the local checker on the candidate.
convertorrecheckstarts requiring MathJax when it was previously optional.- Default tests require real MinerU, GPU, Node.js, MathJax, network, Obsidian, or
samples/. samples/or generatedoutputs/files are committed.
Verification Commands
uv run pytest tests/test_quality.py tests/test_math_repair.py tests/test_conversion.py tests/test_cli.py tests/test_report.py
uv run pytest
uv run pdf2md convert samples\MITC공부.pdf --out outputs\sprint11-MITC공부 --overwrite
git diff --check
git status --short --untracked-files=all
Handoff Requirements
After implementation:
- Update
PROGRESS.mdwith files changed, commands run, test outcomes, sample validation outcome, known failures, and next action. - Keep sample PDFs and generated outputs out of the commit.
- Commit the completed sprint if verification passes.
Implementation Handoff
- Files changed:
src/pdf2md/quality.py,src/pdf2md/math_repair.py,src/pdf2md/conversion.py,src/pdf2md/ir.py, tests,ARCHITECTURE.md,docs/V1IMPLEMENTATIONPLAN.md,PLAN.md, andPROGRESS.md. - Default verification:
uv run pytestpassed 172 tests with 1 skipped. - Targeted verification:
uv run pytest tests/test_quality.py tests/test_math_repair.py tests/test_conversion.py tests/test_cli.py tests/test_report.pypassed 56 tests. - Requested sample verification:
uv run pdf2md convert samples\MITC공부.pdf --out outputs\sprint11-MITC공부 --overwritesucceeded; final report showsMath render error count: 0and twoMATH_RENDER_REPAIREDinfo warnings. - Known failures: none.
- Residual risk: repair rules are deliberately narrow; future PDFs may expose MathJax failures that should remain warnings until a deterministic rule is added and tested.
- Next action: optional Obsidian visual review or additional sample validation.