Files
PDFToMD/docs/Sprints/SPRINT11CONTRACT.md
T
2026-05-11 02:08:46 +09:00

7.8 KiB

Sprint 11 Contract: MathJax Warning Mitigation

Status: Implemented Last updated: 2026-05-11

Objective

Add a conservative local cleanup pass for MathJax-invalid formulas:

  1. Run the existing MathJax renderability check on normalized Markdown.
  2. Build repair candidates only for expressions that failed MathJax validation.
  3. Re-check each candidate with the same local checker.
  4. Replace only candidates that pass.
  5. Re-run final quality checks before writing Markdown, metadata JSON, and report Markdown.

The feature should reduce MATH_RENDER_FAILED warnings without hiding that a formula was changed.

Current Precondition

  • pdf2md convert writes normalized Markdown, metadata JSON, and <stem>.report.md.
  • pdf2md recheck can rerun quality checks for an existing generated Markdown file without rerunning MinerU.
  • Local MathJax checking is already optional and nonfatal.
  • outputs/MITC공부/MITC공부.md currently has two MathJax render failures:
    • expression 8: Double exponent: use braces to clarify
    • expression 83: Unknown environment 'a'
  • samples/MITC공부.pdf is the requested real local validation sample.

Touched Surfaces

Allowed during implementation:

  • src/pdf2md/quality.py
  • src/pdf2md/math_repair.py
  • src/pdf2md/conversion.py
  • src/pdf2md/ir.py
  • tests/test_quality.py
  • tests/test_math_repair.py
  • tests/test_conversion.py
  • tests/test_cli.py
  • docs/Sprints/SPRINT11CONTRACT.md
  • PLAN.md
  • PROGRESS.md

Not allowed:

  • Remote OCR, remote LLMs, remote render APIs, or external document upload paths.
  • Alternate PDF conversion engines.
  • Switchable conversion-engine behavior.
  • A full LaTeX parser or symbolic math rewrite engine.
  • New CLI flags unless a later user request explicitly asks for them.
  • Mandatory default tests that require real MinerU, GPU, Node.js, MathJax, network, Obsidian, or samples/.
  • Committed files under samples/.
  • Committed generated conversion outputs under outputs/.

Product Behavior

Repair activation:

  • Repair runs automatically when a local math checker is available and at least one math expression fails validation.
  • If the checker is unavailable, behavior remains unchanged: conversion/recheck continues with an info-level unavailable-checker warning.
  • The same repair path applies to fresh convert output and existing Markdown processed through recheck.

Initial deterministic repair rules:

  • Repeated same-direction script repair:
    • Convert consecutive superscripts/subscripts such as ^ {i} ^ {t} to ^ {i} {} ^ {t}.
    • This resolves MathJax double-super/subscript syntax while preserving both script tokens.
  • Truncated array environment repair:
    • Convert \end{a} to \end{array} only when the expression has unmatched \begin{array} / \end{array} counts.
    • This targets obvious extraction truncation, not arbitrary environment renaming.

Provenance:

  • Applied repairs produce MATH_RENDER_REPAIRED info warnings.
  • Successfully repaired expressions must not count as math_render_error_count.
  • Unrepaired expressions keep the original MATH_RENDER_FAILED warning behavior.
  • The report remains derived from metadata and local quality checks.

Architecture Plan

WP11.1: Failed Math Detail Capture

Actions:

  • Add a project-owned result type that can include failed MathExpression records and checker messages.
  • Preserve the current check_math_renderability() return behavior for existing callers.
  • Keep expression extraction outside fenced code and inline code.

Expected output:

  • Conversion can access failed expression spans without parsing warning message text.

WP11.2: Repair Module

Actions:

  • Add src/pdf2md/math_repair.py.
  • Define repair result records.
  • Generate candidates only for failed expressions.
  • Revalidate candidates through the injected checker.
  • Apply replacements from right to left so Markdown spans remain stable.

Expected output:

  • Pure string-level repair behavior that is deterministic, local-only, and independently testable.

WP11.3: Conversion And Recheck Integration

Actions:

  • Route convert normalized Markdown through repair before final metadata/report construction.
  • Route recheck Markdown through the same repair path before rewriting metadata/report.
  • Re-run final quality checks after any repair.
  • Preserve asset checking and strict-local behavior unchanged.

Expected output:

  • Fresh conversions and rechecks both benefit from MathJax warning mitigation.

WP11.4: Tests

Default tests:

  • Repeated superscripts are repaired only when the original expression failed.
  • \end{a} repairs to \end{array} only when array environments are unbalanced.
  • A candidate that still fails is not written back.
  • Passing expressions are not changed.
  • Conversion writes repaired Markdown only after candidate revalidation.
  • Recheck can repair an existing Markdown output and regenerate metadata/report.
  • Existing unavailable-checker behavior remains nonfatal.

Optional local validation:

  • Run uv run pdf2md convert samples\MITC공부.pdf --out outputs\sprint11-MITC공부 --overwrite.
  • Confirm the generated report has Math render error count: 0 for the requested sample, or record any remaining failures exactly.

Acceptance Criteria

  • Default fast tests pass without real MinerU, GPU, Node.js, MathJax, network, Obsidian, or samples/.
  • pdf2md convert and pdf2md recheck share the same repair behavior.
  • MathJax failed spans are repaired only after candidate revalidation succeeds.
  • Successfully repaired formulas remain visible through MATH_RENDER_REPAIRED info warnings.
  • Existing strict-local and MinerU-only constraints are unchanged.
  • samples/MITC공부.pdf is validated locally as requested, with generated outputs kept ignored under outputs/.

Hard Failure Criteria

  • Repair changes a math span that did not fail initial MathJax validation.
  • Repair drops an entire formula or removes meaningful LaTeX tokens solely to silence warnings.
  • Repair claims success without re-running the local checker on the candidate.
  • convert or recheck starts requiring MathJax when it was previously optional.
  • Default tests require real MinerU, GPU, Node.js, MathJax, network, Obsidian, or samples/.
  • samples/ or generated outputs/ files are committed.

Verification Commands

uv run pytest tests/test_quality.py tests/test_math_repair.py tests/test_conversion.py tests/test_cli.py tests/test_report.py
uv run pytest
uv run pdf2md convert samples\MITC공부.pdf --out outputs\sprint11-MITC공부 --overwrite
git diff --check
git status --short --untracked-files=all

Handoff Requirements

After implementation:

  • Update PROGRESS.md with files changed, commands run, test outcomes, sample validation outcome, known failures, and next action.
  • Keep sample PDFs and generated outputs out of the commit.
  • Commit the completed sprint if verification passes.

Implementation Handoff

  • Files changed: src/pdf2md/quality.py, src/pdf2md/math_repair.py, src/pdf2md/conversion.py, src/pdf2md/ir.py, tests, ARCHITECTURE.md, docs/V1IMPLEMENTATIONPLAN.md, PLAN.md, and PROGRESS.md.
  • Default verification: uv run pytest passed 172 tests with 1 skipped.
  • Targeted verification: uv run pytest tests/test_quality.py tests/test_math_repair.py tests/test_conversion.py tests/test_cli.py tests/test_report.py passed 56 tests.
  • Requested sample verification: uv run pdf2md convert samples\MITC공부.pdf --out outputs\sprint11-MITC공부 --overwrite succeeded; final report shows Math render error count: 0 and two MATH_RENDER_REPAIRED info warnings.
  • Known failures: none.
  • Residual risk: repair rules are deliberately narrow; future PDFs may expose MathJax failures that should remain warnings until a deterministic rule is added and tested.
  • Next action: optional Obsidian visual review or additional sample validation.