Files
PDFToMD/docs/MATHJAXCHECKERPLAN.md
T
2026-05-08 16:42:19 +09:00

11 KiB

MathJax Local Render Checker Implementation Plan

Purpose

Add a local MathJax-based render checker so the converter can validate whether extracted LaTeX formulas are likely to render in Obsidian. The checker must remain a quality signal only: failed formulas produce structured warnings, metadata counts, and report entries, but they do not stop conversion when Markdown output can still be produced.

This plan is implementation planning only. It does not add a second PDF conversion engine, cloud service, remote API, or manual review workflow.

Implementation status: implemented on 2026-05-08 with a local Node.js helper, Python MathJax wrapper, conversion integration, doctor diagnostics, setup documentation, and mocked default tests. Real checker execution uses the official local mathjax package and requires npm install to populate local node_modules/.

Product Context

The project already normalizes inline math to $...$ and display math to $$...$$. src/pdf2md/quality.py already has a math renderability boundary through check_math_renderability() and MathCheckerUnavailable, but current conversions record an info warning when no checker is injected.

The next implementation should replace that unavailable-checker path with a real local MathJax check when local Node.js and MathJax are available.

Relevant existing behavior:

  • Conversion remains local-only.
  • MinerU 3.1.0 remains the only PDF conversion engine.
  • Quality warnings are non-fatal unless no usable output can be produced.
  • Metadata and .report.md already include math_render_error_count.
  • Default tests must not require real MinerU, GPU, Node.js, MathJax, network, Obsidian, or sample PDFs.

References

Design Decisions

  1. Use MathJax, not KaTeX, as the primary checker.

    • Obsidian compatibility is the output standard.
    • Obsidian uses MathJax for math rendering.
    • KaTeX can remain a future fast preflight option, but it should not define v1 pass/fail behavior.
  2. Run MathJax locally through Node.js.

    • Do not use a CDN.
    • Do not fetch packages at conversion time.
    • Do not call remote render APIs.
  3. Batch formulas in one Node process per conversion.

    • Spawning one process per formula would be too slow for math-heavy papers.
    • samples/MITC공부.pdf produced 126 math expressions, so batch checking is the practical baseline.
  4. Treat unavailable tooling differently from invalid math.

    • Missing Node.js, missing MathJax, bad helper path, timeout, or invalid helper JSON should produce an info-level unavailable-checker warning.
    • A MathJax parse/render failure for a specific expression should produce a warning-level MATH_RENDER_FAILED record and increment math_render_error_count.
  5. Preserve conversion continuity.

    • Math render failures never remove formulas from the Markdown.
    • Math render failures do not trigger fallback engines.
    • The generated report remains derived from metadata and local checks.

Proposed Touched Surfaces

  • src/pdf2md/quality.py

    • Replace body-only math iteration with expression records carrying body, display mode, index, and Markdown span.
    • Keep code fence and inline code protection.
    • Keep unavailable-checker behavior non-fatal.
  • src/pdf2md/math_render.py

    • Add the Python wrapper for local MathJax checking.
    • Probe Node.js availability.
    • Execute the Node helper with JSON stdin.
    • Parse JSON stdout into project-owned check results.
    • Convert helper failures into MathCheckerUnavailable.
  • tools/mathjax-checker/check.mjs

    • Add the Node helper.
    • Load local MathJax components.
    • Accept JSON input with expressions.
    • Return JSON results only.
  • package.json and lockfile, or equivalent local setup documentation

    • Add a local MathJax Node dependency if the project chooses a committed Node package manifest.
    • The implementation should not install npm dependencies during conversion.
  • src/pdf2md/conversion.py

    • Wire the default local checker when available, while preserving dependency injection for tests.
    • Keep the public Python API compatible unless a later sprint explicitly changes it.
  • src/pdf2md/doctor.py

    • Add diagnostic checks for Node.js and local MathJax checker availability.
    • Report missing MathJax as a warning, not a hard failure, unless the project later decides the checker is mandatory.
  • README.md or setup documentation

    • Document local MathJax checker setup.
    • Explain that missing MathJax does not block conversion but leaves renderability unvalidated.
  • PROGRESS.md

    • Record the implementation and verification outcome after completion.

Data Model Plan

Add a small expression record for quality checking:

@dataclass(frozen=True)
class MathExpression:
    index: int
    body: str
    display: bool
    markdown_span: tuple[int, int]

The checker output should be project-owned and independent of MathJax internals:

@dataclass(frozen=True)
class MathCheckResult:
    ok: bool
    message: str = ""

If per-expression metadata is later needed, extend warning messages first. Do not expose raw MathJax objects in metadata or public API return values.

Node Helper Contract

Input over stdin:

{
  "expressions": [
    {"index": 0, "body": "x^2", "display": false},
    {"index": 1, "body": "\\frac{1}{2}", "display": true}
  ]
}

Output over stdout:

{
  "results": [
    {"index": 0, "ok": true},
    {"index": 1, "ok": false, "message": "MathJax error message"}
  ]
}

stderr is reserved for diagnostics only. The Python wrapper should not depend on stderr format.

Python Wrapper Behavior

The wrapper should:

  1. Locate node.
  2. Locate the helper script.
  3. Send all expressions through JSON stdin.
  4. Set a deterministic timeout.
  5. Require valid JSON stdout.
  6. Map each result by expression index.
  7. Return MathCheckResult values for expression failures.
  8. Raise MathCheckerUnavailable for tool-level failures.

Recommended default timeout policy:

  • Start with one conversion-level MathJax timeout.
  • Use a conservative default such as 60 seconds for a document batch.
  • Make the timeout test-injectable.
  • Do not add CLI flags for timeout unless the user explicitly asks for configuration.

Integration Plan

  1. Refactor math extraction in quality.py.

    • Add expression records.
    • Preserve existing code-block exclusions.
    • Preserve inline/display count behavior.
    • Add tests for display mode and spans.
  2. Add mocked MathJax wrapper tests.

    • Fake successful Node JSON response.
    • Fake per-expression failure.
    • Fake missing node.
    • Fake timeout.
    • Fake invalid JSON.
    • Fake mismatched expression indexes.
  3. Add the Node helper.

    • Keep stdout as JSON only.
    • Ensure local package resolution.
    • Avoid remote imports or CDN URLs.
  4. Wire the checker into conversion.

    • If a test injects math_checker, use the injected checker.
    • Otherwise, build a default local MathJax checker when available.
    • If unavailable, keep the current info warning behavior.
  5. Extend doctor.

    • Report Node.js availability.
    • Report local MathJax package/helper availability.
    • Keep missing MathJax as WARN.
  6. Update setup docs.

    • Explain how to install local MathJax dependencies.
    • Explain expected report behavior with and without MathJax.
  7. Run optional real fixture validation.

    • Re-run samples/MITC공부.pdf only under an explicit local fixture gate or direct user request.
    • Confirm MATH_RENDER_FAILED unavailable-checker warning disappears when MathJax is installed.

Test Plan

Default fast tests, no real Node or MathJax required:

  • uv run pytest tests/test_quality.py
  • uv run pytest tests/test_conversion.py
  • uv run pytest tests/test_doctor.py tests/test_cli.py
  • uv run pytest

New required tests:

  • Extract inline and display expressions with correct display values.
  • Ignore math-like text inside fenced code and inline code.
  • Count failures from injected checker results.
  • Preserve conversion success when some formulas fail render checks.
  • Preserve info warning when the checker is unavailable.
  • Validate Python wrapper command construction and JSON stdin/stdout handling with a fake runner.
  • Validate timeout and invalid JSON handling.
  • Validate doctor warning output when Node.js or MathJax is missing.

Optional local tests:

  • Run the Node helper against a small expression list.
  • Run the converter on samples/MITC공부.pdf.
  • Confirm report fields:
    • Math render error count is actual failure count.
    • Missing checker info warning is absent when MathJax is available.
    • Asset link counts remain 0 missing and 0 invalid for the sample.

Acceptance Criteria

  • Default test suite passes without Node.js or MathJax.
  • Local-only policy is preserved: no CDN, remote API, or document upload path.
  • pdf2md doctor reports MathJax checker availability clearly.
  • Conversion still succeeds when MathJax is unavailable, with an info warning.
  • Conversion still succeeds when individual formulas fail, with warning records.
  • .metadata.json and .report.md show actual math render failure counts when MathJax is available.
  • The generated Markdown is not changed by the checker.

Hard Failure Criteria

  • The checker blocks conversion when Markdown output exists.
  • The checker uses a remote service or CDN at runtime.
  • Default tests require Node.js, MathJax, MinerU, GPU, network, Obsidian, or sample PDFs.
  • Raw MathJax output objects become public API return types.
  • The report records renderability as successful when the checker did not actually run.

Open Decisions Before Implementation

  1. Dependency packaging:

    • Use committed package.json and lockfile for a local MathJax package, or document a manual local npm setup.
    • Recommended: commit package.json and lockfile so setup is reproducible.
  2. Default checker activation:

    • Recommended: auto-attempt the local checker when available; otherwise emit the existing unavailable-checker info warning.
  3. Timeout value:

    • Recommended initial default: 60 seconds per document batch, test-injectable and documented.
  4. Doctor severity:

    • Recommended: missing MathJax checker is WARN, not FAIL, because conversion can still produce useful output.
  5. Real fixture gate:

    • Recommended: keep real sample conversion explicit and opt-in for tests, but allow direct user-requested runs.

Suggested Implementation Contract

Objective:

  • Implement a local MathJax render checker that validates normalized Markdown math expressions and records failures in metadata/report output without changing conversion continuity.

Expected outputs:

  • Python MathJax checker wrapper.
  • Node MathJax helper.
  • Updated quality extraction for display/inline expression records.
  • Doctor warning coverage for missing checker dependencies.
  • Setup documentation.
  • Fast mocked tests and optional real local checker validation.

Non-goals:

  • No cloud rendering.
  • No Obsidian app automation.
  • No full LaTeX engine.
  • No manual review queue.
  • No runtime engine selection.
  • No correction or rewriting of failed formulas.

Verification:

  • uv run pytest
  • git diff --check
  • Optional local Node helper smoke test.
  • Optional samples/MITC공부.pdf conversion after local MathJax setup.