Files
PDFToMD/docs/MATHJAXCHECKERPLAN.md
T
2026-05-08 16:42:19 +09:00

299 lines
11 KiB
Markdown

# MathJax Local Render Checker Implementation Plan
## Purpose
Add a local MathJax-based render checker so the converter can validate whether extracted LaTeX formulas are likely to render in Obsidian. The checker must remain a quality signal only: failed formulas produce structured warnings, metadata counts, and report entries, but they do not stop conversion when Markdown output can still be produced.
This plan is implementation planning only. It does not add a second PDF conversion engine, cloud service, remote API, or manual review workflow.
Implementation status: implemented on 2026-05-08 with a local Node.js helper, Python MathJax wrapper, conversion integration, doctor diagnostics, setup documentation, and mocked default tests. Real checker execution uses the official local `mathjax` package and requires `npm install` to populate local `node_modules/`.
## Product Context
The project already normalizes inline math to `$...$` and display math to `$$...$$`. `src/pdf2md/quality.py` already has a math renderability boundary through `check_math_renderability()` and `MathCheckerUnavailable`, but current conversions record an info warning when no checker is injected.
The next implementation should replace that unavailable-checker path with a real local MathJax check when local Node.js and MathJax are available.
Relevant existing behavior:
- Conversion remains local-only.
- MinerU 3.1.0 remains the only PDF conversion engine.
- Quality warnings are non-fatal unless no usable output can be produced.
- Metadata and `.report.md` already include `math_render_error_count`.
- Default tests must not require real MinerU, GPU, Node.js, MathJax, network, Obsidian, or sample PDFs.
## References
- Obsidian documents math expressions as MathJax/LaTeX notation: https://help.obsidian.md/advanced-syntax
- MathJax supports Node/server-side use through components: https://docs.mathjax.org/en/v4.1/server/components.html
- MathJax can convert TeX strings to SVG, including `tex2svgPromise()`: https://docs.mathjax.org/en/latest/web/convert.html
## Design Decisions
1. Use MathJax, not KaTeX, as the primary checker.
- Obsidian compatibility is the output standard.
- Obsidian uses MathJax for math rendering.
- KaTeX can remain a future fast preflight option, but it should not define v1 pass/fail behavior.
2. Run MathJax locally through Node.js.
- Do not use a CDN.
- Do not fetch packages at conversion time.
- Do not call remote render APIs.
3. Batch formulas in one Node process per conversion.
- Spawning one process per formula would be too slow for math-heavy papers.
- `samples/MITC공부.pdf` produced 126 math expressions, so batch checking is the practical baseline.
4. Treat unavailable tooling differently from invalid math.
- Missing Node.js, missing MathJax, bad helper path, timeout, or invalid helper JSON should produce an info-level unavailable-checker warning.
- A MathJax parse/render failure for a specific expression should produce a warning-level `MATH_RENDER_FAILED` record and increment `math_render_error_count`.
5. Preserve conversion continuity.
- Math render failures never remove formulas from the Markdown.
- Math render failures do not trigger fallback engines.
- The generated report remains derived from metadata and local checks.
## Proposed Touched Surfaces
- `src/pdf2md/quality.py`
- Replace body-only math iteration with expression records carrying body, display mode, index, and Markdown span.
- Keep code fence and inline code protection.
- Keep unavailable-checker behavior non-fatal.
- `src/pdf2md/math_render.py`
- Add the Python wrapper for local MathJax checking.
- Probe Node.js availability.
- Execute the Node helper with JSON stdin.
- Parse JSON stdout into project-owned check results.
- Convert helper failures into `MathCheckerUnavailable`.
- `tools/mathjax-checker/check.mjs`
- Add the Node helper.
- Load local MathJax components.
- Accept JSON input with expressions.
- Return JSON results only.
- `package.json` and lockfile, or equivalent local setup documentation
- Add a local MathJax Node dependency if the project chooses a committed Node package manifest.
- The implementation should not install npm dependencies during conversion.
- `src/pdf2md/conversion.py`
- Wire the default local checker when available, while preserving dependency injection for tests.
- Keep the public Python API compatible unless a later sprint explicitly changes it.
- `src/pdf2md/doctor.py`
- Add diagnostic checks for Node.js and local MathJax checker availability.
- Report missing MathJax as a warning, not a hard failure, unless the project later decides the checker is mandatory.
- `README.md` or setup documentation
- Document local MathJax checker setup.
- Explain that missing MathJax does not block conversion but leaves renderability unvalidated.
- `PROGRESS.md`
- Record the implementation and verification outcome after completion.
## Data Model Plan
Add a small expression record for quality checking:
```python
@dataclass(frozen=True)
class MathExpression:
index: int
body: str
display: bool
markdown_span: tuple[int, int]
```
The checker output should be project-owned and independent of MathJax internals:
```python
@dataclass(frozen=True)
class MathCheckResult:
ok: bool
message: str = ""
```
If per-expression metadata is later needed, extend warning messages first. Do not expose raw MathJax objects in metadata or public API return values.
## Node Helper Contract
Input over stdin:
```json
{
"expressions": [
{"index": 0, "body": "x^2", "display": false},
{"index": 1, "body": "\\frac{1}{2}", "display": true}
]
}
```
Output over stdout:
```json
{
"results": [
{"index": 0, "ok": true},
{"index": 1, "ok": false, "message": "MathJax error message"}
]
}
```
stderr is reserved for diagnostics only. The Python wrapper should not depend on stderr format.
## Python Wrapper Behavior
The wrapper should:
1. Locate `node`.
2. Locate the helper script.
3. Send all expressions through JSON stdin.
4. Set a deterministic timeout.
5. Require valid JSON stdout.
6. Map each result by expression index.
7. Return `MathCheckResult` values for expression failures.
8. Raise `MathCheckerUnavailable` for tool-level failures.
Recommended default timeout policy:
- Start with one conversion-level MathJax timeout.
- Use a conservative default such as 60 seconds for a document batch.
- Make the timeout test-injectable.
- Do not add CLI flags for timeout unless the user explicitly asks for configuration.
## Integration Plan
1. Refactor math extraction in `quality.py`.
- Add expression records.
- Preserve existing code-block exclusions.
- Preserve inline/display count behavior.
- Add tests for display mode and spans.
2. Add mocked MathJax wrapper tests.
- Fake successful Node JSON response.
- Fake per-expression failure.
- Fake missing `node`.
- Fake timeout.
- Fake invalid JSON.
- Fake mismatched expression indexes.
3. Add the Node helper.
- Keep stdout as JSON only.
- Ensure local package resolution.
- Avoid remote imports or CDN URLs.
4. Wire the checker into conversion.
- If a test injects `math_checker`, use the injected checker.
- Otherwise, build a default local MathJax checker when available.
- If unavailable, keep the current info warning behavior.
5. Extend doctor.
- Report Node.js availability.
- Report local MathJax package/helper availability.
- Keep missing MathJax as WARN.
6. Update setup docs.
- Explain how to install local MathJax dependencies.
- Explain expected report behavior with and without MathJax.
7. Run optional real fixture validation.
- Re-run `samples/MITC공부.pdf` only under an explicit local fixture gate or direct user request.
- Confirm `MATH_RENDER_FAILED` unavailable-checker warning disappears when MathJax is installed.
## Test Plan
Default fast tests, no real Node or MathJax required:
- `uv run pytest tests/test_quality.py`
- `uv run pytest tests/test_conversion.py`
- `uv run pytest tests/test_doctor.py tests/test_cli.py`
- `uv run pytest`
New required tests:
- Extract inline and display expressions with correct `display` values.
- Ignore math-like text inside fenced code and inline code.
- Count failures from injected checker results.
- Preserve conversion success when some formulas fail render checks.
- Preserve info warning when the checker is unavailable.
- Validate Python wrapper command construction and JSON stdin/stdout handling with a fake runner.
- Validate timeout and invalid JSON handling.
- Validate doctor warning output when Node.js or MathJax is missing.
Optional local tests:
- Run the Node helper against a small expression list.
- Run the converter on `samples/MITC공부.pdf`.
- Confirm report fields:
- `Math render error count` is actual failure count.
- Missing checker info warning is absent when MathJax is available.
- Asset link counts remain 0 missing and 0 invalid for the sample.
## Acceptance Criteria
- Default test suite passes without Node.js or MathJax.
- Local-only policy is preserved: no CDN, remote API, or document upload path.
- `pdf2md doctor` reports MathJax checker availability clearly.
- Conversion still succeeds when MathJax is unavailable, with an info warning.
- Conversion still succeeds when individual formulas fail, with warning records.
- `.metadata.json` and `.report.md` show actual math render failure counts when MathJax is available.
- The generated Markdown is not changed by the checker.
## Hard Failure Criteria
- The checker blocks conversion when Markdown output exists.
- The checker uses a remote service or CDN at runtime.
- Default tests require Node.js, MathJax, MinerU, GPU, network, Obsidian, or sample PDFs.
- Raw MathJax output objects become public API return types.
- The report records renderability as successful when the checker did not actually run.
## Open Decisions Before Implementation
1. Dependency packaging:
- Use committed `package.json` and lockfile for a local MathJax package, or document a manual local npm setup.
- Recommended: commit `package.json` and lockfile so setup is reproducible.
2. Default checker activation:
- Recommended: auto-attempt the local checker when available; otherwise emit the existing unavailable-checker info warning.
3. Timeout value:
- Recommended initial default: 60 seconds per document batch, test-injectable and documented.
4. Doctor severity:
- Recommended: missing MathJax checker is WARN, not FAIL, because conversion can still produce useful output.
5. Real fixture gate:
- Recommended: keep real sample conversion explicit and opt-in for tests, but allow direct user-requested runs.
## Suggested Implementation Contract
Objective:
- Implement a local MathJax render checker that validates normalized Markdown math expressions and records failures in metadata/report output without changing conversion continuity.
Expected outputs:
- Python MathJax checker wrapper.
- Node MathJax helper.
- Updated quality extraction for display/inline expression records.
- Doctor warning coverage for missing checker dependencies.
- Setup documentation.
- Fast mocked tests and optional real local checker validation.
Non-goals:
- No cloud rendering.
- No Obsidian app automation.
- No full LaTeX engine.
- No manual review queue.
- No runtime engine selection.
- No correction or rewriting of failed formulas.
Verification:
- `uv run pytest`
- `git diff --check`
- Optional local Node helper smoke test.
- Optional `samples/MITC공부.pdf` conversion after local MathJax setup.