299 lines
11 KiB
Markdown
299 lines
11 KiB
Markdown
# MathJax Local Render Checker Implementation Plan
|
|
|
|
## Purpose
|
|
|
|
Add a local MathJax-based render checker so the converter can validate whether extracted LaTeX formulas are likely to render in Obsidian. The checker must remain a quality signal only: failed formulas produce structured warnings, metadata counts, and report entries, but they do not stop conversion when Markdown output can still be produced.
|
|
|
|
This plan is implementation planning only. It does not add a second PDF conversion engine, cloud service, remote API, or manual review workflow.
|
|
|
|
Implementation status: implemented on 2026-05-08 with a local Node.js helper, Python MathJax wrapper, conversion integration, doctor diagnostics, setup documentation, and mocked default tests. Real checker execution uses the official local `mathjax` package and requires `npm install` to populate local `node_modules/`.
|
|
|
|
## Product Context
|
|
|
|
The project already normalizes inline math to `$...$` and display math to `$$...$$`. `src/pdf2md/quality.py` already has a math renderability boundary through `check_math_renderability()` and `MathCheckerUnavailable`, but current conversions record an info warning when no checker is injected.
|
|
|
|
The next implementation should replace that unavailable-checker path with a real local MathJax check when local Node.js and MathJax are available.
|
|
|
|
Relevant existing behavior:
|
|
|
|
- Conversion remains local-only.
|
|
- MinerU 3.1.0 remains the only PDF conversion engine.
|
|
- Quality warnings are non-fatal unless no usable output can be produced.
|
|
- Metadata and `.report.md` already include `math_render_error_count`.
|
|
- Default tests must not require real MinerU, GPU, Node.js, MathJax, network, Obsidian, or sample PDFs.
|
|
|
|
## References
|
|
|
|
- Obsidian documents math expressions as MathJax/LaTeX notation: https://help.obsidian.md/advanced-syntax
|
|
- MathJax supports Node/server-side use through components: https://docs.mathjax.org/en/v4.1/server/components.html
|
|
- MathJax can convert TeX strings to SVG, including `tex2svgPromise()`: https://docs.mathjax.org/en/latest/web/convert.html
|
|
|
|
## Design Decisions
|
|
|
|
1. Use MathJax, not KaTeX, as the primary checker.
|
|
- Obsidian compatibility is the output standard.
|
|
- Obsidian uses MathJax for math rendering.
|
|
- KaTeX can remain a future fast preflight option, but it should not define v1 pass/fail behavior.
|
|
|
|
2. Run MathJax locally through Node.js.
|
|
- Do not use a CDN.
|
|
- Do not fetch packages at conversion time.
|
|
- Do not call remote render APIs.
|
|
|
|
3. Batch formulas in one Node process per conversion.
|
|
- Spawning one process per formula would be too slow for math-heavy papers.
|
|
- `samples/MITC공부.pdf` produced 126 math expressions, so batch checking is the practical baseline.
|
|
|
|
4. Treat unavailable tooling differently from invalid math.
|
|
- Missing Node.js, missing MathJax, bad helper path, timeout, or invalid helper JSON should produce an info-level unavailable-checker warning.
|
|
- A MathJax parse/render failure for a specific expression should produce a warning-level `MATH_RENDER_FAILED` record and increment `math_render_error_count`.
|
|
|
|
5. Preserve conversion continuity.
|
|
- Math render failures never remove formulas from the Markdown.
|
|
- Math render failures do not trigger fallback engines.
|
|
- The generated report remains derived from metadata and local checks.
|
|
|
|
## Proposed Touched Surfaces
|
|
|
|
- `src/pdf2md/quality.py`
|
|
- Replace body-only math iteration with expression records carrying body, display mode, index, and Markdown span.
|
|
- Keep code fence and inline code protection.
|
|
- Keep unavailable-checker behavior non-fatal.
|
|
|
|
- `src/pdf2md/math_render.py`
|
|
- Add the Python wrapper for local MathJax checking.
|
|
- Probe Node.js availability.
|
|
- Execute the Node helper with JSON stdin.
|
|
- Parse JSON stdout into project-owned check results.
|
|
- Convert helper failures into `MathCheckerUnavailable`.
|
|
|
|
- `tools/mathjax-checker/check.mjs`
|
|
- Add the Node helper.
|
|
- Load local MathJax components.
|
|
- Accept JSON input with expressions.
|
|
- Return JSON results only.
|
|
|
|
- `package.json` and lockfile, or equivalent local setup documentation
|
|
- Add a local MathJax Node dependency if the project chooses a committed Node package manifest.
|
|
- The implementation should not install npm dependencies during conversion.
|
|
|
|
- `src/pdf2md/conversion.py`
|
|
- Wire the default local checker when available, while preserving dependency injection for tests.
|
|
- Keep the public Python API compatible unless a later sprint explicitly changes it.
|
|
|
|
- `src/pdf2md/doctor.py`
|
|
- Add diagnostic checks for Node.js and local MathJax checker availability.
|
|
- Report missing MathJax as a warning, not a hard failure, unless the project later decides the checker is mandatory.
|
|
|
|
- `README.md` or setup documentation
|
|
- Document local MathJax checker setup.
|
|
- Explain that missing MathJax does not block conversion but leaves renderability unvalidated.
|
|
|
|
- `PROGRESS.md`
|
|
- Record the implementation and verification outcome after completion.
|
|
|
|
## Data Model Plan
|
|
|
|
Add a small expression record for quality checking:
|
|
|
|
```python
|
|
@dataclass(frozen=True)
|
|
class MathExpression:
|
|
index: int
|
|
body: str
|
|
display: bool
|
|
markdown_span: tuple[int, int]
|
|
```
|
|
|
|
The checker output should be project-owned and independent of MathJax internals:
|
|
|
|
```python
|
|
@dataclass(frozen=True)
|
|
class MathCheckResult:
|
|
ok: bool
|
|
message: str = ""
|
|
```
|
|
|
|
If per-expression metadata is later needed, extend warning messages first. Do not expose raw MathJax objects in metadata or public API return values.
|
|
|
|
## Node Helper Contract
|
|
|
|
Input over stdin:
|
|
|
|
```json
|
|
{
|
|
"expressions": [
|
|
{"index": 0, "body": "x^2", "display": false},
|
|
{"index": 1, "body": "\\frac{1}{2}", "display": true}
|
|
]
|
|
}
|
|
```
|
|
|
|
Output over stdout:
|
|
|
|
```json
|
|
{
|
|
"results": [
|
|
{"index": 0, "ok": true},
|
|
{"index": 1, "ok": false, "message": "MathJax error message"}
|
|
]
|
|
}
|
|
```
|
|
|
|
stderr is reserved for diagnostics only. The Python wrapper should not depend on stderr format.
|
|
|
|
## Python Wrapper Behavior
|
|
|
|
The wrapper should:
|
|
|
|
1. Locate `node`.
|
|
2. Locate the helper script.
|
|
3. Send all expressions through JSON stdin.
|
|
4. Set a deterministic timeout.
|
|
5. Require valid JSON stdout.
|
|
6. Map each result by expression index.
|
|
7. Return `MathCheckResult` values for expression failures.
|
|
8. Raise `MathCheckerUnavailable` for tool-level failures.
|
|
|
|
Recommended default timeout policy:
|
|
|
|
- Start with one conversion-level MathJax timeout.
|
|
- Use a conservative default such as 60 seconds for a document batch.
|
|
- Make the timeout test-injectable.
|
|
- Do not add CLI flags for timeout unless the user explicitly asks for configuration.
|
|
|
|
## Integration Plan
|
|
|
|
1. Refactor math extraction in `quality.py`.
|
|
- Add expression records.
|
|
- Preserve existing code-block exclusions.
|
|
- Preserve inline/display count behavior.
|
|
- Add tests for display mode and spans.
|
|
|
|
2. Add mocked MathJax wrapper tests.
|
|
- Fake successful Node JSON response.
|
|
- Fake per-expression failure.
|
|
- Fake missing `node`.
|
|
- Fake timeout.
|
|
- Fake invalid JSON.
|
|
- Fake mismatched expression indexes.
|
|
|
|
3. Add the Node helper.
|
|
- Keep stdout as JSON only.
|
|
- Ensure local package resolution.
|
|
- Avoid remote imports or CDN URLs.
|
|
|
|
4. Wire the checker into conversion.
|
|
- If a test injects `math_checker`, use the injected checker.
|
|
- Otherwise, build a default local MathJax checker when available.
|
|
- If unavailable, keep the current info warning behavior.
|
|
|
|
5. Extend doctor.
|
|
- Report Node.js availability.
|
|
- Report local MathJax package/helper availability.
|
|
- Keep missing MathJax as WARN.
|
|
|
|
6. Update setup docs.
|
|
- Explain how to install local MathJax dependencies.
|
|
- Explain expected report behavior with and without MathJax.
|
|
|
|
7. Run optional real fixture validation.
|
|
- Re-run `samples/MITC공부.pdf` only under an explicit local fixture gate or direct user request.
|
|
- Confirm `MATH_RENDER_FAILED` unavailable-checker warning disappears when MathJax is installed.
|
|
|
|
## Test Plan
|
|
|
|
Default fast tests, no real Node or MathJax required:
|
|
|
|
- `uv run pytest tests/test_quality.py`
|
|
- `uv run pytest tests/test_conversion.py`
|
|
- `uv run pytest tests/test_doctor.py tests/test_cli.py`
|
|
- `uv run pytest`
|
|
|
|
New required tests:
|
|
|
|
- Extract inline and display expressions with correct `display` values.
|
|
- Ignore math-like text inside fenced code and inline code.
|
|
- Count failures from injected checker results.
|
|
- Preserve conversion success when some formulas fail render checks.
|
|
- Preserve info warning when the checker is unavailable.
|
|
- Validate Python wrapper command construction and JSON stdin/stdout handling with a fake runner.
|
|
- Validate timeout and invalid JSON handling.
|
|
- Validate doctor warning output when Node.js or MathJax is missing.
|
|
|
|
Optional local tests:
|
|
|
|
- Run the Node helper against a small expression list.
|
|
- Run the converter on `samples/MITC공부.pdf`.
|
|
- Confirm report fields:
|
|
- `Math render error count` is actual failure count.
|
|
- Missing checker info warning is absent when MathJax is available.
|
|
- Asset link counts remain 0 missing and 0 invalid for the sample.
|
|
|
|
## Acceptance Criteria
|
|
|
|
- Default test suite passes without Node.js or MathJax.
|
|
- Local-only policy is preserved: no CDN, remote API, or document upload path.
|
|
- `pdf2md doctor` reports MathJax checker availability clearly.
|
|
- Conversion still succeeds when MathJax is unavailable, with an info warning.
|
|
- Conversion still succeeds when individual formulas fail, with warning records.
|
|
- `.metadata.json` and `.report.md` show actual math render failure counts when MathJax is available.
|
|
- The generated Markdown is not changed by the checker.
|
|
|
|
## Hard Failure Criteria
|
|
|
|
- The checker blocks conversion when Markdown output exists.
|
|
- The checker uses a remote service or CDN at runtime.
|
|
- Default tests require Node.js, MathJax, MinerU, GPU, network, Obsidian, or sample PDFs.
|
|
- Raw MathJax output objects become public API return types.
|
|
- The report records renderability as successful when the checker did not actually run.
|
|
|
|
## Open Decisions Before Implementation
|
|
|
|
1. Dependency packaging:
|
|
- Use committed `package.json` and lockfile for a local MathJax package, or document a manual local npm setup.
|
|
- Recommended: commit `package.json` and lockfile so setup is reproducible.
|
|
|
|
2. Default checker activation:
|
|
- Recommended: auto-attempt the local checker when available; otherwise emit the existing unavailable-checker info warning.
|
|
|
|
3. Timeout value:
|
|
- Recommended initial default: 60 seconds per document batch, test-injectable and documented.
|
|
|
|
4. Doctor severity:
|
|
- Recommended: missing MathJax checker is WARN, not FAIL, because conversion can still produce useful output.
|
|
|
|
5. Real fixture gate:
|
|
- Recommended: keep real sample conversion explicit and opt-in for tests, but allow direct user-requested runs.
|
|
|
|
## Suggested Implementation Contract
|
|
|
|
Objective:
|
|
|
|
- Implement a local MathJax render checker that validates normalized Markdown math expressions and records failures in metadata/report output without changing conversion continuity.
|
|
|
|
Expected outputs:
|
|
|
|
- Python MathJax checker wrapper.
|
|
- Node MathJax helper.
|
|
- Updated quality extraction for display/inline expression records.
|
|
- Doctor warning coverage for missing checker dependencies.
|
|
- Setup documentation.
|
|
- Fast mocked tests and optional real local checker validation.
|
|
|
|
Non-goals:
|
|
|
|
- No cloud rendering.
|
|
- No Obsidian app automation.
|
|
- No full LaTeX engine.
|
|
- No manual review queue.
|
|
- No runtime engine selection.
|
|
- No correction or rewriting of failed formulas.
|
|
|
|
Verification:
|
|
|
|
- `uv run pytest`
|
|
- `git diff --check`
|
|
- Optional local Node helper smoke test.
|
|
- Optional `samples/MITC공부.pdf` conversion after local MathJax setup.
|