feat: mitigate MathJax formula warnings

This commit is contained in:
NINI
2026-05-11 02:08:46 +09:00
parent 005f17bac1
commit 71e6fbcc51
12 changed files with 625 additions and 41 deletions
+1
View File
@@ -200,6 +200,7 @@ Stable warning code examples:
- `GPU_UNAVAILABLE`
- `LOW_CONFIDENCE_FORMULA`
- `MATH_RENDER_FAILED`
- `MATH_RENDER_REPAIRED`
- `ASSET_LINK_MISSING`
- `READING_ORDER_UNCERTAIN`
- `STRICT_LOCAL_VIOLATION`
+8 -8
View File
@@ -4,7 +4,7 @@ This file is the shared work plan for agents. Read it before starting work, then
## Current Goal
Completed work history is archived in `docs/WORKARCHIVE.md`. Sprint 10 pre-conversion PDF chunking is implemented. On this PC, full local runtime setup is complete in `.venv`; Markdown quality recheck for existing outputs is implemented. Next planned work is MathJax warning mitigation: after local MathJax validation, conservatively clean only warning-causing math spans, rerun validation, and preserve provenance for changed or still-failing formulas. Manual Obsidian quality review and sample validation remain optional fallback tasks.
Completed work history is archived in `docs/WORKARCHIVE.md`. Sprint 11 MathJax warning mitigation is implemented. On this PC, full local runtime setup is complete in `.venv`; Markdown quality recheck for existing outputs is implemented and now shares the same conservative MathJax repair path as fresh conversion. Next work is optional manual Obsidian quality review, additional sample validation, or broader repair rules if future samples expose new deterministic MathJax failure patterns.
## Active Constraints
@@ -35,15 +35,14 @@ Completed work history is archived in `docs/WORKARCHIVE.md`. Sprint 10 pre-conve
12. Follow `docs/V1IMPLEMENTATIONPLAN.md` for the v1 implementation sprint sequence.
13. Use `docs/Sprints/SPRINT10CONTRACT.md` for the implemented long-PDF pre-conversion chunking sprint.
14. Use `docs/WORKARCHIVE.md` for completed sprint history, prior verification, runtime setup evidence, and sample conversion evidence.
15. Plan Sprint 11 for MathJax warning mitigation before code changes start.
16. Create `docs/Sprints/SPRINT11CONTRACT.md` for the mitigation sprint if implementation is requested.
17. Keep the mitigation path shared by `pdf2md convert` and `pdf2md recheck` so existing Markdown outputs can be cleaned without rerunning MinerU.
15. Use `docs/Sprints/SPRINT11CONTRACT.md` for the implemented MathJax warning mitigation sprint.
16. Keep the mitigation path shared by `pdf2md convert` and `pdf2md recheck` so existing Markdown outputs can be cleaned without rerunning MinerU.
## Proposed Sprint 11: MathJax Warning Mitigation
## Sprint 11: MathJax Warning Mitigation
Objective:
- Add a conservative local post-validation cleanup pass that attempts to remove only the specific math-span artifacts responsible for MathJax warnings, then reruns MathJax validation before writing final Markdown, metadata JSON, and report Markdown.
- Implemented a conservative local post-validation cleanup pass that attempts to remove only the specific math-span artifacts responsible for MathJax warnings, then reruns MathJax validation before writing final Markdown, metadata JSON, and report Markdown.
Assumptions:
@@ -97,8 +96,7 @@ Hard failure criteria:
## Open Questions
- Which exact cleanup rules should Sprint 11 allow after inspecting current MathJax failure messages? Recommendation: start with deterministic non-semantic artifacts only.
- Should applied mitigations use a new stable warning/info code or be represented through existing metadata/report fields? Recommendation: make repair provenance visible without counting a successfully repaired expression as a render failure.
- None.
## Decisions
@@ -116,6 +114,8 @@ Hard failure criteria:
- Candidate math cleanup must be revalidated with the local MathJax checker before replacing Markdown.
- If no candidate passes validation, keep the original formula and retain the `MATH_RENDER_FAILED` warning.
- Successfully mitigated formulas must remain traceable in metadata/report output; warning reduction must not hide that a formula was changed.
- Sprint 11 uses `MATH_RENDER_REPAIRED` info warnings for applied repair provenance.
- Sprint 11 initial repair rules cover repeated same-direction scripts and truncated array `\end{a}` endings only.
- Project-scoped custom agents live in `.codex/agents/*.toml`.
- Project prompt commands live in `.codex/commands/*.md`.
- Project-specific skills live in `.codex/skills/*/SKILL.md`.
+9 -9
View File
@@ -6,9 +6,9 @@ This file records current progress for agents. Read it before starting work, the
- Project direction is documented in `PRD.md`, `ARCHITECTURE.md`, `AGENTS.md`, and `docs/KNOWLEDGEBASE.md`.
- MinerU 3.1.0 is fixed as the only conversion engine.
- The converter currently includes path planning, project-owned records, metadata, direct local MinerU adapter boundary, Obsidian Markdown normalization, local quality checks, report rendering, conversion orchestration, `pdf2md convert`, `pdf2md recheck`, `pdf2md doctor`, local MathJax render checking, release-gate tests, and opt-in pre-conversion PDF chunking.
- The converter currently includes path planning, project-owned records, metadata, direct local MinerU adapter boundary, Obsidian Markdown normalization, local quality checks, report rendering, conversion orchestration, `pdf2md convert`, `pdf2md recheck`, `pdf2md doctor`, local MathJax render checking, conservative MathJax warning mitigation, release-gate tests, and opt-in pre-conversion PDF chunking.
- `docs/V1IMPLEMENTATIONPLAN.md` defines the v1 implementation sequence.
- `docs/Sprints/` contains completed sprint contracts through Sprint 10.
- `docs/Sprints/` contains completed sprint contracts through Sprint 11.
- `docs/WORKARCHIVE.md` contains completed sprint history, historical verification results, runtime setup notes, and sample conversion evidence.
- `samples/` exists locally as fixture context.
- `outputs/` is ignored and contains local generated conversion outputs.
@@ -48,7 +48,9 @@ This file records current progress for agents. Read it before starting work, the
- Added `recheck_markdown()` and `pdf2md recheck <markdown.md>` to rerun local quality checks for an existing generated Markdown file and rewrite the adjacent metadata JSON and `.report.md` without rerunning MinerU.
- Verified `uv run pdf2md recheck outputs\MITC공부\MITC공부.md`; the command regenerated metadata/report and still reported 2 warnings because the current Markdown still contains the two MathJax-invalid expressions.
- Reconverted `samples/MITC공부.pdf` with `--overwrite` to ignored `outputs/MITC공부/`; report status remains `partial`: 13 pages, 107 assets, 23 inline formulas, 103 display formulas, 2 MathJax render warnings, and 0 missing or invalid asset links.
- Added a `PLAN.md` Sprint 11 proposal for conservative MathJax warning mitigation after validation; no implementation code has been started.
- Sprint 11 implemented conservative MathJax warning mitigation with failed-expression details, `src/pdf2md/math_repair.py`, shared `convert`/`recheck` repair integration, and `MATH_RENDER_REPAIRED` info warnings.
- Verified default fast suite: `uv run pytest` passed 172 tests with 1 skipped.
- Verified requested real sample: `uv run pdf2md convert samples\MITC공부.pdf --out outputs\sprint11-MITC공부 --overwrite` succeeded with 13 pages, 107 assets, 23 inline formulas, 103 display formulas, 0 MathJax render errors, and 2 `MATH_RENDER_REPAIRED` info warnings.
## In Progress
@@ -60,9 +62,7 @@ This file records current progress for agents. Read it before starting work, the
## Next Actions
1. If implementation is requested, write `docs/Sprints/SPRINT11CONTRACT.md` for MathJax warning mitigation before code changes start.
2. Inspect the current MathJax failure messages from `outputs/MITC공부/MITC공부.md` to choose the narrow initial cleanup rule set.
3. Manually fix the two MathJax-invalid expressions in `outputs/MITC공부/MITC공부.md` only if a warning-free local report is desired before Sprint 11 exists, then run `uv run pdf2md recheck outputs\MITC공부\MITC공부.md`.
4. Review generated sample Markdown outputs in Obsidian if visual quality needs manual assessment.
5. Run optional real local chunked conversion on a long sample only if requested.
6. Preserve strict-local runtime behavior: use local model paths, direct CLI execution, and no user-specified API or remote backend.
1. Review generated sample Markdown outputs in Obsidian if visual quality needs manual assessment.
2. Run additional real local sample validation only if requested, especially for new MathJax failure messages not covered by Sprint 11's narrow repair rules.
3. Run optional real local chunked conversion on a long sample only if requested.
4. Preserve strict-local runtime behavior: use local model paths, direct CLI execution, and no user-specified API or remote backend.
+181
View File
@@ -0,0 +1,181 @@
# Sprint 11 Contract: MathJax Warning Mitigation
Status: Implemented
Last updated: 2026-05-11
## Objective
Add a conservative local cleanup pass for MathJax-invalid formulas:
1. Run the existing MathJax renderability check on normalized Markdown.
2. Build repair candidates only for expressions that failed MathJax validation.
3. Re-check each candidate with the same local checker.
4. Replace only candidates that pass.
5. Re-run final quality checks before writing Markdown, metadata JSON, and report Markdown.
The feature should reduce `MATH_RENDER_FAILED` warnings without hiding that a formula was changed.
## Current Precondition
- `pdf2md convert` writes normalized Markdown, metadata JSON, and `<stem>.report.md`.
- `pdf2md recheck` can rerun quality checks for an existing generated Markdown file without rerunning MinerU.
- Local MathJax checking is already optional and nonfatal.
- `outputs/MITC공부/MITC공부.md` currently has two MathJax render failures:
- expression 8: `Double exponent: use braces to clarify`
- expression 83: `Unknown environment 'a'`
- `samples/MITC공부.pdf` is the requested real local validation sample.
## Touched Surfaces
Allowed during implementation:
- `src/pdf2md/quality.py`
- `src/pdf2md/math_repair.py`
- `src/pdf2md/conversion.py`
- `src/pdf2md/ir.py`
- `tests/test_quality.py`
- `tests/test_math_repair.py`
- `tests/test_conversion.py`
- `tests/test_cli.py`
- `docs/Sprints/SPRINT11CONTRACT.md`
- `PLAN.md`
- `PROGRESS.md`
Not allowed:
- Remote OCR, remote LLMs, remote render APIs, or external document upload paths.
- Alternate PDF conversion engines.
- Switchable conversion-engine behavior.
- A full LaTeX parser or symbolic math rewrite engine.
- New CLI flags unless a later user request explicitly asks for them.
- Mandatory default tests that require real MinerU, GPU, Node.js, MathJax, network, Obsidian, or `samples/`.
- Committed files under `samples/`.
- Committed generated conversion outputs under `outputs/`.
## Product Behavior
Repair activation:
- Repair runs automatically when a local math checker is available and at least one math expression fails validation.
- If the checker is unavailable, behavior remains unchanged: conversion/recheck continues with an info-level unavailable-checker warning.
- The same repair path applies to fresh `convert` output and existing Markdown processed through `recheck`.
Initial deterministic repair rules:
- Repeated same-direction script repair:
- Convert consecutive superscripts/subscripts such as `^ {i} ^ {t}` to `^ {i} {} ^ {t}`.
- This resolves MathJax double-super/subscript syntax while preserving both script tokens.
- Truncated array environment repair:
- Convert `\end{a}` to `\end{array}` only when the expression has unmatched `\begin{array}` / `\end{array}` counts.
- This targets obvious extraction truncation, not arbitrary environment renaming.
Provenance:
- Applied repairs produce `MATH_RENDER_REPAIRED` info warnings.
- Successfully repaired expressions must not count as `math_render_error_count`.
- Unrepaired expressions keep the original `MATH_RENDER_FAILED` warning behavior.
- The report remains derived from metadata and local quality checks.
## Architecture Plan
### WP11.1: Failed Math Detail Capture
Actions:
- Add a project-owned result type that can include failed `MathExpression` records and checker messages.
- Preserve the current `check_math_renderability()` return behavior for existing callers.
- Keep expression extraction outside fenced code and inline code.
Expected output:
- Conversion can access failed expression spans without parsing warning message text.
### WP11.2: Repair Module
Actions:
- Add `src/pdf2md/math_repair.py`.
- Define repair result records.
- Generate candidates only for failed expressions.
- Revalidate candidates through the injected checker.
- Apply replacements from right to left so Markdown spans remain stable.
Expected output:
- Pure string-level repair behavior that is deterministic, local-only, and independently testable.
### WP11.3: Conversion And Recheck Integration
Actions:
- Route `convert` normalized Markdown through repair before final metadata/report construction.
- Route `recheck` Markdown through the same repair path before rewriting metadata/report.
- Re-run final quality checks after any repair.
- Preserve asset checking and strict-local behavior unchanged.
Expected output:
- Fresh conversions and rechecks both benefit from MathJax warning mitigation.
### WP11.4: Tests
Default tests:
- Repeated superscripts are repaired only when the original expression failed.
- `\end{a}` repairs to `\end{array}` only when array environments are unbalanced.
- A candidate that still fails is not written back.
- Passing expressions are not changed.
- Conversion writes repaired Markdown only after candidate revalidation.
- Recheck can repair an existing Markdown output and regenerate metadata/report.
- Existing unavailable-checker behavior remains nonfatal.
Optional local validation:
- Run `uv run pdf2md convert samples\MITC공부.pdf --out outputs\sprint11-MITC공부 --overwrite`.
- Confirm the generated report has `Math render error count: 0` for the requested sample, or record any remaining failures exactly.
## Acceptance Criteria
- Default fast tests pass without real MinerU, GPU, Node.js, MathJax, network, Obsidian, or `samples/`.
- `pdf2md convert` and `pdf2md recheck` share the same repair behavior.
- MathJax failed spans are repaired only after candidate revalidation succeeds.
- Successfully repaired formulas remain visible through `MATH_RENDER_REPAIRED` info warnings.
- Existing strict-local and MinerU-only constraints are unchanged.
- `samples/MITC공부.pdf` is validated locally as requested, with generated outputs kept ignored under `outputs/`.
## Hard Failure Criteria
- Repair changes a math span that did not fail initial MathJax validation.
- Repair drops an entire formula or removes meaningful LaTeX tokens solely to silence warnings.
- Repair claims success without re-running the local checker on the candidate.
- `convert` or `recheck` starts requiring MathJax when it was previously optional.
- Default tests require real MinerU, GPU, Node.js, MathJax, network, Obsidian, or `samples/`.
- `samples/` or generated `outputs/` files are committed.
## Verification Commands
```powershell
uv run pytest tests/test_quality.py tests/test_math_repair.py tests/test_conversion.py tests/test_cli.py tests/test_report.py
uv run pytest
uv run pdf2md convert samples\MITC공부.pdf --out outputs\sprint11-MITC공부 --overwrite
git diff --check
git status --short --untracked-files=all
```
## Handoff Requirements
After implementation:
- Update `PROGRESS.md` with files changed, commands run, test outcomes, sample validation outcome, known failures, and next action.
- Keep sample PDFs and generated outputs out of the commit.
- Commit the completed sprint if verification passes.
## Implementation Handoff
- Files changed: `src/pdf2md/quality.py`, `src/pdf2md/math_repair.py`, `src/pdf2md/conversion.py`, `src/pdf2md/ir.py`, tests, `ARCHITECTURE.md`, `docs/V1IMPLEMENTATIONPLAN.md`, `PLAN.md`, and `PROGRESS.md`.
- Default verification: `uv run pytest` passed 172 tests with 1 skipped.
- Targeted verification: `uv run pytest tests/test_quality.py tests/test_math_repair.py tests/test_conversion.py tests/test_cli.py tests/test_report.py` passed 56 tests.
- Requested sample verification: `uv run pdf2md convert samples\MITC공부.pdf --out outputs\sprint11-MITC공부 --overwrite` succeeded; final report shows `Math render error count: 0` and two `MATH_RENDER_REPAIRED` info warnings.
- Known failures: none.
- Residual risk: repair rules are deliberately narrow; future PDFs may expose MathJax failures that should remain warnings until a deterministic rule is added and tested.
- Next action: optional Obsidian visual review or additional sample validation.
+44 -2
View File
@@ -4,7 +4,7 @@ Last updated: 2026-05-08
This document is the implementation plan for v1. It does not replace `PRD.md` or `ARCHITECTURE.md`; use those files as the source of product requirements and system design. This plan explains the order of work, sprint contracts, verification gates, and agent ownership for implementing the converter.
Sprint 1 created the Python package scaffold and CLI placeholder. Sprint 2 created path planning. Sprint 3 created project-owned records and metadata construction. Sprint 4 created the mocked direct local MinerU adapter boundary. Sprint 5 created the Obsidian Markdown normalization boundary. Sprint 6 created local quality-check and report-rendering boundaries. Sprint 7 implemented conversion orchestration, the public conversion API, and the `pdf2md convert` CLI path with fake-adapter tests. Sprint 8 implemented mockable doctor diagnostics, the `pdf2md doctor` CLI path, and setup documentation. Sprint 9 implemented fast mocked integration tests, explicit opt-in local MinerU fixture evaluation, and the v1 release checklist. Sprint 10 implemented opt-in pre-conversion PDF chunking for long documents.
Sprint 1 created the Python package scaffold and CLI placeholder. Sprint 2 created path planning. Sprint 3 created project-owned records and metadata construction. Sprint 4 created the mocked direct local MinerU adapter boundary. Sprint 5 created the Obsidian Markdown normalization boundary. Sprint 6 created local quality-check and report-rendering boundaries. Sprint 7 implemented conversion orchestration, the public conversion API, and the `pdf2md convert` CLI path with fake-adapter tests. Sprint 8 implemented mockable doctor diagnostics, the `pdf2md doctor` CLI path, and setup documentation. Sprint 9 implemented fast mocked integration tests, explicit opt-in local MinerU fixture evaluation, and the v1 release checklist. Sprint 10 implemented opt-in pre-conversion PDF chunking for long documents. Sprint 11 implemented conservative MathJax warning mitigation for failed math spans.
## 1. V1 Outcome
@@ -599,6 +599,48 @@ Hard failure criteria:
- Chunk outputs are merged.
- Default tests require real MinerU, GPU, model files, network, Obsidian, LaTeX tooling, or `samples/`.
### Sprint 11: MathJax Warning Mitigation
Active contract:
- `docs/Sprints/SPRINT11CONTRACT.md`
Status:
- Implemented.
Objective:
- Repair narrow MathJax-invalid formula artifacts after initial local validation and before final output writing.
Touched surfaces:
- `quality.py`
- `math_repair.py`
- `conversion.py`
- `ir.py`
- Unit tests for quality details, repair rules, conversion, and recheck behavior
Expected outputs:
- Failed math expression records expose body, display mode, span, and checker message.
- Repair candidates are generated only for failed math spans.
- Repeated same-direction scripts are disambiguated with an empty group.
- Truncated `\end{a}` array endings are repaired when array environments are unbalanced.
- `convert` and `recheck` share the same repair behavior.
- Applied repairs are recorded as `MATH_RENDER_REPAIRED` info warnings and do not count as math render errors.
Verification checks:
- Default fast tests pass without real MinerU, GPU, Node.js, MathJax, network, Obsidian, or `samples/`.
- `samples/MITC공부.pdf` validates locally with `Math render error count: 0`.
Hard failure criteria:
- Repair changes math spans that did not fail local MathJax validation.
- Repair claims success without candidate revalidation.
- Repair introduces remote services, alternate engines, or mandatory sample-dependent default tests.
## 6. Cross-Cutting Acceptance Criteria
Every implementation sprint must preserve these acceptance criteria:
@@ -645,7 +687,7 @@ Handoff fields:
- MinerU 3.1.0 install and CLI behavior are source-verified, but real local output still needs a later local probe before release.
- GTX 1070 Ti 8GB is visible locally, but it is Pascal compute capability 6.1; `doctor` and setup docs must make CUDA/PyTorch limits clear.
- `uv` is installed per-user at `C:\Users\user\.local\bin`, but a new shell may need PATH refresh before `uv` is visible.
- Formula renderability checks need a local tool choice; the implementation should start with an interface and graceful unavailable-tool warning if needed.
- Formula renderability checks and conservative warning mitigation are implemented, but formula reconstruction remains best effort and should keep warnings/provenance visible.
- Some PDFs will have tables or formulas that cannot be faithfully represented in Markdown; metadata and `.report.md` must surface this instead of hiding it.
- Redistribution license obligations must be reviewed before packaging, redistribution, or bundling model weights.
+52 -6
View File
@@ -25,6 +25,7 @@ from pdf2md.ir import (
)
from pdf2md.markdown import normalize_markdown
from pdf2md.math_render import create_default_math_checker
from pdf2md.math_repair import repair_math_render_failures
from pdf2md.metadata import build_metadata
from pdf2md.mineru_adapter import (
ENGINE_NAME,
@@ -35,7 +36,7 @@ from pdf2md.mineru_adapter import (
)
from pdf2md.paths import DiscoveredPdf, PathLike, PlannedOutput, discover_pdfs, plan_outputs
from pdf2md.pdf_splitter import PdfChunkPlan, plan_pdf_chunks, write_pdf_chunk
from pdf2md.quality import MathChecker, QualityResult, check_asset_links, check_math_renderability, merge_quality_results
from pdf2md.quality import MathChecker, QualityResult, check_asset_links, check_math_renderability_details, merge_quality_results
from pdf2md.report import FinalStatus, determine_final_status, render_report
@@ -101,12 +102,19 @@ class _ConversionTask:
original_source_sha256: str | None = None
@dataclass(frozen=True)
class _PreparedMarkdown:
markdown: str
quality: QualityResult
_IMAGE_LINK_RE = re.compile(r"!\[(?P<alt>[^\]\n]*)\]\((?P<target>[^)\n]+)\)")
_DISPLAY_MATH_RE = re.compile(r"(?<!\\)\$\$(?P<body>.*?)(?<!\\)\$\$", re.DOTALL)
_INLINE_MATH_RE = re.compile(r"(?<!\\)\$(?P<body>[^\n$]+?)(?<!\\)\$")
_RECHECKED_WARNING_CODES = frozenset(
{
WarningCode.MATH_RENDER_FAILED,
WarningCode.MATH_RENDER_REPAIRED,
WarningCode.ASSET_LINK_MISSING,
WarningCode.ASSET_LINK_INVALID,
}
@@ -240,12 +248,14 @@ def recheck_markdown(
markdown = markdown_file.read_text(encoding="utf-8")
assets_dir = markdown_file.with_suffix(".assets")
assets = _assets_from_metadata(existing_metadata)
quality = _run_quality_checks(
prepared = _prepare_markdown_for_output(
markdown,
markdown_dir=markdown_file.parent,
asset_root=assets_dir,
math_checker=math_checker,
)
markdown = prepared.markdown
quality = prepared.quality
warnings = _preserved_metadata_warnings(existing_metadata) + quality.warnings
document = _build_document(
source_pdf=Path(_metadata_text(existing_metadata, "source_pdf")),
@@ -276,6 +286,7 @@ def recheck_markdown(
)
final_status = determine_final_status(metadata_data, report_quality)
_write_text(markdown_file, markdown)
_write_text(metadata_path, json.dumps(metadata_data, indent=2, ensure_ascii=False, sort_keys=True) + "\n")
_write_text(report_path, report_text)
@@ -641,16 +652,17 @@ def _convert_in_work_dir(
asset_root=plan.assets_dir,
check_assets=False,
)
quality = _run_quality_checks(
prepared = _prepare_markdown_for_output(
normalized.markdown,
markdown_dir=plan.markdown_path.parent,
asset_root=plan.assets_dir,
math_checker=math_checker,
)
quality = prepared.quality
warnings = adapter_result.warnings + assets.warnings + normalized.warnings + quality.warnings
document = _build_document(
source_pdf=metadata_source,
markdown=normalized.markdown,
markdown=prepared.markdown,
assets=assets.records,
warnings=warnings,
raw_structured=adapter_result.raw_structured,
@@ -679,7 +691,7 @@ def _convert_in_work_dir(
)
final_status = determine_final_status(metadata_data, report_quality)
_write_text(plan.markdown_path, normalized.markdown)
_write_text(plan.markdown_path, prepared.markdown)
if metadata_enabled and plan.metadata_path is not None:
_write_text(plan.metadata_path, json.dumps(metadata_data, indent=2, ensure_ascii=False, sort_keys=True) + "\n")
_write_text(plan.report_path, report_text)
@@ -824,10 +836,44 @@ def _run_quality_checks(
return asset_quality
if math_checker is None:
math_checker = create_default_math_checker()
math_quality = check_math_renderability(markdown, math_checker)
math_quality = check_math_renderability_details(markdown, math_checker).quality
return merge_quality_results(asset_quality, math_quality)
def _prepare_markdown_for_output(
markdown: str,
*,
markdown_dir: Path,
asset_root: Path,
math_checker: MathChecker | None,
) -> _PreparedMarkdown:
asset_quality = check_asset_links(markdown, markdown_dir=markdown_dir, asset_root=asset_root)
if not _has_math(markdown):
return _PreparedMarkdown(markdown=markdown, quality=asset_quality)
checker = math_checker if math_checker is not None else create_default_math_checker()
math_details = check_math_renderability_details(markdown, checker)
initial_quality = merge_quality_results(asset_quality, math_details.quality)
if checker is None or not math_details.failures:
return _PreparedMarkdown(markdown=markdown, quality=initial_quality)
repair_result = repair_math_render_failures(markdown, math_details.failures, checker)
if not repair_result.repairs:
return _PreparedMarkdown(markdown=markdown, quality=initial_quality)
repaired_quality = _run_quality_checks(
repair_result.markdown,
markdown_dir=markdown_dir,
asset_root=asset_root,
math_checker=checker,
)
repair_quality = QualityResult(warnings=repair_result.warnings)
return _PreparedMarkdown(
markdown=repair_result.markdown,
quality=merge_quality_results(repaired_quality, repair_quality),
)
def _has_math(markdown: str) -> bool:
return _DISPLAY_MATH_RE.search(markdown) is not None or _INLINE_MATH_RE.search(markdown) is not None
+1
View File
@@ -33,6 +33,7 @@ class WarningCode(StrEnum):
GPU_UNAVAILABLE = "GPU_UNAVAILABLE"
LOW_CONFIDENCE_FORMULA = "LOW_CONFIDENCE_FORMULA"
MATH_RENDER_FAILED = "MATH_RENDER_FAILED"
MATH_RENDER_REPAIRED = "MATH_RENDER_REPAIRED"
ASSET_LINK_MISSING = "ASSET_LINK_MISSING"
ASSET_LINK_INVALID = "ASSET_LINK_INVALID"
READING_ORDER_UNCERTAIN = "READING_ORDER_UNCERTAIN"
+165
View File
@@ -0,0 +1,165 @@
"""Conservative repairs for MathJax-invalid Markdown math spans."""
from __future__ import annotations
import re
from dataclasses import dataclass
from pdf2md.ir import WarningCode, WarningRecord, WarningSeverity
from pdf2md.quality import (
MathChecker,
MathCheckerUnavailable,
MathCheckResult,
MathExpression,
MathRenderFailure,
)
@dataclass(frozen=True)
class MathRepair:
expression_index: int
rule: str
original_body: str
repaired_body: str
markdown_span: tuple[int, int]
@dataclass(frozen=True)
class MathRepairResult:
markdown: str
repairs: tuple[MathRepair, ...] = ()
warnings: tuple[WarningRecord, ...] = ()
@dataclass(frozen=True)
class _Candidate:
body: str
rule: str
_SCRIPT_RE = re.compile(
r"(?P<script>[\^_])(?P<first_arg>\s*\{[^{}]*\})(?P<space>\s+)(?P=script)(?P<second_arg>\s*\{)"
)
def repair_math_render_failures(
markdown: str,
failures: tuple[MathRenderFailure, ...],
checker: MathChecker,
) -> MathRepairResult:
"""Repair failed math spans only when a candidate passes the same checker."""
if not failures:
return MathRepairResult(markdown)
replacements: list[tuple[tuple[int, int], str]] = []
repairs: list[MathRepair] = []
warnings: list[WarningRecord] = []
for failure in sorted(failures, key=lambda item: item.expression.markdown_span[0], reverse=True):
expression = failure.expression
candidate = _first_valid_candidate(expression, checker)
if candidate is None:
continue
replacements.append((expression.markdown_span, _format_math_span(candidate.body, expression.display)))
repair = MathRepair(
expression_index=expression.index,
rule=candidate.rule,
original_body=expression.body,
repaired_body=candidate.body,
markdown_span=expression.markdown_span,
)
repairs.append(repair)
warnings.append(
WarningRecord(
WarningCode.MATH_RENDER_REPAIRED,
WarningSeverity.INFO,
f"Math expression {expression.index} was repaired by {candidate.rule}.",
)
)
repaired = markdown
for span, replacement in replacements:
start, end = span
repaired = repaired[:start] + replacement + repaired[end:]
return MathRepairResult(markdown=repaired, repairs=tuple(reversed(repairs)), warnings=tuple(reversed(warnings)))
def _first_valid_candidate(expression: MathExpression, checker: MathChecker) -> _Candidate | None:
for candidate in _repair_candidates(expression.body):
if candidate.body != expression.body and _candidate_passes(candidate.body, expression.display, checker):
return candidate
return None
def _repair_candidates(body: str) -> tuple[_Candidate, ...]:
candidates: list[_Candidate] = []
seen: set[str] = {body}
repeated_script = _repair_repeated_scripts(body)
_append_candidate(candidates, seen, repeated_script, "repeated_script")
truncated_array = _repair_truncated_array_end(body)
_append_candidate(candidates, seen, truncated_array, "truncated_array_end")
combined = _repair_truncated_array_end(repeated_script)
_append_candidate(candidates, seen, combined, "combined")
return tuple(candidates)
def _append_candidate(candidates: list[_Candidate], seen: set[str], body: str, rule: str) -> None:
if body not in seen:
candidates.append(_Candidate(body=body, rule=rule))
seen.add(body)
def _repair_repeated_scripts(body: str) -> str:
def replace(match: re.Match[str]) -> str:
script = match.group("script")
return (
f"{script}{match.group('first_arg')}"
f"{match.group('space')}{{}} {script}{match.group('second_arg')}"
)
return _SCRIPT_RE.sub(replace, body)
def _repair_truncated_array_end(body: str) -> str:
if r"\end{a}" not in body:
return body
if body.count(r"\begin{array}") <= body.count(r"\end{array}"):
return body
return body.replace(r"\end{a}", r"\end{array}")
def _candidate_passes(body: str, display: bool, checker: MathChecker) -> bool:
expression = MathExpression(index=0, body=body, display=display, markdown_span=(0, 0))
try:
batch_checker = getattr(checker, "check_expressions", None)
if callable(batch_checker):
raw_results = batch_checker((expression,))
if not isinstance(raw_results, tuple | list) or len(raw_results) != 1:
return False
result = _coerce_result(raw_results[0])
else:
result = _coerce_result(checker(body))
except MathCheckerUnavailable:
return False
return result.ok
def _coerce_result(value: bool | MathCheckResult) -> MathCheckResult:
if isinstance(value, bool):
return MathCheckResult(ok=value)
if isinstance(value, MathCheckResult):
return value
return MathCheckResult(ok=False)
def _format_math_span(body: str, display: bool) -> str:
if display:
return f"$$\n{body.strip()}\n$$"
return f"${body.strip()}$"
+43 -16
View File
@@ -24,6 +24,12 @@ class MathExpression:
markdown_span: tuple[int, int]
@dataclass(frozen=True)
class MathRenderFailure:
expression: MathExpression
message: str = ""
MathChecker = Callable[[str], bool | MathCheckResult]
@@ -39,6 +45,12 @@ class QualityResult:
return self.missing_asset_link_count + self.invalid_asset_link_count + self.math_render_error_count
@dataclass(frozen=True)
class MathRenderabilityResult:
quality: QualityResult
failures: tuple[MathRenderFailure, ...] = ()
class MathCheckerUnavailable(RuntimeError):
"""Raised by a local math checker when renderability cannot be checked."""
@@ -95,25 +107,34 @@ def check_asset_links(
def check_math_renderability(markdown: str, checker: MathChecker | None = None) -> QualityResult:
"""Check math renderability through an injected local checker."""
return check_math_renderability_details(markdown, checker).quality
def check_math_renderability_details(markdown: str, checker: MathChecker | None = None) -> MathRenderabilityResult:
"""Check math renderability and return failed expression records."""
if not isinstance(markdown, str):
raise TypeError("markdown must be a string")
expressions = extract_math_expressions(markdown)
if not expressions:
return QualityResult()
return MathRenderabilityResult(QualityResult())
if checker is None:
return QualityResult(
warnings=(
WarningRecord(
WarningCode.MATH_RENDER_FAILED,
WarningSeverity.INFO,
"Math render checker is unavailable; renderability was not validated.",
),
return MathRenderabilityResult(
QualityResult(
warnings=(
WarningRecord(
WarningCode.MATH_RENDER_FAILED,
WarningSeverity.INFO,
"Math render checker is unavailable; renderability was not validated.",
),
)
)
)
warnings: list[WarningRecord] = []
failures: list[MathRenderFailure] = []
failure_count = 0
try:
results = _check_expressions(expressions, checker)
@@ -122,6 +143,7 @@ def check_math_renderability(markdown: str, checker: MathChecker | None = None)
message = result.message
if not ok:
failure_count += 1
failures.append(MathRenderFailure(expression=expression, message=message))
details = f": {message}" if message else ""
kind = "display" if expression.display else "inline"
warnings.append(
@@ -131,17 +153,22 @@ def check_math_renderability(markdown: str, checker: MathChecker | None = None)
)
)
except MathCheckerUnavailable as error:
return QualityResult(
warnings=(
WarningRecord(
WarningCode.MATH_RENDER_FAILED,
WarningSeverity.INFO,
f"Math render checker is unavailable: {error}",
),
return MathRenderabilityResult(
QualityResult(
warnings=(
WarningRecord(
WarningCode.MATH_RENDER_FAILED,
WarningSeverity.INFO,
f"Math render checker is unavailable: {error}",
),
)
)
)
return QualityResult(math_render_error_count=failure_count, warnings=tuple(warnings))
return MathRenderabilityResult(
QualityResult(math_render_error_count=failure_count, warnings=tuple(warnings)),
failures=tuple(failures),
)
def merge_quality_results(*results: QualityResult) -> QualityResult:
+41
View File
@@ -13,6 +13,7 @@ from pdf2md.conversion import BatchConversionResult, convert_input, convert_pdf,
from pdf2md.ir import WarningCode, WarningRecord, WarningSeverity
from pdf2md.mineru_adapter import MinerUAdapterResult, StrictLocalViolationError
from pdf2md.paths import OutputConflictError
from pdf2md.quality import MathCheckResult
class FakeAdapter:
@@ -230,6 +231,27 @@ def test_convert_pdf_records_math_checker_failures_in_metadata_and_report(tmp_pa
assert "`MATH_RENDER_FAILED`" in report
def test_convert_pdf_repairs_math_render_failure_before_writing_outputs(tmp_path: Path) -> None:
class RepairAwareChecker:
def check_expressions(self, expressions):
return tuple(MathCheckResult(ok="{} ^ {t}" in expression.body) for expression in expressions)
pdf = make_pdf(tmp_path)
adapter = FakeAdapter(raw_markdown="\\[x ^ {i} ^ {t}\\]\n")
result = convert_pdf(pdf, tmp_path / "out", adapter=adapter, math_checker=RepairAwareChecker(), clock=fixed_clock)
assert result.final_status == "partial"
assert result.markdown_path.read_text(encoding="utf-8") == "$$\nx ^ {i} {} ^ {t}\n$$"
assert [warning.code for warning in result.warnings] == [WarningCode.MATH_RENDER_REPAIRED]
metadata = json.loads(result.metadata_path.read_text(encoding="utf-8"))
assert metadata["summary"]["math_render_error_count"] == 0
assert metadata["warnings"][0]["code"] == "MATH_RENDER_REPAIRED"
report = result.report_path.read_text(encoding="utf-8")
assert "- Math render error count: 0" in report
assert "`MATH_RENDER_REPAIRED`" in report
def test_recheck_markdown_regenerates_metadata_and_report_from_current_markdown(tmp_path: Path) -> None:
pdf = make_pdf(tmp_path)
adapter = FakeAdapter(raw_markdown="Inline \\(bad_math\\)\n")
@@ -257,6 +279,25 @@ def test_recheck_markdown_regenerates_metadata_and_report_from_current_markdown(
assert "- None" in report
def test_recheck_markdown_repairs_math_render_failure(tmp_path: Path) -> None:
class RepairAwareChecker:
def check_expressions(self, expressions):
return tuple(MathCheckResult(ok="{} ^ {t}" in expression.body) for expression in expressions)
pdf = make_pdf(tmp_path)
adapter = FakeAdapter(raw_markdown="No formulas.\n")
result = convert_pdf(pdf, tmp_path / "out", adapter=adapter, math_checker=lambda _: True, clock=fixed_clock)
result.markdown_path.write_text("$$\nx ^ {i} ^ {t}\n$$\n", encoding="utf-8")
rechecked = recheck_markdown(result.markdown_path, math_checker=RepairAwareChecker(), clock=fixed_clock)
assert rechecked.markdown_path.read_text(encoding="utf-8") == "$$\nx ^ {i} {} ^ {t}\n$$\n"
assert [warning.code for warning in rechecked.warnings] == [WarningCode.MATH_RENDER_REPAIRED]
metadata = json.loads(result.metadata_path.read_text(encoding="utf-8"))
assert metadata["summary"]["math_render_error_count"] == 0
assert metadata["warnings"][0]["code"] == "MATH_RENDER_REPAIRED"
def test_convert_pdf_records_unavailable_math_checker_for_math_output(tmp_path: Path, monkeypatch) -> None:
pdf = make_pdf(tmp_path)
adapter = FakeAdapter(raw_markdown="Inline \\(x\\)\n")
+65
View File
@@ -0,0 +1,65 @@
from __future__ import annotations
from pdf2md.ir import WarningCode, WarningSeverity
from pdf2md.math_repair import repair_math_render_failures
from pdf2md.quality import MathCheckResult, MathRenderFailure, extract_math_expressions
class BodyChecker:
def __init__(self, passing_fragment: str) -> None:
self.passing_fragment = passing_fragment
self.checked_bodies: list[str] = []
def check_expressions(self, expressions):
self.checked_bodies.extend(expression.body for expression in expressions)
return tuple(MathCheckResult(ok=self.passing_fragment in expression.body) for expression in expressions)
def test_repair_math_render_failures_disambiguates_repeated_superscripts() -> None:
markdown = "$$\nx ^ {i} ^ {t}\n$$\n"
expression = extract_math_expressions(markdown)[0]
failure = MathRenderFailure(expression=expression, message="Double exponent: use braces to clarify")
checker = BodyChecker("{} ^ {t}")
result = repair_math_render_failures(markdown, (failure,), checker)
assert result.markdown == "$$\nx ^ {i} {} ^ {t}\n$$\n"
assert result.repairs[0].rule == "repeated_script"
assert result.warnings[0].code == WarningCode.MATH_RENDER_REPAIRED
assert result.warnings[0].severity == WarningSeverity.INFO
def test_repair_math_render_failures_repairs_truncated_array_environment() -> None:
markdown = "$$\n\\begin{array}{c} x \\end{a}\n$$\n"
expression = extract_math_expressions(markdown)[0]
failure = MathRenderFailure(expression=expression, message="Unknown environment 'a'")
checker = BodyChecker("\\end{array}")
result = repair_math_render_failures(markdown, (failure,), checker)
assert result.markdown == "$$\n\\begin{array}{c} x \\end{array}\n$$\n"
assert result.repairs[0].rule == "truncated_array_end"
def test_repair_math_render_failures_leaves_markdown_unchanged_when_candidate_fails() -> None:
markdown = "$$\nx ^ {i} ^ {t}\n$$\n"
expression = extract_math_expressions(markdown)[0]
failure = MathRenderFailure(expression=expression, message="Double exponent: use braces to clarify")
checker = BodyChecker("never-passes")
result = repair_math_render_failures(markdown, (failure,), checker)
assert result.markdown == markdown
assert result.repairs == ()
assert result.warnings == ()
def test_repair_math_render_failures_only_changes_failed_spans() -> None:
markdown = "$a ^ {b} ^ {c}$ and $unchanged ^ {ok}$\n"
expressions = extract_math_expressions(markdown)
failure = MathRenderFailure(expression=expressions[0], message="Double exponent: use braces to clarify")
checker = BodyChecker("{} ^ {c}")
result = repair_math_render_failures(markdown, (failure,), checker)
assert result.markdown == "$a ^ {b} {} ^ {c}$ and $unchanged ^ {ok}$\n"
+15
View File
@@ -6,6 +6,7 @@ from pdf2md.ir import WarningCode, WarningSeverity
from pdf2md.quality import (
MathCheckerUnavailable,
MathCheckResult,
check_math_renderability_details,
check_asset_links,
check_math_renderability,
extract_math_expressions,
@@ -71,6 +72,20 @@ def test_math_render_failures_are_aggregated_with_fake_checker() -> None:
assert "bad_math failed" in result.warnings[0].message
def test_math_renderability_details_include_failed_expression_records() -> None:
def checker(body: str) -> MathCheckResult:
return MathCheckResult(ok="bad" not in body, message=f"{body} failed")
result = check_math_renderability_details("$x_i$\n\n$$\nbad_math\n$$", checker)
assert result.quality.math_render_error_count == 1
assert len(result.failures) == 1
assert result.failures[0].expression.index == 1
assert result.failures[0].expression.body == "bad_math"
assert result.failures[0].expression.display is True
assert result.failures[0].message == "bad_math failed"
def test_math_extraction_records_display_mode_and_markdown_spans() -> None:
markdown = "Inline $x_i^2$ before\n\n$$\n\\frac{1}{2}\n$$\n"