feat: mitigate MathJax formula warnings
This commit is contained in:
@@ -200,6 +200,7 @@ Stable warning code examples:
|
|||||||
- `GPU_UNAVAILABLE`
|
- `GPU_UNAVAILABLE`
|
||||||
- `LOW_CONFIDENCE_FORMULA`
|
- `LOW_CONFIDENCE_FORMULA`
|
||||||
- `MATH_RENDER_FAILED`
|
- `MATH_RENDER_FAILED`
|
||||||
|
- `MATH_RENDER_REPAIRED`
|
||||||
- `ASSET_LINK_MISSING`
|
- `ASSET_LINK_MISSING`
|
||||||
- `READING_ORDER_UNCERTAIN`
|
- `READING_ORDER_UNCERTAIN`
|
||||||
- `STRICT_LOCAL_VIOLATION`
|
- `STRICT_LOCAL_VIOLATION`
|
||||||
|
|||||||
@@ -4,7 +4,7 @@ This file is the shared work plan for agents. Read it before starting work, then
|
|||||||
|
|
||||||
## Current Goal
|
## Current Goal
|
||||||
|
|
||||||
Completed work history is archived in `docs/WORKARCHIVE.md`. Sprint 10 pre-conversion PDF chunking is implemented. On this PC, full local runtime setup is complete in `.venv`; Markdown quality recheck for existing outputs is implemented. Next planned work is MathJax warning mitigation: after local MathJax validation, conservatively clean only warning-causing math spans, rerun validation, and preserve provenance for changed or still-failing formulas. Manual Obsidian quality review and sample validation remain optional fallback tasks.
|
Completed work history is archived in `docs/WORKARCHIVE.md`. Sprint 11 MathJax warning mitigation is implemented. On this PC, full local runtime setup is complete in `.venv`; Markdown quality recheck for existing outputs is implemented and now shares the same conservative MathJax repair path as fresh conversion. Next work is optional manual Obsidian quality review, additional sample validation, or broader repair rules if future samples expose new deterministic MathJax failure patterns.
|
||||||
|
|
||||||
## Active Constraints
|
## Active Constraints
|
||||||
|
|
||||||
@@ -35,15 +35,14 @@ Completed work history is archived in `docs/WORKARCHIVE.md`. Sprint 10 pre-conve
|
|||||||
12. Follow `docs/V1IMPLEMENTATIONPLAN.md` for the v1 implementation sprint sequence.
|
12. Follow `docs/V1IMPLEMENTATIONPLAN.md` for the v1 implementation sprint sequence.
|
||||||
13. Use `docs/Sprints/SPRINT10CONTRACT.md` for the implemented long-PDF pre-conversion chunking sprint.
|
13. Use `docs/Sprints/SPRINT10CONTRACT.md` for the implemented long-PDF pre-conversion chunking sprint.
|
||||||
14. Use `docs/WORKARCHIVE.md` for completed sprint history, prior verification, runtime setup evidence, and sample conversion evidence.
|
14. Use `docs/WORKARCHIVE.md` for completed sprint history, prior verification, runtime setup evidence, and sample conversion evidence.
|
||||||
15. Plan Sprint 11 for MathJax warning mitigation before code changes start.
|
15. Use `docs/Sprints/SPRINT11CONTRACT.md` for the implemented MathJax warning mitigation sprint.
|
||||||
16. Create `docs/Sprints/SPRINT11CONTRACT.md` for the mitigation sprint if implementation is requested.
|
16. Keep the mitigation path shared by `pdf2md convert` and `pdf2md recheck` so existing Markdown outputs can be cleaned without rerunning MinerU.
|
||||||
17. Keep the mitigation path shared by `pdf2md convert` and `pdf2md recheck` so existing Markdown outputs can be cleaned without rerunning MinerU.
|
|
||||||
|
|
||||||
## Proposed Sprint 11: MathJax Warning Mitigation
|
## Sprint 11: MathJax Warning Mitigation
|
||||||
|
|
||||||
Objective:
|
Objective:
|
||||||
|
|
||||||
- Add a conservative local post-validation cleanup pass that attempts to remove only the specific math-span artifacts responsible for MathJax warnings, then reruns MathJax validation before writing final Markdown, metadata JSON, and report Markdown.
|
- Implemented a conservative local post-validation cleanup pass that attempts to remove only the specific math-span artifacts responsible for MathJax warnings, then reruns MathJax validation before writing final Markdown, metadata JSON, and report Markdown.
|
||||||
|
|
||||||
Assumptions:
|
Assumptions:
|
||||||
|
|
||||||
@@ -97,8 +96,7 @@ Hard failure criteria:
|
|||||||
|
|
||||||
## Open Questions
|
## Open Questions
|
||||||
|
|
||||||
- Which exact cleanup rules should Sprint 11 allow after inspecting current MathJax failure messages? Recommendation: start with deterministic non-semantic artifacts only.
|
- None.
|
||||||
- Should applied mitigations use a new stable warning/info code or be represented through existing metadata/report fields? Recommendation: make repair provenance visible without counting a successfully repaired expression as a render failure.
|
|
||||||
|
|
||||||
## Decisions
|
## Decisions
|
||||||
|
|
||||||
@@ -116,6 +114,8 @@ Hard failure criteria:
|
|||||||
- Candidate math cleanup must be revalidated with the local MathJax checker before replacing Markdown.
|
- Candidate math cleanup must be revalidated with the local MathJax checker before replacing Markdown.
|
||||||
- If no candidate passes validation, keep the original formula and retain the `MATH_RENDER_FAILED` warning.
|
- If no candidate passes validation, keep the original formula and retain the `MATH_RENDER_FAILED` warning.
|
||||||
- Successfully mitigated formulas must remain traceable in metadata/report output; warning reduction must not hide that a formula was changed.
|
- Successfully mitigated formulas must remain traceable in metadata/report output; warning reduction must not hide that a formula was changed.
|
||||||
|
- Sprint 11 uses `MATH_RENDER_REPAIRED` info warnings for applied repair provenance.
|
||||||
|
- Sprint 11 initial repair rules cover repeated same-direction scripts and truncated array `\end{a}` endings only.
|
||||||
- Project-scoped custom agents live in `.codex/agents/*.toml`.
|
- Project-scoped custom agents live in `.codex/agents/*.toml`.
|
||||||
- Project prompt commands live in `.codex/commands/*.md`.
|
- Project prompt commands live in `.codex/commands/*.md`.
|
||||||
- Project-specific skills live in `.codex/skills/*/SKILL.md`.
|
- Project-specific skills live in `.codex/skills/*/SKILL.md`.
|
||||||
|
|||||||
+9
-9
@@ -6,9 +6,9 @@ This file records current progress for agents. Read it before starting work, the
|
|||||||
|
|
||||||
- Project direction is documented in `PRD.md`, `ARCHITECTURE.md`, `AGENTS.md`, and `docs/KNOWLEDGEBASE.md`.
|
- Project direction is documented in `PRD.md`, `ARCHITECTURE.md`, `AGENTS.md`, and `docs/KNOWLEDGEBASE.md`.
|
||||||
- MinerU 3.1.0 is fixed as the only conversion engine.
|
- MinerU 3.1.0 is fixed as the only conversion engine.
|
||||||
- The converter currently includes path planning, project-owned records, metadata, direct local MinerU adapter boundary, Obsidian Markdown normalization, local quality checks, report rendering, conversion orchestration, `pdf2md convert`, `pdf2md recheck`, `pdf2md doctor`, local MathJax render checking, release-gate tests, and opt-in pre-conversion PDF chunking.
|
- The converter currently includes path planning, project-owned records, metadata, direct local MinerU adapter boundary, Obsidian Markdown normalization, local quality checks, report rendering, conversion orchestration, `pdf2md convert`, `pdf2md recheck`, `pdf2md doctor`, local MathJax render checking, conservative MathJax warning mitigation, release-gate tests, and opt-in pre-conversion PDF chunking.
|
||||||
- `docs/V1IMPLEMENTATIONPLAN.md` defines the v1 implementation sequence.
|
- `docs/V1IMPLEMENTATIONPLAN.md` defines the v1 implementation sequence.
|
||||||
- `docs/Sprints/` contains completed sprint contracts through Sprint 10.
|
- `docs/Sprints/` contains completed sprint contracts through Sprint 11.
|
||||||
- `docs/WORKARCHIVE.md` contains completed sprint history, historical verification results, runtime setup notes, and sample conversion evidence.
|
- `docs/WORKARCHIVE.md` contains completed sprint history, historical verification results, runtime setup notes, and sample conversion evidence.
|
||||||
- `samples/` exists locally as fixture context.
|
- `samples/` exists locally as fixture context.
|
||||||
- `outputs/` is ignored and contains local generated conversion outputs.
|
- `outputs/` is ignored and contains local generated conversion outputs.
|
||||||
@@ -48,7 +48,9 @@ This file records current progress for agents. Read it before starting work, the
|
|||||||
- Added `recheck_markdown()` and `pdf2md recheck <markdown.md>` to rerun local quality checks for an existing generated Markdown file and rewrite the adjacent metadata JSON and `.report.md` without rerunning MinerU.
|
- Added `recheck_markdown()` and `pdf2md recheck <markdown.md>` to rerun local quality checks for an existing generated Markdown file and rewrite the adjacent metadata JSON and `.report.md` without rerunning MinerU.
|
||||||
- Verified `uv run pdf2md recheck outputs\MITC공부\MITC공부.md`; the command regenerated metadata/report and still reported 2 warnings because the current Markdown still contains the two MathJax-invalid expressions.
|
- Verified `uv run pdf2md recheck outputs\MITC공부\MITC공부.md`; the command regenerated metadata/report and still reported 2 warnings because the current Markdown still contains the two MathJax-invalid expressions.
|
||||||
- Reconverted `samples/MITC공부.pdf` with `--overwrite` to ignored `outputs/MITC공부/`; report status remains `partial`: 13 pages, 107 assets, 23 inline formulas, 103 display formulas, 2 MathJax render warnings, and 0 missing or invalid asset links.
|
- Reconverted `samples/MITC공부.pdf` with `--overwrite` to ignored `outputs/MITC공부/`; report status remains `partial`: 13 pages, 107 assets, 23 inline formulas, 103 display formulas, 2 MathJax render warnings, and 0 missing or invalid asset links.
|
||||||
- Added a `PLAN.md` Sprint 11 proposal for conservative MathJax warning mitigation after validation; no implementation code has been started.
|
- Sprint 11 implemented conservative MathJax warning mitigation with failed-expression details, `src/pdf2md/math_repair.py`, shared `convert`/`recheck` repair integration, and `MATH_RENDER_REPAIRED` info warnings.
|
||||||
|
- Verified default fast suite: `uv run pytest` passed 172 tests with 1 skipped.
|
||||||
|
- Verified requested real sample: `uv run pdf2md convert samples\MITC공부.pdf --out outputs\sprint11-MITC공부 --overwrite` succeeded with 13 pages, 107 assets, 23 inline formulas, 103 display formulas, 0 MathJax render errors, and 2 `MATH_RENDER_REPAIRED` info warnings.
|
||||||
|
|
||||||
## In Progress
|
## In Progress
|
||||||
|
|
||||||
@@ -60,9 +62,7 @@ This file records current progress for agents. Read it before starting work, the
|
|||||||
|
|
||||||
## Next Actions
|
## Next Actions
|
||||||
|
|
||||||
1. If implementation is requested, write `docs/Sprints/SPRINT11CONTRACT.md` for MathJax warning mitigation before code changes start.
|
1. Review generated sample Markdown outputs in Obsidian if visual quality needs manual assessment.
|
||||||
2. Inspect the current MathJax failure messages from `outputs/MITC공부/MITC공부.md` to choose the narrow initial cleanup rule set.
|
2. Run additional real local sample validation only if requested, especially for new MathJax failure messages not covered by Sprint 11's narrow repair rules.
|
||||||
3. Manually fix the two MathJax-invalid expressions in `outputs/MITC공부/MITC공부.md` only if a warning-free local report is desired before Sprint 11 exists, then run `uv run pdf2md recheck outputs\MITC공부\MITC공부.md`.
|
3. Run optional real local chunked conversion on a long sample only if requested.
|
||||||
4. Review generated sample Markdown outputs in Obsidian if visual quality needs manual assessment.
|
4. Preserve strict-local runtime behavior: use local model paths, direct CLI execution, and no user-specified API or remote backend.
|
||||||
5. Run optional real local chunked conversion on a long sample only if requested.
|
|
||||||
6. Preserve strict-local runtime behavior: use local model paths, direct CLI execution, and no user-specified API or remote backend.
|
|
||||||
|
|||||||
@@ -0,0 +1,181 @@
|
|||||||
|
# Sprint 11 Contract: MathJax Warning Mitigation
|
||||||
|
|
||||||
|
Status: Implemented
|
||||||
|
Last updated: 2026-05-11
|
||||||
|
|
||||||
|
## Objective
|
||||||
|
|
||||||
|
Add a conservative local cleanup pass for MathJax-invalid formulas:
|
||||||
|
|
||||||
|
1. Run the existing MathJax renderability check on normalized Markdown.
|
||||||
|
2. Build repair candidates only for expressions that failed MathJax validation.
|
||||||
|
3. Re-check each candidate with the same local checker.
|
||||||
|
4. Replace only candidates that pass.
|
||||||
|
5. Re-run final quality checks before writing Markdown, metadata JSON, and report Markdown.
|
||||||
|
|
||||||
|
The feature should reduce `MATH_RENDER_FAILED` warnings without hiding that a formula was changed.
|
||||||
|
|
||||||
|
## Current Precondition
|
||||||
|
|
||||||
|
- `pdf2md convert` writes normalized Markdown, metadata JSON, and `<stem>.report.md`.
|
||||||
|
- `pdf2md recheck` can rerun quality checks for an existing generated Markdown file without rerunning MinerU.
|
||||||
|
- Local MathJax checking is already optional and nonfatal.
|
||||||
|
- `outputs/MITC공부/MITC공부.md` currently has two MathJax render failures:
|
||||||
|
- expression 8: `Double exponent: use braces to clarify`
|
||||||
|
- expression 83: `Unknown environment 'a'`
|
||||||
|
- `samples/MITC공부.pdf` is the requested real local validation sample.
|
||||||
|
|
||||||
|
## Touched Surfaces
|
||||||
|
|
||||||
|
Allowed during implementation:
|
||||||
|
|
||||||
|
- `src/pdf2md/quality.py`
|
||||||
|
- `src/pdf2md/math_repair.py`
|
||||||
|
- `src/pdf2md/conversion.py`
|
||||||
|
- `src/pdf2md/ir.py`
|
||||||
|
- `tests/test_quality.py`
|
||||||
|
- `tests/test_math_repair.py`
|
||||||
|
- `tests/test_conversion.py`
|
||||||
|
- `tests/test_cli.py`
|
||||||
|
- `docs/Sprints/SPRINT11CONTRACT.md`
|
||||||
|
- `PLAN.md`
|
||||||
|
- `PROGRESS.md`
|
||||||
|
|
||||||
|
Not allowed:
|
||||||
|
|
||||||
|
- Remote OCR, remote LLMs, remote render APIs, or external document upload paths.
|
||||||
|
- Alternate PDF conversion engines.
|
||||||
|
- Switchable conversion-engine behavior.
|
||||||
|
- A full LaTeX parser or symbolic math rewrite engine.
|
||||||
|
- New CLI flags unless a later user request explicitly asks for them.
|
||||||
|
- Mandatory default tests that require real MinerU, GPU, Node.js, MathJax, network, Obsidian, or `samples/`.
|
||||||
|
- Committed files under `samples/`.
|
||||||
|
- Committed generated conversion outputs under `outputs/`.
|
||||||
|
|
||||||
|
## Product Behavior
|
||||||
|
|
||||||
|
Repair activation:
|
||||||
|
|
||||||
|
- Repair runs automatically when a local math checker is available and at least one math expression fails validation.
|
||||||
|
- If the checker is unavailable, behavior remains unchanged: conversion/recheck continues with an info-level unavailable-checker warning.
|
||||||
|
- The same repair path applies to fresh `convert` output and existing Markdown processed through `recheck`.
|
||||||
|
|
||||||
|
Initial deterministic repair rules:
|
||||||
|
|
||||||
|
- Repeated same-direction script repair:
|
||||||
|
- Convert consecutive superscripts/subscripts such as `^ {i} ^ {t}` to `^ {i} {} ^ {t}`.
|
||||||
|
- This resolves MathJax double-super/subscript syntax while preserving both script tokens.
|
||||||
|
- Truncated array environment repair:
|
||||||
|
- Convert `\end{a}` to `\end{array}` only when the expression has unmatched `\begin{array}` / `\end{array}` counts.
|
||||||
|
- This targets obvious extraction truncation, not arbitrary environment renaming.
|
||||||
|
|
||||||
|
Provenance:
|
||||||
|
|
||||||
|
- Applied repairs produce `MATH_RENDER_REPAIRED` info warnings.
|
||||||
|
- Successfully repaired expressions must not count as `math_render_error_count`.
|
||||||
|
- Unrepaired expressions keep the original `MATH_RENDER_FAILED` warning behavior.
|
||||||
|
- The report remains derived from metadata and local quality checks.
|
||||||
|
|
||||||
|
## Architecture Plan
|
||||||
|
|
||||||
|
### WP11.1: Failed Math Detail Capture
|
||||||
|
|
||||||
|
Actions:
|
||||||
|
|
||||||
|
- Add a project-owned result type that can include failed `MathExpression` records and checker messages.
|
||||||
|
- Preserve the current `check_math_renderability()` return behavior for existing callers.
|
||||||
|
- Keep expression extraction outside fenced code and inline code.
|
||||||
|
|
||||||
|
Expected output:
|
||||||
|
|
||||||
|
- Conversion can access failed expression spans without parsing warning message text.
|
||||||
|
|
||||||
|
### WP11.2: Repair Module
|
||||||
|
|
||||||
|
Actions:
|
||||||
|
|
||||||
|
- Add `src/pdf2md/math_repair.py`.
|
||||||
|
- Define repair result records.
|
||||||
|
- Generate candidates only for failed expressions.
|
||||||
|
- Revalidate candidates through the injected checker.
|
||||||
|
- Apply replacements from right to left so Markdown spans remain stable.
|
||||||
|
|
||||||
|
Expected output:
|
||||||
|
|
||||||
|
- Pure string-level repair behavior that is deterministic, local-only, and independently testable.
|
||||||
|
|
||||||
|
### WP11.3: Conversion And Recheck Integration
|
||||||
|
|
||||||
|
Actions:
|
||||||
|
|
||||||
|
- Route `convert` normalized Markdown through repair before final metadata/report construction.
|
||||||
|
- Route `recheck` Markdown through the same repair path before rewriting metadata/report.
|
||||||
|
- Re-run final quality checks after any repair.
|
||||||
|
- Preserve asset checking and strict-local behavior unchanged.
|
||||||
|
|
||||||
|
Expected output:
|
||||||
|
|
||||||
|
- Fresh conversions and rechecks both benefit from MathJax warning mitigation.
|
||||||
|
|
||||||
|
### WP11.4: Tests
|
||||||
|
|
||||||
|
Default tests:
|
||||||
|
|
||||||
|
- Repeated superscripts are repaired only when the original expression failed.
|
||||||
|
- `\end{a}` repairs to `\end{array}` only when array environments are unbalanced.
|
||||||
|
- A candidate that still fails is not written back.
|
||||||
|
- Passing expressions are not changed.
|
||||||
|
- Conversion writes repaired Markdown only after candidate revalidation.
|
||||||
|
- Recheck can repair an existing Markdown output and regenerate metadata/report.
|
||||||
|
- Existing unavailable-checker behavior remains nonfatal.
|
||||||
|
|
||||||
|
Optional local validation:
|
||||||
|
|
||||||
|
- Run `uv run pdf2md convert samples\MITC공부.pdf --out outputs\sprint11-MITC공부 --overwrite`.
|
||||||
|
- Confirm the generated report has `Math render error count: 0` for the requested sample, or record any remaining failures exactly.
|
||||||
|
|
||||||
|
## Acceptance Criteria
|
||||||
|
|
||||||
|
- Default fast tests pass without real MinerU, GPU, Node.js, MathJax, network, Obsidian, or `samples/`.
|
||||||
|
- `pdf2md convert` and `pdf2md recheck` share the same repair behavior.
|
||||||
|
- MathJax failed spans are repaired only after candidate revalidation succeeds.
|
||||||
|
- Successfully repaired formulas remain visible through `MATH_RENDER_REPAIRED` info warnings.
|
||||||
|
- Existing strict-local and MinerU-only constraints are unchanged.
|
||||||
|
- `samples/MITC공부.pdf` is validated locally as requested, with generated outputs kept ignored under `outputs/`.
|
||||||
|
|
||||||
|
## Hard Failure Criteria
|
||||||
|
|
||||||
|
- Repair changes a math span that did not fail initial MathJax validation.
|
||||||
|
- Repair drops an entire formula or removes meaningful LaTeX tokens solely to silence warnings.
|
||||||
|
- Repair claims success without re-running the local checker on the candidate.
|
||||||
|
- `convert` or `recheck` starts requiring MathJax when it was previously optional.
|
||||||
|
- Default tests require real MinerU, GPU, Node.js, MathJax, network, Obsidian, or `samples/`.
|
||||||
|
- `samples/` or generated `outputs/` files are committed.
|
||||||
|
|
||||||
|
## Verification Commands
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
uv run pytest tests/test_quality.py tests/test_math_repair.py tests/test_conversion.py tests/test_cli.py tests/test_report.py
|
||||||
|
uv run pytest
|
||||||
|
uv run pdf2md convert samples\MITC공부.pdf --out outputs\sprint11-MITC공부 --overwrite
|
||||||
|
git diff --check
|
||||||
|
git status --short --untracked-files=all
|
||||||
|
```
|
||||||
|
|
||||||
|
## Handoff Requirements
|
||||||
|
|
||||||
|
After implementation:
|
||||||
|
|
||||||
|
- Update `PROGRESS.md` with files changed, commands run, test outcomes, sample validation outcome, known failures, and next action.
|
||||||
|
- Keep sample PDFs and generated outputs out of the commit.
|
||||||
|
- Commit the completed sprint if verification passes.
|
||||||
|
|
||||||
|
## Implementation Handoff
|
||||||
|
|
||||||
|
- Files changed: `src/pdf2md/quality.py`, `src/pdf2md/math_repair.py`, `src/pdf2md/conversion.py`, `src/pdf2md/ir.py`, tests, `ARCHITECTURE.md`, `docs/V1IMPLEMENTATIONPLAN.md`, `PLAN.md`, and `PROGRESS.md`.
|
||||||
|
- Default verification: `uv run pytest` passed 172 tests with 1 skipped.
|
||||||
|
- Targeted verification: `uv run pytest tests/test_quality.py tests/test_math_repair.py tests/test_conversion.py tests/test_cli.py tests/test_report.py` passed 56 tests.
|
||||||
|
- Requested sample verification: `uv run pdf2md convert samples\MITC공부.pdf --out outputs\sprint11-MITC공부 --overwrite` succeeded; final report shows `Math render error count: 0` and two `MATH_RENDER_REPAIRED` info warnings.
|
||||||
|
- Known failures: none.
|
||||||
|
- Residual risk: repair rules are deliberately narrow; future PDFs may expose MathJax failures that should remain warnings until a deterministic rule is added and tested.
|
||||||
|
- Next action: optional Obsidian visual review or additional sample validation.
|
||||||
@@ -4,7 +4,7 @@ Last updated: 2026-05-08
|
|||||||
|
|
||||||
This document is the implementation plan for v1. It does not replace `PRD.md` or `ARCHITECTURE.md`; use those files as the source of product requirements and system design. This plan explains the order of work, sprint contracts, verification gates, and agent ownership for implementing the converter.
|
This document is the implementation plan for v1. It does not replace `PRD.md` or `ARCHITECTURE.md`; use those files as the source of product requirements and system design. This plan explains the order of work, sprint contracts, verification gates, and agent ownership for implementing the converter.
|
||||||
|
|
||||||
Sprint 1 created the Python package scaffold and CLI placeholder. Sprint 2 created path planning. Sprint 3 created project-owned records and metadata construction. Sprint 4 created the mocked direct local MinerU adapter boundary. Sprint 5 created the Obsidian Markdown normalization boundary. Sprint 6 created local quality-check and report-rendering boundaries. Sprint 7 implemented conversion orchestration, the public conversion API, and the `pdf2md convert` CLI path with fake-adapter tests. Sprint 8 implemented mockable doctor diagnostics, the `pdf2md doctor` CLI path, and setup documentation. Sprint 9 implemented fast mocked integration tests, explicit opt-in local MinerU fixture evaluation, and the v1 release checklist. Sprint 10 implemented opt-in pre-conversion PDF chunking for long documents.
|
Sprint 1 created the Python package scaffold and CLI placeholder. Sprint 2 created path planning. Sprint 3 created project-owned records and metadata construction. Sprint 4 created the mocked direct local MinerU adapter boundary. Sprint 5 created the Obsidian Markdown normalization boundary. Sprint 6 created local quality-check and report-rendering boundaries. Sprint 7 implemented conversion orchestration, the public conversion API, and the `pdf2md convert` CLI path with fake-adapter tests. Sprint 8 implemented mockable doctor diagnostics, the `pdf2md doctor` CLI path, and setup documentation. Sprint 9 implemented fast mocked integration tests, explicit opt-in local MinerU fixture evaluation, and the v1 release checklist. Sprint 10 implemented opt-in pre-conversion PDF chunking for long documents. Sprint 11 implemented conservative MathJax warning mitigation for failed math spans.
|
||||||
|
|
||||||
## 1. V1 Outcome
|
## 1. V1 Outcome
|
||||||
|
|
||||||
@@ -599,6 +599,48 @@ Hard failure criteria:
|
|||||||
- Chunk outputs are merged.
|
- Chunk outputs are merged.
|
||||||
- Default tests require real MinerU, GPU, model files, network, Obsidian, LaTeX tooling, or `samples/`.
|
- Default tests require real MinerU, GPU, model files, network, Obsidian, LaTeX tooling, or `samples/`.
|
||||||
|
|
||||||
|
### Sprint 11: MathJax Warning Mitigation
|
||||||
|
|
||||||
|
Active contract:
|
||||||
|
|
||||||
|
- `docs/Sprints/SPRINT11CONTRACT.md`
|
||||||
|
|
||||||
|
Status:
|
||||||
|
|
||||||
|
- Implemented.
|
||||||
|
|
||||||
|
Objective:
|
||||||
|
|
||||||
|
- Repair narrow MathJax-invalid formula artifacts after initial local validation and before final output writing.
|
||||||
|
|
||||||
|
Touched surfaces:
|
||||||
|
|
||||||
|
- `quality.py`
|
||||||
|
- `math_repair.py`
|
||||||
|
- `conversion.py`
|
||||||
|
- `ir.py`
|
||||||
|
- Unit tests for quality details, repair rules, conversion, and recheck behavior
|
||||||
|
|
||||||
|
Expected outputs:
|
||||||
|
|
||||||
|
- Failed math expression records expose body, display mode, span, and checker message.
|
||||||
|
- Repair candidates are generated only for failed math spans.
|
||||||
|
- Repeated same-direction scripts are disambiguated with an empty group.
|
||||||
|
- Truncated `\end{a}` array endings are repaired when array environments are unbalanced.
|
||||||
|
- `convert` and `recheck` share the same repair behavior.
|
||||||
|
- Applied repairs are recorded as `MATH_RENDER_REPAIRED` info warnings and do not count as math render errors.
|
||||||
|
|
||||||
|
Verification checks:
|
||||||
|
|
||||||
|
- Default fast tests pass without real MinerU, GPU, Node.js, MathJax, network, Obsidian, or `samples/`.
|
||||||
|
- `samples/MITC공부.pdf` validates locally with `Math render error count: 0`.
|
||||||
|
|
||||||
|
Hard failure criteria:
|
||||||
|
|
||||||
|
- Repair changes math spans that did not fail local MathJax validation.
|
||||||
|
- Repair claims success without candidate revalidation.
|
||||||
|
- Repair introduces remote services, alternate engines, or mandatory sample-dependent default tests.
|
||||||
|
|
||||||
## 6. Cross-Cutting Acceptance Criteria
|
## 6. Cross-Cutting Acceptance Criteria
|
||||||
|
|
||||||
Every implementation sprint must preserve these acceptance criteria:
|
Every implementation sprint must preserve these acceptance criteria:
|
||||||
@@ -645,7 +687,7 @@ Handoff fields:
|
|||||||
- MinerU 3.1.0 install and CLI behavior are source-verified, but real local output still needs a later local probe before release.
|
- MinerU 3.1.0 install and CLI behavior are source-verified, but real local output still needs a later local probe before release.
|
||||||
- GTX 1070 Ti 8GB is visible locally, but it is Pascal compute capability 6.1; `doctor` and setup docs must make CUDA/PyTorch limits clear.
|
- GTX 1070 Ti 8GB is visible locally, but it is Pascal compute capability 6.1; `doctor` and setup docs must make CUDA/PyTorch limits clear.
|
||||||
- `uv` is installed per-user at `C:\Users\user\.local\bin`, but a new shell may need PATH refresh before `uv` is visible.
|
- `uv` is installed per-user at `C:\Users\user\.local\bin`, but a new shell may need PATH refresh before `uv` is visible.
|
||||||
- Formula renderability checks need a local tool choice; the implementation should start with an interface and graceful unavailable-tool warning if needed.
|
- Formula renderability checks and conservative warning mitigation are implemented, but formula reconstruction remains best effort and should keep warnings/provenance visible.
|
||||||
- Some PDFs will have tables or formulas that cannot be faithfully represented in Markdown; metadata and `.report.md` must surface this instead of hiding it.
|
- Some PDFs will have tables or formulas that cannot be faithfully represented in Markdown; metadata and `.report.md` must surface this instead of hiding it.
|
||||||
- Redistribution license obligations must be reviewed before packaging, redistribution, or bundling model weights.
|
- Redistribution license obligations must be reviewed before packaging, redistribution, or bundling model weights.
|
||||||
|
|
||||||
|
|||||||
@@ -25,6 +25,7 @@ from pdf2md.ir import (
|
|||||||
)
|
)
|
||||||
from pdf2md.markdown import normalize_markdown
|
from pdf2md.markdown import normalize_markdown
|
||||||
from pdf2md.math_render import create_default_math_checker
|
from pdf2md.math_render import create_default_math_checker
|
||||||
|
from pdf2md.math_repair import repair_math_render_failures
|
||||||
from pdf2md.metadata import build_metadata
|
from pdf2md.metadata import build_metadata
|
||||||
from pdf2md.mineru_adapter import (
|
from pdf2md.mineru_adapter import (
|
||||||
ENGINE_NAME,
|
ENGINE_NAME,
|
||||||
@@ -35,7 +36,7 @@ from pdf2md.mineru_adapter import (
|
|||||||
)
|
)
|
||||||
from pdf2md.paths import DiscoveredPdf, PathLike, PlannedOutput, discover_pdfs, plan_outputs
|
from pdf2md.paths import DiscoveredPdf, PathLike, PlannedOutput, discover_pdfs, plan_outputs
|
||||||
from pdf2md.pdf_splitter import PdfChunkPlan, plan_pdf_chunks, write_pdf_chunk
|
from pdf2md.pdf_splitter import PdfChunkPlan, plan_pdf_chunks, write_pdf_chunk
|
||||||
from pdf2md.quality import MathChecker, QualityResult, check_asset_links, check_math_renderability, merge_quality_results
|
from pdf2md.quality import MathChecker, QualityResult, check_asset_links, check_math_renderability_details, merge_quality_results
|
||||||
from pdf2md.report import FinalStatus, determine_final_status, render_report
|
from pdf2md.report import FinalStatus, determine_final_status, render_report
|
||||||
|
|
||||||
|
|
||||||
@@ -101,12 +102,19 @@ class _ConversionTask:
|
|||||||
original_source_sha256: str | None = None
|
original_source_sha256: str | None = None
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass(frozen=True)
|
||||||
|
class _PreparedMarkdown:
|
||||||
|
markdown: str
|
||||||
|
quality: QualityResult
|
||||||
|
|
||||||
|
|
||||||
_IMAGE_LINK_RE = re.compile(r"!\[(?P<alt>[^\]\n]*)\]\((?P<target>[^)\n]+)\)")
|
_IMAGE_LINK_RE = re.compile(r"!\[(?P<alt>[^\]\n]*)\]\((?P<target>[^)\n]+)\)")
|
||||||
_DISPLAY_MATH_RE = re.compile(r"(?<!\\)\$\$(?P<body>.*?)(?<!\\)\$\$", re.DOTALL)
|
_DISPLAY_MATH_RE = re.compile(r"(?<!\\)\$\$(?P<body>.*?)(?<!\\)\$\$", re.DOTALL)
|
||||||
_INLINE_MATH_RE = re.compile(r"(?<!\\)\$(?P<body>[^\n$]+?)(?<!\\)\$")
|
_INLINE_MATH_RE = re.compile(r"(?<!\\)\$(?P<body>[^\n$]+?)(?<!\\)\$")
|
||||||
_RECHECKED_WARNING_CODES = frozenset(
|
_RECHECKED_WARNING_CODES = frozenset(
|
||||||
{
|
{
|
||||||
WarningCode.MATH_RENDER_FAILED,
|
WarningCode.MATH_RENDER_FAILED,
|
||||||
|
WarningCode.MATH_RENDER_REPAIRED,
|
||||||
WarningCode.ASSET_LINK_MISSING,
|
WarningCode.ASSET_LINK_MISSING,
|
||||||
WarningCode.ASSET_LINK_INVALID,
|
WarningCode.ASSET_LINK_INVALID,
|
||||||
}
|
}
|
||||||
@@ -240,12 +248,14 @@ def recheck_markdown(
|
|||||||
markdown = markdown_file.read_text(encoding="utf-8")
|
markdown = markdown_file.read_text(encoding="utf-8")
|
||||||
assets_dir = markdown_file.with_suffix(".assets")
|
assets_dir = markdown_file.with_suffix(".assets")
|
||||||
assets = _assets_from_metadata(existing_metadata)
|
assets = _assets_from_metadata(existing_metadata)
|
||||||
quality = _run_quality_checks(
|
prepared = _prepare_markdown_for_output(
|
||||||
markdown,
|
markdown,
|
||||||
markdown_dir=markdown_file.parent,
|
markdown_dir=markdown_file.parent,
|
||||||
asset_root=assets_dir,
|
asset_root=assets_dir,
|
||||||
math_checker=math_checker,
|
math_checker=math_checker,
|
||||||
)
|
)
|
||||||
|
markdown = prepared.markdown
|
||||||
|
quality = prepared.quality
|
||||||
warnings = _preserved_metadata_warnings(existing_metadata) + quality.warnings
|
warnings = _preserved_metadata_warnings(existing_metadata) + quality.warnings
|
||||||
document = _build_document(
|
document = _build_document(
|
||||||
source_pdf=Path(_metadata_text(existing_metadata, "source_pdf")),
|
source_pdf=Path(_metadata_text(existing_metadata, "source_pdf")),
|
||||||
@@ -276,6 +286,7 @@ def recheck_markdown(
|
|||||||
)
|
)
|
||||||
final_status = determine_final_status(metadata_data, report_quality)
|
final_status = determine_final_status(metadata_data, report_quality)
|
||||||
|
|
||||||
|
_write_text(markdown_file, markdown)
|
||||||
_write_text(metadata_path, json.dumps(metadata_data, indent=2, ensure_ascii=False, sort_keys=True) + "\n")
|
_write_text(metadata_path, json.dumps(metadata_data, indent=2, ensure_ascii=False, sort_keys=True) + "\n")
|
||||||
_write_text(report_path, report_text)
|
_write_text(report_path, report_text)
|
||||||
|
|
||||||
@@ -641,16 +652,17 @@ def _convert_in_work_dir(
|
|||||||
asset_root=plan.assets_dir,
|
asset_root=plan.assets_dir,
|
||||||
check_assets=False,
|
check_assets=False,
|
||||||
)
|
)
|
||||||
quality = _run_quality_checks(
|
prepared = _prepare_markdown_for_output(
|
||||||
normalized.markdown,
|
normalized.markdown,
|
||||||
markdown_dir=plan.markdown_path.parent,
|
markdown_dir=plan.markdown_path.parent,
|
||||||
asset_root=plan.assets_dir,
|
asset_root=plan.assets_dir,
|
||||||
math_checker=math_checker,
|
math_checker=math_checker,
|
||||||
)
|
)
|
||||||
|
quality = prepared.quality
|
||||||
warnings = adapter_result.warnings + assets.warnings + normalized.warnings + quality.warnings
|
warnings = adapter_result.warnings + assets.warnings + normalized.warnings + quality.warnings
|
||||||
document = _build_document(
|
document = _build_document(
|
||||||
source_pdf=metadata_source,
|
source_pdf=metadata_source,
|
||||||
markdown=normalized.markdown,
|
markdown=prepared.markdown,
|
||||||
assets=assets.records,
|
assets=assets.records,
|
||||||
warnings=warnings,
|
warnings=warnings,
|
||||||
raw_structured=adapter_result.raw_structured,
|
raw_structured=adapter_result.raw_structured,
|
||||||
@@ -679,7 +691,7 @@ def _convert_in_work_dir(
|
|||||||
)
|
)
|
||||||
final_status = determine_final_status(metadata_data, report_quality)
|
final_status = determine_final_status(metadata_data, report_quality)
|
||||||
|
|
||||||
_write_text(plan.markdown_path, normalized.markdown)
|
_write_text(plan.markdown_path, prepared.markdown)
|
||||||
if metadata_enabled and plan.metadata_path is not None:
|
if metadata_enabled and plan.metadata_path is not None:
|
||||||
_write_text(plan.metadata_path, json.dumps(metadata_data, indent=2, ensure_ascii=False, sort_keys=True) + "\n")
|
_write_text(plan.metadata_path, json.dumps(metadata_data, indent=2, ensure_ascii=False, sort_keys=True) + "\n")
|
||||||
_write_text(plan.report_path, report_text)
|
_write_text(plan.report_path, report_text)
|
||||||
@@ -824,10 +836,44 @@ def _run_quality_checks(
|
|||||||
return asset_quality
|
return asset_quality
|
||||||
if math_checker is None:
|
if math_checker is None:
|
||||||
math_checker = create_default_math_checker()
|
math_checker = create_default_math_checker()
|
||||||
math_quality = check_math_renderability(markdown, math_checker)
|
math_quality = check_math_renderability_details(markdown, math_checker).quality
|
||||||
return merge_quality_results(asset_quality, math_quality)
|
return merge_quality_results(asset_quality, math_quality)
|
||||||
|
|
||||||
|
|
||||||
|
def _prepare_markdown_for_output(
|
||||||
|
markdown: str,
|
||||||
|
*,
|
||||||
|
markdown_dir: Path,
|
||||||
|
asset_root: Path,
|
||||||
|
math_checker: MathChecker | None,
|
||||||
|
) -> _PreparedMarkdown:
|
||||||
|
asset_quality = check_asset_links(markdown, markdown_dir=markdown_dir, asset_root=asset_root)
|
||||||
|
if not _has_math(markdown):
|
||||||
|
return _PreparedMarkdown(markdown=markdown, quality=asset_quality)
|
||||||
|
|
||||||
|
checker = math_checker if math_checker is not None else create_default_math_checker()
|
||||||
|
math_details = check_math_renderability_details(markdown, checker)
|
||||||
|
initial_quality = merge_quality_results(asset_quality, math_details.quality)
|
||||||
|
if checker is None or not math_details.failures:
|
||||||
|
return _PreparedMarkdown(markdown=markdown, quality=initial_quality)
|
||||||
|
|
||||||
|
repair_result = repair_math_render_failures(markdown, math_details.failures, checker)
|
||||||
|
if not repair_result.repairs:
|
||||||
|
return _PreparedMarkdown(markdown=markdown, quality=initial_quality)
|
||||||
|
|
||||||
|
repaired_quality = _run_quality_checks(
|
||||||
|
repair_result.markdown,
|
||||||
|
markdown_dir=markdown_dir,
|
||||||
|
asset_root=asset_root,
|
||||||
|
math_checker=checker,
|
||||||
|
)
|
||||||
|
repair_quality = QualityResult(warnings=repair_result.warnings)
|
||||||
|
return _PreparedMarkdown(
|
||||||
|
markdown=repair_result.markdown,
|
||||||
|
quality=merge_quality_results(repaired_quality, repair_quality),
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
def _has_math(markdown: str) -> bool:
|
def _has_math(markdown: str) -> bool:
|
||||||
return _DISPLAY_MATH_RE.search(markdown) is not None or _INLINE_MATH_RE.search(markdown) is not None
|
return _DISPLAY_MATH_RE.search(markdown) is not None or _INLINE_MATH_RE.search(markdown) is not None
|
||||||
|
|
||||||
|
|||||||
@@ -33,6 +33,7 @@ class WarningCode(StrEnum):
|
|||||||
GPU_UNAVAILABLE = "GPU_UNAVAILABLE"
|
GPU_UNAVAILABLE = "GPU_UNAVAILABLE"
|
||||||
LOW_CONFIDENCE_FORMULA = "LOW_CONFIDENCE_FORMULA"
|
LOW_CONFIDENCE_FORMULA = "LOW_CONFIDENCE_FORMULA"
|
||||||
MATH_RENDER_FAILED = "MATH_RENDER_FAILED"
|
MATH_RENDER_FAILED = "MATH_RENDER_FAILED"
|
||||||
|
MATH_RENDER_REPAIRED = "MATH_RENDER_REPAIRED"
|
||||||
ASSET_LINK_MISSING = "ASSET_LINK_MISSING"
|
ASSET_LINK_MISSING = "ASSET_LINK_MISSING"
|
||||||
ASSET_LINK_INVALID = "ASSET_LINK_INVALID"
|
ASSET_LINK_INVALID = "ASSET_LINK_INVALID"
|
||||||
READING_ORDER_UNCERTAIN = "READING_ORDER_UNCERTAIN"
|
READING_ORDER_UNCERTAIN = "READING_ORDER_UNCERTAIN"
|
||||||
|
|||||||
@@ -0,0 +1,165 @@
|
|||||||
|
"""Conservative repairs for MathJax-invalid Markdown math spans."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import re
|
||||||
|
from dataclasses import dataclass
|
||||||
|
|
||||||
|
from pdf2md.ir import WarningCode, WarningRecord, WarningSeverity
|
||||||
|
from pdf2md.quality import (
|
||||||
|
MathChecker,
|
||||||
|
MathCheckerUnavailable,
|
||||||
|
MathCheckResult,
|
||||||
|
MathExpression,
|
||||||
|
MathRenderFailure,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass(frozen=True)
|
||||||
|
class MathRepair:
|
||||||
|
expression_index: int
|
||||||
|
rule: str
|
||||||
|
original_body: str
|
||||||
|
repaired_body: str
|
||||||
|
markdown_span: tuple[int, int]
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass(frozen=True)
|
||||||
|
class MathRepairResult:
|
||||||
|
markdown: str
|
||||||
|
repairs: tuple[MathRepair, ...] = ()
|
||||||
|
warnings: tuple[WarningRecord, ...] = ()
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass(frozen=True)
|
||||||
|
class _Candidate:
|
||||||
|
body: str
|
||||||
|
rule: str
|
||||||
|
|
||||||
|
|
||||||
|
_SCRIPT_RE = re.compile(
|
||||||
|
r"(?P<script>[\^_])(?P<first_arg>\s*\{[^{}]*\})(?P<space>\s+)(?P=script)(?P<second_arg>\s*\{)"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def repair_math_render_failures(
|
||||||
|
markdown: str,
|
||||||
|
failures: tuple[MathRenderFailure, ...],
|
||||||
|
checker: MathChecker,
|
||||||
|
) -> MathRepairResult:
|
||||||
|
"""Repair failed math spans only when a candidate passes the same checker."""
|
||||||
|
|
||||||
|
if not failures:
|
||||||
|
return MathRepairResult(markdown)
|
||||||
|
|
||||||
|
replacements: list[tuple[tuple[int, int], str]] = []
|
||||||
|
repairs: list[MathRepair] = []
|
||||||
|
warnings: list[WarningRecord] = []
|
||||||
|
|
||||||
|
for failure in sorted(failures, key=lambda item: item.expression.markdown_span[0], reverse=True):
|
||||||
|
expression = failure.expression
|
||||||
|
candidate = _first_valid_candidate(expression, checker)
|
||||||
|
if candidate is None:
|
||||||
|
continue
|
||||||
|
|
||||||
|
replacements.append((expression.markdown_span, _format_math_span(candidate.body, expression.display)))
|
||||||
|
repair = MathRepair(
|
||||||
|
expression_index=expression.index,
|
||||||
|
rule=candidate.rule,
|
||||||
|
original_body=expression.body,
|
||||||
|
repaired_body=candidate.body,
|
||||||
|
markdown_span=expression.markdown_span,
|
||||||
|
)
|
||||||
|
repairs.append(repair)
|
||||||
|
warnings.append(
|
||||||
|
WarningRecord(
|
||||||
|
WarningCode.MATH_RENDER_REPAIRED,
|
||||||
|
WarningSeverity.INFO,
|
||||||
|
f"Math expression {expression.index} was repaired by {candidate.rule}.",
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
repaired = markdown
|
||||||
|
for span, replacement in replacements:
|
||||||
|
start, end = span
|
||||||
|
repaired = repaired[:start] + replacement + repaired[end:]
|
||||||
|
|
||||||
|
return MathRepairResult(markdown=repaired, repairs=tuple(reversed(repairs)), warnings=tuple(reversed(warnings)))
|
||||||
|
|
||||||
|
|
||||||
|
def _first_valid_candidate(expression: MathExpression, checker: MathChecker) -> _Candidate | None:
|
||||||
|
for candidate in _repair_candidates(expression.body):
|
||||||
|
if candidate.body != expression.body and _candidate_passes(candidate.body, expression.display, checker):
|
||||||
|
return candidate
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def _repair_candidates(body: str) -> tuple[_Candidate, ...]:
|
||||||
|
candidates: list[_Candidate] = []
|
||||||
|
seen: set[str] = {body}
|
||||||
|
|
||||||
|
repeated_script = _repair_repeated_scripts(body)
|
||||||
|
_append_candidate(candidates, seen, repeated_script, "repeated_script")
|
||||||
|
|
||||||
|
truncated_array = _repair_truncated_array_end(body)
|
||||||
|
_append_candidate(candidates, seen, truncated_array, "truncated_array_end")
|
||||||
|
|
||||||
|
combined = _repair_truncated_array_end(repeated_script)
|
||||||
|
_append_candidate(candidates, seen, combined, "combined")
|
||||||
|
|
||||||
|
return tuple(candidates)
|
||||||
|
|
||||||
|
|
||||||
|
def _append_candidate(candidates: list[_Candidate], seen: set[str], body: str, rule: str) -> None:
|
||||||
|
if body not in seen:
|
||||||
|
candidates.append(_Candidate(body=body, rule=rule))
|
||||||
|
seen.add(body)
|
||||||
|
|
||||||
|
|
||||||
|
def _repair_repeated_scripts(body: str) -> str:
|
||||||
|
def replace(match: re.Match[str]) -> str:
|
||||||
|
script = match.group("script")
|
||||||
|
return (
|
||||||
|
f"{script}{match.group('first_arg')}"
|
||||||
|
f"{match.group('space')}{{}} {script}{match.group('second_arg')}"
|
||||||
|
)
|
||||||
|
|
||||||
|
return _SCRIPT_RE.sub(replace, body)
|
||||||
|
|
||||||
|
|
||||||
|
def _repair_truncated_array_end(body: str) -> str:
|
||||||
|
if r"\end{a}" not in body:
|
||||||
|
return body
|
||||||
|
if body.count(r"\begin{array}") <= body.count(r"\end{array}"):
|
||||||
|
return body
|
||||||
|
return body.replace(r"\end{a}", r"\end{array}")
|
||||||
|
|
||||||
|
|
||||||
|
def _candidate_passes(body: str, display: bool, checker: MathChecker) -> bool:
|
||||||
|
expression = MathExpression(index=0, body=body, display=display, markdown_span=(0, 0))
|
||||||
|
try:
|
||||||
|
batch_checker = getattr(checker, "check_expressions", None)
|
||||||
|
if callable(batch_checker):
|
||||||
|
raw_results = batch_checker((expression,))
|
||||||
|
if not isinstance(raw_results, tuple | list) or len(raw_results) != 1:
|
||||||
|
return False
|
||||||
|
result = _coerce_result(raw_results[0])
|
||||||
|
else:
|
||||||
|
result = _coerce_result(checker(body))
|
||||||
|
except MathCheckerUnavailable:
|
||||||
|
return False
|
||||||
|
return result.ok
|
||||||
|
|
||||||
|
|
||||||
|
def _coerce_result(value: bool | MathCheckResult) -> MathCheckResult:
|
||||||
|
if isinstance(value, bool):
|
||||||
|
return MathCheckResult(ok=value)
|
||||||
|
if isinstance(value, MathCheckResult):
|
||||||
|
return value
|
||||||
|
return MathCheckResult(ok=False)
|
||||||
|
|
||||||
|
|
||||||
|
def _format_math_span(body: str, display: bool) -> str:
|
||||||
|
if display:
|
||||||
|
return f"$$\n{body.strip()}\n$$"
|
||||||
|
return f"${body.strip()}$"
|
||||||
+43
-16
@@ -24,6 +24,12 @@ class MathExpression:
|
|||||||
markdown_span: tuple[int, int]
|
markdown_span: tuple[int, int]
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass(frozen=True)
|
||||||
|
class MathRenderFailure:
|
||||||
|
expression: MathExpression
|
||||||
|
message: str = ""
|
||||||
|
|
||||||
|
|
||||||
MathChecker = Callable[[str], bool | MathCheckResult]
|
MathChecker = Callable[[str], bool | MathCheckResult]
|
||||||
|
|
||||||
|
|
||||||
@@ -39,6 +45,12 @@ class QualityResult:
|
|||||||
return self.missing_asset_link_count + self.invalid_asset_link_count + self.math_render_error_count
|
return self.missing_asset_link_count + self.invalid_asset_link_count + self.math_render_error_count
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass(frozen=True)
|
||||||
|
class MathRenderabilityResult:
|
||||||
|
quality: QualityResult
|
||||||
|
failures: tuple[MathRenderFailure, ...] = ()
|
||||||
|
|
||||||
|
|
||||||
class MathCheckerUnavailable(RuntimeError):
|
class MathCheckerUnavailable(RuntimeError):
|
||||||
"""Raised by a local math checker when renderability cannot be checked."""
|
"""Raised by a local math checker when renderability cannot be checked."""
|
||||||
|
|
||||||
@@ -95,25 +107,34 @@ def check_asset_links(
|
|||||||
def check_math_renderability(markdown: str, checker: MathChecker | None = None) -> QualityResult:
|
def check_math_renderability(markdown: str, checker: MathChecker | None = None) -> QualityResult:
|
||||||
"""Check math renderability through an injected local checker."""
|
"""Check math renderability through an injected local checker."""
|
||||||
|
|
||||||
|
return check_math_renderability_details(markdown, checker).quality
|
||||||
|
|
||||||
|
|
||||||
|
def check_math_renderability_details(markdown: str, checker: MathChecker | None = None) -> MathRenderabilityResult:
|
||||||
|
"""Check math renderability and return failed expression records."""
|
||||||
|
|
||||||
if not isinstance(markdown, str):
|
if not isinstance(markdown, str):
|
||||||
raise TypeError("markdown must be a string")
|
raise TypeError("markdown must be a string")
|
||||||
|
|
||||||
expressions = extract_math_expressions(markdown)
|
expressions = extract_math_expressions(markdown)
|
||||||
if not expressions:
|
if not expressions:
|
||||||
return QualityResult()
|
return MathRenderabilityResult(QualityResult())
|
||||||
|
|
||||||
if checker is None:
|
if checker is None:
|
||||||
return QualityResult(
|
return MathRenderabilityResult(
|
||||||
warnings=(
|
QualityResult(
|
||||||
WarningRecord(
|
warnings=(
|
||||||
WarningCode.MATH_RENDER_FAILED,
|
WarningRecord(
|
||||||
WarningSeverity.INFO,
|
WarningCode.MATH_RENDER_FAILED,
|
||||||
"Math render checker is unavailable; renderability was not validated.",
|
WarningSeverity.INFO,
|
||||||
),
|
"Math render checker is unavailable; renderability was not validated.",
|
||||||
|
),
|
||||||
|
)
|
||||||
)
|
)
|
||||||
)
|
)
|
||||||
|
|
||||||
warnings: list[WarningRecord] = []
|
warnings: list[WarningRecord] = []
|
||||||
|
failures: list[MathRenderFailure] = []
|
||||||
failure_count = 0
|
failure_count = 0
|
||||||
try:
|
try:
|
||||||
results = _check_expressions(expressions, checker)
|
results = _check_expressions(expressions, checker)
|
||||||
@@ -122,6 +143,7 @@ def check_math_renderability(markdown: str, checker: MathChecker | None = None)
|
|||||||
message = result.message
|
message = result.message
|
||||||
if not ok:
|
if not ok:
|
||||||
failure_count += 1
|
failure_count += 1
|
||||||
|
failures.append(MathRenderFailure(expression=expression, message=message))
|
||||||
details = f": {message}" if message else ""
|
details = f": {message}" if message else ""
|
||||||
kind = "display" if expression.display else "inline"
|
kind = "display" if expression.display else "inline"
|
||||||
warnings.append(
|
warnings.append(
|
||||||
@@ -131,17 +153,22 @@ def check_math_renderability(markdown: str, checker: MathChecker | None = None)
|
|||||||
)
|
)
|
||||||
)
|
)
|
||||||
except MathCheckerUnavailable as error:
|
except MathCheckerUnavailable as error:
|
||||||
return QualityResult(
|
return MathRenderabilityResult(
|
||||||
warnings=(
|
QualityResult(
|
||||||
WarningRecord(
|
warnings=(
|
||||||
WarningCode.MATH_RENDER_FAILED,
|
WarningRecord(
|
||||||
WarningSeverity.INFO,
|
WarningCode.MATH_RENDER_FAILED,
|
||||||
f"Math render checker is unavailable: {error}",
|
WarningSeverity.INFO,
|
||||||
),
|
f"Math render checker is unavailable: {error}",
|
||||||
|
),
|
||||||
|
)
|
||||||
)
|
)
|
||||||
)
|
)
|
||||||
|
|
||||||
return QualityResult(math_render_error_count=failure_count, warnings=tuple(warnings))
|
return MathRenderabilityResult(
|
||||||
|
QualityResult(math_render_error_count=failure_count, warnings=tuple(warnings)),
|
||||||
|
failures=tuple(failures),
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
def merge_quality_results(*results: QualityResult) -> QualityResult:
|
def merge_quality_results(*results: QualityResult) -> QualityResult:
|
||||||
|
|||||||
@@ -13,6 +13,7 @@ from pdf2md.conversion import BatchConversionResult, convert_input, convert_pdf,
|
|||||||
from pdf2md.ir import WarningCode, WarningRecord, WarningSeverity
|
from pdf2md.ir import WarningCode, WarningRecord, WarningSeverity
|
||||||
from pdf2md.mineru_adapter import MinerUAdapterResult, StrictLocalViolationError
|
from pdf2md.mineru_adapter import MinerUAdapterResult, StrictLocalViolationError
|
||||||
from pdf2md.paths import OutputConflictError
|
from pdf2md.paths import OutputConflictError
|
||||||
|
from pdf2md.quality import MathCheckResult
|
||||||
|
|
||||||
|
|
||||||
class FakeAdapter:
|
class FakeAdapter:
|
||||||
@@ -230,6 +231,27 @@ def test_convert_pdf_records_math_checker_failures_in_metadata_and_report(tmp_pa
|
|||||||
assert "`MATH_RENDER_FAILED`" in report
|
assert "`MATH_RENDER_FAILED`" in report
|
||||||
|
|
||||||
|
|
||||||
|
def test_convert_pdf_repairs_math_render_failure_before_writing_outputs(tmp_path: Path) -> None:
|
||||||
|
class RepairAwareChecker:
|
||||||
|
def check_expressions(self, expressions):
|
||||||
|
return tuple(MathCheckResult(ok="{} ^ {t}" in expression.body) for expression in expressions)
|
||||||
|
|
||||||
|
pdf = make_pdf(tmp_path)
|
||||||
|
adapter = FakeAdapter(raw_markdown="\\[x ^ {i} ^ {t}\\]\n")
|
||||||
|
|
||||||
|
result = convert_pdf(pdf, tmp_path / "out", adapter=adapter, math_checker=RepairAwareChecker(), clock=fixed_clock)
|
||||||
|
|
||||||
|
assert result.final_status == "partial"
|
||||||
|
assert result.markdown_path.read_text(encoding="utf-8") == "$$\nx ^ {i} {} ^ {t}\n$$"
|
||||||
|
assert [warning.code for warning in result.warnings] == [WarningCode.MATH_RENDER_REPAIRED]
|
||||||
|
metadata = json.loads(result.metadata_path.read_text(encoding="utf-8"))
|
||||||
|
assert metadata["summary"]["math_render_error_count"] == 0
|
||||||
|
assert metadata["warnings"][0]["code"] == "MATH_RENDER_REPAIRED"
|
||||||
|
report = result.report_path.read_text(encoding="utf-8")
|
||||||
|
assert "- Math render error count: 0" in report
|
||||||
|
assert "`MATH_RENDER_REPAIRED`" in report
|
||||||
|
|
||||||
|
|
||||||
def test_recheck_markdown_regenerates_metadata_and_report_from_current_markdown(tmp_path: Path) -> None:
|
def test_recheck_markdown_regenerates_metadata_and_report_from_current_markdown(tmp_path: Path) -> None:
|
||||||
pdf = make_pdf(tmp_path)
|
pdf = make_pdf(tmp_path)
|
||||||
adapter = FakeAdapter(raw_markdown="Inline \\(bad_math\\)\n")
|
adapter = FakeAdapter(raw_markdown="Inline \\(bad_math\\)\n")
|
||||||
@@ -257,6 +279,25 @@ def test_recheck_markdown_regenerates_metadata_and_report_from_current_markdown(
|
|||||||
assert "- None" in report
|
assert "- None" in report
|
||||||
|
|
||||||
|
|
||||||
|
def test_recheck_markdown_repairs_math_render_failure(tmp_path: Path) -> None:
|
||||||
|
class RepairAwareChecker:
|
||||||
|
def check_expressions(self, expressions):
|
||||||
|
return tuple(MathCheckResult(ok="{} ^ {t}" in expression.body) for expression in expressions)
|
||||||
|
|
||||||
|
pdf = make_pdf(tmp_path)
|
||||||
|
adapter = FakeAdapter(raw_markdown="No formulas.\n")
|
||||||
|
result = convert_pdf(pdf, tmp_path / "out", adapter=adapter, math_checker=lambda _: True, clock=fixed_clock)
|
||||||
|
result.markdown_path.write_text("$$\nx ^ {i} ^ {t}\n$$\n", encoding="utf-8")
|
||||||
|
|
||||||
|
rechecked = recheck_markdown(result.markdown_path, math_checker=RepairAwareChecker(), clock=fixed_clock)
|
||||||
|
|
||||||
|
assert rechecked.markdown_path.read_text(encoding="utf-8") == "$$\nx ^ {i} {} ^ {t}\n$$\n"
|
||||||
|
assert [warning.code for warning in rechecked.warnings] == [WarningCode.MATH_RENDER_REPAIRED]
|
||||||
|
metadata = json.loads(result.metadata_path.read_text(encoding="utf-8"))
|
||||||
|
assert metadata["summary"]["math_render_error_count"] == 0
|
||||||
|
assert metadata["warnings"][0]["code"] == "MATH_RENDER_REPAIRED"
|
||||||
|
|
||||||
|
|
||||||
def test_convert_pdf_records_unavailable_math_checker_for_math_output(tmp_path: Path, monkeypatch) -> None:
|
def test_convert_pdf_records_unavailable_math_checker_for_math_output(tmp_path: Path, monkeypatch) -> None:
|
||||||
pdf = make_pdf(tmp_path)
|
pdf = make_pdf(tmp_path)
|
||||||
adapter = FakeAdapter(raw_markdown="Inline \\(x\\)\n")
|
adapter = FakeAdapter(raw_markdown="Inline \\(x\\)\n")
|
||||||
|
|||||||
@@ -0,0 +1,65 @@
|
|||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from pdf2md.ir import WarningCode, WarningSeverity
|
||||||
|
from pdf2md.math_repair import repair_math_render_failures
|
||||||
|
from pdf2md.quality import MathCheckResult, MathRenderFailure, extract_math_expressions
|
||||||
|
|
||||||
|
|
||||||
|
class BodyChecker:
|
||||||
|
def __init__(self, passing_fragment: str) -> None:
|
||||||
|
self.passing_fragment = passing_fragment
|
||||||
|
self.checked_bodies: list[str] = []
|
||||||
|
|
||||||
|
def check_expressions(self, expressions):
|
||||||
|
self.checked_bodies.extend(expression.body for expression in expressions)
|
||||||
|
return tuple(MathCheckResult(ok=self.passing_fragment in expression.body) for expression in expressions)
|
||||||
|
|
||||||
|
|
||||||
|
def test_repair_math_render_failures_disambiguates_repeated_superscripts() -> None:
|
||||||
|
markdown = "$$\nx ^ {i} ^ {t}\n$$\n"
|
||||||
|
expression = extract_math_expressions(markdown)[0]
|
||||||
|
failure = MathRenderFailure(expression=expression, message="Double exponent: use braces to clarify")
|
||||||
|
checker = BodyChecker("{} ^ {t}")
|
||||||
|
|
||||||
|
result = repair_math_render_failures(markdown, (failure,), checker)
|
||||||
|
|
||||||
|
assert result.markdown == "$$\nx ^ {i} {} ^ {t}\n$$\n"
|
||||||
|
assert result.repairs[0].rule == "repeated_script"
|
||||||
|
assert result.warnings[0].code == WarningCode.MATH_RENDER_REPAIRED
|
||||||
|
assert result.warnings[0].severity == WarningSeverity.INFO
|
||||||
|
|
||||||
|
|
||||||
|
def test_repair_math_render_failures_repairs_truncated_array_environment() -> None:
|
||||||
|
markdown = "$$\n\\begin{array}{c} x \\end{a}\n$$\n"
|
||||||
|
expression = extract_math_expressions(markdown)[0]
|
||||||
|
failure = MathRenderFailure(expression=expression, message="Unknown environment 'a'")
|
||||||
|
checker = BodyChecker("\\end{array}")
|
||||||
|
|
||||||
|
result = repair_math_render_failures(markdown, (failure,), checker)
|
||||||
|
|
||||||
|
assert result.markdown == "$$\n\\begin{array}{c} x \\end{array}\n$$\n"
|
||||||
|
assert result.repairs[0].rule == "truncated_array_end"
|
||||||
|
|
||||||
|
|
||||||
|
def test_repair_math_render_failures_leaves_markdown_unchanged_when_candidate_fails() -> None:
|
||||||
|
markdown = "$$\nx ^ {i} ^ {t}\n$$\n"
|
||||||
|
expression = extract_math_expressions(markdown)[0]
|
||||||
|
failure = MathRenderFailure(expression=expression, message="Double exponent: use braces to clarify")
|
||||||
|
checker = BodyChecker("never-passes")
|
||||||
|
|
||||||
|
result = repair_math_render_failures(markdown, (failure,), checker)
|
||||||
|
|
||||||
|
assert result.markdown == markdown
|
||||||
|
assert result.repairs == ()
|
||||||
|
assert result.warnings == ()
|
||||||
|
|
||||||
|
|
||||||
|
def test_repair_math_render_failures_only_changes_failed_spans() -> None:
|
||||||
|
markdown = "$a ^ {b} ^ {c}$ and $unchanged ^ {ok}$\n"
|
||||||
|
expressions = extract_math_expressions(markdown)
|
||||||
|
failure = MathRenderFailure(expression=expressions[0], message="Double exponent: use braces to clarify")
|
||||||
|
checker = BodyChecker("{} ^ {c}")
|
||||||
|
|
||||||
|
result = repair_math_render_failures(markdown, (failure,), checker)
|
||||||
|
|
||||||
|
assert result.markdown == "$a ^ {b} {} ^ {c}$ and $unchanged ^ {ok}$\n"
|
||||||
@@ -6,6 +6,7 @@ from pdf2md.ir import WarningCode, WarningSeverity
|
|||||||
from pdf2md.quality import (
|
from pdf2md.quality import (
|
||||||
MathCheckerUnavailable,
|
MathCheckerUnavailable,
|
||||||
MathCheckResult,
|
MathCheckResult,
|
||||||
|
check_math_renderability_details,
|
||||||
check_asset_links,
|
check_asset_links,
|
||||||
check_math_renderability,
|
check_math_renderability,
|
||||||
extract_math_expressions,
|
extract_math_expressions,
|
||||||
@@ -71,6 +72,20 @@ def test_math_render_failures_are_aggregated_with_fake_checker() -> None:
|
|||||||
assert "bad_math failed" in result.warnings[0].message
|
assert "bad_math failed" in result.warnings[0].message
|
||||||
|
|
||||||
|
|
||||||
|
def test_math_renderability_details_include_failed_expression_records() -> None:
|
||||||
|
def checker(body: str) -> MathCheckResult:
|
||||||
|
return MathCheckResult(ok="bad" not in body, message=f"{body} failed")
|
||||||
|
|
||||||
|
result = check_math_renderability_details("$x_i$\n\n$$\nbad_math\n$$", checker)
|
||||||
|
|
||||||
|
assert result.quality.math_render_error_count == 1
|
||||||
|
assert len(result.failures) == 1
|
||||||
|
assert result.failures[0].expression.index == 1
|
||||||
|
assert result.failures[0].expression.body == "bad_math"
|
||||||
|
assert result.failures[0].expression.display is True
|
||||||
|
assert result.failures[0].message == "bad_math failed"
|
||||||
|
|
||||||
|
|
||||||
def test_math_extraction_records_display_mode_and_markdown_spans() -> None:
|
def test_math_extraction_records_display_mode_and_markdown_spans() -> None:
|
||||||
markdown = "Inline $x_i^2$ before\n\n$$\n\\frac{1}{2}\n$$\n"
|
markdown = "Inline $x_i^2$ before\n\n$$\n\\frac{1}{2}\n$$\n"
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user