Compare commits
8 Commits
a4dcfbdedc
...
2232b51fc9
| Author | SHA1 | Date | |
|---|---|---|---|
| 2232b51fc9 | |||
| 71e6fbcc51 | |||
| 005f17bac1 | |||
| c77db658e7 | |||
| 03927a26a1 | |||
| b69c03c206 | |||
| 80fda47163 | |||
| 4b316ebd0b |
@@ -200,6 +200,7 @@ Stable warning code examples:
|
||||
- `GPU_UNAVAILABLE`
|
||||
- `LOW_CONFIDENCE_FORMULA`
|
||||
- `MATH_RENDER_FAILED`
|
||||
- `MATH_RENDER_REPAIRED`
|
||||
- `ASSET_LINK_MISSING`
|
||||
- `READING_ORDER_UNCERTAIN`
|
||||
- `STRICT_LOCAL_VIOLATION`
|
||||
|
||||
@@ -4,7 +4,7 @@ This file is the shared work plan for agents. Read it before starting work, then
|
||||
|
||||
## Current Goal
|
||||
|
||||
Completed work history is archived in `docs/WORKARCHIVE.md`. CUDA-enabled PyTorch and MinerU 3.1.0 runtime setup is complete in the project `.venv`; Sprint 10 pre-conversion PDF chunking is implemented; next work is optional real local sample validation only if requested.
|
||||
Completed work history is archived in `docs/WORKARCHIVE.md`. Sprint 11 MathJax warning mitigation is implemented. On this PC, full local runtime setup is complete in `.venv`; Markdown quality recheck for existing outputs is implemented and now shares the same conservative MathJax repair path as fresh conversion. Next work is optional manual Obsidian quality review, additional sample validation, or broader repair rules if future samples expose new deterministic MathJax failure patterns.
|
||||
|
||||
## Active Constraints
|
||||
|
||||
@@ -35,6 +35,64 @@ Completed work history is archived in `docs/WORKARCHIVE.md`. CUDA-enabled PyTorc
|
||||
12. Follow `docs/V1IMPLEMENTATIONPLAN.md` for the v1 implementation sprint sequence.
|
||||
13. Use `docs/Sprints/SPRINT10CONTRACT.md` for the implemented long-PDF pre-conversion chunking sprint.
|
||||
14. Use `docs/WORKARCHIVE.md` for completed sprint history, prior verification, runtime setup evidence, and sample conversion evidence.
|
||||
15. Use `docs/Sprints/SPRINT11CONTRACT.md` for the implemented MathJax warning mitigation sprint.
|
||||
16. Keep the mitigation path shared by `pdf2md convert` and `pdf2md recheck` so existing Markdown outputs can be cleaned without rerunning MinerU.
|
||||
|
||||
## Sprint 11: MathJax Warning Mitigation
|
||||
|
||||
Objective:
|
||||
|
||||
- Implemented a conservative local post-validation cleanup pass that attempts to remove only the specific math-span artifacts responsible for MathJax warnings, then reruns MathJax validation before writing final Markdown, metadata JSON, and report Markdown.
|
||||
|
||||
Assumptions:
|
||||
|
||||
- MathJax warning mitigation is best-effort and nonfatal.
|
||||
- The cleanup pass must stay deterministic and local-only.
|
||||
- Warning reduction must not silently erase meaningful formula content.
|
||||
- The same behavior should apply to fresh conversions and `pdf2md recheck`.
|
||||
|
||||
Planned workflow:
|
||||
|
||||
1. Run the existing MathJax renderability check against normalized Markdown and keep failed `MathExpression` records, including index, display mode, Markdown span, and MathJax message.
|
||||
2. Generate cleanup candidates only for failed spans. Candidate rules should start with narrow, non-semantic fixes such as trimming invisible/control artifacts, removing obvious OCR/extractor debris, normalizing accidental delimiter leftovers, and fixing whitespace/newline forms known to break MathJax.
|
||||
3. Validate each candidate with the same local MathJax checker. Replace a math span only when the candidate passes and preserves the original inline/display delimiter shape.
|
||||
4. Rebuild Markdown from approved span replacements and rerun the full quality check on the repaired Markdown.
|
||||
5. Write metadata/report data from the final Markdown and final quality result. Record unresolved failures as `MATH_RENDER_FAILED`; record applied mitigations in a traceable form so warning counts are not reduced by hiding changes.
|
||||
|
||||
Touched surfaces to plan in the sprint contract:
|
||||
|
||||
- `src/pdf2md/quality.py`: expose failed math expression details without losing the existing warning behavior.
|
||||
- `src/pdf2md/math_render.py`: keep MathJax checking local and batch-oriented; do not expose raw MathJax objects as public API.
|
||||
- New focused module, likely `src/pdf2md/math_repair.py`: own candidate generation, span replacement, and repair result records.
|
||||
- `src/pdf2md/conversion.py`: run mitigation between normalization and final metadata/report construction for `convert` and `recheck`.
|
||||
- `src/pdf2md/ir.py`, `src/pdf2md/metadata.py`, and `src/pdf2md/report.py`: update only if the contract decides a new repair warning/info code or summary field is needed.
|
||||
- Tests in `tests/test_quality.py`, a new `tests/test_math_repair.py`, and targeted conversion/recheck CLI tests.
|
||||
|
||||
Non-goals:
|
||||
|
||||
- Do not add cloud OCR, remote LLMs, remote render APIs, or external document upload paths.
|
||||
- Do not add a second conversion engine or runtime engine selection.
|
||||
- Do not implement a full LaTeX parser, symbolic math simplifier, or Obsidian automation.
|
||||
- Do not remove whole formulas or meaningful LaTeX tokens solely to silence warnings.
|
||||
- Do not add new CLI flags unless a later contract explicitly justifies them.
|
||||
|
||||
Verification:
|
||||
|
||||
- Unit tests for failed-expression capture, candidate generation, safe span replacement, and no-op behavior when no candidate passes.
|
||||
- Conversion tests proving repaired Markdown is written only after candidate revalidation.
|
||||
- Recheck tests proving existing output Markdown can be repaired and metadata/report regenerated without rerunning MinerU.
|
||||
- Report/metadata tests proving remaining warnings and applied mitigations are visible and derived from final state.
|
||||
- Run `uv run pytest tests/test_quality.py tests/test_math_repair.py tests/test_conversion.py tests/test_cli.py tests/test_report.py`.
|
||||
- Run `uv run pytest` before marking the sprint complete.
|
||||
- Optionally run `uv run pdf2md recheck outputs\MITC공부\MITC공부.md` against ignored local sample output when the user requests real-output validation.
|
||||
|
||||
Hard failure criteria:
|
||||
|
||||
- The cleanup changes math spans that did not fail MathJax validation.
|
||||
- The cleanup removes an entire formula or a semantically meaningful token without an explicit trace.
|
||||
- The cleanup reduces warning counts by dropping warnings instead of producing MathJax-valid Markdown.
|
||||
- The cleanup makes `pdf2md convert` or `pdf2md recheck` require Node.js/MathJax when they were previously optional.
|
||||
- Default tests require real MinerU, GPU, Node.js, MathJax, network, Obsidian, or `samples/`.
|
||||
|
||||
## Open Questions
|
||||
|
||||
@@ -51,6 +109,13 @@ Completed work history is archived in `docs/WORKARCHIVE.md`. CUDA-enabled PyTorc
|
||||
- No silent fallback after MinerU failure.
|
||||
- Conversion output includes both metadata JSON and `<stem>.report.md`.
|
||||
- Local MathJax render checking is optional and nonfatal; missing Node.js or MathJax must produce a clear warning instead of blocking conversion.
|
||||
- MathJax warning mitigation must run only after initial local MathJax validation identifies failed math spans.
|
||||
- MathJax warning mitigation must be deterministic, local-only, and limited to failed math spans.
|
||||
- Candidate math cleanup must be revalidated with the local MathJax checker before replacing Markdown.
|
||||
- If no candidate passes validation, keep the original formula and retain the `MATH_RENDER_FAILED` warning.
|
||||
- Successfully mitigated formulas must remain traceable in metadata/report output; warning reduction must not hide that a formula was changed.
|
||||
- Sprint 11 uses `MATH_RENDER_REPAIRED` info warnings for applied repair provenance.
|
||||
- Sprint 11 initial repair rules cover repeated same-direction scripts and truncated array `\end{a}` endings only.
|
||||
- Project-scoped custom agents live in `.codex/agents/*.toml`.
|
||||
- Project prompt commands live in `.codex/commands/*.md`.
|
||||
- Project-specific skills live in `.codex/skills/*/SKILL.md`.
|
||||
|
||||
+28
-12
@@ -6,26 +6,28 @@ This file records current progress for agents. Read it before starting work, the
|
||||
|
||||
- Project direction is documented in `PRD.md`, `ARCHITECTURE.md`, `AGENTS.md`, and `docs/KNOWLEDGEBASE.md`.
|
||||
- MinerU 3.1.0 is fixed as the only conversion engine.
|
||||
- The converter currently includes path planning, project-owned records, metadata, direct local MinerU adapter boundary, Obsidian Markdown normalization, local quality checks, report rendering, conversion orchestration, `pdf2md convert`, `pdf2md doctor`, local MathJax render checking, release-gate tests, and opt-in pre-conversion PDF chunking.
|
||||
- The converter currently includes path planning, project-owned records, metadata, direct local MinerU adapter boundary, Obsidian Markdown normalization, local quality checks, report rendering, conversion orchestration, `pdf2md convert`, `pdf2md recheck`, `pdf2md doctor`, local MathJax render checking, conservative MathJax warning mitigation, release-gate tests, and opt-in pre-conversion PDF chunking.
|
||||
- `docs/V1IMPLEMENTATIONPLAN.md` defines the v1 implementation sequence.
|
||||
- `docs/Sprints/` contains completed sprint contracts through Sprint 10.
|
||||
- `docs/Sprints/` contains completed sprint contracts through Sprint 11.
|
||||
- `docs/WORKARCHIVE.md` contains completed sprint history, historical verification results, runtime setup notes, and sample conversion evidence.
|
||||
- `samples/` exists locally and is untracked by git.
|
||||
- `samples/` exists locally as fixture context.
|
||||
- `outputs/` is ignored and contains local generated conversion outputs.
|
||||
|
||||
## Environment Notes
|
||||
|
||||
- OS/workspace: Windows PowerShell in `D:\Work\Repos\AICoding\ConvertPDFToMD`.
|
||||
- OS/workspace: Windows PowerShell in `C:\git\PDFToMD`.
|
||||
- Python target: 3.12.
|
||||
- Local Python observed: 3.12.7.
|
||||
- `uv` is installed per-user at `C:\Users\user\.local\bin`.
|
||||
- Target GPU: NVIDIA GTX 1070 Ti 8GB.
|
||||
- Local project Python observed: 3.12.13 in `.venv`.
|
||||
- `uv` is installed per-user at `C:\Users\baram\.local\bin`.
|
||||
- Target GPU documented for the original project setup: NVIDIA GTX 1070 Ti 8GB.
|
||||
- Current PC GPU observed by `doctor`: NVIDIA GeForce RTX 4080 SUPER 16GB.
|
||||
- Default conversion device: `cuda:0`.
|
||||
- MinerU execution mode: direct local `mineru` CLI only.
|
||||
- Strict-local allows MinerU 3.1.0's CLI-internal temporary local `mineru-api` when the CLI runs without `--api-url`.
|
||||
- Strict-local prohibits `--api-url`, remote APIs, router mode, HTTP client backends, and remote OpenAI-compatible backends.
|
||||
- Current local runtime has CUDA-enabled PyTorch `2.6.0+cu126`, `torchvision 0.21.0+cu126`, `mineru[core]==3.1.0`, local MinerU models, and `MINERU_MODEL_SOURCE=local`.
|
||||
- Current `pdf2md doctor` status is WARN only because GTX 1070 Ti is Pascal/pre-Turing; MinerU, CUDA PyTorch, local model config, MathJax, and strict-local checks pass.
|
||||
- Current `.venv` has project fast-test dependencies, CUDA-enabled PyTorch `2.6.0+cu126`, `torchvision 0.21.0+cu126`, and `mineru[core]==3.1.0`.
|
||||
- Current `pdf2md doctor` status is PASS. MinerU, RTX 4080 SUPER CUDA PyTorch, local model config, MathJax, and strict-local checks pass.
|
||||
- MinerU models were downloaded from Hugging Face by explicit setup command. Runtime model loading uses `MINERU_MODEL_SOURCE=local`.
|
||||
|
||||
## Recent Completed Work
|
||||
|
||||
@@ -36,6 +38,20 @@ This file records current progress for agents. Read it before starting work, the
|
||||
- `convert_pdf()` returns `BatchConversionResult` when `chunk_pages` is set and keeps returning `ConversionResult` when chunking is unset.
|
||||
- Converted `samples/FourNodeQuadrilateralShellElementMITC4.pdf` with `MINERU_MODEL_SOURCE=local` and default `--gpu cuda:0`; output was written to ignored `outputs/FourNodeQuadrilateralShellElementMITC4/`.
|
||||
- The FourNode sample conversion report status was `success`: 7 pages, 22 assets, 38 inline formulas, 16 display formulas, 0 math render errors, and 0 warnings.
|
||||
- Installed uv `0.11.12` at `C:\Users\baram\.local\bin`, installed uv-managed CPython `3.12.13`, created `.venv`, and ran `uv sync`.
|
||||
- Verified base project environment with `uv run pytest`: 163 passed, 1 skipped.
|
||||
- Installed runtime dependencies on this PC: CUDA PyTorch `2.6.0+cu126`, `torchvision 0.21.0+cu126`, `mineru[core]==3.1.0`, local MathJax npm dependencies, and local MinerU models.
|
||||
- Set user environment variable `MINERU_MODEL_SOURCE=local`.
|
||||
- Verified full local runtime with `uv run pdf2md doctor`: PASS.
|
||||
- Verified real local sample conversion: `samples/FourNodeQuadrilateralShellElementMITC4.pdf` to ignored `outputs/runtime-smoke/`, status `success`, 7 pages, 22 assets, 38 inline formulas, 16 display formulas, 0 math render errors, and 0 warnings.
|
||||
- Converted `samples/MITC공부.pdf` to ignored `outputs/MITC공부/`; report status was `partial`: 13 pages, 107 assets, 23 inline formulas, 103 display formulas, 2 MathJax render warnings, and 0 missing or invalid asset links.
|
||||
- Added `recheck_markdown()` and `pdf2md recheck <markdown.md>` to rerun local quality checks for an existing generated Markdown file and rewrite the adjacent metadata JSON and `.report.md` without rerunning MinerU.
|
||||
- Verified `uv run pdf2md recheck outputs\MITC공부\MITC공부.md`; the command regenerated metadata/report and still reported 2 warnings because the current Markdown still contains the two MathJax-invalid expressions.
|
||||
- Reconverted `samples/MITC공부.pdf` with `--overwrite` to ignored `outputs/MITC공부/`; report status remains `partial`: 13 pages, 107 assets, 23 inline formulas, 103 display formulas, 2 MathJax render warnings, and 0 missing or invalid asset links.
|
||||
- Sprint 11 implemented conservative MathJax warning mitigation with failed-expression details, `src/pdf2md/math_repair.py`, shared `convert`/`recheck` repair integration, and `MATH_RENDER_REPAIRED` info warnings.
|
||||
- Verified default fast suite: `uv run pytest` passed 172 tests with 1 skipped.
|
||||
- Verified requested real sample: `uv run pdf2md convert samples\MITC공부.pdf --out outputs\sprint11-MITC공부 --overwrite` succeeded with 13 pages, 107 assets, 23 inline formulas, 103 display formulas, 0 MathJax render errors, and 2 `MATH_RENDER_REPAIRED` info warnings.
|
||||
- Reconverted `samples/MITC공부.pdf` to ignored `outputs/MITC공부/` with Sprint 11 mitigation; report status is `partial` from 2 `MATH_RENDER_REPAIRED` info warnings, with 13 pages, 107 assets, 23 inline formulas, 103 display formulas, 0 MathJax render errors, and 0 missing or invalid asset links.
|
||||
|
||||
## In Progress
|
||||
|
||||
@@ -44,10 +60,10 @@ This file records current progress for agents. Read it before starting work, the
|
||||
## Blockers
|
||||
|
||||
- No active blocker.
|
||||
- GTX 1070 Ti remains an 8GB Pascal GPU; larger PDFs may still hit VRAM or model compatibility limits.
|
||||
|
||||
## Next Actions
|
||||
|
||||
1. Review generated sample Markdown outputs in Obsidian if visual quality needs manual assessment.
|
||||
2. Run optional real local chunked conversion on a long sample only if requested.
|
||||
3. Preserve strict-local runtime behavior: use local model paths, direct CLI execution, and no user-specified API or remote backend.
|
||||
2. Run additional real local sample validation only if requested, especially for new MathJax failure messages not covered by Sprint 11's narrow repair rules.
|
||||
3. Run optional real local chunked conversion on a long sample only if requested.
|
||||
4. Preserve strict-local runtime behavior: use local model paths, direct CLI execution, and no user-specified API or remote backend.
|
||||
|
||||
@@ -4,7 +4,7 @@ Local-only PDF-to-Markdown converter for math-heavy digital documents.
|
||||
|
||||
## Status
|
||||
|
||||
The project currently provides a Python package, `pdf2md convert`, metadata/report output, mocked MinerU adapter tests, `pdf2md doctor` setup diagnostics, and Sprint 9 release-gate documentation. Real local MinerU sample validation remains optional and may be blocked until MinerU 3.1.0 and local model/cache setup are available.
|
||||
The project currently provides a Python package, `pdf2md convert`, Markdown recheck via `pdf2md recheck`, metadata/report output, mocked MinerU adapter tests, `pdf2md doctor` setup diagnostics, and Sprint 9 release-gate documentation. Real local MinerU sample validation remains optional and may be blocked until MinerU 3.1.0 and local model/cache setup are available.
|
||||
|
||||
## Setup
|
||||
|
||||
@@ -76,6 +76,16 @@ The model/cache check looks for these environment variables when present:
|
||||
|
||||
It also checks for `%USERPROFILE%\mineru.json`, which MinerU documents as its default user config location. Missing model/cache paths are warnings because model download and cache population must be explicit setup actions.
|
||||
|
||||
## Rechecking Markdown
|
||||
|
||||
After editing a generated Markdown file, rerun local quality checks and regenerate the adjacent metadata/report files:
|
||||
|
||||
```powershell
|
||||
uv run pdf2md recheck outputs/MITC공부/MITC공부.md
|
||||
```
|
||||
|
||||
`recheck` reads the existing `<stem>.metadata.json` for source PDF, engine, page, and asset provenance. It replaces quality warnings that can be recalculated from the current Markdown, including MathJax render failures and local asset-link warnings, then rewrites `<stem>.metadata.json` and `<stem>.report.md`.
|
||||
|
||||
## Runtime Policy
|
||||
|
||||
Runtime conversion is strict-local. Allowed: direct `mineru` CLI execution and the CLI-internal temporary local `mineru-api` that MinerU starts when `--api-url` is omitted. Prohibited: `--api-url`, remote APIs, router mode, HTTP client backends, remote OpenAI-compatible backends, hosted renderers, and cloud fallbacks.
|
||||
|
||||
@@ -0,0 +1,181 @@
|
||||
# Sprint 11 Contract: MathJax Warning Mitigation
|
||||
|
||||
Status: Implemented
|
||||
Last updated: 2026-05-11
|
||||
|
||||
## Objective
|
||||
|
||||
Add a conservative local cleanup pass for MathJax-invalid formulas:
|
||||
|
||||
1. Run the existing MathJax renderability check on normalized Markdown.
|
||||
2. Build repair candidates only for expressions that failed MathJax validation.
|
||||
3. Re-check each candidate with the same local checker.
|
||||
4. Replace only candidates that pass.
|
||||
5. Re-run final quality checks before writing Markdown, metadata JSON, and report Markdown.
|
||||
|
||||
The feature should reduce `MATH_RENDER_FAILED` warnings without hiding that a formula was changed.
|
||||
|
||||
## Current Precondition
|
||||
|
||||
- `pdf2md convert` writes normalized Markdown, metadata JSON, and `<stem>.report.md`.
|
||||
- `pdf2md recheck` can rerun quality checks for an existing generated Markdown file without rerunning MinerU.
|
||||
- Local MathJax checking is already optional and nonfatal.
|
||||
- `outputs/MITC공부/MITC공부.md` currently has two MathJax render failures:
|
||||
- expression 8: `Double exponent: use braces to clarify`
|
||||
- expression 83: `Unknown environment 'a'`
|
||||
- `samples/MITC공부.pdf` is the requested real local validation sample.
|
||||
|
||||
## Touched Surfaces
|
||||
|
||||
Allowed during implementation:
|
||||
|
||||
- `src/pdf2md/quality.py`
|
||||
- `src/pdf2md/math_repair.py`
|
||||
- `src/pdf2md/conversion.py`
|
||||
- `src/pdf2md/ir.py`
|
||||
- `tests/test_quality.py`
|
||||
- `tests/test_math_repair.py`
|
||||
- `tests/test_conversion.py`
|
||||
- `tests/test_cli.py`
|
||||
- `docs/Sprints/SPRINT11CONTRACT.md`
|
||||
- `PLAN.md`
|
||||
- `PROGRESS.md`
|
||||
|
||||
Not allowed:
|
||||
|
||||
- Remote OCR, remote LLMs, remote render APIs, or external document upload paths.
|
||||
- Alternate PDF conversion engines.
|
||||
- Switchable conversion-engine behavior.
|
||||
- A full LaTeX parser or symbolic math rewrite engine.
|
||||
- New CLI flags unless a later user request explicitly asks for them.
|
||||
- Mandatory default tests that require real MinerU, GPU, Node.js, MathJax, network, Obsidian, or `samples/`.
|
||||
- Committed files under `samples/`.
|
||||
- Committed generated conversion outputs under `outputs/`.
|
||||
|
||||
## Product Behavior
|
||||
|
||||
Repair activation:
|
||||
|
||||
- Repair runs automatically when a local math checker is available and at least one math expression fails validation.
|
||||
- If the checker is unavailable, behavior remains unchanged: conversion/recheck continues with an info-level unavailable-checker warning.
|
||||
- The same repair path applies to fresh `convert` output and existing Markdown processed through `recheck`.
|
||||
|
||||
Initial deterministic repair rules:
|
||||
|
||||
- Repeated same-direction script repair:
|
||||
- Convert consecutive superscripts/subscripts such as `^ {i} ^ {t}` to `^ {i} {} ^ {t}`.
|
||||
- This resolves MathJax double-super/subscript syntax while preserving both script tokens.
|
||||
- Truncated array environment repair:
|
||||
- Convert `\end{a}` to `\end{array}` only when the expression has unmatched `\begin{array}` / `\end{array}` counts.
|
||||
- This targets obvious extraction truncation, not arbitrary environment renaming.
|
||||
|
||||
Provenance:
|
||||
|
||||
- Applied repairs produce `MATH_RENDER_REPAIRED` info warnings.
|
||||
- Successfully repaired expressions must not count as `math_render_error_count`.
|
||||
- Unrepaired expressions keep the original `MATH_RENDER_FAILED` warning behavior.
|
||||
- The report remains derived from metadata and local quality checks.
|
||||
|
||||
## Architecture Plan
|
||||
|
||||
### WP11.1: Failed Math Detail Capture
|
||||
|
||||
Actions:
|
||||
|
||||
- Add a project-owned result type that can include failed `MathExpression` records and checker messages.
|
||||
- Preserve the current `check_math_renderability()` return behavior for existing callers.
|
||||
- Keep expression extraction outside fenced code and inline code.
|
||||
|
||||
Expected output:
|
||||
|
||||
- Conversion can access failed expression spans without parsing warning message text.
|
||||
|
||||
### WP11.2: Repair Module
|
||||
|
||||
Actions:
|
||||
|
||||
- Add `src/pdf2md/math_repair.py`.
|
||||
- Define repair result records.
|
||||
- Generate candidates only for failed expressions.
|
||||
- Revalidate candidates through the injected checker.
|
||||
- Apply replacements from right to left so Markdown spans remain stable.
|
||||
|
||||
Expected output:
|
||||
|
||||
- Pure string-level repair behavior that is deterministic, local-only, and independently testable.
|
||||
|
||||
### WP11.3: Conversion And Recheck Integration
|
||||
|
||||
Actions:
|
||||
|
||||
- Route `convert` normalized Markdown through repair before final metadata/report construction.
|
||||
- Route `recheck` Markdown through the same repair path before rewriting metadata/report.
|
||||
- Re-run final quality checks after any repair.
|
||||
- Preserve asset checking and strict-local behavior unchanged.
|
||||
|
||||
Expected output:
|
||||
|
||||
- Fresh conversions and rechecks both benefit from MathJax warning mitigation.
|
||||
|
||||
### WP11.4: Tests
|
||||
|
||||
Default tests:
|
||||
|
||||
- Repeated superscripts are repaired only when the original expression failed.
|
||||
- `\end{a}` repairs to `\end{array}` only when array environments are unbalanced.
|
||||
- A candidate that still fails is not written back.
|
||||
- Passing expressions are not changed.
|
||||
- Conversion writes repaired Markdown only after candidate revalidation.
|
||||
- Recheck can repair an existing Markdown output and regenerate metadata/report.
|
||||
- Existing unavailable-checker behavior remains nonfatal.
|
||||
|
||||
Optional local validation:
|
||||
|
||||
- Run `uv run pdf2md convert samples\MITC공부.pdf --out outputs\sprint11-MITC공부 --overwrite`.
|
||||
- Confirm the generated report has `Math render error count: 0` for the requested sample, or record any remaining failures exactly.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
- Default fast tests pass without real MinerU, GPU, Node.js, MathJax, network, Obsidian, or `samples/`.
|
||||
- `pdf2md convert` and `pdf2md recheck` share the same repair behavior.
|
||||
- MathJax failed spans are repaired only after candidate revalidation succeeds.
|
||||
- Successfully repaired formulas remain visible through `MATH_RENDER_REPAIRED` info warnings.
|
||||
- Existing strict-local and MinerU-only constraints are unchanged.
|
||||
- `samples/MITC공부.pdf` is validated locally as requested, with generated outputs kept ignored under `outputs/`.
|
||||
|
||||
## Hard Failure Criteria
|
||||
|
||||
- Repair changes a math span that did not fail initial MathJax validation.
|
||||
- Repair drops an entire formula or removes meaningful LaTeX tokens solely to silence warnings.
|
||||
- Repair claims success without re-running the local checker on the candidate.
|
||||
- `convert` or `recheck` starts requiring MathJax when it was previously optional.
|
||||
- Default tests require real MinerU, GPU, Node.js, MathJax, network, Obsidian, or `samples/`.
|
||||
- `samples/` or generated `outputs/` files are committed.
|
||||
|
||||
## Verification Commands
|
||||
|
||||
```powershell
|
||||
uv run pytest tests/test_quality.py tests/test_math_repair.py tests/test_conversion.py tests/test_cli.py tests/test_report.py
|
||||
uv run pytest
|
||||
uv run pdf2md convert samples\MITC공부.pdf --out outputs\sprint11-MITC공부 --overwrite
|
||||
git diff --check
|
||||
git status --short --untracked-files=all
|
||||
```
|
||||
|
||||
## Handoff Requirements
|
||||
|
||||
After implementation:
|
||||
|
||||
- Update `PROGRESS.md` with files changed, commands run, test outcomes, sample validation outcome, known failures, and next action.
|
||||
- Keep sample PDFs and generated outputs out of the commit.
|
||||
- Commit the completed sprint if verification passes.
|
||||
|
||||
## Implementation Handoff
|
||||
|
||||
- Files changed: `src/pdf2md/quality.py`, `src/pdf2md/math_repair.py`, `src/pdf2md/conversion.py`, `src/pdf2md/ir.py`, tests, `ARCHITECTURE.md`, `docs/V1IMPLEMENTATIONPLAN.md`, `PLAN.md`, and `PROGRESS.md`.
|
||||
- Default verification: `uv run pytest` passed 172 tests with 1 skipped.
|
||||
- Targeted verification: `uv run pytest tests/test_quality.py tests/test_math_repair.py tests/test_conversion.py tests/test_cli.py tests/test_report.py` passed 56 tests.
|
||||
- Requested sample verification: `uv run pdf2md convert samples\MITC공부.pdf --out outputs\sprint11-MITC공부 --overwrite` succeeded; final report shows `Math render error count: 0` and two `MATH_RENDER_REPAIRED` info warnings.
|
||||
- Known failures: none.
|
||||
- Residual risk: repair rules are deliberately narrow; future PDFs may expose MathJax failures that should remain warnings until a deterministic rule is added and tested.
|
||||
- Next action: optional Obsidian visual review or additional sample validation.
|
||||
@@ -4,7 +4,7 @@ Last updated: 2026-05-08
|
||||
|
||||
This document is the implementation plan for v1. It does not replace `PRD.md` or `ARCHITECTURE.md`; use those files as the source of product requirements and system design. This plan explains the order of work, sprint contracts, verification gates, and agent ownership for implementing the converter.
|
||||
|
||||
Sprint 1 created the Python package scaffold and CLI placeholder. Sprint 2 created path planning. Sprint 3 created project-owned records and metadata construction. Sprint 4 created the mocked direct local MinerU adapter boundary. Sprint 5 created the Obsidian Markdown normalization boundary. Sprint 6 created local quality-check and report-rendering boundaries. Sprint 7 implemented conversion orchestration, the public conversion API, and the `pdf2md convert` CLI path with fake-adapter tests. Sprint 8 implemented mockable doctor diagnostics, the `pdf2md doctor` CLI path, and setup documentation. Sprint 9 implemented fast mocked integration tests, explicit opt-in local MinerU fixture evaluation, and the v1 release checklist. Sprint 10 implemented opt-in pre-conversion PDF chunking for long documents.
|
||||
Sprint 1 created the Python package scaffold and CLI placeholder. Sprint 2 created path planning. Sprint 3 created project-owned records and metadata construction. Sprint 4 created the mocked direct local MinerU adapter boundary. Sprint 5 created the Obsidian Markdown normalization boundary. Sprint 6 created local quality-check and report-rendering boundaries. Sprint 7 implemented conversion orchestration, the public conversion API, and the `pdf2md convert` CLI path with fake-adapter tests. Sprint 8 implemented mockable doctor diagnostics, the `pdf2md doctor` CLI path, and setup documentation. Sprint 9 implemented fast mocked integration tests, explicit opt-in local MinerU fixture evaluation, and the v1 release checklist. Sprint 10 implemented opt-in pre-conversion PDF chunking for long documents. Sprint 11 implemented conservative MathJax warning mitigation for failed math spans.
|
||||
|
||||
## 1. V1 Outcome
|
||||
|
||||
@@ -599,6 +599,48 @@ Hard failure criteria:
|
||||
- Chunk outputs are merged.
|
||||
- Default tests require real MinerU, GPU, model files, network, Obsidian, LaTeX tooling, or `samples/`.
|
||||
|
||||
### Sprint 11: MathJax Warning Mitigation
|
||||
|
||||
Active contract:
|
||||
|
||||
- `docs/Sprints/SPRINT11CONTRACT.md`
|
||||
|
||||
Status:
|
||||
|
||||
- Implemented.
|
||||
|
||||
Objective:
|
||||
|
||||
- Repair narrow MathJax-invalid formula artifacts after initial local validation and before final output writing.
|
||||
|
||||
Touched surfaces:
|
||||
|
||||
- `quality.py`
|
||||
- `math_repair.py`
|
||||
- `conversion.py`
|
||||
- `ir.py`
|
||||
- Unit tests for quality details, repair rules, conversion, and recheck behavior
|
||||
|
||||
Expected outputs:
|
||||
|
||||
- Failed math expression records expose body, display mode, span, and checker message.
|
||||
- Repair candidates are generated only for failed math spans.
|
||||
- Repeated same-direction scripts are disambiguated with an empty group.
|
||||
- Truncated `\end{a}` array endings are repaired when array environments are unbalanced.
|
||||
- `convert` and `recheck` share the same repair behavior.
|
||||
- Applied repairs are recorded as `MATH_RENDER_REPAIRED` info warnings and do not count as math render errors.
|
||||
|
||||
Verification checks:
|
||||
|
||||
- Default fast tests pass without real MinerU, GPU, Node.js, MathJax, network, Obsidian, or `samples/`.
|
||||
- `samples/MITC공부.pdf` validates locally with `Math render error count: 0`.
|
||||
|
||||
Hard failure criteria:
|
||||
|
||||
- Repair changes math spans that did not fail local MathJax validation.
|
||||
- Repair claims success without candidate revalidation.
|
||||
- Repair introduces remote services, alternate engines, or mandatory sample-dependent default tests.
|
||||
|
||||
## 6. Cross-Cutting Acceptance Criteria
|
||||
|
||||
Every implementation sprint must preserve these acceptance criteria:
|
||||
@@ -645,7 +687,7 @@ Handoff fields:
|
||||
- MinerU 3.1.0 install and CLI behavior are source-verified, but real local output still needs a later local probe before release.
|
||||
- GTX 1070 Ti 8GB is visible locally, but it is Pascal compute capability 6.1; `doctor` and setup docs must make CUDA/PyTorch limits clear.
|
||||
- `uv` is installed per-user at `C:\Users\user\.local\bin`, but a new shell may need PATH refresh before `uv` is visible.
|
||||
- Formula renderability checks need a local tool choice; the implementation should start with an interface and graceful unavailable-tool warning if needed.
|
||||
- Formula renderability checks and conservative warning mitigation are implemented, but formula reconstruction remains best effort and should keep warnings/provenance visible.
|
||||
- Some PDFs will have tables or formulas that cannot be faithfully represented in Markdown; metadata and `.report.md` must surface this instead of hiding it.
|
||||
- Redistribution license obligations must be reviewed before packaging, redistribution, or bundling model weights.
|
||||
|
||||
|
||||
@@ -7,6 +7,7 @@ from pdf2md.conversion import (
|
||||
ConversionResult,
|
||||
convert_input,
|
||||
convert_pdf,
|
||||
recheck_markdown,
|
||||
)
|
||||
|
||||
__version__ = "0.1.0"
|
||||
@@ -19,4 +20,5 @@ __all__ = [
|
||||
"__version__",
|
||||
"convert_input",
|
||||
"convert_pdf",
|
||||
"recheck_markdown",
|
||||
]
|
||||
|
||||
+20
-1
@@ -7,7 +7,7 @@ import sys
|
||||
from collections.abc import Sequence
|
||||
|
||||
from pdf2md import __version__
|
||||
from pdf2md.conversion import DEFAULT_CHUNK_PAGES, DEFAULT_GPU_DEVICE, ConversionAdapter, convert_input
|
||||
from pdf2md.conversion import DEFAULT_CHUNK_PAGES, DEFAULT_GPU_DEVICE, ConversionAdapter, convert_input, recheck_markdown
|
||||
from pdf2md.doctor import DoctorReport, format_doctor_report, run_doctor
|
||||
from pdf2md.mineru_adapter import StrictLocalViolationError
|
||||
from pdf2md.paths import PathPlanningError
|
||||
@@ -17,6 +17,7 @@ def main(
|
||||
argv: Sequence[str] | None = None,
|
||||
*,
|
||||
adapter: ConversionAdapter | None = None,
|
||||
math_checker=None,
|
||||
clock=None,
|
||||
doctor_runner=None,
|
||||
) -> int:
|
||||
@@ -61,6 +62,8 @@ def main(
|
||||
default=True,
|
||||
help="Keep strict-local conversion policy enabled. Enabled by default.",
|
||||
)
|
||||
recheck_parser = subparsers.add_parser("recheck", help="Re-run quality checks for an existing Markdown output.")
|
||||
recheck_parser.add_argument("markdown", help="Existing Markdown output from pdf2md convert.")
|
||||
args = parser.parse_args(argv)
|
||||
|
||||
if args.version:
|
||||
@@ -72,6 +75,21 @@ def main(
|
||||
print(format_doctor_report(report))
|
||||
return report.exit_code
|
||||
|
||||
if args.command == "recheck":
|
||||
try:
|
||||
result = recheck_markdown(args.markdown, math_checker=math_checker, clock=clock)
|
||||
except ValueError as error:
|
||||
print(f"error: {error}", file=sys.stderr)
|
||||
return 2
|
||||
print(
|
||||
"rechecked: "
|
||||
f"{result.markdown_path} -> {result.metadata_path}, {result.report_path} "
|
||||
f"({result.warning_count} warnings)"
|
||||
)
|
||||
print(f"status: {result.final_status}")
|
||||
print(f"warnings: {result.warning_count}")
|
||||
return 1 if not result.succeeded else 0
|
||||
|
||||
if args.command != "convert":
|
||||
parser.print_help()
|
||||
return 0
|
||||
@@ -88,6 +106,7 @@ def main(
|
||||
gpu=args.gpu,
|
||||
strict_local=args.strict_local,
|
||||
adapter=adapter,
|
||||
math_checker=math_checker,
|
||||
clock=clock,
|
||||
)
|
||||
except (PathPlanningError, StrictLocalViolationError, ValueError) as error:
|
||||
|
||||
+228
-6
@@ -11,7 +11,7 @@ from collections.abc import Callable
|
||||
from dataclasses import dataclass, replace
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path, PurePosixPath
|
||||
from typing import Protocol
|
||||
from typing import Any, Protocol
|
||||
|
||||
from pdf2md.ir import (
|
||||
AssetRecord,
|
||||
@@ -25,6 +25,7 @@ from pdf2md.ir import (
|
||||
)
|
||||
from pdf2md.markdown import normalize_markdown
|
||||
from pdf2md.math_render import create_default_math_checker
|
||||
from pdf2md.math_repair import repair_math_render_failures
|
||||
from pdf2md.metadata import build_metadata
|
||||
from pdf2md.mineru_adapter import (
|
||||
ENGINE_NAME,
|
||||
@@ -35,7 +36,7 @@ from pdf2md.mineru_adapter import (
|
||||
)
|
||||
from pdf2md.paths import DiscoveredPdf, PathLike, PlannedOutput, discover_pdfs, plan_outputs
|
||||
from pdf2md.pdf_splitter import PdfChunkPlan, plan_pdf_chunks, write_pdf_chunk
|
||||
from pdf2md.quality import MathChecker, QualityResult, check_asset_links, check_math_renderability, merge_quality_results
|
||||
from pdf2md.quality import MathChecker, QualityResult, check_asset_links, check_math_renderability_details, merge_quality_results
|
||||
from pdf2md.report import FinalStatus, determine_final_status, render_report
|
||||
|
||||
|
||||
@@ -101,9 +102,23 @@ class _ConversionTask:
|
||||
original_source_sha256: str | None = None
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class _PreparedMarkdown:
|
||||
markdown: str
|
||||
quality: QualityResult
|
||||
|
||||
|
||||
_IMAGE_LINK_RE = re.compile(r"!\[(?P<alt>[^\]\n]*)\]\((?P<target>[^)\n]+)\)")
|
||||
_DISPLAY_MATH_RE = re.compile(r"(?<!\\)\$\$(?P<body>.*?)(?<!\\)\$\$", re.DOTALL)
|
||||
_INLINE_MATH_RE = re.compile(r"(?<!\\)\$(?P<body>[^\n$]+?)(?<!\\)\$")
|
||||
_RECHECKED_WARNING_CODES = frozenset(
|
||||
{
|
||||
WarningCode.MATH_RENDER_FAILED,
|
||||
WarningCode.MATH_RENDER_REPAIRED,
|
||||
WarningCode.ASSET_LINK_MISSING,
|
||||
WarningCode.ASSET_LINK_INVALID,
|
||||
}
|
||||
)
|
||||
|
||||
|
||||
def convert_pdf(
|
||||
@@ -212,6 +227,178 @@ def convert_input(
|
||||
)
|
||||
|
||||
|
||||
def recheck_markdown(
|
||||
markdown_path: PathLike,
|
||||
*,
|
||||
math_checker: MathChecker | None = None,
|
||||
clock: Clock | None = None,
|
||||
) -> ConversionResult:
|
||||
"""Re-run local quality checks for an existing Markdown output and rewrite metadata/report."""
|
||||
|
||||
markdown_file = Path(markdown_path).expanduser().resolve()
|
||||
if not markdown_file.is_file():
|
||||
raise ValueError(f"Markdown output does not exist: {markdown_file}")
|
||||
|
||||
metadata_path = markdown_file.with_suffix(".metadata.json")
|
||||
report_path = markdown_file.with_suffix(".report.md")
|
||||
if not metadata_path.is_file():
|
||||
raise ValueError(f"Existing metadata JSON is required for recheck: {metadata_path}")
|
||||
|
||||
existing_metadata = _read_metadata_json(metadata_path)
|
||||
markdown = markdown_file.read_text(encoding="utf-8")
|
||||
assets_dir = markdown_file.with_suffix(".assets")
|
||||
assets = _assets_from_metadata(existing_metadata)
|
||||
prepared = _prepare_markdown_for_output(
|
||||
markdown,
|
||||
markdown_dir=markdown_file.parent,
|
||||
asset_root=assets_dir,
|
||||
math_checker=math_checker,
|
||||
)
|
||||
markdown = prepared.markdown
|
||||
quality = prepared.quality
|
||||
warnings = _preserved_metadata_warnings(existing_metadata) + quality.warnings
|
||||
document = _build_document(
|
||||
source_pdf=Path(_metadata_text(existing_metadata, "source_pdf")),
|
||||
markdown=markdown,
|
||||
assets=assets,
|
||||
warnings=warnings,
|
||||
raw_structured={"pages": [None] * _metadata_page_count(existing_metadata)},
|
||||
)
|
||||
now = clock or _utc_now
|
||||
metadata_data = build_metadata(
|
||||
document=document,
|
||||
source_sha256=_metadata_text(existing_metadata, "source_sha256"),
|
||||
created_at=_format_timestamp(now()),
|
||||
engine=_metadata_text(existing_metadata, "engine"),
|
||||
engine_version=_metadata_text(existing_metadata, "engine_version"),
|
||||
engine_options=_metadata_engine_options(existing_metadata),
|
||||
)
|
||||
report_quality = QualityResult(
|
||||
missing_asset_link_count=quality.missing_asset_link_count,
|
||||
invalid_asset_link_count=quality.invalid_asset_link_count,
|
||||
)
|
||||
report_text = render_report(
|
||||
metadata_data,
|
||||
quality=report_quality,
|
||||
markdown_path=markdown_file,
|
||||
metadata_path=metadata_path,
|
||||
report_path=report_path,
|
||||
)
|
||||
final_status = determine_final_status(metadata_data, report_quality)
|
||||
|
||||
_write_text(markdown_file, markdown)
|
||||
_write_text(metadata_path, json.dumps(metadata_data, indent=2, ensure_ascii=False, sort_keys=True) + "\n")
|
||||
_write_text(report_path, report_text)
|
||||
|
||||
return ConversionResult(
|
||||
source_pdf=Path(_metadata_text(metadata_data, "source_pdf")),
|
||||
markdown_path=markdown_file,
|
||||
metadata_path=metadata_path,
|
||||
report_path=report_path,
|
||||
assets_dir=assets_dir,
|
||||
raw_dir=None,
|
||||
engine=_metadata_text(metadata_data, "engine"),
|
||||
engine_version=_metadata_text(metadata_data, "engine_version"),
|
||||
final_status=final_status,
|
||||
warning_count=len(warnings),
|
||||
warnings=warnings,
|
||||
pages_processed=int(metadata_data["summary"]["pages_processed"]),
|
||||
)
|
||||
|
||||
|
||||
def _read_metadata_json(path: Path) -> dict[str, Any]:
|
||||
data = json.loads(path.read_text(encoding="utf-8"))
|
||||
if not isinstance(data, dict):
|
||||
raise ValueError(f"metadata JSON must contain an object: {path}")
|
||||
return data
|
||||
|
||||
|
||||
def _assets_from_metadata(metadata: dict[str, Any]) -> tuple[AssetRecord, ...]:
|
||||
raw_assets = metadata.get("assets", ())
|
||||
if not isinstance(raw_assets, list):
|
||||
return ()
|
||||
assets: list[AssetRecord] = []
|
||||
for item in raw_assets:
|
||||
if not isinstance(item, dict):
|
||||
continue
|
||||
relative_path = item.get("relative_path")
|
||||
if not isinstance(relative_path, str) or not relative_path:
|
||||
continue
|
||||
assets.append(
|
||||
AssetRecord(
|
||||
relative_path,
|
||||
page_index=_optional_page_index(item.get("page_index")),
|
||||
bbox=_optional_bbox(item.get("bbox")),
|
||||
)
|
||||
)
|
||||
return tuple(assets)
|
||||
|
||||
|
||||
def _preserved_metadata_warnings(metadata: dict[str, Any]) -> tuple[WarningRecord, ...]:
|
||||
raw_warnings = metadata.get("warnings", ())
|
||||
if not isinstance(raw_warnings, list):
|
||||
return ()
|
||||
warnings: list[WarningRecord] = []
|
||||
for item in raw_warnings:
|
||||
if not isinstance(item, dict):
|
||||
continue
|
||||
warning = _warning_from_metadata(item)
|
||||
if warning is not None and warning.code not in _RECHECKED_WARNING_CODES:
|
||||
warnings.append(warning)
|
||||
return tuple(warnings)
|
||||
|
||||
|
||||
def _warning_from_metadata(item: dict[str, Any]) -> WarningRecord | None:
|
||||
code = item.get("code")
|
||||
severity = item.get("severity")
|
||||
message = item.get("message")
|
||||
if not isinstance(code, str) or not isinstance(severity, str) or not isinstance(message, str) or not message:
|
||||
return None
|
||||
return WarningRecord(
|
||||
WarningCode(code),
|
||||
WarningSeverity(severity),
|
||||
message,
|
||||
page_index=_optional_page_index(item.get("page_index")),
|
||||
bbox=_optional_bbox(item.get("bbox")),
|
||||
)
|
||||
|
||||
|
||||
def _metadata_text(metadata: dict[str, Any], field_name: str) -> str:
|
||||
value = metadata.get(field_name)
|
||||
if not isinstance(value, str) or not value:
|
||||
raise ValueError(f"metadata field is required: {field_name}")
|
||||
return value
|
||||
|
||||
|
||||
def _metadata_engine_options(metadata: dict[str, Any]) -> dict[str, Any]:
|
||||
value = metadata.get("engine_options", {})
|
||||
return dict(value) if isinstance(value, dict) else {}
|
||||
|
||||
|
||||
def _metadata_page_count(metadata: dict[str, Any]) -> int:
|
||||
pages = metadata.get("pages")
|
||||
if isinstance(pages, list) and pages:
|
||||
return len(pages)
|
||||
summary = metadata.get("summary")
|
||||
if isinstance(summary, dict):
|
||||
pages_processed = summary.get("pages_processed")
|
||||
if isinstance(pages_processed, int) and pages_processed > 0:
|
||||
return pages_processed
|
||||
return 1
|
||||
|
||||
|
||||
def _optional_page_index(value: object) -> int | None:
|
||||
return value if isinstance(value, int) and value >= 0 else None
|
||||
|
||||
|
||||
def _optional_bbox(value: object) -> tuple[float, float, float, float] | None:
|
||||
if not isinstance(value, list | tuple) or len(value) != 4:
|
||||
return None
|
||||
if not all(isinstance(part, int | float) for part in value):
|
||||
return None
|
||||
return tuple(float(part) for part in value)
|
||||
|
||||
|
||||
def _plan_conversion_tasks(
|
||||
discovered: tuple[DiscoveredPdf, ...],
|
||||
output_dir: PathLike,
|
||||
@@ -465,16 +652,17 @@ def _convert_in_work_dir(
|
||||
asset_root=plan.assets_dir,
|
||||
check_assets=False,
|
||||
)
|
||||
quality = _run_quality_checks(
|
||||
prepared = _prepare_markdown_for_output(
|
||||
normalized.markdown,
|
||||
markdown_dir=plan.markdown_path.parent,
|
||||
asset_root=plan.assets_dir,
|
||||
math_checker=math_checker,
|
||||
)
|
||||
quality = prepared.quality
|
||||
warnings = adapter_result.warnings + assets.warnings + normalized.warnings + quality.warnings
|
||||
document = _build_document(
|
||||
source_pdf=metadata_source,
|
||||
markdown=normalized.markdown,
|
||||
markdown=prepared.markdown,
|
||||
assets=assets.records,
|
||||
warnings=warnings,
|
||||
raw_structured=adapter_result.raw_structured,
|
||||
@@ -503,7 +691,7 @@ def _convert_in_work_dir(
|
||||
)
|
||||
final_status = determine_final_status(metadata_data, report_quality)
|
||||
|
||||
_write_text(plan.markdown_path, normalized.markdown)
|
||||
_write_text(plan.markdown_path, prepared.markdown)
|
||||
if metadata_enabled and plan.metadata_path is not None:
|
||||
_write_text(plan.metadata_path, json.dumps(metadata_data, indent=2, ensure_ascii=False, sort_keys=True) + "\n")
|
||||
_write_text(plan.report_path, report_text)
|
||||
@@ -648,10 +836,44 @@ def _run_quality_checks(
|
||||
return asset_quality
|
||||
if math_checker is None:
|
||||
math_checker = create_default_math_checker()
|
||||
math_quality = check_math_renderability(markdown, math_checker)
|
||||
math_quality = check_math_renderability_details(markdown, math_checker).quality
|
||||
return merge_quality_results(asset_quality, math_quality)
|
||||
|
||||
|
||||
def _prepare_markdown_for_output(
|
||||
markdown: str,
|
||||
*,
|
||||
markdown_dir: Path,
|
||||
asset_root: Path,
|
||||
math_checker: MathChecker | None,
|
||||
) -> _PreparedMarkdown:
|
||||
asset_quality = check_asset_links(markdown, markdown_dir=markdown_dir, asset_root=asset_root)
|
||||
if not _has_math(markdown):
|
||||
return _PreparedMarkdown(markdown=markdown, quality=asset_quality)
|
||||
|
||||
checker = math_checker if math_checker is not None else create_default_math_checker()
|
||||
math_details = check_math_renderability_details(markdown, checker)
|
||||
initial_quality = merge_quality_results(asset_quality, math_details.quality)
|
||||
if checker is None or not math_details.failures:
|
||||
return _PreparedMarkdown(markdown=markdown, quality=initial_quality)
|
||||
|
||||
repair_result = repair_math_render_failures(markdown, math_details.failures, checker)
|
||||
if not repair_result.repairs:
|
||||
return _PreparedMarkdown(markdown=markdown, quality=initial_quality)
|
||||
|
||||
repaired_quality = _run_quality_checks(
|
||||
repair_result.markdown,
|
||||
markdown_dir=markdown_dir,
|
||||
asset_root=asset_root,
|
||||
math_checker=checker,
|
||||
)
|
||||
repair_quality = QualityResult(warnings=repair_result.warnings)
|
||||
return _PreparedMarkdown(
|
||||
markdown=repair_result.markdown,
|
||||
quality=merge_quality_results(repaired_quality, repair_quality),
|
||||
)
|
||||
|
||||
|
||||
def _has_math(markdown: str) -> bool:
|
||||
return _DISPLAY_MATH_RE.search(markdown) is not None or _INLINE_MATH_RE.search(markdown) is not None
|
||||
|
||||
|
||||
@@ -33,6 +33,7 @@ class WarningCode(StrEnum):
|
||||
GPU_UNAVAILABLE = "GPU_UNAVAILABLE"
|
||||
LOW_CONFIDENCE_FORMULA = "LOW_CONFIDENCE_FORMULA"
|
||||
MATH_RENDER_FAILED = "MATH_RENDER_FAILED"
|
||||
MATH_RENDER_REPAIRED = "MATH_RENDER_REPAIRED"
|
||||
ASSET_LINK_MISSING = "ASSET_LINK_MISSING"
|
||||
ASSET_LINK_INVALID = "ASSET_LINK_INVALID"
|
||||
READING_ORDER_UNCERTAIN = "READING_ORDER_UNCERTAIN"
|
||||
|
||||
@@ -0,0 +1,165 @@
|
||||
"""Conservative repairs for MathJax-invalid Markdown math spans."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import re
|
||||
from dataclasses import dataclass
|
||||
|
||||
from pdf2md.ir import WarningCode, WarningRecord, WarningSeverity
|
||||
from pdf2md.quality import (
|
||||
MathChecker,
|
||||
MathCheckerUnavailable,
|
||||
MathCheckResult,
|
||||
MathExpression,
|
||||
MathRenderFailure,
|
||||
)
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class MathRepair:
|
||||
expression_index: int
|
||||
rule: str
|
||||
original_body: str
|
||||
repaired_body: str
|
||||
markdown_span: tuple[int, int]
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class MathRepairResult:
|
||||
markdown: str
|
||||
repairs: tuple[MathRepair, ...] = ()
|
||||
warnings: tuple[WarningRecord, ...] = ()
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class _Candidate:
|
||||
body: str
|
||||
rule: str
|
||||
|
||||
|
||||
_SCRIPT_RE = re.compile(
|
||||
r"(?P<script>[\^_])(?P<first_arg>\s*\{[^{}]*\})(?P<space>\s+)(?P=script)(?P<second_arg>\s*\{)"
|
||||
)
|
||||
|
||||
|
||||
def repair_math_render_failures(
|
||||
markdown: str,
|
||||
failures: tuple[MathRenderFailure, ...],
|
||||
checker: MathChecker,
|
||||
) -> MathRepairResult:
|
||||
"""Repair failed math spans only when a candidate passes the same checker."""
|
||||
|
||||
if not failures:
|
||||
return MathRepairResult(markdown)
|
||||
|
||||
replacements: list[tuple[tuple[int, int], str]] = []
|
||||
repairs: list[MathRepair] = []
|
||||
warnings: list[WarningRecord] = []
|
||||
|
||||
for failure in sorted(failures, key=lambda item: item.expression.markdown_span[0], reverse=True):
|
||||
expression = failure.expression
|
||||
candidate = _first_valid_candidate(expression, checker)
|
||||
if candidate is None:
|
||||
continue
|
||||
|
||||
replacements.append((expression.markdown_span, _format_math_span(candidate.body, expression.display)))
|
||||
repair = MathRepair(
|
||||
expression_index=expression.index,
|
||||
rule=candidate.rule,
|
||||
original_body=expression.body,
|
||||
repaired_body=candidate.body,
|
||||
markdown_span=expression.markdown_span,
|
||||
)
|
||||
repairs.append(repair)
|
||||
warnings.append(
|
||||
WarningRecord(
|
||||
WarningCode.MATH_RENDER_REPAIRED,
|
||||
WarningSeverity.INFO,
|
||||
f"Math expression {expression.index} was repaired by {candidate.rule}.",
|
||||
)
|
||||
)
|
||||
|
||||
repaired = markdown
|
||||
for span, replacement in replacements:
|
||||
start, end = span
|
||||
repaired = repaired[:start] + replacement + repaired[end:]
|
||||
|
||||
return MathRepairResult(markdown=repaired, repairs=tuple(reversed(repairs)), warnings=tuple(reversed(warnings)))
|
||||
|
||||
|
||||
def _first_valid_candidate(expression: MathExpression, checker: MathChecker) -> _Candidate | None:
|
||||
for candidate in _repair_candidates(expression.body):
|
||||
if candidate.body != expression.body and _candidate_passes(candidate.body, expression.display, checker):
|
||||
return candidate
|
||||
return None
|
||||
|
||||
|
||||
def _repair_candidates(body: str) -> tuple[_Candidate, ...]:
|
||||
candidates: list[_Candidate] = []
|
||||
seen: set[str] = {body}
|
||||
|
||||
repeated_script = _repair_repeated_scripts(body)
|
||||
_append_candidate(candidates, seen, repeated_script, "repeated_script")
|
||||
|
||||
truncated_array = _repair_truncated_array_end(body)
|
||||
_append_candidate(candidates, seen, truncated_array, "truncated_array_end")
|
||||
|
||||
combined = _repair_truncated_array_end(repeated_script)
|
||||
_append_candidate(candidates, seen, combined, "combined")
|
||||
|
||||
return tuple(candidates)
|
||||
|
||||
|
||||
def _append_candidate(candidates: list[_Candidate], seen: set[str], body: str, rule: str) -> None:
|
||||
if body not in seen:
|
||||
candidates.append(_Candidate(body=body, rule=rule))
|
||||
seen.add(body)
|
||||
|
||||
|
||||
def _repair_repeated_scripts(body: str) -> str:
|
||||
def replace(match: re.Match[str]) -> str:
|
||||
script = match.group("script")
|
||||
return (
|
||||
f"{script}{match.group('first_arg')}"
|
||||
f"{match.group('space')}{{}} {script}{match.group('second_arg')}"
|
||||
)
|
||||
|
||||
return _SCRIPT_RE.sub(replace, body)
|
||||
|
||||
|
||||
def _repair_truncated_array_end(body: str) -> str:
|
||||
if r"\end{a}" not in body:
|
||||
return body
|
||||
if body.count(r"\begin{array}") <= body.count(r"\end{array}"):
|
||||
return body
|
||||
return body.replace(r"\end{a}", r"\end{array}")
|
||||
|
||||
|
||||
def _candidate_passes(body: str, display: bool, checker: MathChecker) -> bool:
|
||||
expression = MathExpression(index=0, body=body, display=display, markdown_span=(0, 0))
|
||||
try:
|
||||
batch_checker = getattr(checker, "check_expressions", None)
|
||||
if callable(batch_checker):
|
||||
raw_results = batch_checker((expression,))
|
||||
if not isinstance(raw_results, tuple | list) or len(raw_results) != 1:
|
||||
return False
|
||||
result = _coerce_result(raw_results[0])
|
||||
else:
|
||||
result = _coerce_result(checker(body))
|
||||
except MathCheckerUnavailable:
|
||||
return False
|
||||
return result.ok
|
||||
|
||||
|
||||
def _coerce_result(value: bool | MathCheckResult) -> MathCheckResult:
|
||||
if isinstance(value, bool):
|
||||
return MathCheckResult(ok=value)
|
||||
if isinstance(value, MathCheckResult):
|
||||
return value
|
||||
return MathCheckResult(ok=False)
|
||||
|
||||
|
||||
def _format_math_span(body: str, display: bool) -> str:
|
||||
if display:
|
||||
return f"$$\n{body.strip()}\n$$"
|
||||
return f"${body.strip()}$"
|
||||
+31
-4
@@ -24,6 +24,12 @@ class MathExpression:
|
||||
markdown_span: tuple[int, int]
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class MathRenderFailure:
|
||||
expression: MathExpression
|
||||
message: str = ""
|
||||
|
||||
|
||||
MathChecker = Callable[[str], bool | MathCheckResult]
|
||||
|
||||
|
||||
@@ -39,6 +45,12 @@ class QualityResult:
|
||||
return self.missing_asset_link_count + self.invalid_asset_link_count + self.math_render_error_count
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class MathRenderabilityResult:
|
||||
quality: QualityResult
|
||||
failures: tuple[MathRenderFailure, ...] = ()
|
||||
|
||||
|
||||
class MathCheckerUnavailable(RuntimeError):
|
||||
"""Raised by a local math checker when renderability cannot be checked."""
|
||||
|
||||
@@ -95,15 +107,22 @@ def check_asset_links(
|
||||
def check_math_renderability(markdown: str, checker: MathChecker | None = None) -> QualityResult:
|
||||
"""Check math renderability through an injected local checker."""
|
||||
|
||||
return check_math_renderability_details(markdown, checker).quality
|
||||
|
||||
|
||||
def check_math_renderability_details(markdown: str, checker: MathChecker | None = None) -> MathRenderabilityResult:
|
||||
"""Check math renderability and return failed expression records."""
|
||||
|
||||
if not isinstance(markdown, str):
|
||||
raise TypeError("markdown must be a string")
|
||||
|
||||
expressions = extract_math_expressions(markdown)
|
||||
if not expressions:
|
||||
return QualityResult()
|
||||
return MathRenderabilityResult(QualityResult())
|
||||
|
||||
if checker is None:
|
||||
return QualityResult(
|
||||
return MathRenderabilityResult(
|
||||
QualityResult(
|
||||
warnings=(
|
||||
WarningRecord(
|
||||
WarningCode.MATH_RENDER_FAILED,
|
||||
@@ -112,8 +131,10 @@ def check_math_renderability(markdown: str, checker: MathChecker | None = None)
|
||||
),
|
||||
)
|
||||
)
|
||||
)
|
||||
|
||||
warnings: list[WarningRecord] = []
|
||||
failures: list[MathRenderFailure] = []
|
||||
failure_count = 0
|
||||
try:
|
||||
results = _check_expressions(expressions, checker)
|
||||
@@ -122,6 +143,7 @@ def check_math_renderability(markdown: str, checker: MathChecker | None = None)
|
||||
message = result.message
|
||||
if not ok:
|
||||
failure_count += 1
|
||||
failures.append(MathRenderFailure(expression=expression, message=message))
|
||||
details = f": {message}" if message else ""
|
||||
kind = "display" if expression.display else "inline"
|
||||
warnings.append(
|
||||
@@ -131,7 +153,8 @@ def check_math_renderability(markdown: str, checker: MathChecker | None = None)
|
||||
)
|
||||
)
|
||||
except MathCheckerUnavailable as error:
|
||||
return QualityResult(
|
||||
return MathRenderabilityResult(
|
||||
QualityResult(
|
||||
warnings=(
|
||||
WarningRecord(
|
||||
WarningCode.MATH_RENDER_FAILED,
|
||||
@@ -140,8 +163,12 @@ def check_math_renderability(markdown: str, checker: MathChecker | None = None)
|
||||
),
|
||||
)
|
||||
)
|
||||
)
|
||||
|
||||
return QualityResult(math_render_error_count=failure_count, warnings=tuple(warnings))
|
||||
return MathRenderabilityResult(
|
||||
QualityResult(math_render_error_count=failure_count, warnings=tuple(warnings)),
|
||||
failures=tuple(failures),
|
||||
)
|
||||
|
||||
|
||||
def merge_quality_results(*results: QualityResult) -> QualityResult:
|
||||
|
||||
+29
-2
@@ -16,8 +16,9 @@ from pdf2md.mineru_adapter import MinerUAdapterResult
|
||||
|
||||
|
||||
class FakeAdapter:
|
||||
def __init__(self, *, succeeded: bool = True) -> None:
|
||||
def __init__(self, *, succeeded: bool = True, raw_markdown: str | None = None) -> None:
|
||||
self.succeeded = succeeded
|
||||
self.raw_markdown = raw_markdown
|
||||
self.calls: list[Path] = []
|
||||
self.options: list[object] = []
|
||||
|
||||
@@ -33,7 +34,7 @@ class FakeAdapter:
|
||||
command=("mineru", "-p", str(input_path), "-o", str(output_dir)),
|
||||
input_pdf=input_path,
|
||||
work_dir=output_dir,
|
||||
raw_markdown=f"# {input_path.stem}\n" if self.succeeded else None,
|
||||
raw_markdown=(self.raw_markdown or f"# {input_path.stem}\n") if self.succeeded else None,
|
||||
raw_structured={"pages": 1},
|
||||
asset_paths=(),
|
||||
warnings=() if self.succeeded else (warning,),
|
||||
@@ -188,6 +189,32 @@ def test_cli_failure_summary_returns_nonzero(tmp_path: Path, capsys) -> None:
|
||||
assert not (tmp_path / "out" / "paper.md").exists()
|
||||
|
||||
|
||||
def test_cli_recheck_markdown_regenerates_adjacent_metadata_and_report(tmp_path: Path, capsys) -> None:
|
||||
pdf = make_pdf(tmp_path, "paper.pdf")
|
||||
out = tmp_path / "out"
|
||||
adapter = FakeAdapter(raw_markdown="Inline \\(bad_math\\)\n")
|
||||
assert (
|
||||
main(
|
||||
["convert", str(pdf), "--out", str(out)],
|
||||
adapter=adapter,
|
||||
clock=fixed_clock,
|
||||
math_checker=lambda _: False,
|
||||
)
|
||||
== 0
|
||||
)
|
||||
capsys.readouterr()
|
||||
|
||||
markdown_path = out / "paper.md"
|
||||
markdown_path.write_text("Inline $x_i$\n", encoding="utf-8")
|
||||
exit_code = main(["recheck", str(markdown_path)], clock=fixed_clock, math_checker=lambda _: True)
|
||||
|
||||
captured = capsys.readouterr()
|
||||
assert exit_code == 0
|
||||
assert "rechecked:" in captured.out
|
||||
assert "warnings: 0" in captured.out
|
||||
assert "- Final status: `success`" in (out / "paper.report.md").read_text(encoding="utf-8")
|
||||
|
||||
|
||||
def test_cli_preflight_conflict_fails_before_conversion(tmp_path: Path, capsys) -> None:
|
||||
pdf = make_pdf(tmp_path, "paper.pdf")
|
||||
out = tmp_path / "out"
|
||||
|
||||
@@ -9,10 +9,11 @@ import pytest
|
||||
from pypdf import PdfWriter
|
||||
|
||||
import pdf2md.conversion as conversion_module
|
||||
from pdf2md.conversion import BatchConversionResult, convert_input, convert_pdf
|
||||
from pdf2md.conversion import BatchConversionResult, convert_input, convert_pdf, recheck_markdown
|
||||
from pdf2md.ir import WarningCode, WarningRecord, WarningSeverity
|
||||
from pdf2md.mineru_adapter import MinerUAdapterResult, StrictLocalViolationError
|
||||
from pdf2md.paths import OutputConflictError
|
||||
from pdf2md.quality import MathCheckResult
|
||||
|
||||
|
||||
class FakeAdapter:
|
||||
@@ -230,6 +231,73 @@ def test_convert_pdf_records_math_checker_failures_in_metadata_and_report(tmp_pa
|
||||
assert "`MATH_RENDER_FAILED`" in report
|
||||
|
||||
|
||||
def test_convert_pdf_repairs_math_render_failure_before_writing_outputs(tmp_path: Path) -> None:
|
||||
class RepairAwareChecker:
|
||||
def check_expressions(self, expressions):
|
||||
return tuple(MathCheckResult(ok="{} ^ {t}" in expression.body) for expression in expressions)
|
||||
|
||||
pdf = make_pdf(tmp_path)
|
||||
adapter = FakeAdapter(raw_markdown="\\[x ^ {i} ^ {t}\\]\n")
|
||||
|
||||
result = convert_pdf(pdf, tmp_path / "out", adapter=adapter, math_checker=RepairAwareChecker(), clock=fixed_clock)
|
||||
|
||||
assert result.final_status == "partial"
|
||||
assert result.markdown_path.read_text(encoding="utf-8") == "$$\nx ^ {i} {} ^ {t}\n$$"
|
||||
assert [warning.code for warning in result.warnings] == [WarningCode.MATH_RENDER_REPAIRED]
|
||||
metadata = json.loads(result.metadata_path.read_text(encoding="utf-8"))
|
||||
assert metadata["summary"]["math_render_error_count"] == 0
|
||||
assert metadata["warnings"][0]["code"] == "MATH_RENDER_REPAIRED"
|
||||
report = result.report_path.read_text(encoding="utf-8")
|
||||
assert "- Math render error count: 0" in report
|
||||
assert "`MATH_RENDER_REPAIRED`" in report
|
||||
|
||||
|
||||
def test_recheck_markdown_regenerates_metadata_and_report_from_current_markdown(tmp_path: Path) -> None:
|
||||
pdf = make_pdf(tmp_path)
|
||||
adapter = FakeAdapter(raw_markdown="Inline \\(bad_math\\)\n")
|
||||
result = convert_pdf(pdf, tmp_path / "out", adapter=adapter, math_checker=lambda _: False, clock=fixed_clock)
|
||||
|
||||
result.markdown_path.write_text("Inline $x_i$\n", encoding="utf-8")
|
||||
rechecked = recheck_markdown(result.markdown_path, math_checker=lambda _: True, clock=fixed_clock)
|
||||
|
||||
assert rechecked.final_status == "success"
|
||||
assert rechecked.warning_count == 0
|
||||
assert rechecked.markdown_path == result.markdown_path
|
||||
assert rechecked.metadata_path == result.metadata_path
|
||||
assert rechecked.report_path == result.report_path
|
||||
metadata = json.loads(result.metadata_path.read_text(encoding="utf-8"))
|
||||
assert metadata["source_sha256"] == hashlib.sha256(pdf.read_bytes()).hexdigest()
|
||||
assert metadata["created_at"] == "2026-05-08T00:00:00Z"
|
||||
assert metadata["summary"]["pages_processed"] == 1
|
||||
assert metadata["summary"]["inline_formula_count"] == 1
|
||||
assert metadata["summary"]["math_render_error_count"] == 0
|
||||
assert metadata["summary"]["warning_count"] == 0
|
||||
assert metadata["warnings"] == []
|
||||
report = result.report_path.read_text(encoding="utf-8")
|
||||
assert "- Final status: `success`" in report
|
||||
assert "- Math render error count: 0" in report
|
||||
assert "- None" in report
|
||||
|
||||
|
||||
def test_recheck_markdown_repairs_math_render_failure(tmp_path: Path) -> None:
|
||||
class RepairAwareChecker:
|
||||
def check_expressions(self, expressions):
|
||||
return tuple(MathCheckResult(ok="{} ^ {t}" in expression.body) for expression in expressions)
|
||||
|
||||
pdf = make_pdf(tmp_path)
|
||||
adapter = FakeAdapter(raw_markdown="No formulas.\n")
|
||||
result = convert_pdf(pdf, tmp_path / "out", adapter=adapter, math_checker=lambda _: True, clock=fixed_clock)
|
||||
result.markdown_path.write_text("$$\nx ^ {i} ^ {t}\n$$\n", encoding="utf-8")
|
||||
|
||||
rechecked = recheck_markdown(result.markdown_path, math_checker=RepairAwareChecker(), clock=fixed_clock)
|
||||
|
||||
assert rechecked.markdown_path.read_text(encoding="utf-8") == "$$\nx ^ {i} {} ^ {t}\n$$\n"
|
||||
assert [warning.code for warning in rechecked.warnings] == [WarningCode.MATH_RENDER_REPAIRED]
|
||||
metadata = json.loads(result.metadata_path.read_text(encoding="utf-8"))
|
||||
assert metadata["summary"]["math_render_error_count"] == 0
|
||||
assert metadata["warnings"][0]["code"] == "MATH_RENDER_REPAIRED"
|
||||
|
||||
|
||||
def test_convert_pdf_records_unavailable_math_checker_for_math_output(tmp_path: Path, monkeypatch) -> None:
|
||||
pdf = make_pdf(tmp_path)
|
||||
adapter = FakeAdapter(raw_markdown="Inline \\(x\\)\n")
|
||||
|
||||
@@ -0,0 +1,65 @@
|
||||
from __future__ import annotations
|
||||
|
||||
from pdf2md.ir import WarningCode, WarningSeverity
|
||||
from pdf2md.math_repair import repair_math_render_failures
|
||||
from pdf2md.quality import MathCheckResult, MathRenderFailure, extract_math_expressions
|
||||
|
||||
|
||||
class BodyChecker:
|
||||
def __init__(self, passing_fragment: str) -> None:
|
||||
self.passing_fragment = passing_fragment
|
||||
self.checked_bodies: list[str] = []
|
||||
|
||||
def check_expressions(self, expressions):
|
||||
self.checked_bodies.extend(expression.body for expression in expressions)
|
||||
return tuple(MathCheckResult(ok=self.passing_fragment in expression.body) for expression in expressions)
|
||||
|
||||
|
||||
def test_repair_math_render_failures_disambiguates_repeated_superscripts() -> None:
|
||||
markdown = "$$\nx ^ {i} ^ {t}\n$$\n"
|
||||
expression = extract_math_expressions(markdown)[0]
|
||||
failure = MathRenderFailure(expression=expression, message="Double exponent: use braces to clarify")
|
||||
checker = BodyChecker("{} ^ {t}")
|
||||
|
||||
result = repair_math_render_failures(markdown, (failure,), checker)
|
||||
|
||||
assert result.markdown == "$$\nx ^ {i} {} ^ {t}\n$$\n"
|
||||
assert result.repairs[0].rule == "repeated_script"
|
||||
assert result.warnings[0].code == WarningCode.MATH_RENDER_REPAIRED
|
||||
assert result.warnings[0].severity == WarningSeverity.INFO
|
||||
|
||||
|
||||
def test_repair_math_render_failures_repairs_truncated_array_environment() -> None:
|
||||
markdown = "$$\n\\begin{array}{c} x \\end{a}\n$$\n"
|
||||
expression = extract_math_expressions(markdown)[0]
|
||||
failure = MathRenderFailure(expression=expression, message="Unknown environment 'a'")
|
||||
checker = BodyChecker("\\end{array}")
|
||||
|
||||
result = repair_math_render_failures(markdown, (failure,), checker)
|
||||
|
||||
assert result.markdown == "$$\n\\begin{array}{c} x \\end{array}\n$$\n"
|
||||
assert result.repairs[0].rule == "truncated_array_end"
|
||||
|
||||
|
||||
def test_repair_math_render_failures_leaves_markdown_unchanged_when_candidate_fails() -> None:
|
||||
markdown = "$$\nx ^ {i} ^ {t}\n$$\n"
|
||||
expression = extract_math_expressions(markdown)[0]
|
||||
failure = MathRenderFailure(expression=expression, message="Double exponent: use braces to clarify")
|
||||
checker = BodyChecker("never-passes")
|
||||
|
||||
result = repair_math_render_failures(markdown, (failure,), checker)
|
||||
|
||||
assert result.markdown == markdown
|
||||
assert result.repairs == ()
|
||||
assert result.warnings == ()
|
||||
|
||||
|
||||
def test_repair_math_render_failures_only_changes_failed_spans() -> None:
|
||||
markdown = "$a ^ {b} ^ {c}$ and $unchanged ^ {ok}$\n"
|
||||
expressions = extract_math_expressions(markdown)
|
||||
failure = MathRenderFailure(expression=expressions[0], message="Double exponent: use braces to clarify")
|
||||
checker = BodyChecker("{} ^ {c}")
|
||||
|
||||
result = repair_math_render_failures(markdown, (failure,), checker)
|
||||
|
||||
assert result.markdown == "$a ^ {b} {} ^ {c}$ and $unchanged ^ {ok}$\n"
|
||||
@@ -6,3 +6,4 @@ import pdf2md
|
||||
def test_package_imports() -> None:
|
||||
assert pdf2md.__version__ == "0.1.0"
|
||||
assert callable(pdf2md.convert_pdf)
|
||||
assert callable(pdf2md.recheck_markdown)
|
||||
|
||||
@@ -6,6 +6,7 @@ from pdf2md.ir import WarningCode, WarningSeverity
|
||||
from pdf2md.quality import (
|
||||
MathCheckerUnavailable,
|
||||
MathCheckResult,
|
||||
check_math_renderability_details,
|
||||
check_asset_links,
|
||||
check_math_renderability,
|
||||
extract_math_expressions,
|
||||
@@ -71,6 +72,20 @@ def test_math_render_failures_are_aggregated_with_fake_checker() -> None:
|
||||
assert "bad_math failed" in result.warnings[0].message
|
||||
|
||||
|
||||
def test_math_renderability_details_include_failed_expression_records() -> None:
|
||||
def checker(body: str) -> MathCheckResult:
|
||||
return MathCheckResult(ok="bad" not in body, message=f"{body} failed")
|
||||
|
||||
result = check_math_renderability_details("$x_i$\n\n$$\nbad_math\n$$", checker)
|
||||
|
||||
assert result.quality.math_render_error_count == 1
|
||||
assert len(result.failures) == 1
|
||||
assert result.failures[0].expression.index == 1
|
||||
assert result.failures[0].expression.body == "bad_math"
|
||||
assert result.failures[0].expression.display is True
|
||||
assert result.failures[0].message == "bad_math failed"
|
||||
|
||||
|
||||
def test_math_extraction_records_display_mode_and_markdown_spans() -> None:
|
||||
markdown = "Inline $x_i^2$ before\n\n$$\n\\frac{1}{2}\n$$\n"
|
||||
|
||||
|
||||
Reference in New Issue
Block a user