modify pdftomd
This commit is contained in:
@@ -0,0 +1,292 @@
|
||||
# Sprint 13 Contract: Text Layer Fidelity Diagnostics
|
||||
|
||||
Status: Implemented
|
||||
Last updated: 2026-05-11
|
||||
|
||||
## Objective
|
||||
|
||||
Add a local pypdf-based text fidelity diagnostic pass that compares source PDF text-layer extraction with MinerU-generated Markdown text on a per-page basis where page mapping is available.
|
||||
|
||||
The first priority is diagnosis, not automatic body-text replacement. This sprint should record enough evidence in metadata JSON and `<stem>.report.md` to identify pages where MinerU likely misrecognized Korean body text, especially missing Hangul syllables, unexpected CJK ideographs, and abnormal spacing. It may mark pypdf text as a future replacement candidate, but it must not replace Markdown body text in this sprint.
|
||||
|
||||
## Current Precondition
|
||||
|
||||
- MinerU 3.1.0 remains the only conversion engine.
|
||||
- Conversion runs through direct local `mineru` CLI execution only.
|
||||
- `pypdf` is already used by the project for local PDF chunk planning.
|
||||
- `pdf2md convert` writes Markdown, metadata JSON, and `<stem>.report.md`.
|
||||
- `pdf2md recheck` can regenerate metadata/report from an existing Markdown file.
|
||||
- Chunked conversion records original source page ranges in metadata `engine_options.chunk`.
|
||||
- The 2007 Korean shell-structure sample showed clear text fidelity problems:
|
||||
- pypdf can extract more accurate Hangul from the digital text layer.
|
||||
- MinerU Markdown can omit Hangul syllables or misrecognize headings/body text as unrelated CJK characters.
|
||||
- The source text layer itself can contain abnormal spacing between Hangul syllables.
|
||||
|
||||
## Touched Surfaces
|
||||
|
||||
Allowed during implementation:
|
||||
|
||||
- `src/pdf2md/text_fidelity.py`
|
||||
- `src/pdf2md/ir.py`
|
||||
- `src/pdf2md/metadata.py`
|
||||
- `src/pdf2md/report.py`
|
||||
- `src/pdf2md/conversion.py`
|
||||
- `tests/test_text_fidelity.py`
|
||||
- `tests/test_metadata.py`
|
||||
- `tests/test_report.py`
|
||||
- `tests/test_conversion.py`
|
||||
- `docs/V1IMPLEMENTATIONPLAN.md`
|
||||
- `PLAN.md`
|
||||
- `PROGRESS.md`
|
||||
- `docs/WORKARCHIVE.md` after completion
|
||||
|
||||
Allowed only if needed for CLI/API wiring:
|
||||
|
||||
- `src/pdf2md/cli.py`
|
||||
- `tests/test_cli.py`
|
||||
- `README.md`
|
||||
|
||||
Not allowed:
|
||||
|
||||
- Replacing Markdown body text with pypdf text in this sprint.
|
||||
- Adding a second conversion engine or engine selector.
|
||||
- Adding remote OCR, hosted LLM/VLM, remote document parsing, `--api-url`, router mode, HTTP client backends, or remote OpenAI-compatible endpoints.
|
||||
- Mandatory default tests that require real MinerU, GPU, model files, network, Obsidian, or committed `samples/`.
|
||||
- Committing sample PDFs or generated `outputs/`.
|
||||
|
||||
## Product Behavior
|
||||
|
||||
Text fidelity diagnostics should run automatically after MinerU Markdown normalization and local quality checks have produced the final Markdown candidate.
|
||||
|
||||
For each page that can be compared, metadata should record a compact diagnostic object with at least:
|
||||
|
||||
- `page_index`: zero-based output page index.
|
||||
- `source_page_number`: one-based original PDF page number when known.
|
||||
- `pypdf_text_available`: whether pypdf extracted non-empty source text.
|
||||
- `markdown_text_available`: whether comparable Markdown text exists for the page.
|
||||
- `pypdf_hangul_count`: Hangul syllable count from pypdf text.
|
||||
- `markdown_hangul_count`: Hangul syllable count from Markdown text.
|
||||
- `hangul_count_delta`: `markdown_hangul_count - pypdf_hangul_count`.
|
||||
- `hangul_count_ratio`: Markdown Hangul count divided by pypdf Hangul count, or `null` when unavailable.
|
||||
- `unexpected_cjk_count`: count of CJK Unified Ideographs in Markdown that are suspicious in a page with Korean source text.
|
||||
- `pypdf_hangul_spacing_anomaly_ratio`: ratio of Hangul-to-Hangul whitespace breaks in pypdf text.
|
||||
- `markdown_hangul_spacing_anomaly_ratio`: ratio of Hangul-to-Hangul whitespace breaks in Markdown text.
|
||||
- `text_similarity`: normalized text similarity between pypdf text and Markdown text.
|
||||
- `replacement_candidate`: `true` only when pypdf text appears more reliable than Markdown text under conservative thresholds.
|
||||
- `comparison_status`: one of `checked`, `source_text_missing`, `markdown_page_unavailable`, or `page_mapping_uncertain`.
|
||||
|
||||
Metadata summary should include:
|
||||
|
||||
- `text_fidelity_checked_page_count`.
|
||||
- `text_fidelity_low_page_count`.
|
||||
- `text_fidelity_unexpected_cjk_count`.
|
||||
- `text_fidelity_replacement_candidate_page_count`.
|
||||
- `text_fidelity_page_mapping_uncertain_count`.
|
||||
|
||||
Report Markdown should add a dedicated `## Text Fidelity` section showing:
|
||||
|
||||
- checked page count and low-fidelity page count.
|
||||
- total unexpected CJK count.
|
||||
- replacement candidate page count.
|
||||
- pages with low similarity.
|
||||
- pages with high unexpected CJK count.
|
||||
- pages where page-level comparison could not be trusted.
|
||||
|
||||
Warning behavior:
|
||||
|
||||
- Add `TEXT_LAYER_AVAILABLE` as an info warning when pypdf source text is available and diagnostics run.
|
||||
- Add `TEXT_FIDELITY_LOW` as a warning for pages below the fidelity threshold.
|
||||
- Add `UNEXPECTED_CJK_IN_KOREAN_TEXT` as a warning when suspicious CJK ideographs appear in Markdown for pages with Korean source text.
|
||||
- Add `HANGUL_SPACING_SUSPECT` as an info or warning-level signal when pypdf or Markdown has high Hangul spacing anomaly ratio.
|
||||
- Add `TEXT_PAGE_MAPPING_UNCERTAIN` as an info warning when page-level Markdown mapping is not reliable enough for per-page metrics.
|
||||
|
||||
Replacement candidate policy:
|
||||
|
||||
- `replacement_candidate` is a diagnostic marker only.
|
||||
- It must not change Markdown output.
|
||||
- It should be `true` only when:
|
||||
- pypdf source text is available,
|
||||
- pypdf Hangul count is materially higher than Markdown Hangul count or Markdown has suspicious CJK ideographs,
|
||||
- pypdf spacing anomalies are not so severe that the source text layer is clearly unusable,
|
||||
- page mapping is `checked`.
|
||||
|
||||
## Architecture Plan
|
||||
|
||||
### WP13.1: Text Fidelity Module
|
||||
|
||||
Actions:
|
||||
|
||||
- Add `src/pdf2md/text_fidelity.py`.
|
||||
- Use `pypdf.PdfReader` to extract source page text locally.
|
||||
- Define immutable result records for per-page metrics and summary metrics.
|
||||
- Strip Markdown syntax, image links, fenced code, inline code, and math spans before text comparison.
|
||||
- Normalize text for comparison without mutating the output Markdown:
|
||||
- Unicode NFKC normalization for comparison strings only.
|
||||
- collapse whitespace for similarity only.
|
||||
- keep raw-count metrics independent enough to expose spacing anomalies.
|
||||
- Count Hangul syllables with the Hangul syllable block.
|
||||
- Count suspicious CJK ideographs with CJK Unified Ideograph ranges, excluding Hangul ranges.
|
||||
- Compute similarity with a deterministic standard-library algorithm such as `difflib.SequenceMatcher`.
|
||||
|
||||
Expected output:
|
||||
|
||||
- Pure local helper functions that are independently testable and do not call MinerU, network services, or the filesystem except for reading the source PDF.
|
||||
|
||||
### WP13.2: Page Mapping Boundary
|
||||
|
||||
Actions:
|
||||
|
||||
- Derive source page numbers from `engine_options.chunk` when chunking is active.
|
||||
- Use project page records and any reliable raw structured page count to decide whether page-level comparison is possible.
|
||||
- If Markdown cannot be mapped to pages reliably, produce `TEXT_PAGE_MAPPING_UNCERTAIN` and avoid pretending per-page metrics are exact.
|
||||
- For the initial implementation, allow a conservative fallback for single-page mocked outputs and chunk outputs where one Markdown file corresponds to a known source page range.
|
||||
|
||||
Expected output:
|
||||
|
||||
- Page-level diagnostics are only marked `checked` when the mapping is credible.
|
||||
- Ambiguous cases are visible in metadata/report instead of producing misleading page metrics.
|
||||
|
||||
### WP13.3: Metadata And Warning Integration
|
||||
|
||||
Actions:
|
||||
|
||||
- Add warning codes in `src/pdf2md/ir.py`.
|
||||
- Add text fidelity fields to metadata without changing existing top-level fields used by current tests.
|
||||
- Extend `build_summary()` to include text fidelity summary counts when diagnostics are present.
|
||||
- Ensure warnings retain `page_index` where available.
|
||||
- Preserve JSON serializability and deterministic key ordering on write.
|
||||
|
||||
Expected output:
|
||||
|
||||
- Metadata contains compact page-level text fidelity diagnostics and summary counts.
|
||||
- Existing metadata consumers remain compatible.
|
||||
|
||||
### WP13.4: Report Integration
|
||||
|
||||
Actions:
|
||||
|
||||
- Extend `render_report()` to render a `## Text Fidelity` section when diagnostics exist.
|
||||
- Keep the report derived from metadata and quality results.
|
||||
- Include low-fidelity pages and replacement candidate pages in human-readable form.
|
||||
- Do not include full extracted page text in the report.
|
||||
|
||||
Expected output:
|
||||
|
||||
- A human can identify which pages need attention without opening metadata JSON first.
|
||||
|
||||
### WP13.5: Conversion And Recheck Integration
|
||||
|
||||
Actions:
|
||||
|
||||
- Run text fidelity diagnostics during `convert` after final Markdown preparation and before metadata/report writing.
|
||||
- Run the same diagnostics during `recheck` when the original source PDF path still exists.
|
||||
- If the source PDF is missing during `recheck`, preserve existing behavior and add a clear nonfatal warning or omit diagnostics.
|
||||
- Keep chunked conversion page ranges tied to original source page numbers.
|
||||
|
||||
Expected output:
|
||||
|
||||
- Fresh conversions and rechecks can produce text fidelity diagnostics without rerunning MinerU.
|
||||
|
||||
### WP13.6: Tests
|
||||
|
||||
Default fast tests:
|
||||
|
||||
- pypdf extraction boundary handles generated local PDFs without requiring real MinerU or sample files.
|
||||
- Hangul count, unexpected CJK count, and spacing anomaly ratio helpers use direct Korean/CJK strings.
|
||||
- Markdown text stripping ignores math, image links, fenced code, and inline code.
|
||||
- Similarity score is deterministic for equivalent and degraded text.
|
||||
- Metadata contains text fidelity summary fields when diagnostics are present.
|
||||
- Report contains `## Text Fidelity` and page-level warning summaries.
|
||||
- Conversion with a fake adapter records `TEXT_FIDELITY_LOW` when Markdown omits Hangul from a source-text PDF.
|
||||
- Recheck reruns diagnostics when source PDF exists.
|
||||
- Missing source PDF during recheck remains nonfatal.
|
||||
|
||||
Optional local validation:
|
||||
|
||||
- Convert the local 2007 Korean shell-structure sample with chunking to ignored `outputs\`.
|
||||
- Confirm the report flags the pages where the previous output had missing Hangul and unexpected CJK characters.
|
||||
- Do not commit sample PDFs or generated outputs.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
- Default tests pass without real MinerU, GPU, model files, network, Obsidian, or `samples/`.
|
||||
- Diagnostics are local-only and use pypdf source text only from the local PDF.
|
||||
- Metadata JSON records page-level text fidelity metrics where page mapping is credible.
|
||||
- Metadata summary records aggregate text fidelity counts.
|
||||
- `<stem>.report.md` includes a text fidelity section when diagnostics exist.
|
||||
- Suspicious Korean text loss produces structured warnings with page provenance where available.
|
||||
- Replacement candidate markers are recorded only as diagnostics and do not alter Markdown content.
|
||||
- Existing math, asset, table, chunk, strict-local, and UI behavior remains unchanged.
|
||||
|
||||
## Hard Failure Criteria
|
||||
|
||||
- Markdown body text is replaced automatically in this sprint.
|
||||
- Page-level metrics are reported as exact when page mapping is uncertain.
|
||||
- Diagnostics upload PDFs, page text, Markdown, or extracted text to any remote service.
|
||||
- Default tests require MinerU, CUDA/GPU, model files, network, Obsidian, or `samples/`.
|
||||
- Existing output schema fields are removed or renamed.
|
||||
- `samples/`, generated `outputs/`, or `dist/pdf2md-ui.exe` are committed.
|
||||
|
||||
## Verification Commands
|
||||
|
||||
```powershell
|
||||
uv run pytest tests/test_text_fidelity.py tests/test_metadata.py tests/test_report.py tests/test_conversion.py
|
||||
uv run pytest
|
||||
git diff --check
|
||||
git status --short --untracked-files=all
|
||||
```
|
||||
|
||||
Optional local validation:
|
||||
|
||||
```powershell
|
||||
$env:MINERU_MODEL_SOURCE='local'
|
||||
$pdf = (Get-ChildItem samples -Filter '2007*.pdf' | Select-Object -First 1).FullName
|
||||
uv run pdf2md convert $pdf --out outputs\sprint13-2007-text-fidelity --overwrite --chunk-pages 5
|
||||
```
|
||||
|
||||
## Handoff Requirements
|
||||
|
||||
After implementation:
|
||||
|
||||
- Update `PROGRESS.md` with files changed, commands run, test outcomes, optional sample validation outcome, known failures, residual risks, and next action.
|
||||
- Archive completed implementation details in `docs/WORKARCHIVE.md` after verification.
|
||||
- Keep sample PDFs, generated outputs, and build artifacts out of the commit.
|
||||
- Record whether page-level mapping was exact, approximate, or unavailable for the validated sample.
|
||||
|
||||
## Implementation Handoff
|
||||
|
||||
Files changed:
|
||||
|
||||
- `src/pdf2md/text_fidelity.py`
|
||||
- `src/pdf2md/ir.py`
|
||||
- `src/pdf2md/metadata.py`
|
||||
- `src/pdf2md/report.py`
|
||||
- `src/pdf2md/conversion.py`
|
||||
- `tests/test_text_fidelity.py`
|
||||
- `tests/test_metadata.py`
|
||||
- `tests/test_report.py`
|
||||
- `tests/test_conversion.py`
|
||||
- `ARCHITECTURE.md`
|
||||
- `PLAN.md`
|
||||
- `PROGRESS.md`
|
||||
- `docs/WORKARCHIVE.md`
|
||||
- `docs/V1IMPLEMENTATIONPLAN.md`
|
||||
|
||||
Verification:
|
||||
|
||||
- `uv run pytest tests/test_text_fidelity.py tests/test_metadata.py tests/test_report.py tests/test_conversion.py`: passed 49 tests.
|
||||
- `uv run pytest`: passed 198 tests with 1 optional skip.
|
||||
|
||||
Known failures:
|
||||
|
||||
- None in the default fast test suite.
|
||||
|
||||
Residual risks:
|
||||
|
||||
- Page-level Markdown mapping is only scored when credible. Multi-page Markdown without reliable page boundaries is reported as `TEXT_PAGE_MAPPING_UNCERTAIN` rather than guessed.
|
||||
- Automatic body-text replacement remains out of scope and is not implemented.
|
||||
- Optional real MinerU validation on the local 2007 Korean shell-structure sample was not run during implementation to avoid a long GPU conversion.
|
||||
|
||||
## Future Sprint Boundary
|
||||
|
||||
A later sprint may implement controlled body-text replacement from pypdf text after Sprint 13 diagnostics show reliable thresholds. That future sprint must have its own contract and must preserve math, tables, figures, asset links, and Markdown structure from MinerU unless explicitly redesigned.
|
||||
Reference in New Issue
Block a user