13 KiB
Sprint 13 Contract: Text Layer Fidelity Diagnostics
Status: Implemented Last updated: 2026-05-11
Objective
Add a local pypdf-based text fidelity diagnostic pass that compares source PDF text-layer extraction with MinerU-generated Markdown text on a per-page basis where page mapping is available.
The first priority is diagnosis, not automatic body-text replacement. This sprint should record enough evidence in metadata JSON and <stem>.report.md to identify pages where MinerU likely misrecognized Korean body text, especially missing Hangul syllables, unexpected CJK ideographs, and abnormal spacing. It may mark pypdf text as a future replacement candidate, but it must not replace Markdown body text in this sprint.
Current Precondition
- MinerU 3.1.0 remains the only conversion engine.
- Conversion runs through direct local
mineruCLI execution only. pypdfis already used by the project for local PDF chunk planning.pdf2md convertwrites Markdown, metadata JSON, and<stem>.report.md.pdf2md recheckcan regenerate metadata/report from an existing Markdown file.- Chunked conversion records original source page ranges in metadata
engine_options.chunk. - The 2007 Korean shell-structure sample showed clear text fidelity problems:
- pypdf can extract more accurate Hangul from the digital text layer.
- MinerU Markdown can omit Hangul syllables or misrecognize headings/body text as unrelated CJK characters.
- The source text layer itself can contain abnormal spacing between Hangul syllables.
Touched Surfaces
Allowed during implementation:
src/pdf2md/text_fidelity.pysrc/pdf2md/ir.pysrc/pdf2md/metadata.pysrc/pdf2md/report.pysrc/pdf2md/conversion.pytests/test_text_fidelity.pytests/test_metadata.pytests/test_report.pytests/test_conversion.pydocs/V1IMPLEMENTATIONPLAN.mdPLAN.mdPROGRESS.mddocs/WORKARCHIVE.mdafter completion
Allowed only if needed for CLI/API wiring:
src/pdf2md/cli.pytests/test_cli.pyREADME.md
Not allowed:
- Replacing Markdown body text with pypdf text in this sprint.
- Adding a second conversion engine or engine selector.
- Adding remote OCR, hosted LLM/VLM, remote document parsing,
--api-url, router mode, HTTP client backends, or remote OpenAI-compatible endpoints. - Mandatory default tests that require real MinerU, GPU, model files, network, Obsidian, or committed
samples/. - Committing sample PDFs or generated
outputs/.
Product Behavior
Text fidelity diagnostics should run automatically after MinerU Markdown normalization and local quality checks have produced the final Markdown candidate.
For each page that can be compared, metadata should record a compact diagnostic object with at least:
page_index: zero-based output page index.source_page_number: one-based original PDF page number when known.pypdf_text_available: whether pypdf extracted non-empty source text.markdown_text_available: whether comparable Markdown text exists for the page.pypdf_hangul_count: Hangul syllable count from pypdf text.markdown_hangul_count: Hangul syllable count from Markdown text.hangul_count_delta:markdown_hangul_count - pypdf_hangul_count.hangul_count_ratio: Markdown Hangul count divided by pypdf Hangul count, ornullwhen unavailable.unexpected_cjk_count: count of CJK Unified Ideographs in Markdown that are suspicious in a page with Korean source text.pypdf_hangul_spacing_anomaly_ratio: ratio of Hangul-to-Hangul whitespace breaks in pypdf text.markdown_hangul_spacing_anomaly_ratio: ratio of Hangul-to-Hangul whitespace breaks in Markdown text.text_similarity: normalized text similarity between pypdf text and Markdown text.replacement_candidate:trueonly when pypdf text appears more reliable than Markdown text under conservative thresholds.comparison_status: one ofchecked,source_text_missing,markdown_page_unavailable, orpage_mapping_uncertain.
Metadata summary should include:
text_fidelity_checked_page_count.text_fidelity_low_page_count.text_fidelity_unexpected_cjk_count.text_fidelity_replacement_candidate_page_count.text_fidelity_page_mapping_uncertain_count.
Report Markdown should add a dedicated ## Text Fidelity section showing:
- checked page count and low-fidelity page count.
- total unexpected CJK count.
- replacement candidate page count.
- pages with low similarity.
- pages with high unexpected CJK count.
- pages where page-level comparison could not be trusted.
Warning behavior:
- Add
TEXT_LAYER_AVAILABLEas an info warning when pypdf source text is available and diagnostics run. - Add
TEXT_FIDELITY_LOWas a warning for pages below the fidelity threshold. - Add
UNEXPECTED_CJK_IN_KOREAN_TEXTas a warning when suspicious CJK ideographs appear in Markdown for pages with Korean source text. - Add
HANGUL_SPACING_SUSPECTas an info or warning-level signal when pypdf or Markdown has high Hangul spacing anomaly ratio. - Add
TEXT_PAGE_MAPPING_UNCERTAINas an info warning when page-level Markdown mapping is not reliable enough for per-page metrics.
Replacement candidate policy:
replacement_candidateis a diagnostic marker only.- It must not change Markdown output.
- It should be
trueonly when:- pypdf source text is available,
- pypdf Hangul count is materially higher than Markdown Hangul count or Markdown has suspicious CJK ideographs,
- pypdf spacing anomalies are not so severe that the source text layer is clearly unusable,
- page mapping is
checked.
Architecture Plan
WP13.1: Text Fidelity Module
Actions:
- Add
src/pdf2md/text_fidelity.py. - Use
pypdf.PdfReaderto extract source page text locally. - Define immutable result records for per-page metrics and summary metrics.
- Strip Markdown syntax, image links, fenced code, inline code, and math spans before text comparison.
- Normalize text for comparison without mutating the output Markdown:
- Unicode NFKC normalization for comparison strings only.
- collapse whitespace for similarity only.
- keep raw-count metrics independent enough to expose spacing anomalies.
- Count Hangul syllables with the Hangul syllable block.
- Count suspicious CJK ideographs with CJK Unified Ideograph ranges, excluding Hangul ranges.
- Compute similarity with a deterministic standard-library algorithm such as
difflib.SequenceMatcher.
Expected output:
- Pure local helper functions that are independently testable and do not call MinerU, network services, or the filesystem except for reading the source PDF.
WP13.2: Page Mapping Boundary
Actions:
- Derive source page numbers from
engine_options.chunkwhen chunking is active. - Use project page records and any reliable raw structured page count to decide whether page-level comparison is possible.
- If Markdown cannot be mapped to pages reliably, produce
TEXT_PAGE_MAPPING_UNCERTAINand avoid pretending per-page metrics are exact. - For the initial implementation, allow a conservative fallback for single-page mocked outputs and chunk outputs where one Markdown file corresponds to a known source page range.
Expected output:
- Page-level diagnostics are only marked
checkedwhen the mapping is credible. - Ambiguous cases are visible in metadata/report instead of producing misleading page metrics.
WP13.3: Metadata And Warning Integration
Actions:
- Add warning codes in
src/pdf2md/ir.py. - Add text fidelity fields to metadata without changing existing top-level fields used by current tests.
- Extend
build_summary()to include text fidelity summary counts when diagnostics are present. - Ensure warnings retain
page_indexwhere available. - Preserve JSON serializability and deterministic key ordering on write.
Expected output:
- Metadata contains compact page-level text fidelity diagnostics and summary counts.
- Existing metadata consumers remain compatible.
WP13.4: Report Integration
Actions:
- Extend
render_report()to render a## Text Fidelitysection when diagnostics exist. - Keep the report derived from metadata and quality results.
- Include low-fidelity pages and replacement candidate pages in human-readable form.
- Do not include full extracted page text in the report.
Expected output:
- A human can identify which pages need attention without opening metadata JSON first.
WP13.5: Conversion And Recheck Integration
Actions:
- Run text fidelity diagnostics during
convertafter final Markdown preparation and before metadata/report writing. - Run the same diagnostics during
recheckwhen the original source PDF path still exists. - If the source PDF is missing during
recheck, preserve existing behavior and add a clear nonfatal warning or omit diagnostics. - Keep chunked conversion page ranges tied to original source page numbers.
Expected output:
- Fresh conversions and rechecks can produce text fidelity diagnostics without rerunning MinerU.
WP13.6: Tests
Default fast tests:
- pypdf extraction boundary handles generated local PDFs without requiring real MinerU or sample files.
- Hangul count, unexpected CJK count, and spacing anomaly ratio helpers use direct Korean/CJK strings.
- Markdown text stripping ignores math, image links, fenced code, and inline code.
- Similarity score is deterministic for equivalent and degraded text.
- Metadata contains text fidelity summary fields when diagnostics are present.
- Report contains
## Text Fidelityand page-level warning summaries. - Conversion with a fake adapter records
TEXT_FIDELITY_LOWwhen Markdown omits Hangul from a source-text PDF. - Recheck reruns diagnostics when source PDF exists.
- Missing source PDF during recheck remains nonfatal.
Optional local validation:
- Convert the local 2007 Korean shell-structure sample with chunking to ignored
outputs\. - Confirm the report flags the pages where the previous output had missing Hangul and unexpected CJK characters.
- Do not commit sample PDFs or generated outputs.
Acceptance Criteria
- Default tests pass without real MinerU, GPU, model files, network, Obsidian, or
samples/. - Diagnostics are local-only and use pypdf source text only from the local PDF.
- Metadata JSON records page-level text fidelity metrics where page mapping is credible.
- Metadata summary records aggregate text fidelity counts.
<stem>.report.mdincludes a text fidelity section when diagnostics exist.- Suspicious Korean text loss produces structured warnings with page provenance where available.
- Replacement candidate markers are recorded only as diagnostics and do not alter Markdown content.
- Existing math, asset, table, chunk, strict-local, and UI behavior remains unchanged.
Hard Failure Criteria
- Markdown body text is replaced automatically in this sprint.
- Page-level metrics are reported as exact when page mapping is uncertain.
- Diagnostics upload PDFs, page text, Markdown, or extracted text to any remote service.
- Default tests require MinerU, CUDA/GPU, model files, network, Obsidian, or
samples/. - Existing output schema fields are removed or renamed.
samples/, generatedoutputs/, ordist/pdf2md-ui.exeare committed.
Verification Commands
uv run pytest tests/test_text_fidelity.py tests/test_metadata.py tests/test_report.py tests/test_conversion.py
uv run pytest
git diff --check
git status --short --untracked-files=all
Optional local validation:
$env:MINERU_MODEL_SOURCE='local'
$pdf = (Get-ChildItem samples -Filter '2007*.pdf' | Select-Object -First 1).FullName
uv run pdf2md convert $pdf --out outputs\sprint13-2007-text-fidelity --overwrite --chunk-pages 5
Handoff Requirements
After implementation:
- Update
PROGRESS.mdwith files changed, commands run, test outcomes, optional sample validation outcome, known failures, residual risks, and next action. - Archive completed implementation details in
docs/WORKARCHIVE.mdafter verification. - Keep sample PDFs, generated outputs, and build artifacts out of the commit.
- Record whether page-level mapping was exact, approximate, or unavailable for the validated sample.
Implementation Handoff
Files changed:
src/pdf2md/text_fidelity.pysrc/pdf2md/ir.pysrc/pdf2md/metadata.pysrc/pdf2md/report.pysrc/pdf2md/conversion.pytests/test_text_fidelity.pytests/test_metadata.pytests/test_report.pytests/test_conversion.pyARCHITECTURE.mdPLAN.mdPROGRESS.mddocs/WORKARCHIVE.mddocs/V1IMPLEMENTATIONPLAN.md
Verification:
uv run pytest tests/test_text_fidelity.py tests/test_metadata.py tests/test_report.py tests/test_conversion.py: passed 49 tests.uv run pytest: passed 198 tests with 1 optional skip.
Known failures:
- None in the default fast test suite.
Residual risks:
- Page-level Markdown mapping is only scored when credible. Multi-page Markdown without reliable page boundaries is reported as
TEXT_PAGE_MAPPING_UNCERTAINrather than guessed. - Automatic body-text replacement remains out of scope and is not implemented.
- Optional real MinerU validation on the local 2007 Korean shell-structure sample was not run during implementation to avoid a long GPU conversion.
Future Sprint Boundary
A later sprint may implement controlled body-text replacement from pypdf text after Sprint 13 diagnostics show reliable thresholds. That future sprint must have its own contract and must preserve math, tables, figures, asset links, and Markdown structure from MinerU unless explicitly redesigned.