baram2584/PDFToMD

Fork 0

Files

T

김경종 dc11880140 modify pdftomd

2026-05-14 10:16:59 +09:00

13 KiB

Raw Blame History

Sprint 13 Contract: Text Layer Fidelity Diagnostics

Status: Implemented Last updated: 2026-05-11

Objective

Add a local pypdf-based text fidelity diagnostic pass that compares source PDF text-layer extraction with MinerU-generated Markdown text on a per-page basis where page mapping is available.

The first priority is diagnosis, not automatic body-text replacement. This sprint should record enough evidence in metadata JSON and <stem>.report.md to identify pages where MinerU likely misrecognized Korean body text, especially missing Hangul syllables, unexpected CJK ideographs, and abnormal spacing. It may mark pypdf text as a future replacement candidate, but it must not replace Markdown body text in this sprint.

Current Precondition

MinerU 3.1.0 remains the only conversion engine.
Conversion runs through direct local mineru CLI execution only.
pypdf is already used by the project for local PDF chunk planning.
pdf2md convert writes Markdown, metadata JSON, and <stem>.report.md.
pdf2md recheck can regenerate metadata/report from an existing Markdown file.
Chunked conversion records original source page ranges in metadata engine_options.chunk.
The 2007 Korean shell-structure sample showed clear text fidelity problems:
- pypdf can extract more accurate Hangul from the digital text layer.
- MinerU Markdown can omit Hangul syllables or misrecognize headings/body text as unrelated CJK characters.
- The source text layer itself can contain abnormal spacing between Hangul syllables.

Touched Surfaces

Allowed during implementation:

src/pdf2md/text_fidelity.py
src/pdf2md/ir.py
src/pdf2md/metadata.py
src/pdf2md/report.py
src/pdf2md/conversion.py
tests/test_text_fidelity.py
tests/test_metadata.py
tests/test_report.py
tests/test_conversion.py
docs/V1IMPLEMENTATIONPLAN.md
PLAN.md
PROGRESS.md
docs/WORKARCHIVE.md after completion

Allowed only if needed for CLI/API wiring:

src/pdf2md/cli.py
tests/test_cli.py
README.md

Not allowed:

Replacing Markdown body text with pypdf text in this sprint.
Adding a second conversion engine or engine selector.
Adding remote OCR, hosted LLM/VLM, remote document parsing, --api-url, router mode, HTTP client backends, or remote OpenAI-compatible endpoints.
Mandatory default tests that require real MinerU, GPU, model files, network, Obsidian, or committed samples/.
Committing sample PDFs or generated outputs/.

Product Behavior

Text fidelity diagnostics should run automatically after MinerU Markdown normalization and local quality checks have produced the final Markdown candidate.

For each page that can be compared, metadata should record a compact diagnostic object with at least:

page_index: zero-based output page index.
source_page_number: one-based original PDF page number when known.
pypdf_text_available: whether pypdf extracted non-empty source text.
markdown_text_available: whether comparable Markdown text exists for the page.
pypdf_hangul_count: Hangul syllable count from pypdf text.
markdown_hangul_count: Hangul syllable count from Markdown text.
hangul_count_delta: markdown_hangul_count - pypdf_hangul_count.
hangul_count_ratio: Markdown Hangul count divided by pypdf Hangul count, or null when unavailable.
unexpected_cjk_count: count of CJK Unified Ideographs in Markdown that are suspicious in a page with Korean source text.
pypdf_hangul_spacing_anomaly_ratio: ratio of Hangul-to-Hangul whitespace breaks in pypdf text.
markdown_hangul_spacing_anomaly_ratio: ratio of Hangul-to-Hangul whitespace breaks in Markdown text.
text_similarity: normalized text similarity between pypdf text and Markdown text.
replacement_candidate: true only when pypdf text appears more reliable than Markdown text under conservative thresholds.
comparison_status: one of checked, source_text_missing, markdown_page_unavailable, or page_mapping_uncertain.

Metadata summary should include:

text_fidelity_checked_page_count.
text_fidelity_low_page_count.
text_fidelity_unexpected_cjk_count.
text_fidelity_replacement_candidate_page_count.
text_fidelity_page_mapping_uncertain_count.

Report Markdown should add a dedicated ## Text Fidelity section showing:

checked page count and low-fidelity page count.
total unexpected CJK count.
replacement candidate page count.
pages with low similarity.
pages with high unexpected CJK count.
pages where page-level comparison could not be trusted.

Warning behavior:

Add TEXT_LAYER_AVAILABLE as an info warning when pypdf source text is available and diagnostics run.
Add TEXT_FIDELITY_LOW as a warning for pages below the fidelity threshold.
Add UNEXPECTED_CJK_IN_KOREAN_TEXT as a warning when suspicious CJK ideographs appear in Markdown for pages with Korean source text.
Add HANGUL_SPACING_SUSPECT as an info or warning-level signal when pypdf or Markdown has high Hangul spacing anomaly ratio.
Add TEXT_PAGE_MAPPING_UNCERTAIN as an info warning when page-level Markdown mapping is not reliable enough for per-page metrics.

Replacement candidate policy:

replacement_candidate is a diagnostic marker only.
It must not change Markdown output.
It should be true only when:
- pypdf source text is available,
- pypdf Hangul count is materially higher than Markdown Hangul count or Markdown has suspicious CJK ideographs,
- pypdf spacing anomalies are not so severe that the source text layer is clearly unusable,
- page mapping is checked.

Architecture Plan

WP13.1: Text Fidelity Module

Actions:

Add src/pdf2md/text_fidelity.py.
Use pypdf.PdfReader to extract source page text locally.
Define immutable result records for per-page metrics and summary metrics.
Strip Markdown syntax, image links, fenced code, inline code, and math spans before text comparison.
Normalize text for comparison without mutating the output Markdown:
- Unicode NFKC normalization for comparison strings only.
- collapse whitespace for similarity only.
- keep raw-count metrics independent enough to expose spacing anomalies.
Count Hangul syllables with the Hangul syllable block.
Count suspicious CJK ideographs with CJK Unified Ideograph ranges, excluding Hangul ranges.
Compute similarity with a deterministic standard-library algorithm such as difflib.SequenceMatcher.

Expected output:

Pure local helper functions that are independently testable and do not call MinerU, network services, or the filesystem except for reading the source PDF.

WP13.2: Page Mapping Boundary

Actions:

Derive source page numbers from engine_options.chunk when chunking is active.
Use project page records and any reliable raw structured page count to decide whether page-level comparison is possible.
If Markdown cannot be mapped to pages reliably, produce TEXT_PAGE_MAPPING_UNCERTAIN and avoid pretending per-page metrics are exact.
For the initial implementation, allow a conservative fallback for single-page mocked outputs and chunk outputs where one Markdown file corresponds to a known source page range.

Expected output:

Page-level diagnostics are only marked checked when the mapping is credible.
Ambiguous cases are visible in metadata/report instead of producing misleading page metrics.

WP13.3: Metadata And Warning Integration

Actions:

Add warning codes in src/pdf2md/ir.py.
Add text fidelity fields to metadata without changing existing top-level fields used by current tests.
Extend build_summary() to include text fidelity summary counts when diagnostics are present.
Ensure warnings retain page_index where available.
Preserve JSON serializability and deterministic key ordering on write.

Expected output:

Metadata contains compact page-level text fidelity diagnostics and summary counts.
Existing metadata consumers remain compatible.

WP13.4: Report Integration

Actions:

Extend render_report() to render a ## Text Fidelity section when diagnostics exist.
Keep the report derived from metadata and quality results.
Include low-fidelity pages and replacement candidate pages in human-readable form.
Do not include full extracted page text in the report.

Expected output:

A human can identify which pages need attention without opening metadata JSON first.

WP13.5: Conversion And Recheck Integration

Actions:

Run text fidelity diagnostics during convert after final Markdown preparation and before metadata/report writing.
Run the same diagnostics during recheck when the original source PDF path still exists.
If the source PDF is missing during recheck, preserve existing behavior and add a clear nonfatal warning or omit diagnostics.
Keep chunked conversion page ranges tied to original source page numbers.

Expected output:

Fresh conversions and rechecks can produce text fidelity diagnostics without rerunning MinerU.

WP13.6: Tests

Default fast tests:

pypdf extraction boundary handles generated local PDFs without requiring real MinerU or sample files.
Hangul count, unexpected CJK count, and spacing anomaly ratio helpers use direct Korean/CJK strings.
Markdown text stripping ignores math, image links, fenced code, and inline code.
Similarity score is deterministic for equivalent and degraded text.
Metadata contains text fidelity summary fields when diagnostics are present.
Report contains ## Text Fidelity and page-level warning summaries.
Conversion with a fake adapter records TEXT_FIDELITY_LOW when Markdown omits Hangul from a source-text PDF.
Recheck reruns diagnostics when source PDF exists.
Missing source PDF during recheck remains nonfatal.

Optional local validation:

Convert the local 2007 Korean shell-structure sample with chunking to ignored outputs\.
Confirm the report flags the pages where the previous output had missing Hangul and unexpected CJK characters.
Do not commit sample PDFs or generated outputs.

Acceptance Criteria

Default tests pass without real MinerU, GPU, model files, network, Obsidian, or samples/.
Diagnostics are local-only and use pypdf source text only from the local PDF.
Metadata JSON records page-level text fidelity metrics where page mapping is credible.
Metadata summary records aggregate text fidelity counts.
<stem>.report.md includes a text fidelity section when diagnostics exist.
Suspicious Korean text loss produces structured warnings with page provenance where available.
Replacement candidate markers are recorded only as diagnostics and do not alter Markdown content.
Existing math, asset, table, chunk, strict-local, and UI behavior remains unchanged.

Hard Failure Criteria

Markdown body text is replaced automatically in this sprint.
Page-level metrics are reported as exact when page mapping is uncertain.
Diagnostics upload PDFs, page text, Markdown, or extracted text to any remote service.
Default tests require MinerU, CUDA/GPU, model files, network, Obsidian, or samples/.
Existing output schema fields are removed or renamed.
samples/, generated outputs/, or dist/pdf2md-ui.exe are committed.

Verification Commands

uv run pytest tests/test_text_fidelity.py tests/test_metadata.py tests/test_report.py tests/test_conversion.py
uv run pytest
git diff --check
git status --short --untracked-files=all

Optional local validation:

$env:MINERU_MODEL_SOURCE='local'
$pdf = (Get-ChildItem samples -Filter '2007*.pdf' | Select-Object -First 1).FullName
uv run pdf2md convert $pdf --out outputs\sprint13-2007-text-fidelity --overwrite --chunk-pages 5

Handoff Requirements

After implementation:

Update PROGRESS.md with files changed, commands run, test outcomes, optional sample validation outcome, known failures, residual risks, and next action.
Archive completed implementation details in docs/WORKARCHIVE.md after verification.
Keep sample PDFs, generated outputs, and build artifacts out of the commit.
Record whether page-level mapping was exact, approximate, or unavailable for the validated sample.

Implementation Handoff

Files changed:

src/pdf2md/text_fidelity.py
src/pdf2md/ir.py
src/pdf2md/metadata.py
src/pdf2md/report.py
src/pdf2md/conversion.py
tests/test_text_fidelity.py
tests/test_metadata.py
tests/test_report.py
tests/test_conversion.py
ARCHITECTURE.md
PLAN.md
PROGRESS.md
docs/WORKARCHIVE.md
docs/V1IMPLEMENTATIONPLAN.md

Verification:

uv run pytest tests/test_text_fidelity.py tests/test_metadata.py tests/test_report.py tests/test_conversion.py: passed 49 tests.
uv run pytest: passed 198 tests with 1 optional skip.

Known failures:

None in the default fast test suite.

Residual risks:

Page-level Markdown mapping is only scored when credible. Multi-page Markdown without reliable page boundaries is reported as TEXT_PAGE_MAPPING_UNCERTAIN rather than guessed.
Automatic body-text replacement remains out of scope and is not implemented.
Optional real MinerU validation on the local 2007 Korean shell-structure sample was not run during implementation to avoid a long GPU conversion.

Future Sprint Boundary

A later sprint may implement controlled body-text replacement from pypdf text after Sprint 13 diagnostics show reliable thresholds. That future sprint must have its own contract and must preserve math, tables, figures, asset links, and Markdown structure from MinerU unless explicitly redesigned.

13 KiB Raw Blame History

Sprint 13 Contract: Text Layer Fidelity Diagnostics

Objective

Current Precondition

Touched Surfaces

Product Behavior

Architecture Plan

WP13.1: Text Fidelity Module

WP13.2: Page Mapping Boundary

WP13.3: Metadata And Warning Integration

WP13.4: Report Integration

WP13.5: Conversion And Recheck Integration

WP13.6: Tests

Acceptance Criteria

Hard Failure Criteria

Verification Commands

Handoff Requirements

Implementation Handoff

Future Sprint Boundary

13 KiB

Raw Blame History