Files
PDFToMD/phases/2-marker-adapter/step1.md
T
김경종 7e985ae94a add files
2026-04-30 17:05:19 +09:00

1.4 KiB

Step 1: ocr-plan-handoff

Read First

  • /AGENTS.md
  • /PLAN.md
  • /PROGRESS.md
  • /docs/HARNESS.md
  • /docs/IMPLEMENTATION_PLAN.md
  • /docs/CONVERSION_POLICY.md
  • /phases/0-harness-foundation/step2.md
  • /phases/2-marker-adapter/step0.md

Task

Connect PyMuPDF page pre-analysis results to the Marker adapter as an OCR/layout handoff plan.

The goal is to preserve page-level OCR decisions without making the entire document scan-only or text-only.

Sprint Contract

  • Done means: the adapter accepts page-level OCR candidates and passes the relevant intent into Marker configuration or records an explicit unsupported-path fallback.
  • Hard thresholds: OCR decisions stay page-aware; PyMuPDF remains pre-analysis only; no OCR logs are inserted into Markdown.
  • Files owned: src/pdftomd/marker_adapter.py, src/pdftomd/preanalysis.py if needed, tests, PROGRESS.md, phase index.
  • Dependencies: Phase 0 pre-analysis and Step 0 Marker adapter.

Acceptance Criteria

python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests

Verification

  1. Run the acceptance commands.
  2. Confirm mixed text/scanned sample traits are represented in tests.
  3. Update PROGRESS.md and this phase index.

Do Not

  • Do not force document-wide OCR when only selected pages need OCR.
  • Do not implement reading-order fixes here.
  • Do not add a second primary parser.