Files
PDFToMD/phases/0-harness-foundation/step2.md
T
김경종 7e985ae94a add files
2026-04-30 17:05:19 +09:00

2.3 KiB

Step 2: page-preanalysis-contract

Read First

  • /AGENTS.md
  • /PLAN.md
  • /PROGRESS.md
  • /docs/HARNESS.md
  • /docs/ARCHITECTURE.md
  • /docs/CONVERSION_POLICY.md
  • /docs/ADR.md
  • /docs/TOOLCHAIN.md
  • /phases/0-harness-foundation/step0.md
  • /phases/0-harness-foundation/step1.md
  • /phases/0-harness-foundation/index.json

Task

Implement the lightweight page pre-analysis contract that decides what later conversion steps need to know before Marker runs.

This step should use PyMuPDF only for fast document/page inspection:

  • page count
  • text length or text density per page
  • image count per page
  • OCR candidate flag per page
  • basic long-document chunk candidates using the 20-page target

The output should be typed using the models from Step 1.

Sprint Contract

  • Done means: given a PDF path, the pre-analysis API returns deterministic page-level facts and chunk candidates without running Marker, Nougat, OCR, or GPU code.
  • Hard thresholds:
    • Tests cover at least one text-heavy sample and one mixed/scanned-risk sample from samples/metadata.json.
    • Tests cover Korean path handling through pathlib.
    • OCR candidate logic is deterministic and documented by tests.
    • Chunk candidates never exceed the document page count.
    • Explicit conversion or Markdown rendering is not implemented here.
  • Files owned:
    • src/pdftomd/preanalysis.py
    • model additions in src/pdftomd/models.py only if required
    • tests/test_preanalysis.py
    • PROGRESS.md
    • phases/0-harness-foundation/index.json
  • Dependencies:
    • Step 0 sample metadata
    • Step 1 package skeleton and models

Acceptance Criteria

python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests\test_preanalysis.py

Verification

  1. Run the acceptance commands.
  2. Confirm PyMuPDF is the only PDF inspection dependency used in this step.
  3. Confirm the sample metadata traits and test expectations are consistent.
  4. Update PROGRESS.md with completed work, validation output, and next handoff.
  5. Update this phase index step to completed with a one-line summary, or to blocked/error with a concrete reason.

Do Not

  • Do not call Marker, Nougat, Surya, torch, or OCR.
  • Do not write conversion output under output/.
  • Do not create resume cache or runtime state files.
  • Do not implement reading-order reconstruction in this step.