2.3 KiB
2.3 KiB
Step 2: page-preanalysis-contract
Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/ARCHITECTURE.md
- /docs/CONVERSION_POLICY.md
- /docs/ADR.md
- /docs/TOOLCHAIN.md
- /phases/0-harness-foundation/step0.md
- /phases/0-harness-foundation/step1.md
- /phases/0-harness-foundation/index.json
Task
Implement the lightweight page pre-analysis contract that decides what later conversion steps need to know before Marker runs.
This step should use PyMuPDF only for fast document/page inspection:
- page count
- text length or text density per page
- image count per page
- OCR candidate flag per page
- basic long-document chunk candidates using the 20-page target
The output should be typed using the models from Step 1.
Sprint Contract
- Done means: given a PDF path, the pre-analysis API returns deterministic page-level facts and chunk candidates without running Marker, Nougat, OCR, or GPU code.
- Hard thresholds:
- Tests cover at least one text-heavy sample and one mixed/scanned-risk sample from
samples/metadata.json. - Tests cover Korean path handling through
pathlib. - OCR candidate logic is deterministic and documented by tests.
- Chunk candidates never exceed the document page count.
- Explicit conversion or Markdown rendering is not implemented here.
- Tests cover at least one text-heavy sample and one mixed/scanned-risk sample from
- Files owned:
src/pdftomd/preanalysis.py- model additions in
src/pdftomd/models.pyonly if required tests/test_preanalysis.pyPROGRESS.mdphases/0-harness-foundation/index.json
- Dependencies:
- Step 0 sample metadata
- Step 1 package skeleton and models
Acceptance Criteria
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests\test_preanalysis.py
Verification
- Run the acceptance commands.
- Confirm PyMuPDF is the only PDF inspection dependency used in this step.
- Confirm the sample metadata traits and test expectations are consistent.
- Update
PROGRESS.mdwith completed work, validation output, and next handoff. - Update this phase index step to
completedwith a one-linesummary, or toblocked/errorwith a concrete reason.
Do Not
- Do not call Marker, Nougat, Surya, torch, or OCR.
- Do not write conversion output under
output/. - Do not create resume cache or runtime state files.
- Do not implement reading-order reconstruction in this step.