# Step 2: page-preanalysis-contract ## Read First - /AGENTS.md - /PLAN.md - /PROGRESS.md - /docs/HARNESS.md - /docs/ARCHITECTURE.md - /docs/CONVERSION_POLICY.md - /docs/ADR.md - /docs/TOOLCHAIN.md - /phases/0-harness-foundation/step0.md - /phases/0-harness-foundation/step1.md - /phases/0-harness-foundation/index.json ## Task Implement the lightweight page pre-analysis contract that decides what later conversion steps need to know before Marker runs. This step should use PyMuPDF only for fast document/page inspection: - page count - text length or text density per page - image count per page - OCR candidate flag per page - basic long-document chunk candidates using the 20-page target The output should be typed using the models from Step 1. ## Sprint Contract - Done means: given a PDF path, the pre-analysis API returns deterministic page-level facts and chunk candidates without running Marker, Nougat, OCR, or GPU code. - Hard thresholds: - Tests cover at least one text-heavy sample and one mixed/scanned-risk sample from `samples/metadata.json`. - Tests cover Korean path handling through `pathlib`. - OCR candidate logic is deterministic and documented by tests. - Chunk candidates never exceed the document page count. - Explicit conversion or Markdown rendering is not implemented here. - Files owned: - `src/pdftomd/preanalysis.py` - model additions in `src/pdftomd/models.py` only if required - `tests/test_preanalysis.py` - `PROGRESS.md` - `phases/0-harness-foundation/index.json` - Dependencies: - Step 0 sample metadata - Step 1 package skeleton and models ## Acceptance Criteria ```powershell python scripts\validate_workspace.py .\venv\python.exe -m pytest tests\test_preanalysis.py ``` ## Verification 1. Run the acceptance commands. 2. Confirm PyMuPDF is the only PDF inspection dependency used in this step. 3. Confirm the sample metadata traits and test expectations are consistent. 4. Update `PROGRESS.md` with completed work, validation output, and next handoff. 5. Update this phase index step to `completed` with a one-line `summary`, or to `blocked`/`error` with a concrete reason. ## Do Not - Do not call Marker, Nougat, Surya, torch, or OCR. - Do not write conversion output under `output/`. - Do not create resume cache or runtime state files. - Do not implement reading-order reconstruction in this step.