add files

2026-04-30 17:05:19 +09:00
parent f3e01b5a8c
commit 7e985ae94a
135 changed files with 41205 additions and 0 deletions
@@ -0,0 +1,63 @@
+# Step 2: page-preanalysis-contract
+
+## Read First
+- /AGENTS.md
+- /PLAN.md
+- /PROGRESS.md
+- /docs/HARNESS.md
+- /docs/ARCHITECTURE.md
+- /docs/CONVERSION_POLICY.md
+- /docs/ADR.md
+- /docs/TOOLCHAIN.md
+- /phases/0-harness-foundation/step0.md
+- /phases/0-harness-foundation/step1.md
+- /phases/0-harness-foundation/index.json
+
+## Task
+Implement the lightweight page pre-analysis contract that decides what later conversion steps need to know before Marker runs.
+
+This step should use PyMuPDF only for fast document/page inspection:
+- page count
+- text length or text density per page
+- image count per page
+- OCR candidate flag per page
+- basic long-document chunk candidates using the 20-page target
+
+The output should be typed using the models from Step 1.
+
+## Sprint Contract
+- Done means: given a PDF path, the pre-analysis API returns deterministic page-level facts and chunk candidates without running Marker, Nougat, OCR, or GPU code.
+- Hard thresholds:
+  - Tests cover at least one text-heavy sample and one mixed/scanned-risk sample from `samples/metadata.json`.
+  - Tests cover Korean path handling through `pathlib`.
+  - OCR candidate logic is deterministic and documented by tests.
+  - Chunk candidates never exceed the document page count.
+  - Explicit conversion or Markdown rendering is not implemented here.
+- Files owned:
+  - `src/pdftomd/preanalysis.py`
+  - model additions in `src/pdftomd/models.py` only if required
+  - `tests/test_preanalysis.py`
+  - `PROGRESS.md`
+  - `phases/0-harness-foundation/index.json`
+- Dependencies:
+  - Step 0 sample metadata
+  - Step 1 package skeleton and models
+
+## Acceptance Criteria
+```powershell
+python scripts\validate_workspace.py
+.\venv\python.exe -m pytest tests\test_preanalysis.py
+```
+
+## Verification
+1. Run the acceptance commands.
+2. Confirm PyMuPDF is the only PDF inspection dependency used in this step.
+3. Confirm the sample metadata traits and test expectations are consistent.
+4. Update `PROGRESS.md` with completed work, validation output, and next handoff.
+5. Update this phase index step to `completed` with a one-line `summary`, or to `blocked`/`error` with a concrete reason.
+
+## Do Not
+- Do not call Marker, Nougat, Surya, torch, or OCR.
+- Do not write conversion output under `output/`.
+- Do not create resume cache or runtime state files.
+- Do not implement reading-order reconstruction in this step.