add files

This commit is contained in:
김경종
2026-04-30 17:05:19 +09:00
parent f3e01b5a8c
commit 7e985ae94a
135 changed files with 41205 additions and 0 deletions
+63
View File
@@ -0,0 +1,63 @@
# Step 2: page-preanalysis-contract
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/ARCHITECTURE.md
- /docs/CONVERSION_POLICY.md
- /docs/ADR.md
- /docs/TOOLCHAIN.md
- /phases/0-harness-foundation/step0.md
- /phases/0-harness-foundation/step1.md
- /phases/0-harness-foundation/index.json
## Task
Implement the lightweight page pre-analysis contract that decides what later conversion steps need to know before Marker runs.
This step should use PyMuPDF only for fast document/page inspection:
- page count
- text length or text density per page
- image count per page
- OCR candidate flag per page
- basic long-document chunk candidates using the 20-page target
The output should be typed using the models from Step 1.
## Sprint Contract
- Done means: given a PDF path, the pre-analysis API returns deterministic page-level facts and chunk candidates without running Marker, Nougat, OCR, or GPU code.
- Hard thresholds:
- Tests cover at least one text-heavy sample and one mixed/scanned-risk sample from `samples/metadata.json`.
- Tests cover Korean path handling through `pathlib`.
- OCR candidate logic is deterministic and documented by tests.
- Chunk candidates never exceed the document page count.
- Explicit conversion or Markdown rendering is not implemented here.
- Files owned:
- `src/pdftomd/preanalysis.py`
- model additions in `src/pdftomd/models.py` only if required
- `tests/test_preanalysis.py`
- `PROGRESS.md`
- `phases/0-harness-foundation/index.json`
- Dependencies:
- Step 0 sample metadata
- Step 1 package skeleton and models
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests\test_preanalysis.py
```
## Verification
1. Run the acceptance commands.
2. Confirm PyMuPDF is the only PDF inspection dependency used in this step.
3. Confirm the sample metadata traits and test expectations are consistent.
4. Update `PROGRESS.md` with completed work, validation output, and next handoff.
5. Update this phase index step to `completed` with a one-line `summary`, or to `blocked`/`error` with a concrete reason.
## Do Not
- Do not call Marker, Nougat, Surya, torch, or OCR.
- Do not write conversion output under `output/`.
- Do not create resume cache or runtime state files.
- Do not implement reading-order reconstruction in this step.