add files
This commit is contained in:
@@ -0,0 +1,63 @@
|
||||
# Step 2: page-preanalysis-contract
|
||||
|
||||
## Read First
|
||||
- /AGENTS.md
|
||||
- /PLAN.md
|
||||
- /PROGRESS.md
|
||||
- /docs/HARNESS.md
|
||||
- /docs/ARCHITECTURE.md
|
||||
- /docs/CONVERSION_POLICY.md
|
||||
- /docs/ADR.md
|
||||
- /docs/TOOLCHAIN.md
|
||||
- /phases/0-harness-foundation/step0.md
|
||||
- /phases/0-harness-foundation/step1.md
|
||||
- /phases/0-harness-foundation/index.json
|
||||
|
||||
## Task
|
||||
Implement the lightweight page pre-analysis contract that decides what later conversion steps need to know before Marker runs.
|
||||
|
||||
This step should use PyMuPDF only for fast document/page inspection:
|
||||
- page count
|
||||
- text length or text density per page
|
||||
- image count per page
|
||||
- OCR candidate flag per page
|
||||
- basic long-document chunk candidates using the 20-page target
|
||||
|
||||
The output should be typed using the models from Step 1.
|
||||
|
||||
## Sprint Contract
|
||||
- Done means: given a PDF path, the pre-analysis API returns deterministic page-level facts and chunk candidates without running Marker, Nougat, OCR, or GPU code.
|
||||
- Hard thresholds:
|
||||
- Tests cover at least one text-heavy sample and one mixed/scanned-risk sample from `samples/metadata.json`.
|
||||
- Tests cover Korean path handling through `pathlib`.
|
||||
- OCR candidate logic is deterministic and documented by tests.
|
||||
- Chunk candidates never exceed the document page count.
|
||||
- Explicit conversion or Markdown rendering is not implemented here.
|
||||
- Files owned:
|
||||
- `src/pdftomd/preanalysis.py`
|
||||
- model additions in `src/pdftomd/models.py` only if required
|
||||
- `tests/test_preanalysis.py`
|
||||
- `PROGRESS.md`
|
||||
- `phases/0-harness-foundation/index.json`
|
||||
- Dependencies:
|
||||
- Step 0 sample metadata
|
||||
- Step 1 package skeleton and models
|
||||
|
||||
## Acceptance Criteria
|
||||
```powershell
|
||||
python scripts\validate_workspace.py
|
||||
.\venv\python.exe -m pytest tests\test_preanalysis.py
|
||||
```
|
||||
|
||||
## Verification
|
||||
1. Run the acceptance commands.
|
||||
2. Confirm PyMuPDF is the only PDF inspection dependency used in this step.
|
||||
3. Confirm the sample metadata traits and test expectations are consistent.
|
||||
4. Update `PROGRESS.md` with completed work, validation output, and next handoff.
|
||||
5. Update this phase index step to `completed` with a one-line `summary`, or to `blocked`/`error` with a concrete reason.
|
||||
|
||||
## Do Not
|
||||
- Do not call Marker, Nougat, Surya, torch, or OCR.
|
||||
- Do not write conversion output under `output/`.
|
||||
- Do not create resume cache or runtime state files.
|
||||
- Do not implement reading-order reconstruction in this step.
|
||||
Reference in New Issue
Block a user