add files
This commit is contained in:
@@ -0,0 +1,63 @@
|
||||
# Step 0: sample-metadata-contract
|
||||
|
||||
## Read First
|
||||
- /AGENTS.md
|
||||
- /PLAN.md
|
||||
- /PROGRESS.md
|
||||
- /docs/HARNESS.md
|
||||
- /docs/PRD.md
|
||||
- /docs/ARCHITECTURE.md
|
||||
- /docs/CONVERSION_POLICY.md
|
||||
- /docs/ADR.md
|
||||
- /docs/TOOLCHAIN.md
|
||||
|
||||
## Task
|
||||
Create the first sample corpus metadata contract without implementing the conversion engine.
|
||||
|
||||
The metadata must classify every PDF currently under `samples/` by traits that future regression tests can use:
|
||||
- text layer quality
|
||||
- scanned or mixed scanned/text pages
|
||||
- multi-column or complex layout risk
|
||||
- formula density
|
||||
- table density
|
||||
- figure density
|
||||
- Korean filename/path coverage
|
||||
- target regression focus
|
||||
|
||||
Use deterministic JSON so future agents can update it with minimal diff noise.
|
||||
|
||||
## Sprint Contract
|
||||
- Done means: `samples/metadata.json` exists, includes every current PDF by exact relative path, and has enough structured fields for future tests to select OCR, layout, formula, table, figure, and Korean-path cases.
|
||||
- Hard thresholds:
|
||||
- Every current `samples/*.pdf` appears exactly once.
|
||||
- Metadata is valid UTF-8 JSON.
|
||||
- Tests fail if a sample PDF is added without metadata.
|
||||
- Tests fail if duplicate sample paths exist in the metadata.
|
||||
- No conversion engine code is introduced in this step.
|
||||
- Files owned:
|
||||
- `samples/metadata.json`
|
||||
- `tests/test_sample_metadata.py`
|
||||
- `PROGRESS.md`
|
||||
- `phases/0-harness-foundation/index.json`
|
||||
- Dependencies:
|
||||
- Existing sample PDFs under `samples/`
|
||||
- PyMuPDF may be used only for lightweight page count/text/image inspection if needed.
|
||||
|
||||
## Acceptance Criteria
|
||||
```powershell
|
||||
python scripts\validate_workspace.py
|
||||
.\venv\python.exe -m pytest tests\test_sample_metadata.py
|
||||
```
|
||||
|
||||
## Verification
|
||||
1. Run the acceptance commands.
|
||||
2. Confirm `samples/metadata.json` paths match `samples/*.pdf`.
|
||||
3. Confirm Korean filenames remain readable in JSON.
|
||||
4. Update `PROGRESS.md` with completed work, validation output, and next handoff.
|
||||
5. Update this phase index step to `completed` with a one-line `summary`, or to `blocked`/`error` with a concrete reason.
|
||||
|
||||
## Do Not
|
||||
- Do not create `src/` or conversion engine modules in this step.
|
||||
- Do not rename, delete, compress, or rewrite sample PDFs.
|
||||
- Do not add sidecar output files for converted documents.
|
||||
- Do not add a new custom agent.
|
||||
Reference in New Issue
Block a user