# Step 0: sample-metadata-contract ## Read First - /AGENTS.md - /PLAN.md - /PROGRESS.md - /docs/HARNESS.md - /docs/PRD.md - /docs/ARCHITECTURE.md - /docs/CONVERSION_POLICY.md - /docs/ADR.md - /docs/TOOLCHAIN.md ## Task Create the first sample corpus metadata contract without implementing the conversion engine. The metadata must classify every PDF currently under `samples/` by traits that future regression tests can use: - text layer quality - scanned or mixed scanned/text pages - multi-column or complex layout risk - formula density - table density - figure density - Korean filename/path coverage - target regression focus Use deterministic JSON so future agents can update it with minimal diff noise. ## Sprint Contract - Done means: `samples/metadata.json` exists, includes every current PDF by exact relative path, and has enough structured fields for future tests to select OCR, layout, formula, table, figure, and Korean-path cases. - Hard thresholds: - Every current `samples/*.pdf` appears exactly once. - Metadata is valid UTF-8 JSON. - Tests fail if a sample PDF is added without metadata. - Tests fail if duplicate sample paths exist in the metadata. - No conversion engine code is introduced in this step. - Files owned: - `samples/metadata.json` - `tests/test_sample_metadata.py` - `PROGRESS.md` - `phases/0-harness-foundation/index.json` - Dependencies: - Existing sample PDFs under `samples/` - PyMuPDF may be used only for lightweight page count/text/image inspection if needed. ## Acceptance Criteria ```powershell python scripts\validate_workspace.py .\venv\python.exe -m pytest tests\test_sample_metadata.py ``` ## Verification 1. Run the acceptance commands. 2. Confirm `samples/metadata.json` paths match `samples/*.pdf`. 3. Confirm Korean filenames remain readable in JSON. 4. Update `PROGRESS.md` with completed work, validation output, and next handoff. 5. Update this phase index step to `completed` with a one-line `summary`, or to `blocked`/`error` with a concrete reason. ## Do Not - Do not create `src/` or conversion engine modules in this step. - Do not rename, delete, compress, or rewrite sample PDFs. - Do not add sidecar output files for converted documents. - Do not add a new custom agent.