2.2 KiB
2.2 KiB
Step 0: sample-metadata-contract
Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/PRD.md
- /docs/ARCHITECTURE.md
- /docs/CONVERSION_POLICY.md
- /docs/ADR.md
- /docs/TOOLCHAIN.md
Task
Create the first sample corpus metadata contract without implementing the conversion engine.
The metadata must classify every PDF currently under samples/ by traits that future regression tests can use:
- text layer quality
- scanned or mixed scanned/text pages
- multi-column or complex layout risk
- formula density
- table density
- figure density
- Korean filename/path coverage
- target regression focus
Use deterministic JSON so future agents can update it with minimal diff noise.
Sprint Contract
- Done means:
samples/metadata.jsonexists, includes every current PDF by exact relative path, and has enough structured fields for future tests to select OCR, layout, formula, table, figure, and Korean-path cases. - Hard thresholds:
- Every current
samples/*.pdfappears exactly once. - Metadata is valid UTF-8 JSON.
- Tests fail if a sample PDF is added without metadata.
- Tests fail if duplicate sample paths exist in the metadata.
- No conversion engine code is introduced in this step.
- Every current
- Files owned:
samples/metadata.jsontests/test_sample_metadata.pyPROGRESS.mdphases/0-harness-foundation/index.json
- Dependencies:
- Existing sample PDFs under
samples/ - PyMuPDF may be used only for lightweight page count/text/image inspection if needed.
- Existing sample PDFs under
Acceptance Criteria
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests\test_sample_metadata.py
Verification
- Run the acceptance commands.
- Confirm
samples/metadata.jsonpaths matchsamples/*.pdf. - Confirm Korean filenames remain readable in JSON.
- Update
PROGRESS.mdwith completed work, validation output, and next handoff. - Update this phase index step to
completedwith a one-linesummary, or toblocked/errorwith a concrete reason.
Do Not
- Do not create
src/or conversion engine modules in this step. - Do not rename, delete, compress, or rewrite sample PDFs.
- Do not add sidecar output files for converted documents.
- Do not add a new custom agent.