Files
PDFToMD/phases/0-harness-foundation/step0.md
T
김경종 7e985ae94a add files
2026-04-30 17:05:19 +09:00

2.2 KiB

Step 0: sample-metadata-contract

Read First

  • /AGENTS.md
  • /PLAN.md
  • /PROGRESS.md
  • /docs/HARNESS.md
  • /docs/PRD.md
  • /docs/ARCHITECTURE.md
  • /docs/CONVERSION_POLICY.md
  • /docs/ADR.md
  • /docs/TOOLCHAIN.md

Task

Create the first sample corpus metadata contract without implementing the conversion engine.

The metadata must classify every PDF currently under samples/ by traits that future regression tests can use:

  • text layer quality
  • scanned or mixed scanned/text pages
  • multi-column or complex layout risk
  • formula density
  • table density
  • figure density
  • Korean filename/path coverage
  • target regression focus

Use deterministic JSON so future agents can update it with minimal diff noise.

Sprint Contract

  • Done means: samples/metadata.json exists, includes every current PDF by exact relative path, and has enough structured fields for future tests to select OCR, layout, formula, table, figure, and Korean-path cases.
  • Hard thresholds:
    • Every current samples/*.pdf appears exactly once.
    • Metadata is valid UTF-8 JSON.
    • Tests fail if a sample PDF is added without metadata.
    • Tests fail if duplicate sample paths exist in the metadata.
    • No conversion engine code is introduced in this step.
  • Files owned:
    • samples/metadata.json
    • tests/test_sample_metadata.py
    • PROGRESS.md
    • phases/0-harness-foundation/index.json
  • Dependencies:
    • Existing sample PDFs under samples/
    • PyMuPDF may be used only for lightweight page count/text/image inspection if needed.

Acceptance Criteria

python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests\test_sample_metadata.py

Verification

  1. Run the acceptance commands.
  2. Confirm samples/metadata.json paths match samples/*.pdf.
  3. Confirm Korean filenames remain readable in JSON.
  4. Update PROGRESS.md with completed work, validation output, and next handoff.
  5. Update this phase index step to completed with a one-line summary, or to blocked/error with a concrete reason.

Do Not

  • Do not create src/ or conversion engine modules in this step.
  • Do not rename, delete, compress, or rewrite sample PDFs.
  • Do not add sidecar output files for converted documents.
  • Do not add a new custom agent.