---
name: sample-corpus
description: Analyze and maintain the PDFtoMD samples corpus. Use when Codex needs to classify samples/ PDFs, design samples/metadata.json, identify OCR candidates, or connect corpus traits to focused regression tests.
---

# Sample Corpus

## Workflow

1. Read `AGENTS.md`, `PLAN.md`, `PROGRESS.md`, `docs/PRD.md`, and `docs/CONVERSION_POLICY.md`.
2. Inspect PDFs with PyMuPDF before proposing tests.
3. Track these traits per PDF:
   - page count
   - text-layer quality
   - scanned or mixed pages
   - multi-column layout
   - formula density
   - table density
   - figure density
   - Korean filename/path coverage
4. If writing metadata, use `samples/metadata.json` and update `PROGRESS.md`.

## Guardrails

- Preserve original sample PDFs.
- Do not rename Korean sample files unless the user explicitly asks.
- Do not treat first-page text length as the only OCR signal.