903 B
903 B
name, description
| name | description |
|---|---|
| sample-corpus | Analyze and maintain the PDFtoMD samples corpus. Use when Codex needs to classify samples/ PDFs, design samples/metadata.json, identify OCR candidates, or connect corpus traits to focused regression tests. |
Sample Corpus
Workflow
- Read
AGENTS.md,PLAN.md,PROGRESS.md,docs/PRD.md, anddocs/CONVERSION_POLICY.md. - Inspect PDFs with PyMuPDF before proposing tests.
- Track these traits per PDF:
- page count
- text-layer quality
- scanned or mixed pages
- multi-column layout
- formula density
- table density
- figure density
- Korean filename/path coverage
- If writing metadata, use
samples/metadata.jsonand updatePROGRESS.md.
Guardrails
- Preserve original sample PDFs.
- Do not rename Korean sample files unless the user explicitly asks.
- Do not treat first-page text length as the only OCR signal.