--- name: sample-corpus description: Analyze and maintain the PDFtoMD samples corpus. Use when Codex needs to classify samples/ PDFs, design samples/metadata.json, identify OCR candidates, or connect corpus traits to focused regression tests. --- # Sample Corpus ## Workflow 1. Read `AGENTS.md`, `PLAN.md`, `PROGRESS.md`, `docs/PRD.md`, and `docs/CONVERSION_POLICY.md`. 2. Inspect PDFs with PyMuPDF before proposing tests. 3. Track these traits per PDF: - page count - text-layer quality - scanned or mixed pages - multi-column layout - formula density - table density - figure density - Korean filename/path coverage 4. If writing metadata, use `samples/metadata.json` and update `PROGRESS.md`. ## Guardrails - Preserve original sample PDFs. - Do not rename Korean sample files unless the user explicitly asks. - Do not treat first-page text length as the only OCR signal.