Files
PDFToMD/.codex/skills/sample-corpus/SKILL.md
T
김경종 7e985ae94a add files
2026-04-30 17:05:19 +09:00

903 B

name, description
name description
sample-corpus Analyze and maintain the PDFtoMD samples corpus. Use when Codex needs to classify samples/ PDFs, design samples/metadata.json, identify OCR candidates, or connect corpus traits to focused regression tests.

Sample Corpus

Workflow

  1. Read AGENTS.md, PLAN.md, PROGRESS.md, docs/PRD.md, and docs/CONVERSION_POLICY.md.
  2. Inspect PDFs with PyMuPDF before proposing tests.
  3. Track these traits per PDF:
    • page count
    • text-layer quality
    • scanned or mixed pages
    • multi-column layout
    • formula density
    • table density
    • figure density
    • Korean filename/path coverage
  4. If writing metadata, use samples/metadata.json and update PROGRESS.md.

Guardrails

  • Preserve original sample PDFs.
  • Do not rename Korean sample files unless the user explicitly asks.
  • Do not treat first-page text length as the only OCR signal.