add files

2026-04-30 17:05:19 +09:00
parent f3e01b5a8c
commit 7e985ae94a
135 changed files with 41205 additions and 0 deletions
@@ -0,0 +1,27 @@
+---
+name: sample-corpus
+description: Analyze and maintain the PDFtoMD samples corpus. Use when Codex needs to classify samples/ PDFs, design samples/metadata.json, identify OCR candidates, or connect corpus traits to focused regression tests.
+---
+
+# Sample Corpus
+
+## Workflow
+
+1. Read `AGENTS.md`, `PLAN.md`, `PROGRESS.md`, `docs/PRD.md`, and `docs/CONVERSION_POLICY.md`.
+2. Inspect PDFs with PyMuPDF before proposing tests.
+3. Track these traits per PDF:
+   - page count
+   - text-layer quality
+   - scanned or mixed pages
+   - multi-column layout
+   - formula density
+   - table density
+   - figure density
+   - Korean filename/path coverage
+4. If writing metadata, use `samples/metadata.json` and update `PROGRESS.md`.
+
+## Guardrails
+
+- Preserve original sample PDFs.
+- Do not rename Korean sample files unless the user explicitly asks.
+- Do not treat first-page text length as the only OCR signal.