add files
This commit is contained in:
@@ -0,0 +1,27 @@
|
||||
---
|
||||
name: sample-corpus
|
||||
description: Analyze and maintain the PDFtoMD samples corpus. Use when Codex needs to classify samples/ PDFs, design samples/metadata.json, identify OCR candidates, or connect corpus traits to focused regression tests.
|
||||
---
|
||||
|
||||
# Sample Corpus
|
||||
|
||||
## Workflow
|
||||
|
||||
1. Read `AGENTS.md`, `PLAN.md`, `PROGRESS.md`, `docs/PRD.md`, and `docs/CONVERSION_POLICY.md`.
|
||||
2. Inspect PDFs with PyMuPDF before proposing tests.
|
||||
3. Track these traits per PDF:
|
||||
- page count
|
||||
- text-layer quality
|
||||
- scanned or mixed pages
|
||||
- multi-column layout
|
||||
- formula density
|
||||
- table density
|
||||
- figure density
|
||||
- Korean filename/path coverage
|
||||
4. If writing metadata, use `samples/metadata.json` and update `PROGRESS.md`.
|
||||
|
||||
## Guardrails
|
||||
|
||||
- Preserve original sample PDFs.
|
||||
- Do not rename Korean sample files unless the user explicitly asks.
|
||||
- Do not treat first-page text length as the only OCR signal.
|
||||
Reference in New Issue
Block a user