baram2584/PDFToMD

Files

T

김경종 7e985ae94a add files

2026-04-30 17:05:19 +09:00

903 B

Raw Blame History

name, description

name	description
sample-corpus	Analyze and maintain the PDFtoMD samples corpus. Use when Codex needs to classify samples/ PDFs, design samples/metadata.json, identify OCR candidates, or connect corpus traits to focused regression tests.

Sample Corpus

Workflow

Read AGENTS.md, PLAN.md, PROGRESS.md, docs/PRD.md, and docs/CONVERSION_POLICY.md.
Inspect PDFs with PyMuPDF before proposing tests.
Track these traits per PDF:
- page count
- text-layer quality
- scanned or mixed pages
- multi-column layout
- formula density
- table density
- figure density
- Korean filename/path coverage
If writing metadata, use samples/metadata.json and update PROGRESS.md.

Guardrails

Preserve original sample PDFs.
Do not rename Korean sample files unless the user explicitly asks.
Do not treat first-page text length as the only OCR signal.