add pdftomd

2026-05-08 16:42:19 +09:00
parent 551ab50735
commit 88d6b92283
99 changed files with 47332 additions and 0 deletions
@@ -0,0 +1,30 @@
+---
+name: fixture-evaluation
+description: Plan local fixture-based quality checks for this MinerU PDF-to-Markdown converter using samples/ without committing sample PDFs. Use when Codex needs to define sample coverage, quality metrics, regression checks, JSON metadata assertions, or human-readable .report.md expectations.
+---
+
+# Fixture Evaluation
+
+## Overview
+
+Use this skill to turn local sample PDFs into a small, repeatable quality plan. Keep samples local and untracked unless the user explicitly asks to commit them.
+
+## Workflow
+
+1. Read `PLAN.md` and `PROGRESS.md` first.
+2. Inspect `samples/` only enough to understand fixture categories and filenames.
+3. Map each fixture to risks: math, tables, multi-column reading order, figures/assets, Korean filenames, and metadata coverage.
+4. Separate fast checks using mocked MinerU outputs from optional checks that require MinerU models, GPU, or long execution.
+5. Define metrics for both JSON metadata and `<stem>.report.md`.
+6. Update `PROGRESS.md` with fixture coverage and gaps.
+
+## Guardrails
+
+- Do not commit sample PDFs.
+- Do not copy samples into tracked fixtures without explicit user permission.
+- Do not make GPU/model-dependent checks mandatory for the default fast loop.
+- Do not grade only plain-text edit distance; include math, tables, reading order, assets, metadata, and renderability.
+
+## Reference
+
+Read `references/evaluation-metrics.md` when defining fixture coverage, regression criteria, or report fields.
@@ -0,0 +1,4 @@
+interface:
+  display_name: "Fixture Evaluation"
+  short_description: "Plan fixture quality checks locally"
+  default_prompt: "Use $fixture-evaluation to plan sample coverage, quality metrics, regression checks, and report expectations without committing sample files."
@@ -0,0 +1,37 @@
+# Evaluation Metrics
+
+Use these metrics for local fixture plans and future tests.
+
+## Fixture Categories
+
+- Simple digital PDF with text layer.
+- Math-heavy paper or chapter.
+- Multi-column paper.
+- Table with formulas.
+- Figure with caption and asset extraction.
+- Korean filename/path handling.
+
+## Fast Checks
+
+- Output files are planned at deterministic paths.
+- Metadata JSON includes source PDF, page count, engine, warnings, and output paths.
+- `.report.md` can be generated from metadata without re-running MinerU.
+- Markdown math delimiter normalization is deterministic.
+- Asset links resolve relative to the Markdown file.
+
+## Optional MinerU Checks
+
+- MinerU CLI execution succeeds or produces a clear failure warning.
+- Page coverage equals source PDF page count.
+- Math renderability failures are counted.
+- Table degradation warnings are counted.
+- Reading-order uncertainty is surfaced.
+
+## Report Sections
+
+- Summary: source file, pages, output files, engine, start/end time.
+- Warnings: grouped by severity and code.
+- Math: counts for inline, display, low-confidence, and render failures.
+- Assets: extracted, missing, broken links.
+- Tables: extracted, degraded, fallback count.
+- Environment: Python, uv, MinerU version, GPU visibility when available.