add pdftomd

This commit is contained in:
김경종
2026-05-08 16:42:19 +09:00
parent 551ab50735
commit 88d6b92283
99 changed files with 47332 additions and 0 deletions
+30
View File
@@ -0,0 +1,30 @@
---
name: fixture-evaluation
description: Plan local fixture-based quality checks for this MinerU PDF-to-Markdown converter using samples/ without committing sample PDFs. Use when Codex needs to define sample coverage, quality metrics, regression checks, JSON metadata assertions, or human-readable .report.md expectations.
---
# Fixture Evaluation
## Overview
Use this skill to turn local sample PDFs into a small, repeatable quality plan. Keep samples local and untracked unless the user explicitly asks to commit them.
## Workflow
1. Read `PLAN.md` and `PROGRESS.md` first.
2. Inspect `samples/` only enough to understand fixture categories and filenames.
3. Map each fixture to risks: math, tables, multi-column reading order, figures/assets, Korean filenames, and metadata coverage.
4. Separate fast checks using mocked MinerU outputs from optional checks that require MinerU models, GPU, or long execution.
5. Define metrics for both JSON metadata and `<stem>.report.md`.
6. Update `PROGRESS.md` with fixture coverage and gaps.
## Guardrails
- Do not commit sample PDFs.
- Do not copy samples into tracked fixtures without explicit user permission.
- Do not make GPU/model-dependent checks mandatory for the default fast loop.
- Do not grade only plain-text edit distance; include math, tables, reading order, assets, metadata, and renderability.
## Reference
Read `references/evaluation-metrics.md` when defining fixture coverage, regression criteria, or report fields.
@@ -0,0 +1,4 @@
interface:
display_name: "Fixture Evaluation"
short_description: "Plan fixture quality checks locally"
default_prompt: "Use $fixture-evaluation to plan sample coverage, quality metrics, regression checks, and report expectations without committing sample files."
@@ -0,0 +1,37 @@
# Evaluation Metrics
Use these metrics for local fixture plans and future tests.
## Fixture Categories
- Simple digital PDF with text layer.
- Math-heavy paper or chapter.
- Multi-column paper.
- Table with formulas.
- Figure with caption and asset extraction.
- Korean filename/path handling.
## Fast Checks
- Output files are planned at deterministic paths.
- Metadata JSON includes source PDF, page count, engine, warnings, and output paths.
- `.report.md` can be generated from metadata without re-running MinerU.
- Markdown math delimiter normalization is deterministic.
- Asset links resolve relative to the Markdown file.
## Optional MinerU Checks
- MinerU CLI execution succeeds or produces a clear failure warning.
- Page coverage equals source PDF page count.
- Math renderability failures are counted.
- Table degradation warnings are counted.
- Reading-order uncertainty is surfaced.
## Report Sections
- Summary: source file, pages, output files, engine, start/end time.
- Warnings: grouped by severity and code.
- Math: counts for inline, display, low-confidence, and render failures.
- Assets: extracted, missing, broken links.
- Tables: extracted, degraded, fallback count.
- Environment: Python, uv, MinerU version, GPU visibility when available.