add pdftomd
This commit is contained in:
@@ -0,0 +1,30 @@
|
||||
---
|
||||
name: fixture-evaluation
|
||||
description: Plan local fixture-based quality checks for this MinerU PDF-to-Markdown converter using samples/ without committing sample PDFs. Use when Codex needs to define sample coverage, quality metrics, regression checks, JSON metadata assertions, or human-readable .report.md expectations.
|
||||
---
|
||||
|
||||
# Fixture Evaluation
|
||||
|
||||
## Overview
|
||||
|
||||
Use this skill to turn local sample PDFs into a small, repeatable quality plan. Keep samples local and untracked unless the user explicitly asks to commit them.
|
||||
|
||||
## Workflow
|
||||
|
||||
1. Read `PLAN.md` and `PROGRESS.md` first.
|
||||
2. Inspect `samples/` only enough to understand fixture categories and filenames.
|
||||
3. Map each fixture to risks: math, tables, multi-column reading order, figures/assets, Korean filenames, and metadata coverage.
|
||||
4. Separate fast checks using mocked MinerU outputs from optional checks that require MinerU models, GPU, or long execution.
|
||||
5. Define metrics for both JSON metadata and `<stem>.report.md`.
|
||||
6. Update `PROGRESS.md` with fixture coverage and gaps.
|
||||
|
||||
## Guardrails
|
||||
|
||||
- Do not commit sample PDFs.
|
||||
- Do not copy samples into tracked fixtures without explicit user permission.
|
||||
- Do not make GPU/model-dependent checks mandatory for the default fast loop.
|
||||
- Do not grade only plain-text edit distance; include math, tables, reading order, assets, metadata, and renderability.
|
||||
|
||||
## Reference
|
||||
|
||||
Read `references/evaluation-metrics.md` when defining fixture coverage, regression criteria, or report fields.
|
||||
@@ -0,0 +1,4 @@
|
||||
interface:
|
||||
display_name: "Fixture Evaluation"
|
||||
short_description: "Plan fixture quality checks locally"
|
||||
default_prompt: "Use $fixture-evaluation to plan sample coverage, quality metrics, regression checks, and report expectations without committing sample files."
|
||||
@@ -0,0 +1,37 @@
|
||||
# Evaluation Metrics
|
||||
|
||||
Use these metrics for local fixture plans and future tests.
|
||||
|
||||
## Fixture Categories
|
||||
|
||||
- Simple digital PDF with text layer.
|
||||
- Math-heavy paper or chapter.
|
||||
- Multi-column paper.
|
||||
- Table with formulas.
|
||||
- Figure with caption and asset extraction.
|
||||
- Korean filename/path handling.
|
||||
|
||||
## Fast Checks
|
||||
|
||||
- Output files are planned at deterministic paths.
|
||||
- Metadata JSON includes source PDF, page count, engine, warnings, and output paths.
|
||||
- `.report.md` can be generated from metadata without re-running MinerU.
|
||||
- Markdown math delimiter normalization is deterministic.
|
||||
- Asset links resolve relative to the Markdown file.
|
||||
|
||||
## Optional MinerU Checks
|
||||
|
||||
- MinerU CLI execution succeeds or produces a clear failure warning.
|
||||
- Page coverage equals source PDF page count.
|
||||
- Math renderability failures are counted.
|
||||
- Table degradation warnings are counted.
|
||||
- Reading-order uncertainty is surfaced.
|
||||
|
||||
## Report Sections
|
||||
|
||||
- Summary: source file, pages, output files, engine, start/end time.
|
||||
- Warnings: grouped by severity and code.
|
||||
- Math: counts for inline, display, low-confidence, and render failures.
|
||||
- Assets: extracted, missing, broken links.
|
||||
- Tables: extracted, degraded, fallback count.
|
||||
- Environment: Python, uv, MinerU version, GPU visibility when available.
|
||||
Reference in New Issue
Block a user