modify pdftomd
This commit is contained in:
@@ -1,6 +1,6 @@
|
||||
---
|
||||
name: fixture-evaluation
|
||||
description: Plan local fixture-based quality checks for this MinerU PDF-to-Markdown converter using samples/ without committing sample PDFs. Use when Codex needs to define sample coverage, quality metrics, regression checks, JSON metadata assertions, or human-readable .report.md expectations.
|
||||
description: Plan local fixture-based quality checks for this MinerU PDF-to-Markdown converter using samples/ without committing sample PDFs. Use when Codex needs to define sample coverage, quality metrics, regression checks, internal provenance assertions, or human-readable _report.md expectations.
|
||||
---
|
||||
|
||||
# Fixture Evaluation
|
||||
@@ -14,9 +14,9 @@ Use this skill to turn local sample PDFs into a small, repeatable quality plan.
|
||||
1. Read `PLAN.md` and `PROGRESS.md` first.
|
||||
2. Read `docs/WORKARCHIVE.md` when prior fixture coverage, verification, or sample conversion evidence is needed.
|
||||
3. Inspect `samples/` only enough to understand fixture categories and filenames.
|
||||
4. Map each fixture to risks: math, tables, multi-column reading order, figures/assets, Korean filenames, and metadata coverage.
|
||||
4. Map each fixture to risks: math, tables, multi-column reading order, figures/assets, Korean filenames, and report/provenance coverage.
|
||||
5. Separate fast checks using mocked MinerU outputs from optional checks that require MinerU models, GPU, or long execution.
|
||||
6. Define metrics for both JSON metadata and `<stem>.report.md`.
|
||||
6. Define metrics for internal provenance and `<stem>_report.md`.
|
||||
7. Update `PROGRESS.md` with fixture coverage and gaps.
|
||||
|
||||
## Guardrails
|
||||
@@ -24,7 +24,7 @@ Use this skill to turn local sample PDFs into a small, repeatable quality plan.
|
||||
- Do not commit sample PDFs.
|
||||
- Do not copy samples into tracked fixtures without explicit user permission.
|
||||
- Do not make GPU/model-dependent checks mandatory for the default fast loop.
|
||||
- Do not grade only plain-text edit distance; include math, tables, reading order, assets, metadata, and renderability.
|
||||
- Do not grade only plain-text edit distance; include math, tables, reading order, assets, report provenance, and renderability.
|
||||
|
||||
## Reference
|
||||
|
||||
|
||||
@@ -14,8 +14,8 @@ Use these metrics for local fixture plans and future tests.
|
||||
## Fast Checks
|
||||
|
||||
- Output files are planned at deterministic paths.
|
||||
- Metadata JSON includes source PDF, page count, engine, warnings, and output paths.
|
||||
- `.report.md` can be generated from metadata without re-running MinerU.
|
||||
- Internal provenance includes source PDF, page count, engine, warnings, and output paths.
|
||||
- `_report.md` can be generated from internal provenance without re-running MinerU.
|
||||
- Markdown math delimiter normalization is deterministic.
|
||||
- Asset links resolve relative to the Markdown file.
|
||||
|
||||
|
||||
@@ -13,11 +13,11 @@ Use this skill when Markdown output quality matters more than raw text extractio
|
||||
|
||||
1. Read `PLAN.md` and `PROGRESS.md` first.
|
||||
2. Read `docs/WORKARCHIVE.md` when prior Markdown output, MathJax, or sample conversion evidence is needed.
|
||||
3. Read `PRD.md` and `ARCHITECTURE.md` when output behavior, metadata, or reporting is affected.
|
||||
3. Read `PRD.md` and `ARCHITECTURE.md` when output behavior, internal provenance, or reporting is affected.
|
||||
4. Preserve project delimiter policy: inline math uses `$...$`; display math uses `$$...$$`.
|
||||
5. Check asset links, table fallback behavior, heading/list interactions, and page boundary markers against Obsidian rendering assumptions.
|
||||
6. Define warnings for low-confidence math, non-renderable LaTeX, broken asset links, table degradation, and reading-order uncertainty.
|
||||
7. Ensure `.report.md` content is derived from metadata, not separate manual state.
|
||||
7. Ensure `_report.md` content is derived from internal provenance, not separate manual state.
|
||||
|
||||
## Checks
|
||||
|
||||
@@ -25,7 +25,7 @@ Use this skill when Markdown output quality matters more than raw text extractio
|
||||
- Display math should be separated from surrounding paragraphs by blank lines.
|
||||
- Asset paths should be stable, relative to the Markdown file, and safe for Obsidian vaults.
|
||||
- Tables with formulas should prefer readable Markdown when reliable and warn when downgraded.
|
||||
- Every renderability failure should be countable in metadata and visible in `.report.md`.
|
||||
- Every renderability failure should be countable in internal provenance and visible in `_report.md`.
|
||||
|
||||
## Reference
|
||||
|
||||
|
||||
@@ -12,7 +12,7 @@ Use these checks when designing or reviewing Markdown output.
|
||||
|
||||
## Assets
|
||||
|
||||
- Store images under a deterministic asset directory next to the Markdown output.
|
||||
- Store images under the deterministic shared `images/` directory next to the Markdown output parts.
|
||||
- Use relative Markdown links that remain valid when the output directory is moved as a unit.
|
||||
- Record asset source page, bbox if available, generated file path, and missing-link warnings.
|
||||
|
||||
@@ -20,7 +20,7 @@ Use these checks when designing or reviewing Markdown output.
|
||||
|
||||
- Prefer Markdown tables only when cell boundaries and reading order are reliable.
|
||||
- If formulas or merged cells make Markdown tables misleading, use a readable fallback and emit a table warning.
|
||||
- Keep table warnings visible in both JSON metadata and `.report.md`.
|
||||
- Keep table warnings visible in internal provenance and `_report.md`.
|
||||
|
||||
## Report Signals
|
||||
|
||||
|
||||
Reference in New Issue
Block a user