add pdftomd

This commit is contained in:
김경종
2026-05-08 16:42:19 +09:00
parent 551ab50735
commit 88d6b92283
99 changed files with 47332 additions and 0 deletions
+30
View File
@@ -0,0 +1,30 @@
---
name: fixture-evaluation
description: Plan local fixture-based quality checks for this MinerU PDF-to-Markdown converter using samples/ without committing sample PDFs. Use when Codex needs to define sample coverage, quality metrics, regression checks, JSON metadata assertions, or human-readable .report.md expectations.
---
# Fixture Evaluation
## Overview
Use this skill to turn local sample PDFs into a small, repeatable quality plan. Keep samples local and untracked unless the user explicitly asks to commit them.
## Workflow
1. Read `PLAN.md` and `PROGRESS.md` first.
2. Inspect `samples/` only enough to understand fixture categories and filenames.
3. Map each fixture to risks: math, tables, multi-column reading order, figures/assets, Korean filenames, and metadata coverage.
4. Separate fast checks using mocked MinerU outputs from optional checks that require MinerU models, GPU, or long execution.
5. Define metrics for both JSON metadata and `<stem>.report.md`.
6. Update `PROGRESS.md` with fixture coverage and gaps.
## Guardrails
- Do not commit sample PDFs.
- Do not copy samples into tracked fixtures without explicit user permission.
- Do not make GPU/model-dependent checks mandatory for the default fast loop.
- Do not grade only plain-text edit distance; include math, tables, reading order, assets, metadata, and renderability.
## Reference
Read `references/evaluation-metrics.md` when defining fixture coverage, regression criteria, or report fields.
@@ -0,0 +1,4 @@
interface:
display_name: "Fixture Evaluation"
short_description: "Plan fixture quality checks locally"
default_prompt: "Use $fixture-evaluation to plan sample coverage, quality metrics, regression checks, and report expectations without committing sample files."
@@ -0,0 +1,37 @@
# Evaluation Metrics
Use these metrics for local fixture plans and future tests.
## Fixture Categories
- Simple digital PDF with text layer.
- Math-heavy paper or chapter.
- Multi-column paper.
- Table with formulas.
- Figure with caption and asset extraction.
- Korean filename/path handling.
## Fast Checks
- Output files are planned at deterministic paths.
- Metadata JSON includes source PDF, page count, engine, warnings, and output paths.
- `.report.md` can be generated from metadata without re-running MinerU.
- Markdown math delimiter normalization is deterministic.
- Asset links resolve relative to the Markdown file.
## Optional MinerU Checks
- MinerU CLI execution succeeds or produces a clear failure warning.
- Page coverage equals source PDF page count.
- Math renderability failures are counted.
- Table degradation warnings are counted.
- Reading-order uncertainty is surfaced.
## Report Sections
- Summary: source file, pages, output files, engine, start/end time.
- Warnings: grouped by severity and code.
- Math: counts for inline, display, low-confidence, and render failures.
- Assets: extracted, missing, broken links.
- Tables: extracted, degraded, fallback count.
- Environment: Python, uv, MinerU version, GPU visibility when available.
@@ -0,0 +1,31 @@
---
name: math-markdown-review
description: Review and design Obsidian-friendly Markdown normalization for math-heavy PDF conversion, including LaTeX delimiters, display math spacing, asset links, tables, and quality report warnings. Use when Codex needs to check Markdown output assumptions, design post-processing rules, or define renderability checks for formulas and assets.
---
# Math Markdown Review
## Overview
Use this skill when Markdown output quality matters more than raw text extraction. The goal is best-effort automatic conversion with explicit warnings and provenance for failures.
## Workflow
1. Read `PLAN.md` and `PROGRESS.md` first.
2. Read `PRD.md` and `ARCHITECTURE.md` when output behavior, metadata, or reporting is affected.
3. Preserve project delimiter policy: inline math uses `$...$`; display math uses `$$...$$`.
4. Check asset links, table fallback behavior, heading/list interactions, and page boundary markers against Obsidian rendering assumptions.
5. Define warnings for low-confidence math, non-renderable LaTeX, broken asset links, table degradation, and reading-order uncertainty.
6. Ensure `.report.md` content is derived from metadata, not separate manual state.
## Checks
- Inline math should not contain unescaped newlines or surrounding spaces that break rendering.
- Display math should be separated from surrounding paragraphs by blank lines.
- Asset paths should be stable, relative to the Markdown file, and safe for Obsidian vaults.
- Tables with formulas should prefer readable Markdown when reliable and warn when downgraded.
- Every renderability failure should be countable in metadata and visible in `.report.md`.
## Reference
Read `references/obsidian-output-checks.md` for concrete normalization and report-signal guidance.
@@ -0,0 +1,4 @@
interface:
display_name: "Math Markdown Review"
short_description: "Check Obsidian math Markdown output"
default_prompt: "Use $math-markdown-review to design or check Obsidian-friendly Markdown normalization, math delimiters, asset paths, tables, and quality report signals."
@@ -0,0 +1,31 @@
# Obsidian Output Checks
Use these checks when designing or reviewing Markdown output.
## Math
- Inline math: `$...$`, no line breaks inside the delimiter pair.
- Display math: `$$...$$`, with blank lines before and after the block.
- Preserve source provenance for formulas: page index, bbox if available, engine, confidence, and warning codes.
- Record render failures separately from extraction confidence.
- Avoid rewriting LaTeX semantics unless the rule is deterministic and tested.
## Assets
- Store images under a deterministic asset directory next to the Markdown output.
- Use relative Markdown links that remain valid when the output directory is moved as a unit.
- Record asset source page, bbox if available, generated file path, and missing-link warnings.
## Tables
- Prefer Markdown tables only when cell boundaries and reading order are reliable.
- If formulas or merged cells make Markdown tables misleading, use a readable fallback and emit a table warning.
- Keep table warnings visible in both JSON metadata and `.report.md`.
## Report Signals
- Total pages processed and pages with warnings.
- Math block count, inline math count, and non-renderable math count.
- Broken asset links and missing assets.
- Table degradation count.
- Reading-order uncertainty count.
+32
View File
@@ -0,0 +1,32 @@
---
name: mineru-research
description: Research MinerU 3.1.0 setup, CLI behavior, output formats, model/runtime requirements, licensing, and local-only integration constraints for this PDF-to-Markdown project. Use when Codex needs to update project knowledge, verify MinerU facts, plan the MinerU adapter, or resolve uncertainty about installation, execution, or output behavior without adding alternate engines.
---
# MinerU Research
## Overview
Use this skill to verify MinerU 3.1.0 facts before changing project docs or plans. Keep the scope narrow: MinerU 3.1.0 is the only conversion engine and direct local CLI execution is the only v1 execution mode.
## Workflow
1. Read `PLAN.md` and `PROGRESS.md` first.
2. Read `PRD.md`, `ARCHITECTURE.md`, and `docs/KNOWLEDGEBASE.md` when the change affects product or architecture decisions.
3. Prefer official MinerU documentation, the MinerU GitHub repository, release notes, primary papers, and official dependency docs.
4. Verify time-sensitive facts with web research before updating docs.
5. Record source URLs and access dates in durable docs when the finding affects future implementation.
6. Update `PROGRESS.md` with the verified fact, unresolved uncertainty, and next action.
## Constraints
- Do not reintroduce candidate engine comparisons.
- Allow only direct `mineru` CLI execution and the CLI-internal temporary local `mineru-api` process.
- Do not add cloud OCR, remote LLM, `--api-url`, remote API, router, HTTP client backend, or remote OpenAI-compatible backend paths.
- Do not imply perfect LaTeX reconstruction.
- Do not implement converter code unless the user explicitly requests implementation.
- Treat GTX 1070 Ti 8GB, Python 3.12, uv, and Windows PowerShell as active project constraints.
## Reference
Read `references/source-checklist.md` when planning a research pass or updating source-backed documentation.
@@ -0,0 +1,4 @@
interface:
display_name: "MinerU Research"
short_description: "Verify MinerU local integration facts"
default_prompt: "Use $mineru-research to verify MinerU 2.5 setup, CLI behavior, outputs, licensing, and local-only integration constraints against official sources."
@@ -0,0 +1,29 @@
# MinerU Research Source Checklist
Use this checklist before changing project docs or plans based on MinerU facts.
## Sources
- MinerU GitHub repository for install instructions, CLI examples, output behavior, and license files.
- MinerU official documentation for current setup and execution modes.
- MinerU release notes or tags for version-specific changes.
- Primary papers for model capability claims.
- Official Python, uv, CUDA, PyTorch, or dependency docs for environment compatibility.
## Facts To Verify
- Supported Python versions and package manager expectations.
- Whether MinerU 3.1.0 supports the required local CLI path on Windows.
- Whether MinerU 3.1.0's CLI-internal temporary local `mineru-api` behavior stays local and avoids `--api-url`.
- Required model download/cache behavior and offline reuse assumptions.
- GPU/CPU execution options and expected memory pressure for GTX 1070 Ti 8GB.
- Output directory structure, Markdown output, image asset output, JSON/intermediate output, and page/block metadata availability.
- Exit codes, error messages, logging behavior, and partial-output behavior.
- License obligations for MinerU, bundled models, and transitive runtime packages.
## Recording Rules
- Record source URL and access date for durable claims.
- Distinguish official fact from inference.
- Keep alternate engine names out of project docs unless the user explicitly asks for a separate historical note.
- If a source conflicts with a fixed product decision, record the conflict and ask for a user decision.