add pdftomd
This commit is contained in:
@@ -0,0 +1,37 @@
|
||||
# Evaluation Metrics
|
||||
|
||||
Use these metrics for local fixture plans and future tests.
|
||||
|
||||
## Fixture Categories
|
||||
|
||||
- Simple digital PDF with text layer.
|
||||
- Math-heavy paper or chapter.
|
||||
- Multi-column paper.
|
||||
- Table with formulas.
|
||||
- Figure with caption and asset extraction.
|
||||
- Korean filename/path handling.
|
||||
|
||||
## Fast Checks
|
||||
|
||||
- Output files are planned at deterministic paths.
|
||||
- Metadata JSON includes source PDF, page count, engine, warnings, and output paths.
|
||||
- `.report.md` can be generated from metadata without re-running MinerU.
|
||||
- Markdown math delimiter normalization is deterministic.
|
||||
- Asset links resolve relative to the Markdown file.
|
||||
|
||||
## Optional MinerU Checks
|
||||
|
||||
- MinerU CLI execution succeeds or produces a clear failure warning.
|
||||
- Page coverage equals source PDF page count.
|
||||
- Math renderability failures are counted.
|
||||
- Table degradation warnings are counted.
|
||||
- Reading-order uncertainty is surfaced.
|
||||
|
||||
## Report Sections
|
||||
|
||||
- Summary: source file, pages, output files, engine, start/end time.
|
||||
- Warnings: grouped by severity and code.
|
||||
- Math: counts for inline, display, low-confidence, and render failures.
|
||||
- Assets: extracted, missing, broken links.
|
||||
- Tables: extracted, degraded, fallback count.
|
||||
- Environment: Python, uv, MinerU version, GPU visibility when available.
|
||||
Reference in New Issue
Block a user