modify pdftomd
This commit is contained in:
@@ -1,27 +1,29 @@
|
||||
# PRD: Local PDF-to-Markdown Converter
|
||||
|
||||
Last updated: 2026-05-07
|
||||
Last updated: 2026-05-13
|
||||
|
||||
## 1. Summary
|
||||
|
||||
Build a local-only CLI and Python library that converts math-heavy digital PDFs into Obsidian-friendly Markdown. The product prioritizes accurate LaTeX reconstruction for equations, preservation of document structure, stable asset links, and traceable page-level metadata.
|
||||
Build a local-only CLI and Python library that converts math-heavy digital PDFs into Obsidian-friendly Markdown. The product prioritizes accurate LaTeX reconstruction for equations, preservation of document structure, stable asset links, and traceable page-level provenance in the human-readable report.
|
||||
|
||||
The first version is for personal/research use, targets NVIDIA GPU machines, and uses MinerU 3.1.0 as the fixed conversion engine. It should process digital PDFs with existing text layers first. Scanned books, cloud OCR APIs, web UI, and manual review workflows are out of scope for v1.
|
||||
The first version is for personal/research use, targets NVIDIA GPU machines, and uses MinerU 3.1.0 as the fixed conversion engine. It should process digital PDFs with existing text layers first. Scanned books, cloud OCR APIs, hosted web apps, and manual review workflows are out of scope for v1. A thin local Windows desktop launcher exists as a convenience wrapper over the existing `pdf2md` CLI.
|
||||
|
||||
## 2. Goals
|
||||
|
||||
- Convert a single PDF into one Markdown file plus assets, metadata JSON, and a human-readable quality report.
|
||||
- Convert a single PDF into a PDF-stem output folder containing Markdown part files, shared assets, and one human-readable quality report.
|
||||
- Convert a folder of PDFs in batch mode.
|
||||
- Allow the thin local Windows UI launcher to convert direct-child PDFs in a selected folder by sequentially invoking existing CLI commands.
|
||||
- Preserve inline math as `$...$` and display math as `$$...$$`.
|
||||
- Produce Markdown that opens cleanly in Obsidian.
|
||||
- Use MinerU 3.1.0 locally.
|
||||
- Keep enough metadata to diagnose formula, layout, and reading-order errors.
|
||||
- Keep enough internal provenance to diagnose formula, layout, and reading-order errors through warnings and the report.
|
||||
- Continue conversion automatically when a page or formula is low-confidence, while logging warnings.
|
||||
|
||||
## 3. Non-Goals
|
||||
|
||||
- No cloud OCR, cloud LLM, or third-party document upload in v1.
|
||||
- No web app or GUI in v1.
|
||||
- No hosted web app, manual review UI, or alternate GUI conversion pipeline in v1.
|
||||
- A thin local desktop launcher is allowed only when it invokes the existing `pdf2md` CLI and preserves strict-local behavior.
|
||||
- No manual review queue in v1.
|
||||
- No optimization for low-quality scanned books in v1.
|
||||
- No guaranteed perfect LaTeX reconstruction.
|
||||
@@ -62,10 +64,9 @@ Out of scope for v1 optimization:
|
||||
|
||||
For each input PDF, the converter writes:
|
||||
|
||||
- A normalized Markdown file.
|
||||
- An assets directory when MinerU extracts images or other media.
|
||||
- A metadata JSON file.
|
||||
- A human-readable quality report named `<stem>.report.md`.
|
||||
- One or more normalized Markdown part files named `<stem>_001.md`, `<stem>_002.md`, and so on.
|
||||
- A shared `images/` directory when MinerU extracts images or other media.
|
||||
- A human-readable quality report named `<stem>_report.md`.
|
||||
- Optional raw MinerU outputs for debugging.
|
||||
|
||||
Markdown rules:
|
||||
@@ -74,8 +75,8 @@ Markdown rules:
|
||||
- Display equations use `$$...$$` on separate lines.
|
||||
- Simple tables use Markdown pipe tables.
|
||||
- Complex tables may use HTML when Markdown would lose structure.
|
||||
- Images use relative links to the generated assets directory.
|
||||
- Visible page markers should be avoided by default; page provenance belongs in metadata.
|
||||
- Images use relative links to `images/...` under the PDF output folder.
|
||||
- Visible page markers should be avoided by default; grouped page conversion may use invisible HTML comments for page provenance.
|
||||
- Obsidian compatibility is the output standard.
|
||||
|
||||
Detailed Markdown normalization rules are defined in `ARCHITECTURE.md`.
|
||||
@@ -96,18 +97,20 @@ pdf2md doctor
|
||||
- If `INPUT` is a PDF, convert that file.
|
||||
- If `INPUT` is a directory, convert PDFs in that directory.
|
||||
- Directory conversion requires `--recursive` to descend into subdirectories.
|
||||
- Output filenames default to the source PDF stem plus `.md`.
|
||||
- Asset directories default to `<stem>.assets`.
|
||||
- Output folders default to `<output>/<stem>/`.
|
||||
- Markdown part filenames default to `<stem>_001.md`, `<stem>_002.md`, and so on.
|
||||
- Asset directories default to `<output>/<stem>/images/`.
|
||||
- Existing outputs are not overwritten unless `--overwrite` is passed.
|
||||
|
||||
Required `convert` options:
|
||||
|
||||
- `--out PATH`: output directory.
|
||||
- `--metadata`: write metadata JSON. Enabled by default in v1.
|
||||
- `--metadata`: accepted for compatibility; no metadata JSON is written in the simplified output layout.
|
||||
- `--keep-raw`: keep raw MinerU output for debugging.
|
||||
- `--recursive`: recursively process directory inputs.
|
||||
- `--overwrite`: replace existing outputs.
|
||||
- `--gpu DEVICE`: select CUDA device. Default: `cuda:0`.
|
||||
- `--gpu DEVICE`: select CUDA device. Default: `cuda:0`; `auto` selects the visible NVIDIA GPU with the most VRAM.
|
||||
- `--mineru-profile {auto,safe,performance}`: select MinerU runtime tuning. Default: `auto`.
|
||||
- `--strict-local`: forbid remote network/cloud execution during conversion. Default: true.
|
||||
|
||||
`doctor` behavior:
|
||||
@@ -115,11 +118,18 @@ Required `convert` options:
|
||||
- Report Python version.
|
||||
- Report `uv` availability.
|
||||
- Report CUDA/PyTorch GPU availability when detectable.
|
||||
- Report visible NVIDIA GPU index, VRAM, driver version, `--gpu auto` recommendation, and recommended MinerU profile.
|
||||
- Report MinerU availability.
|
||||
- Report local model/cache paths when detectable.
|
||||
- Warn if no NVIDIA GPU is available.
|
||||
- Fail if required v1 runtime dependencies are missing.
|
||||
|
||||
UI launcher behavior:
|
||||
|
||||
- The UI is a local convenience wrapper over the existing `pdf2md` CLI.
|
||||
- The UI may convert a selected folder by discovering direct-child PDFs only and running one `pdf2md convert` command per PDF sequentially.
|
||||
- The UI must not invoke MinerU directly, add recursive folder conversion outside the existing CLI behavior, run conversions in parallel by default, or expose remote/API options.
|
||||
|
||||
## 8. Python Library Requirements
|
||||
|
||||
The library should expose a stable API suitable for scripts and tests.
|
||||
@@ -139,17 +149,17 @@ result = convert_pdf(
|
||||
Required return fields:
|
||||
|
||||
- `markdown_path`
|
||||
- `metadata_path`
|
||||
- `metadata_path`, which is `None` for new simplified outputs.
|
||||
- `assets_dir`
|
||||
- `warnings`
|
||||
- `engine`
|
||||
- `pages_processed`
|
||||
|
||||
The public API should not expose raw MinerU objects as required return types. MinerU-specific data may be stored under optional metadata fields.
|
||||
The public API should not expose raw MinerU objects as required return types. MinerU-specific data may be stored in internal report/provenance structures.
|
||||
|
||||
## 9. Metadata Requirements
|
||||
## 9. Provenance Requirements
|
||||
|
||||
When `--metadata` is enabled, write `<stem>.metadata.json`.
|
||||
New conversions must not write a public metadata JSON sidecar. Internal metadata-like records may still be built in memory to derive reports, warnings, counts, and `ConversionResult` fields.
|
||||
|
||||
Required top-level fields:
|
||||
|
||||
@@ -175,16 +185,16 @@ Required summary fields:
|
||||
|
||||
Warnings must be non-fatal unless the source file cannot be read or no output can be produced.
|
||||
|
||||
Detailed metadata fields, block types, and warning codes are defined in `ARCHITECTURE.md`.
|
||||
Detailed internal provenance fields, block types, and warning codes are defined in `ARCHITECTURE.md`.
|
||||
|
||||
## 10. Quality Report Requirements
|
||||
|
||||
For every conversion, write `<stem>.report.md`.
|
||||
For every conversion, write `<stem>/<stem>_report.md`.
|
||||
|
||||
The report must be readable without opening the JSON metadata and include:
|
||||
The report must be readable as the primary human-facing quality artifact and include:
|
||||
|
||||
- Source PDF path.
|
||||
- Output Markdown path.
|
||||
- Output folder path and Markdown part paths.
|
||||
- MinerU version.
|
||||
- Page count.
|
||||
- Warning count.
|
||||
@@ -201,7 +211,7 @@ The product is fully automatic in v1.
|
||||
|
||||
- Low-confidence formulas are included in the output as best effort.
|
||||
- Low-confidence pages are included in the output as best effort.
|
||||
- The converter logs warnings and metadata records.
|
||||
- The converter logs warnings and internal provenance records.
|
||||
- Conversion uses MinerU's default local CLI execution. If MinerU cannot run or fails, the converter must emit a clear error/warning instead of silently falling back to another backend.
|
||||
- Conversion fails only when the input cannot be opened, MinerU cannot run, output cannot be written, no usable output can be produced, or local-only policy is violated.
|
||||
|
||||
@@ -254,25 +264,22 @@ uv sync
|
||||
uv run pdf2md doctor
|
||||
```
|
||||
|
||||
MinerU/model setup may require additional scripts, for example:
|
||||
|
||||
```bash
|
||||
uv run scripts/install-mineru.ps1
|
||||
uv run scripts/install-models.py
|
||||
```
|
||||
MinerU/model setup requires explicit user-initiated local setup commands documented in `README.md`. Do not reference setup helper scripts unless they actually exist in the repository.
|
||||
|
||||
The project should document NVIDIA GPU/CUDA expectations and provide clear errors when GPU acceleration is unavailable.
|
||||
|
||||
The default MinerU profile must be conservative on GTX 1070 Ti 8GB and other weak or pre-Turing GPUs. Stronger profile settings are allowed only through local environment tuning on selected 16GB+ Turing-or-newer NVIDIA GPUs.
|
||||
|
||||
## 14. Test Requirements
|
||||
|
||||
Required test categories:
|
||||
|
||||
- Unit tests for Markdown math delimiter normalization.
|
||||
- Unit tests for asset path normalization.
|
||||
- Unit tests for metadata schema creation.
|
||||
- Unit tests for internal metadata/provenance schema creation.
|
||||
- Unit tests for warning aggregation.
|
||||
- MinerU adapter contract tests with mocked outputs.
|
||||
- CLI tests for single PDF, directory input, overwrite behavior, and metadata output.
|
||||
- CLI tests for single PDF, directory input, overwrite behavior, and simplified output layout.
|
||||
|
||||
Fixture categories:
|
||||
|
||||
@@ -285,7 +292,7 @@ Fixture categories:
|
||||
Acceptance checks:
|
||||
|
||||
- Markdown exists after conversion.
|
||||
- Metadata exists when requested.
|
||||
- No metadata JSON is written for new conversions.
|
||||
- Quality report exists after conversion.
|
||||
- Asset links resolve.
|
||||
- Inline/display math delimiters match Obsidian expectations.
|
||||
@@ -298,10 +305,10 @@ Acceptance checks:
|
||||
|
||||
v1 is acceptable when:
|
||||
|
||||
- `pdf2md convert paper.pdf --out out --metadata` works on a representative digital academic PDF.
|
||||
- `pdf2md convert pdfs --out out --recursive --metadata` works on a small folder.
|
||||
- `pdf2md convert paper.pdf --out out` works on a representative digital academic PDF.
|
||||
- `pdf2md convert pdfs --out out --recursive` works on a small folder.
|
||||
- `pdf2md doctor` reports MinerU/GPU status clearly.
|
||||
- The default output opens in Obsidian with math blocks rendered.
|
||||
- Metadata links pages, blocks, warnings, and assets to the source PDF.
|
||||
- `<stem>.report.md` summarizes warnings, formulas, assets, and render/link check results.
|
||||
- The report links pages, warnings, output parts, and assets to the source PDF.
|
||||
- `<stem>/<stem>_report.md` summarizes warnings, formulas, assets, and render/link check results.
|
||||
- The README or setup docs explain local-only behavior and GPU expectations.
|
||||
|
||||
Reference in New Issue
Block a user