modify pdftomd
This commit is contained in:
@@ -0,0 +1,218 @@
|
||||
# Sprint 12 Contract: Minimal Windows UI Launcher
|
||||
|
||||
Status: Implemented with residual conversion-smoke risk
|
||||
Last updated: 2026-05-11
|
||||
|
||||
## Objective
|
||||
|
||||
Build a minimal Windows desktop launcher for the existing `pdf2md` CLI and package the launcher itself as `dist/pdf2md-ui.exe`.
|
||||
|
||||
The UI must remain a thin local launcher. It must not become a second conversion engine, a hosted app, a manual review workflow, or a bundled redistribution of MinerU, CUDA PyTorch, model weights, Node.js, or MathJax.
|
||||
|
||||
## Research Basis
|
||||
|
||||
- Primary research document: `docs/UI_RESEARCH.md`.
|
||||
- The recommended implementation path is `tkinter`/`ttk`, a subprocess runner around `pdf2md` or `uv run pdf2md`, and PyInstaller for the Windows executable.
|
||||
|
||||
## Current Precondition
|
||||
|
||||
- `pdf2md doctor`, `pdf2md convert`, and `pdf2md recheck` are implemented.
|
||||
- Conversion remains strict-local and MinerU-only.
|
||||
- Current CLI output is coarse during MinerU execution because the adapter captures MinerU subprocess output internally.
|
||||
- UI research is complete.
|
||||
- UI implementation exists under `src/pdf2md_ui/`.
|
||||
- `dist\pdf2md-ui.exe` can be built with PyInstaller.
|
||||
|
||||
## Touched Surfaces
|
||||
|
||||
Allowed during implementation:
|
||||
|
||||
- `src/pdf2md_ui/__init__.py`
|
||||
- `src/pdf2md_ui/app.py`
|
||||
- `src/pdf2md_ui/runner.py`
|
||||
- `tests/test_ui_runner.py`
|
||||
- `pyproject.toml`
|
||||
- `uv.lock`
|
||||
- `README.md`
|
||||
- `PLAN.md`
|
||||
- `PROGRESS.md`
|
||||
- `docs/WORKARCHIVE.md`
|
||||
- `docs/V1IMPLEMENTATIONPLAN.md`
|
||||
|
||||
Generated but not committed unless explicitly requested:
|
||||
|
||||
- `build/`
|
||||
- `dist/`
|
||||
- `*.spec`
|
||||
- generated conversion outputs under `outputs/`
|
||||
|
||||
Not allowed:
|
||||
|
||||
- Runtime document upload paths.
|
||||
- Remote OCR, hosted LLM/VLM, hosted renderers, or remote document parsing APIs.
|
||||
- `--api-url`, router mode, HTTP client backends, remote OpenAI-compatible endpoints, or runtime engine selection.
|
||||
- Direct UI calls to `mineru`; the UI must call the project-owned `pdf2md` CLI.
|
||||
- Bundling MinerU, CUDA PyTorch, local model weights, Node.js, or MathJax into the first UI executable.
|
||||
- Batch queues, drag/drop, PDF preview, Markdown preview, Obsidian automation, installer generation, or code signing in this sprint.
|
||||
- Mandatory default tests that require real MinerU, GPU, model files, network, Obsidian, or `samples/`.
|
||||
|
||||
## Product Behavior
|
||||
|
||||
The first UI is a single-window launcher:
|
||||
|
||||
- Select one input PDF.
|
||||
- Select an output root, defaulting to `outputs`; the current CLI creates the final `<stem>\` folder inside it.
|
||||
- Configure only existing CLI options:
|
||||
- overwrite
|
||||
- keep raw output
|
||||
- optional grouped pages with default `20`
|
||||
- GPU device with default `cuda:0`, including `auto` when supported by the CLI
|
||||
- MinerU profile `auto|safe|performance` with default `auto`
|
||||
- Run `Doctor`.
|
||||
- Run `Convert`.
|
||||
- Run `Recheck` for an existing Markdown output.
|
||||
- Cancel a running subprocess.
|
||||
- Open the output directory after completion.
|
||||
- Show a read-only log and indeterminate progress while a command is running.
|
||||
|
||||
Command resolution:
|
||||
|
||||
1. Use a configured command if present.
|
||||
2. Else use `pdf2md` from `PATH`.
|
||||
3. Else use `uv run pdf2md` from a configured project root containing `pyproject.toml`.
|
||||
4. Else report a setup error and direct the user to run `pdf2md doctor`.
|
||||
|
||||
## Architecture Plan
|
||||
|
||||
### WP12.1: CLI Runner
|
||||
|
||||
Actions:
|
||||
|
||||
- Add a runner module that builds fixed argument lists for `doctor`, `convert`, and `recheck`.
|
||||
- Use `subprocess.Popen` with `shell=False`.
|
||||
- Set `MINERU_MODEL_SOURCE=local` in the child environment unless already set.
|
||||
- Merge stderr into stdout for a single UI log stream.
|
||||
- Read subprocess output on a worker thread and report status events to the UI.
|
||||
- Add a Windows process-tree cancellation helper that uses `taskkill /pid <pid> /t /f` only after normal termination does not finish promptly.
|
||||
|
||||
Expected output:
|
||||
|
||||
- Testable command-construction and process-management code that never accepts arbitrary shell text from the UI.
|
||||
|
||||
### WP12.2: Minimal Tk UI
|
||||
|
||||
Actions:
|
||||
|
||||
- Add a `tkinter`/`ttk` app with file and directory pickers, option controls, command buttons, progress indicator, and log pane.
|
||||
- Keep long-running work off Tk's event handler thread.
|
||||
- Disable conflicting controls while a command is running.
|
||||
- Surface non-zero exit codes clearly.
|
||||
|
||||
Expected output:
|
||||
|
||||
- A simple local GUI for existing CLI workflows.
|
||||
|
||||
### WP12.3: Build
|
||||
|
||||
Actions:
|
||||
|
||||
- Add PyInstaller only to a build dependency group such as `ui-build`.
|
||||
- Build the executable with:
|
||||
|
||||
```powershell
|
||||
uv run --group ui-build pyinstaller --clean --onefile --windowed --name pdf2md-ui src\pdf2md_ui\app.py
|
||||
```
|
||||
|
||||
Expected output:
|
||||
|
||||
- `dist\pdf2md-ui.exe` exists after the build.
|
||||
|
||||
## Verification Checks
|
||||
|
||||
Default checks:
|
||||
|
||||
- `uv run pytest tests/test_ui_runner.py`
|
||||
- `uv run pytest tests/test_cli.py` if shared CLI behavior changes
|
||||
- `git diff --check`
|
||||
- `git status --short --untracked-files=all`
|
||||
|
||||
Build check:
|
||||
|
||||
```powershell
|
||||
uv run --group ui-build pyinstaller --clean --onefile --windowed --name pdf2md-ui src\pdf2md_ui\app.py
|
||||
Test-Path dist\pdf2md-ui.exe
|
||||
```
|
||||
|
||||
Manual smoke:
|
||||
|
||||
1. Launch `dist\pdf2md-ui.exe`.
|
||||
2. Run Doctor from the UI.
|
||||
3. Convert one small local sample into an ignored `outputs/` directory.
|
||||
4. Confirm Markdown, report Markdown, and assets are produced as expected for the active output layout.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
- The UI invokes `pdf2md` or `uv run pdf2md`; it never invokes `mineru` directly.
|
||||
- Commands are fixed argument lists and run with `shell=False`.
|
||||
- The UI remains responsive while a conversion is running.
|
||||
- Cancel attempts to stop the process tree on Windows.
|
||||
- Doctor and conversion exit codes are visible in the UI.
|
||||
- PyInstaller produces `dist\pdf2md-ui.exe`.
|
||||
- Default tests stay independent of real MinerU, GPU, model files, network, Obsidian, and `samples/`.
|
||||
|
||||
## Hard Failure Criteria
|
||||
|
||||
- UI code exposes arbitrary shell command execution.
|
||||
- UI exposes remote/API options or weakens strict-local policy.
|
||||
- UI claims conversion success without checking the CLI exit code.
|
||||
- UI freezes during a long conversion because the CLI runs on Tk's event handler thread.
|
||||
- The first UI executable bundles MinerU, CUDA PyTorch, model weights, Node.js, or MathJax.
|
||||
- Build outputs, generated conversion outputs, local models, or sample PDFs are committed.
|
||||
|
||||
## Handoff Requirements
|
||||
|
||||
After implementation:
|
||||
|
||||
- Update `PROGRESS.md` with files changed, commands run, test outcomes, build outcome, known failures, residual risks, and next action.
|
||||
- Move completed implementation details to `docs/WORKARCHIVE.md` after verification.
|
||||
- Keep sample PDFs and generated outputs out of the commit.
|
||||
|
||||
## Implementation Handoff
|
||||
|
||||
Files changed:
|
||||
|
||||
- `src/pdf2md_ui/__init__.py`
|
||||
- `src/pdf2md_ui/app.py`
|
||||
- `src/pdf2md_ui/runner.py`
|
||||
- `tests/test_ui_runner.py`
|
||||
- `pyproject.toml`
|
||||
- `uv.lock`
|
||||
- `README.md`
|
||||
- `PLAN.md`
|
||||
- `PROGRESS.md`
|
||||
- `docs/WORKARCHIVE.md`
|
||||
- `docs/V1IMPLEMENTATIONPLAN.md`
|
||||
|
||||
Verification:
|
||||
|
||||
- `uv run pytest tests\test_ui_runner.py`: passed 16 tests.
|
||||
- `uv run pytest`: passed 188 tests with 1 optional skip.
|
||||
- `uv run --group ui-build pyinstaller --clean --onefile --windowed --name pdf2md-ui src\pdf2md_ui\app.py`: passed.
|
||||
- `Test-Path dist\pdf2md-ui.exe`: returned `True`.
|
||||
- `uv run pdf2md doctor`: returned WARN only for the documented GTX 1070 Ti/Pascal compatibility risk.
|
||||
- Launch smoke for `dist\pdf2md-ui.exe`: process started and was then terminated by the smoke script.
|
||||
|
||||
Follow-up refresh on 2026-05-12:
|
||||
|
||||
- Updated the UI command builder and form controls for the Sprint 15 `--mineru-profile auto|safe|performance` CLI option.
|
||||
- Rebuilt `dist\pdf2md-ui.exe` after Sprint 16 simplified output layout and Sprint 15 profile changes.
|
||||
- `uv run pytest tests\test_ui_runner.py`: passed 17 tests.
|
||||
- Launch smoke for the rebuilt `dist\pdf2md-ui.exe`: process started and was then terminated by the smoke script.
|
||||
|
||||
Known failure:
|
||||
|
||||
- A CLI conversion smoke using `samples\FourNodeQuadrilateralShellElementMITC4.pdf` and the same command shape used by the UI did not finish within the 15-minute timeout. The spawned process tree was terminated with `taskkill`.
|
||||
|
||||
Residual risk:
|
||||
|
||||
- A hands-on UI Doctor click and UI conversion click should still be run when the local MinerU runtime is expected to complete within an acceptable time.
|
||||
@@ -0,0 +1,292 @@
|
||||
# Sprint 13 Contract: Text Layer Fidelity Diagnostics
|
||||
|
||||
Status: Implemented
|
||||
Last updated: 2026-05-11
|
||||
|
||||
## Objective
|
||||
|
||||
Add a local pypdf-based text fidelity diagnostic pass that compares source PDF text-layer extraction with MinerU-generated Markdown text on a per-page basis where page mapping is available.
|
||||
|
||||
The first priority is diagnosis, not automatic body-text replacement. This sprint should record enough evidence in metadata JSON and `<stem>.report.md` to identify pages where MinerU likely misrecognized Korean body text, especially missing Hangul syllables, unexpected CJK ideographs, and abnormal spacing. It may mark pypdf text as a future replacement candidate, but it must not replace Markdown body text in this sprint.
|
||||
|
||||
## Current Precondition
|
||||
|
||||
- MinerU 3.1.0 remains the only conversion engine.
|
||||
- Conversion runs through direct local `mineru` CLI execution only.
|
||||
- `pypdf` is already used by the project for local PDF chunk planning.
|
||||
- `pdf2md convert` writes Markdown, metadata JSON, and `<stem>.report.md`.
|
||||
- `pdf2md recheck` can regenerate metadata/report from an existing Markdown file.
|
||||
- Chunked conversion records original source page ranges in metadata `engine_options.chunk`.
|
||||
- The 2007 Korean shell-structure sample showed clear text fidelity problems:
|
||||
- pypdf can extract more accurate Hangul from the digital text layer.
|
||||
- MinerU Markdown can omit Hangul syllables or misrecognize headings/body text as unrelated CJK characters.
|
||||
- The source text layer itself can contain abnormal spacing between Hangul syllables.
|
||||
|
||||
## Touched Surfaces
|
||||
|
||||
Allowed during implementation:
|
||||
|
||||
- `src/pdf2md/text_fidelity.py`
|
||||
- `src/pdf2md/ir.py`
|
||||
- `src/pdf2md/metadata.py`
|
||||
- `src/pdf2md/report.py`
|
||||
- `src/pdf2md/conversion.py`
|
||||
- `tests/test_text_fidelity.py`
|
||||
- `tests/test_metadata.py`
|
||||
- `tests/test_report.py`
|
||||
- `tests/test_conversion.py`
|
||||
- `docs/V1IMPLEMENTATIONPLAN.md`
|
||||
- `PLAN.md`
|
||||
- `PROGRESS.md`
|
||||
- `docs/WORKARCHIVE.md` after completion
|
||||
|
||||
Allowed only if needed for CLI/API wiring:
|
||||
|
||||
- `src/pdf2md/cli.py`
|
||||
- `tests/test_cli.py`
|
||||
- `README.md`
|
||||
|
||||
Not allowed:
|
||||
|
||||
- Replacing Markdown body text with pypdf text in this sprint.
|
||||
- Adding a second conversion engine or engine selector.
|
||||
- Adding remote OCR, hosted LLM/VLM, remote document parsing, `--api-url`, router mode, HTTP client backends, or remote OpenAI-compatible endpoints.
|
||||
- Mandatory default tests that require real MinerU, GPU, model files, network, Obsidian, or committed `samples/`.
|
||||
- Committing sample PDFs or generated `outputs/`.
|
||||
|
||||
## Product Behavior
|
||||
|
||||
Text fidelity diagnostics should run automatically after MinerU Markdown normalization and local quality checks have produced the final Markdown candidate.
|
||||
|
||||
For each page that can be compared, metadata should record a compact diagnostic object with at least:
|
||||
|
||||
- `page_index`: zero-based output page index.
|
||||
- `source_page_number`: one-based original PDF page number when known.
|
||||
- `pypdf_text_available`: whether pypdf extracted non-empty source text.
|
||||
- `markdown_text_available`: whether comparable Markdown text exists for the page.
|
||||
- `pypdf_hangul_count`: Hangul syllable count from pypdf text.
|
||||
- `markdown_hangul_count`: Hangul syllable count from Markdown text.
|
||||
- `hangul_count_delta`: `markdown_hangul_count - pypdf_hangul_count`.
|
||||
- `hangul_count_ratio`: Markdown Hangul count divided by pypdf Hangul count, or `null` when unavailable.
|
||||
- `unexpected_cjk_count`: count of CJK Unified Ideographs in Markdown that are suspicious in a page with Korean source text.
|
||||
- `pypdf_hangul_spacing_anomaly_ratio`: ratio of Hangul-to-Hangul whitespace breaks in pypdf text.
|
||||
- `markdown_hangul_spacing_anomaly_ratio`: ratio of Hangul-to-Hangul whitespace breaks in Markdown text.
|
||||
- `text_similarity`: normalized text similarity between pypdf text and Markdown text.
|
||||
- `replacement_candidate`: `true` only when pypdf text appears more reliable than Markdown text under conservative thresholds.
|
||||
- `comparison_status`: one of `checked`, `source_text_missing`, `markdown_page_unavailable`, or `page_mapping_uncertain`.
|
||||
|
||||
Metadata summary should include:
|
||||
|
||||
- `text_fidelity_checked_page_count`.
|
||||
- `text_fidelity_low_page_count`.
|
||||
- `text_fidelity_unexpected_cjk_count`.
|
||||
- `text_fidelity_replacement_candidate_page_count`.
|
||||
- `text_fidelity_page_mapping_uncertain_count`.
|
||||
|
||||
Report Markdown should add a dedicated `## Text Fidelity` section showing:
|
||||
|
||||
- checked page count and low-fidelity page count.
|
||||
- total unexpected CJK count.
|
||||
- replacement candidate page count.
|
||||
- pages with low similarity.
|
||||
- pages with high unexpected CJK count.
|
||||
- pages where page-level comparison could not be trusted.
|
||||
|
||||
Warning behavior:
|
||||
|
||||
- Add `TEXT_LAYER_AVAILABLE` as an info warning when pypdf source text is available and diagnostics run.
|
||||
- Add `TEXT_FIDELITY_LOW` as a warning for pages below the fidelity threshold.
|
||||
- Add `UNEXPECTED_CJK_IN_KOREAN_TEXT` as a warning when suspicious CJK ideographs appear in Markdown for pages with Korean source text.
|
||||
- Add `HANGUL_SPACING_SUSPECT` as an info or warning-level signal when pypdf or Markdown has high Hangul spacing anomaly ratio.
|
||||
- Add `TEXT_PAGE_MAPPING_UNCERTAIN` as an info warning when page-level Markdown mapping is not reliable enough for per-page metrics.
|
||||
|
||||
Replacement candidate policy:
|
||||
|
||||
- `replacement_candidate` is a diagnostic marker only.
|
||||
- It must not change Markdown output.
|
||||
- It should be `true` only when:
|
||||
- pypdf source text is available,
|
||||
- pypdf Hangul count is materially higher than Markdown Hangul count or Markdown has suspicious CJK ideographs,
|
||||
- pypdf spacing anomalies are not so severe that the source text layer is clearly unusable,
|
||||
- page mapping is `checked`.
|
||||
|
||||
## Architecture Plan
|
||||
|
||||
### WP13.1: Text Fidelity Module
|
||||
|
||||
Actions:
|
||||
|
||||
- Add `src/pdf2md/text_fidelity.py`.
|
||||
- Use `pypdf.PdfReader` to extract source page text locally.
|
||||
- Define immutable result records for per-page metrics and summary metrics.
|
||||
- Strip Markdown syntax, image links, fenced code, inline code, and math spans before text comparison.
|
||||
- Normalize text for comparison without mutating the output Markdown:
|
||||
- Unicode NFKC normalization for comparison strings only.
|
||||
- collapse whitespace for similarity only.
|
||||
- keep raw-count metrics independent enough to expose spacing anomalies.
|
||||
- Count Hangul syllables with the Hangul syllable block.
|
||||
- Count suspicious CJK ideographs with CJK Unified Ideograph ranges, excluding Hangul ranges.
|
||||
- Compute similarity with a deterministic standard-library algorithm such as `difflib.SequenceMatcher`.
|
||||
|
||||
Expected output:
|
||||
|
||||
- Pure local helper functions that are independently testable and do not call MinerU, network services, or the filesystem except for reading the source PDF.
|
||||
|
||||
### WP13.2: Page Mapping Boundary
|
||||
|
||||
Actions:
|
||||
|
||||
- Derive source page numbers from `engine_options.chunk` when chunking is active.
|
||||
- Use project page records and any reliable raw structured page count to decide whether page-level comparison is possible.
|
||||
- If Markdown cannot be mapped to pages reliably, produce `TEXT_PAGE_MAPPING_UNCERTAIN` and avoid pretending per-page metrics are exact.
|
||||
- For the initial implementation, allow a conservative fallback for single-page mocked outputs and chunk outputs where one Markdown file corresponds to a known source page range.
|
||||
|
||||
Expected output:
|
||||
|
||||
- Page-level diagnostics are only marked `checked` when the mapping is credible.
|
||||
- Ambiguous cases are visible in metadata/report instead of producing misleading page metrics.
|
||||
|
||||
### WP13.3: Metadata And Warning Integration
|
||||
|
||||
Actions:
|
||||
|
||||
- Add warning codes in `src/pdf2md/ir.py`.
|
||||
- Add text fidelity fields to metadata without changing existing top-level fields used by current tests.
|
||||
- Extend `build_summary()` to include text fidelity summary counts when diagnostics are present.
|
||||
- Ensure warnings retain `page_index` where available.
|
||||
- Preserve JSON serializability and deterministic key ordering on write.
|
||||
|
||||
Expected output:
|
||||
|
||||
- Metadata contains compact page-level text fidelity diagnostics and summary counts.
|
||||
- Existing metadata consumers remain compatible.
|
||||
|
||||
### WP13.4: Report Integration
|
||||
|
||||
Actions:
|
||||
|
||||
- Extend `render_report()` to render a `## Text Fidelity` section when diagnostics exist.
|
||||
- Keep the report derived from metadata and quality results.
|
||||
- Include low-fidelity pages and replacement candidate pages in human-readable form.
|
||||
- Do not include full extracted page text in the report.
|
||||
|
||||
Expected output:
|
||||
|
||||
- A human can identify which pages need attention without opening metadata JSON first.
|
||||
|
||||
### WP13.5: Conversion And Recheck Integration
|
||||
|
||||
Actions:
|
||||
|
||||
- Run text fidelity diagnostics during `convert` after final Markdown preparation and before metadata/report writing.
|
||||
- Run the same diagnostics during `recheck` when the original source PDF path still exists.
|
||||
- If the source PDF is missing during `recheck`, preserve existing behavior and add a clear nonfatal warning or omit diagnostics.
|
||||
- Keep chunked conversion page ranges tied to original source page numbers.
|
||||
|
||||
Expected output:
|
||||
|
||||
- Fresh conversions and rechecks can produce text fidelity diagnostics without rerunning MinerU.
|
||||
|
||||
### WP13.6: Tests
|
||||
|
||||
Default fast tests:
|
||||
|
||||
- pypdf extraction boundary handles generated local PDFs without requiring real MinerU or sample files.
|
||||
- Hangul count, unexpected CJK count, and spacing anomaly ratio helpers use direct Korean/CJK strings.
|
||||
- Markdown text stripping ignores math, image links, fenced code, and inline code.
|
||||
- Similarity score is deterministic for equivalent and degraded text.
|
||||
- Metadata contains text fidelity summary fields when diagnostics are present.
|
||||
- Report contains `## Text Fidelity` and page-level warning summaries.
|
||||
- Conversion with a fake adapter records `TEXT_FIDELITY_LOW` when Markdown omits Hangul from a source-text PDF.
|
||||
- Recheck reruns diagnostics when source PDF exists.
|
||||
- Missing source PDF during recheck remains nonfatal.
|
||||
|
||||
Optional local validation:
|
||||
|
||||
- Convert the local 2007 Korean shell-structure sample with chunking to ignored `outputs\`.
|
||||
- Confirm the report flags the pages where the previous output had missing Hangul and unexpected CJK characters.
|
||||
- Do not commit sample PDFs or generated outputs.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
- Default tests pass without real MinerU, GPU, model files, network, Obsidian, or `samples/`.
|
||||
- Diagnostics are local-only and use pypdf source text only from the local PDF.
|
||||
- Metadata JSON records page-level text fidelity metrics where page mapping is credible.
|
||||
- Metadata summary records aggregate text fidelity counts.
|
||||
- `<stem>.report.md` includes a text fidelity section when diagnostics exist.
|
||||
- Suspicious Korean text loss produces structured warnings with page provenance where available.
|
||||
- Replacement candidate markers are recorded only as diagnostics and do not alter Markdown content.
|
||||
- Existing math, asset, table, chunk, strict-local, and UI behavior remains unchanged.
|
||||
|
||||
## Hard Failure Criteria
|
||||
|
||||
- Markdown body text is replaced automatically in this sprint.
|
||||
- Page-level metrics are reported as exact when page mapping is uncertain.
|
||||
- Diagnostics upload PDFs, page text, Markdown, or extracted text to any remote service.
|
||||
- Default tests require MinerU, CUDA/GPU, model files, network, Obsidian, or `samples/`.
|
||||
- Existing output schema fields are removed or renamed.
|
||||
- `samples/`, generated `outputs/`, or `dist/pdf2md-ui.exe` are committed.
|
||||
|
||||
## Verification Commands
|
||||
|
||||
```powershell
|
||||
uv run pytest tests/test_text_fidelity.py tests/test_metadata.py tests/test_report.py tests/test_conversion.py
|
||||
uv run pytest
|
||||
git diff --check
|
||||
git status --short --untracked-files=all
|
||||
```
|
||||
|
||||
Optional local validation:
|
||||
|
||||
```powershell
|
||||
$env:MINERU_MODEL_SOURCE='local'
|
||||
$pdf = (Get-ChildItem samples -Filter '2007*.pdf' | Select-Object -First 1).FullName
|
||||
uv run pdf2md convert $pdf --out outputs\sprint13-2007-text-fidelity --overwrite --chunk-pages 5
|
||||
```
|
||||
|
||||
## Handoff Requirements
|
||||
|
||||
After implementation:
|
||||
|
||||
- Update `PROGRESS.md` with files changed, commands run, test outcomes, optional sample validation outcome, known failures, residual risks, and next action.
|
||||
- Archive completed implementation details in `docs/WORKARCHIVE.md` after verification.
|
||||
- Keep sample PDFs, generated outputs, and build artifacts out of the commit.
|
||||
- Record whether page-level mapping was exact, approximate, or unavailable for the validated sample.
|
||||
|
||||
## Implementation Handoff
|
||||
|
||||
Files changed:
|
||||
|
||||
- `src/pdf2md/text_fidelity.py`
|
||||
- `src/pdf2md/ir.py`
|
||||
- `src/pdf2md/metadata.py`
|
||||
- `src/pdf2md/report.py`
|
||||
- `src/pdf2md/conversion.py`
|
||||
- `tests/test_text_fidelity.py`
|
||||
- `tests/test_metadata.py`
|
||||
- `tests/test_report.py`
|
||||
- `tests/test_conversion.py`
|
||||
- `ARCHITECTURE.md`
|
||||
- `PLAN.md`
|
||||
- `PROGRESS.md`
|
||||
- `docs/WORKARCHIVE.md`
|
||||
- `docs/V1IMPLEMENTATIONPLAN.md`
|
||||
|
||||
Verification:
|
||||
|
||||
- `uv run pytest tests/test_text_fidelity.py tests/test_metadata.py tests/test_report.py tests/test_conversion.py`: passed 49 tests.
|
||||
- `uv run pytest`: passed 198 tests with 1 optional skip.
|
||||
|
||||
Known failures:
|
||||
|
||||
- None in the default fast test suite.
|
||||
|
||||
Residual risks:
|
||||
|
||||
- Page-level Markdown mapping is only scored when credible. Multi-page Markdown without reliable page boundaries is reported as `TEXT_PAGE_MAPPING_UNCERTAIN` rather than guessed.
|
||||
- Automatic body-text replacement remains out of scope and is not implemented.
|
||||
- Optional real MinerU validation on the local 2007 Korean shell-structure sample was not run during implementation to avoid a long GPU conversion.
|
||||
|
||||
## Future Sprint Boundary
|
||||
|
||||
A later sprint may implement controlled body-text replacement from pypdf text after Sprint 13 diagnostics show reliable thresholds. That future sprint must have its own contract and must preserve math, tables, figures, asset links, and Markdown structure from MinerU unless explicitly redesigned.
|
||||
@@ -0,0 +1,378 @@
|
||||
# Sprint 14 Contract: Single-Page Conversion With Grouped Outputs
|
||||
|
||||
Status: Implemented
|
||||
Last updated: 2026-05-11
|
||||
|
||||
## Objective
|
||||
|
||||
Replace the current fixed-size pre-conversion chunking behavior with a safer long-PDF workflow:
|
||||
|
||||
1. When chunk mode is active, split the source PDF into one-page temporary PDFs.
|
||||
2. Convert each one-page PDF sequentially through the existing local MinerU CLI adapter.
|
||||
3. Merge successful converted page Markdown into grouped output files after every configured output group size.
|
||||
4. Keep the default output group size at 20 pages when `--chunk-pages` is supplied without a value.
|
||||
|
||||
This sprint is motivated by local evidence from `samples/2007쉘구조물의유한요소해석에대하여.pdf`: a 5-page MinerU input chunk stalled on GTX 1070 Ti 8GB, while one-page conversion completed all 13 pages.
|
||||
|
||||
## Current Precondition
|
||||
|
||||
- MinerU 3.1.0 remains the only conversion engine.
|
||||
- Conversion runs through direct local `mineru` CLI execution only.
|
||||
- Strict-local allows only the direct CLI and MinerU CLI-internal temporary local `mineru-api`; remote API/backend paths remain prohibited.
|
||||
- `pypdf` is already available and used for local PDF chunk planning and temporary chunk PDF writing.
|
||||
- `pdf2md convert` currently supports `--chunk-pages [PAGES]`.
|
||||
- Existing chunk mode currently treats `chunk_pages` as the MinerU input PDF page count and writes one final Markdown file per input chunk.
|
||||
- `convert_pdf(..., chunk_pages=N)` currently returns `BatchConversionResult` in chunk mode.
|
||||
- Sprint 13 text fidelity diagnostics are most accurate when each MinerU Markdown output maps to exactly one source page.
|
||||
|
||||
## Contract Assumptions
|
||||
|
||||
- Keep chunk mode opt-in for this sprint. If `chunk_pages` is `None`, the existing non-chunked full-PDF conversion path remains unchanged.
|
||||
- Keep the public option name `--chunk-pages` for CLI/API compatibility, but redefine its behavior in chunk mode as the output group size, not the MinerU input size.
|
||||
- If `--chunk-pages` is present without a value, use `DEFAULT_CHUNK_PAGES == 20` as the output group size.
|
||||
- In chunk mode, even a PDF with fewer than `chunk_pages` pages is converted internally one page at a time and emitted as one grouped output file.
|
||||
- Final grouped outputs are the public conversion results. Temporary per-page Markdown, metadata, reports, assets, and one-page PDFs are not retained unless a later sprint explicitly adds debug retention.
|
||||
|
||||
## Touched Surfaces
|
||||
|
||||
Allowed during implementation:
|
||||
|
||||
- `src/pdf2md/pdf_splitter.py`
|
||||
- `src/pdf2md/conversion.py`
|
||||
- `src/pdf2md/paths.py`
|
||||
- `src/pdf2md/metadata.py`
|
||||
- `src/pdf2md/report.py`
|
||||
- `src/pdf2md/cli.py`
|
||||
- `src/pdf2md_ui/app.py`
|
||||
- `src/pdf2md_ui/runner.py`
|
||||
- `tests/test_pdf_splitter.py`
|
||||
- `tests/test_conversion.py`
|
||||
- `tests/test_cli.py`
|
||||
- `tests/test_paths.py`
|
||||
- `tests/test_metadata.py`
|
||||
- `tests/test_report.py`
|
||||
- `tests/test_ui_runner.py`
|
||||
- `README.md`
|
||||
- `ARCHITECTURE.md`
|
||||
- `docs/V1IMPLEMENTATIONPLAN.md`
|
||||
- `PLAN.md`
|
||||
- `PROGRESS.md`
|
||||
- `docs/WORKARCHIVE.md` after implementation
|
||||
|
||||
Allowed if a focused helper boundary keeps `conversion.py` simpler:
|
||||
|
||||
- Create `src/pdf2md/page_grouping.py`
|
||||
- Create `tests/test_page_grouping.py`
|
||||
|
||||
Not allowed:
|
||||
|
||||
- Adding another conversion engine or runtime engine selector.
|
||||
- Running page conversions in parallel by default. GTX 1070 Ti 8GB memory pressure makes sequential conversion the safe default.
|
||||
- Adding cloud OCR, hosted LLM/VLM, remote document parsing, `--api-url`, router mode, HTTP client backends, or remote OpenAI-compatible endpoints.
|
||||
- Making default tests depend on real MinerU, GPU, model files, network, Obsidian, MathJax, or `samples/`.
|
||||
- Committing sample PDFs, generated `outputs/`, retained temporary page outputs, or `dist/pdf2md-ui.exe`.
|
||||
|
||||
## Product Behavior
|
||||
|
||||
### Activation
|
||||
|
||||
Existing non-chunked conversion remains unchanged:
|
||||
|
||||
```powershell
|
||||
uv run pdf2md convert paper.pdf --out outputs
|
||||
```
|
||||
|
||||
Grouped page conversion is enabled by `--chunk-pages`:
|
||||
|
||||
```powershell
|
||||
uv run pdf2md convert paper.pdf --out outputs --chunk-pages
|
||||
uv run pdf2md convert paper.pdf --out outputs --chunk-pages 20
|
||||
uv run pdf2md convert paper.pdf --out outputs --chunk-pages 1
|
||||
```
|
||||
|
||||
Behavior:
|
||||
|
||||
- `--chunk-pages` means output group size.
|
||||
- `--chunk-pages 20` converts pages 1, 2, 3, ... as independent one-page MinerU jobs, then emits grouped outputs covering pages 1-20, 21-40, and so on.
|
||||
- `--chunk-pages 1` emits one final output file per source page.
|
||||
- `convert_pdf(..., chunk_pages=N)` still returns `BatchConversionResult`; each `ConversionResult` represents one final grouped output file, not each internal one-page MinerU run.
|
||||
|
||||
### Output Naming
|
||||
|
||||
Use the existing part/page-range naming shape for grouped outputs:
|
||||
|
||||
```text
|
||||
<stem>.part-001.pages-001-020.md
|
||||
<stem>.part-001.pages-001-020.metadata.json
|
||||
<stem>.part-001.pages-001-020.report.md
|
||||
<stem>.part-001.pages-001-020.assets/
|
||||
|
||||
<stem>.part-002.pages-021-040.md
|
||||
...
|
||||
```
|
||||
|
||||
If a 13-page PDF is converted with `--chunk-pages 20`, it emits:
|
||||
|
||||
```text
|
||||
<stem>.part-001.pages-001-013.md
|
||||
<stem>.part-001.pages-001-013.metadata.json
|
||||
<stem>.part-001.pages-001-013.report.md
|
||||
<stem>.part-001.pages-001-013.assets/
|
||||
```
|
||||
|
||||
This is an intentional behavior change from Sprint 10: short PDFs in chunk mode no longer bypass chunk mode and no longer write `<stem>.md`.
|
||||
|
||||
### Internal Page Conversion
|
||||
|
||||
For every source page in chunk mode:
|
||||
|
||||
- Write a one-page temporary PDF with pypdf.
|
||||
- Run the existing local MinerU adapter against that one-page PDF.
|
||||
- Normalize Markdown, copy page assets into a temporary page assets directory, run MathJax checks/repair, and run Sprint 13 text fidelity diagnostics against the original source page.
|
||||
- Delete the one-page temporary PDF and temporary per-page final files after grouped output generation.
|
||||
|
||||
The implementation should reuse existing conversion primitives where practical, but it must avoid writing final public files for every page before grouping.
|
||||
|
||||
### Markdown Grouping
|
||||
|
||||
For each output group:
|
||||
|
||||
- Concatenate successful page Markdown in source page order.
|
||||
- Separate pages with blank lines and an HTML comment that is invisible in Obsidian preview:
|
||||
|
||||
```markdown
|
||||
<!-- source-page: 7 -->
|
||||
```
|
||||
|
||||
- Do not add visible page headings or instructional text.
|
||||
- If a page conversion fails, do not invent Markdown for that page. Add an invisible comment at the page boundary:
|
||||
|
||||
```markdown
|
||||
<!-- source-page: 7 conversion failed; see report -->
|
||||
```
|
||||
|
||||
- Preserve Obsidian-friendly math delimiters and display math spacing after concatenation.
|
||||
|
||||
### Asset Grouping
|
||||
|
||||
Assets from temporary per-page outputs must be copied into the grouped assets directory with collision-proof names.
|
||||
|
||||
Recommended destination layout:
|
||||
|
||||
```text
|
||||
<stem>.part-001.pages-001-020.assets/page-001/<asset-name>
|
||||
<stem>.part-001.pages-001-020.assets/page-002/<asset-name>
|
||||
```
|
||||
|
||||
Markdown image links must be rewritten to the grouped assets directory. This keeps repeated MinerU asset filenames from different pages from overwriting each other.
|
||||
|
||||
### Metadata And Report Grouping
|
||||
|
||||
Grouped metadata must be derived from per-page conversion records plus group-level checks.
|
||||
|
||||
Required metadata behavior:
|
||||
|
||||
- `source_pdf` remains the original source PDF path.
|
||||
- `source_sha256` remains the original source PDF hash.
|
||||
- `pages` contains one page record per source page in the group.
|
||||
- Page indexes in grouped metadata are group-local zero-based indexes.
|
||||
- Original source page numbers remain visible in chunk/page conversion provenance.
|
||||
- Warnings from per-page conversions are preserved with adjusted group-local page indexes.
|
||||
- Warnings for failed page conversions are added with original source page context.
|
||||
- `text_fidelity` records are carried from one-page checks and keep exact `source_page_number` values.
|
||||
- Summary counts are aggregated from the grouped metadata and grouped Markdown.
|
||||
|
||||
Required `engine_options` shape:
|
||||
|
||||
```json
|
||||
{
|
||||
"chunk": {
|
||||
"original_source_pdf": "...",
|
||||
"chunk_index": 1,
|
||||
"total_chunks": 3,
|
||||
"source_page_start": 1,
|
||||
"source_page_end": 20,
|
||||
"chunk_page_count": 20
|
||||
},
|
||||
"page_conversion": {
|
||||
"mode": "single_page",
|
||||
"mineru_input_page_count": 1,
|
||||
"output_group_page_count": 20,
|
||||
"failed_source_pages": []
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Report Markdown must continue to include the existing chunk context line and should add a concise page-conversion line, for example:
|
||||
|
||||
```text
|
||||
- Page conversion mode: single-page MinerU inputs, grouped output size: 20
|
||||
```
|
||||
|
||||
## Failure Policy
|
||||
|
||||
- Convert pages sequentially.
|
||||
- If a page fails, continue with later pages.
|
||||
- If at least one page in a group succeeds, write the grouped Markdown/metadata/report and mark final status `partial`.
|
||||
- If every page in a group fails, return a failed `ConversionResult` for that grouped output and do not write Markdown for that group.
|
||||
- Failed pages must be visible in metadata/report warnings.
|
||||
- There is no silent fallback and no retry loop in this sprint.
|
||||
|
||||
## Architecture Plan
|
||||
|
||||
### WP14.1: Page And Group Planning
|
||||
|
||||
Actions:
|
||||
|
||||
- Extend `pdf_splitter.py` or add `page_grouping.py` with project-owned records for:
|
||||
- one-page MinerU input plans,
|
||||
- final output group plans,
|
||||
- original source page ranges,
|
||||
- deterministic output stems.
|
||||
- Keep pypdf page extraction local and temporary.
|
||||
- Validate output group size as a positive integer.
|
||||
- Plan output groups before conversion starts so overwrite/conflict behavior remains deterministic.
|
||||
|
||||
Expected output:
|
||||
|
||||
- A 41-page PDF with group size 20 plans 41 one-page MinerU inputs and 3 final grouped outputs.
|
||||
- A 13-page PDF with group size 20 plans 13 one-page MinerU inputs and 1 final grouped output.
|
||||
|
||||
### WP14.2: Conversion Orchestration
|
||||
|
||||
Actions:
|
||||
|
||||
- Rework chunk-mode `convert_pdf()` and `convert_input()` orchestration so `chunk_pages` creates grouped output tasks.
|
||||
- Run one-page MinerU inputs in source-page order.
|
||||
- Keep temporary page PDFs and intermediate page outputs under local temporary directories.
|
||||
- Keep `BatchConversionResult` at the grouped-output level.
|
||||
- Keep strict-local validation unchanged.
|
||||
|
||||
Expected output:
|
||||
|
||||
- The public API keeps returning multiple grouped results in chunk mode while the adapter is called once per source page internally.
|
||||
|
||||
### WP14.3: Markdown And Asset Group Assembly
|
||||
|
||||
Actions:
|
||||
|
||||
- Build a focused helper to merge page Markdown and page assets into a grouped output.
|
||||
- Insert invisible `<!-- source-page: N -->` boundaries.
|
||||
- Rewrite per-page asset links to `page-NNN/` asset subdirectories.
|
||||
- Run final group-level local quality checks after asset rewriting.
|
||||
|
||||
Expected output:
|
||||
|
||||
- Grouped Markdown renders in Obsidian and assets do not collide across pages.
|
||||
|
||||
### WP14.4: Metadata, Warnings, And Report Assembly
|
||||
|
||||
Actions:
|
||||
|
||||
- Aggregate per-page metadata into grouped metadata.
|
||||
- Adjust page indexes from page-local `0` to group-local indexes.
|
||||
- Preserve original source page numbers in `engine_options` and text fidelity records.
|
||||
- Add `page_conversion` engine options.
|
||||
- Add a report line for single-page conversion mode and grouped output size.
|
||||
|
||||
Expected output:
|
||||
|
||||
- Metadata/report can explain both facts: MinerU saw one page at a time, while the user received grouped Markdown files.
|
||||
|
||||
### WP14.5: CLI, UI, And Documentation
|
||||
|
||||
Actions:
|
||||
|
||||
- Update CLI help for `--chunk-pages` from "pre-conversion PDF chunking" to "group converted pages into output files of N pages; MinerU runs one page at a time."
|
||||
- Update README and architecture docs with the new behavior.
|
||||
- Update the Windows UI label/help text so the field represents output group size.
|
||||
- Keep runner command construction using `--chunk-pages N`.
|
||||
|
||||
Expected output:
|
||||
|
||||
- Users do not confuse `--chunk-pages 20` with a 20-page MinerU input.
|
||||
|
||||
### WP14.6: Tests
|
||||
|
||||
Default fast tests:
|
||||
|
||||
- Generated blank local PDFs verify page count and group planning for 1, 13, 20, 21, 40, and 41 pages.
|
||||
- `--chunk-pages` without a value still passes `20`.
|
||||
- `convert_pdf(..., chunk_pages=20)` for 41 pages calls the fake adapter 41 times and returns 3 grouped `ConversionResult` objects.
|
||||
- `convert_pdf(..., chunk_pages=20)` for 13 pages calls the fake adapter 13 times and returns 1 grouped output named `part-001.pages-001-013`.
|
||||
- `convert_pdf(..., chunk_pages=1)` returns one grouped output per source page.
|
||||
- Temporary one-page PDFs and temporary per-page outputs are deleted after conversion.
|
||||
- A failed internal page conversion does not stop later pages and appears in grouped metadata/report.
|
||||
- A group with only failed pages returns a failed result and writes no Markdown.
|
||||
- Asset filenames from different pages do not collide in the grouped assets directory.
|
||||
- Per-page warnings and text fidelity records are adjusted to group-local page indexes while preserving original source page numbers.
|
||||
- Existing non-chunked conversion tests keep passing unchanged.
|
||||
- UI runner tests continue to build fixed argument lists with `shell=False`.
|
||||
|
||||
Optional local validation:
|
||||
|
||||
```powershell
|
||||
$env:MINERU_MODEL_SOURCE='local'
|
||||
$pdf = (Get-ChildItem samples -Filter '2007*.pdf' | Select-Object -First 1).FullName
|
||||
uv run pdf2md convert $pdf --out outputs\sprint14-2007-page-grouped --overwrite --chunk-pages
|
||||
```
|
||||
|
||||
Expected optional validation:
|
||||
|
||||
- The 13-page Korean sample emits one grouped Markdown file for pages 1-13.
|
||||
- Metadata/report show exact page-level text fidelity records.
|
||||
- Generated outputs stay ignored and uncommitted.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
- Chunk mode runs MinerU on one-page temporary PDFs only.
|
||||
- `chunk_pages` controls final grouped output page count.
|
||||
- Default group size remains 20 when `--chunk-pages` is supplied without a value.
|
||||
- Grouped Markdown, metadata JSON, report Markdown, and grouped assets directory are written.
|
||||
- Grouped metadata preserves original source PDF, original source SHA-256, group page range, one-page conversion mode, page warnings, and text fidelity provenance.
|
||||
- Failed page conversions are explicit, nonfatal to later pages, and visible in report/metadata.
|
||||
- Default tests remain fast and local.
|
||||
- Strict-local policy remains unchanged.
|
||||
- Non-chunked conversion behavior remains backward-compatible.
|
||||
|
||||
## Hard Failure Criteria
|
||||
|
||||
- Chunk mode sends more than one source page to MinerU in a single temporary PDF.
|
||||
- `--chunk-pages` continues to mean MinerU input chunk size after this sprint.
|
||||
- Grouped outputs lose source page provenance or hide failed pages.
|
||||
- Asset links collide or point outside the grouped assets directory.
|
||||
- Default tests require real MinerU, GPU, model files, network, Obsidian, MathJax, or `samples/`.
|
||||
- The implementation adds a remote API/backend path, alternate conversion engine, router mode, or OpenAI-compatible backend.
|
||||
- Sample PDFs, generated outputs, retained temporary page outputs, or `dist/pdf2md-ui.exe` are committed.
|
||||
|
||||
## Verification Commands
|
||||
|
||||
```powershell
|
||||
uv run pytest tests/test_pdf_splitter.py tests/test_conversion.py tests/test_cli.py tests/test_paths.py tests/test_metadata.py tests/test_report.py tests/test_ui_runner.py
|
||||
uv run pytest
|
||||
git diff --check
|
||||
git status --short --untracked-files=all
|
||||
```
|
||||
|
||||
Optional local validation command is listed in WP14.6 and should be run only when a long GPU conversion is acceptable.
|
||||
|
||||
## Handoff Requirements
|
||||
|
||||
After implementation:
|
||||
|
||||
- Update `PROGRESS.md` with files changed, commands run, test outcomes, optional sample validation outcome, known failures, residual risks, and next action.
|
||||
- Archive completed implementation details in `docs/WORKARCHIVE.md` after verification.
|
||||
- Keep sample PDFs, generated outputs, retained temporary page outputs, and build artifacts out of the commit.
|
||||
- Record whether the 2007 Korean sample was validated with grouped page conversion and how many grouped outputs were produced.
|
||||
|
||||
Implementation handoff on 2026-05-11:
|
||||
|
||||
- Implemented grouped page conversion in `src/pdf2md/conversion.py` with one-page temporary MinerU inputs and grouped public outputs.
|
||||
- Added report output for `page_conversion` engine options.
|
||||
- Updated CLI help, UI label text, README, architecture, implementation plan, and coordination/archive docs.
|
||||
- Verification: targeted Sprint 14 tests passed, the 101-test related suite passed, and full `uv run pytest` passed 202 tests with 1 optional skip.
|
||||
- Optional real MinerU validation on the 2007 Korean sample was not run during this implementation pass.
|
||||
|
||||
## Future Sprint Boundary
|
||||
|
||||
A later sprint may make grouped page conversion the default even without `--chunk-pages`, add resumable page caches, or add a debug option to retain intermediate per-page outputs. Those behaviors are intentionally out of Sprint 14 scope.
|
||||
@@ -0,0 +1,431 @@
|
||||
# Sprint 15 Contract: NVIDIA GPU Detection And Auto MinerU Profile
|
||||
|
||||
Status: Implemented
|
||||
Last updated: 2026-05-12
|
||||
|
||||
## Objective
|
||||
|
||||
Add a strict-local runtime profiling layer that detects installed NVIDIA GPUs and applies conservative MinerU environment tuning by default.
|
||||
|
||||
The default runtime profile is `auto`. In `auto`, the converter should keep 8GB and pre-Turing GPUs conservative, while allowing a slightly more aggressive local MinerU configuration only when the selected NVIDIA GPU has at least 16GB VRAM and no pre-Turing compatibility warning.
|
||||
|
||||
This sprint is motivated by local evidence from `samples\FourNodeQuadrilateralShellElementMITC4.pdf`: Sprint 14's one-page conversion path used `cuda:0` correctly, but GTX 1070 Ti 8GB stayed near full VRAM use and stalled on source page 2. The next useful test should be on a stronger NVIDIA GPU with explicit runtime diagnostics and reproducible MinerU environment settings.
|
||||
|
||||
## Source Basis
|
||||
|
||||
Use these source-backed facts during implementation:
|
||||
|
||||
- MinerU CLI supports `mineru -p <input_path> -o <output_path>` and, without `--api-url`, launches a temporary local `mineru-api`: https://opendatalab.github.io/MinerU/usage/cli_tools/
|
||||
- MinerU CLI documents `-b/--backend`, `-f/--formula`, `-t/--table`, `--api-url`, and related options, but this project must not expose remote/API or backend selection paths in v1: https://opendatalab.github.io/MinerU/usage/cli_tools/
|
||||
- MinerU environment variables include `MINERU_PDF_RENDER_THREADS`, `MINERU_PROCESSING_WINDOW_SIZE`, `MINERU_API_MAX_CONCURRENT_REQUESTS`, and timeout settings: https://opendatalab.github.io/MinerU/usage/cli_tools/
|
||||
- MinerU advanced CLI docs support selecting visible GPU devices with `CUDA_VISIBLE_DEVICES`: https://opendatalab.github.io/MinerU/usage/advanced_cli_parameters/
|
||||
- MinerU local deployment docs list auto-engine GPU requirements around 8GB+ VRAM and GPU acceleration for Volta-or-later devices: https://opendatalab.github.io/MinerU/quick_start/
|
||||
- MinerU extension docs say `vllm` and `lmdeploy` acceleration extras are alternatives and should not both be installed just for this sprint: https://opendatalab.github.io/MinerU/quick_start/extension_modules/
|
||||
|
||||
Access date for the source review: 2026-05-12.
|
||||
|
||||
## Current Precondition
|
||||
|
||||
- MinerU 3.1.0 remains the only conversion engine.
|
||||
- Conversion runs through direct local `mineru` CLI execution only.
|
||||
- Strict-local allows only the direct CLI and MinerU CLI-internal temporary local `mineru-api`; remote API/backend paths remain prohibited.
|
||||
- `pdf2md convert` defaults to `--gpu cuda:0`.
|
||||
- `MinerUAdapter` currently maps `cuda:N` to `MINERU_DEVICE_MODE=cuda` and `CUDA_VISIBLE_DEVICES=N`.
|
||||
- `pdf2md doctor` already reports NVIDIA GPU visibility, PyTorch CUDA visibility, GPU names, and Pascal/pre-Turing warnings.
|
||||
- Sprint 14 chunk mode runs one source page per MinerU invocation when `--chunk-pages` is active.
|
||||
|
||||
## Contract Assumptions
|
||||
|
||||
- Keep `--gpu cuda:0` as the default for backward compatibility with PRD and existing docs.
|
||||
- Add `--gpu auto` as an opt-in GPU selection mode that chooses the visible NVIDIA GPU with the largest reported VRAM.
|
||||
- Add `--mineru-profile {auto,safe,performance}` with default `auto`.
|
||||
- Keep all conversion requests sequential in Sprint 15. Do not introduce parallel page conversion.
|
||||
- Keep formula and table parsing enabled. Do not optimize by disabling required output quality features.
|
||||
- Do not add `--backend`, `--api-url`, `--url`, router mode, HTTP client backend, remote OpenAI-compatible backend, or remote model server support.
|
||||
- Treat MinerU environment tuning as best-effort. If GPU inventory cannot be read, continue with safe profile settings and a warning/provenance record rather than guessing aggressive values.
|
||||
|
||||
## Touched Surfaces
|
||||
|
||||
Allowed during implementation:
|
||||
|
||||
- Create `src/pdf2md/gpu.py`
|
||||
- Create `src/pdf2md/mineru_profile.py`
|
||||
- Modify `src/pdf2md/mineru_adapter.py`
|
||||
- Modify `src/pdf2md/conversion.py`
|
||||
- Modify `src/pdf2md/cli.py`
|
||||
- Modify `src/pdf2md/doctor.py`
|
||||
- Modify `src/pdf2md_ui/runner.py` only if the UI command builder needs profile passthrough
|
||||
- Modify `src/pdf2md_ui/app.py` only if a minimal profile control is necessary
|
||||
- Add `tests/test_gpu.py`
|
||||
- Add `tests/test_mineru_profile.py`
|
||||
- Modify `tests/test_mineru_adapter.py`
|
||||
- Modify `tests/test_conversion.py`
|
||||
- Modify `tests/test_cli.py`
|
||||
- Modify `tests/test_doctor.py`
|
||||
- Modify `tests/test_ui_runner.py` only if UI command construction changes
|
||||
- Modify `README.md`
|
||||
- Modify `ARCHITECTURE.md`
|
||||
- Modify `PRD.md` if CLI option documentation changes
|
||||
- Modify `docs/V1IMPLEMENTATIONPLAN.md`
|
||||
- Modify `PLAN.md`
|
||||
- Modify `PROGRESS.md`
|
||||
- Modify `docs/WORKARCHIVE.md` after implementation
|
||||
|
||||
Not allowed:
|
||||
|
||||
- Adding another conversion engine or runtime engine selector.
|
||||
- Passing `--api-url`, `--url`, or any remote endpoint to MinerU.
|
||||
- Adding `mineru-router`, HTTP client backend, or OpenAI-compatible backend usage.
|
||||
- Installing `vllm`, `lmdeploy`, CUDA packages, models, or any runtime package automatically.
|
||||
- Changing the default conversion engine or disabling formula/table recognition.
|
||||
- Making default tests depend on real MinerU, GPU, CUDA, PyTorch, model files, network, Obsidian, MathJax, or `samples/`.
|
||||
- Committing sample PDFs, generated `outputs/`, retained temporary page outputs, local model files, or `dist/pdf2md-ui.exe`.
|
||||
|
||||
## Product Behavior
|
||||
|
||||
### CLI
|
||||
|
||||
Existing behavior remains valid:
|
||||
|
||||
```powershell
|
||||
uv run pdf2md convert paper.pdf --out outputs
|
||||
uv run pdf2md convert paper.pdf --out outputs --gpu cuda:0
|
||||
```
|
||||
|
||||
New behavior:
|
||||
|
||||
```powershell
|
||||
uv run pdf2md convert paper.pdf --out outputs --mineru-profile auto
|
||||
uv run pdf2md convert paper.pdf --out outputs --mineru-profile safe
|
||||
uv run pdf2md convert paper.pdf --out outputs --mineru-profile performance
|
||||
uv run pdf2md convert paper.pdf --out outputs --gpu auto --mineru-profile auto
|
||||
```
|
||||
|
||||
Rules:
|
||||
|
||||
- `--mineru-profile` defaults to `auto`.
|
||||
- `--gpu cuda:N` selects a concrete CUDA index and tunes MinerU for that selected GPU when inventory is available.
|
||||
- `--gpu N` is still normalized to `cuda:N`.
|
||||
- `--gpu auto` selects the visible NVIDIA GPU with the largest VRAM from local GPU inventory.
|
||||
- If `--gpu auto` cannot find a visible NVIDIA GPU, fail clearly before conversion rather than silently switching to CPU.
|
||||
- If `--mineru-profile performance` is requested on a selected GPU below 16GB VRAM or with pre-Turing risk, downgrade to safe settings with a warning in metadata/report. Do not fail solely because performance was unsafe.
|
||||
|
||||
### Doctor
|
||||
|
||||
`pdf2md doctor` should report:
|
||||
|
||||
- All visible NVIDIA GPUs with index, name, total VRAM, and driver version from `nvidia-smi`.
|
||||
- PyTorch CUDA device names and compute capabilities when available.
|
||||
- Selected default GPU recommendation for `--gpu auto`.
|
||||
- Recommended MinerU profile for the detected primary GPU.
|
||||
- Existing Pascal/pre-Turing warnings.
|
||||
|
||||
Doctor must not require a real conversion, model load, network access, or package download.
|
||||
|
||||
### Auto Profile Policy
|
||||
|
||||
Use a small deterministic policy table. Values are intentionally conservative because the converter runs real PDFs and should prefer completion over peak throughput.
|
||||
|
||||
| Selected GPU | Auto policy | MinerU environment |
|
||||
| --- | --- | --- |
|
||||
| No GPU inventory, CUDA requested | Safe fallback with warning | `MINERU_PROCESSING_WINDOW_SIZE=1`, `MINERU_API_MAX_CONCURRENT_REQUESTS=1`, `MINERU_PDF_RENDER_THREADS=1` |
|
||||
| Pre-Turing or VRAM < 12GB | Safe | `MINERU_PROCESSING_WINDOW_SIZE=1`, `MINERU_API_MAX_CONCURRENT_REQUESTS=1`, `MINERU_PDF_RENDER_THREADS=1` |
|
||||
| 12GB <= VRAM < 16GB | Auto conservative | `MINERU_PROCESSING_WINDOW_SIZE=4`, `MINERU_API_MAX_CONCURRENT_REQUESTS=1`, `MINERU_PDF_RENDER_THREADS=2` |
|
||||
| VRAM >= 16GB and Turing-or-newer | Auto moderately aggressive | `MINERU_PROCESSING_WINDOW_SIZE=8`, `MINERU_API_MAX_CONCURRENT_REQUESTS=1`, `MINERU_PDF_RENDER_THREADS=4` |
|
||||
| Explicit `safe` | Safe regardless of GPU | `MINERU_PROCESSING_WINDOW_SIZE=1`, `MINERU_API_MAX_CONCURRENT_REQUESTS=1`, `MINERU_PDF_RENDER_THREADS=1` |
|
||||
| Explicit `performance` on VRAM >= 16GB and Turing-or-newer | Performance | `MINERU_PROCESSING_WINDOW_SIZE=16`, `MINERU_API_MAX_CONCURRENT_REQUESTS=1`, `MINERU_PDF_RENDER_THREADS=4` |
|
||||
| Explicit `performance` on weaker GPU | Downgraded safe with warning | safe values |
|
||||
|
||||
Do not set `MINERU_HYBRID_BATCH_RATIO` in Sprint 15 because MinerU docs describe it as commonly used for `hybrid-http-client`, which this project prohibits in v1.
|
||||
|
||||
Do not set backend CLI flags in Sprint 15. The default MinerU backend remains MinerU-owned.
|
||||
|
||||
## Architecture Plan
|
||||
|
||||
### WP15.1: GPU Inventory Boundary
|
||||
|
||||
Actions:
|
||||
|
||||
- Add `src/pdf2md/gpu.py`.
|
||||
- Define immutable `GpuInfo` and `GpuInventory` records.
|
||||
- Parse `nvidia-smi --query-gpu=index,name,memory.total,driver_version --format=csv,noheader,nounits`.
|
||||
- Parse memory in MiB as an integer.
|
||||
- Mark pre-Turing risk using the existing name-based heuristic for GTX 10xx and pre-Turing names.
|
||||
- Optionally enrich compute capability through PyTorch when available, but keep PyTorch optional and mockable.
|
||||
- Provide `select_gpu(gpus, requested)` for `cuda:N`, `N`, and `auto`.
|
||||
|
||||
Expected output:
|
||||
|
||||
- GPU detection is independently testable with captured command output strings.
|
||||
- No real `nvidia-smi`, GPU, or PyTorch is needed in default tests.
|
||||
|
||||
### WP15.2: MinerU Profile Policy
|
||||
|
||||
Actions:
|
||||
|
||||
- Add `src/pdf2md/mineru_profile.py`.
|
||||
- Define supported profile names: `auto`, `safe`, `performance`.
|
||||
- Define a result record containing:
|
||||
- requested profile,
|
||||
- applied profile,
|
||||
- selected GPU index if known,
|
||||
- selected GPU name if known,
|
||||
- selected GPU VRAM MiB if known,
|
||||
- environment variables to set,
|
||||
- warnings or info messages as project `WarningRecord` values.
|
||||
- Implement the policy table above.
|
||||
- Keep profile environment values in a small allowlist.
|
||||
|
||||
Expected output:
|
||||
|
||||
- The policy can be tested without running MinerU.
|
||||
- Performance profile cannot silently overcommit weak GPUs.
|
||||
|
||||
### WP15.3: Adapter Environment Integration
|
||||
|
||||
Actions:
|
||||
|
||||
- Extend `MinerUOptions` with `mineru_profile: str = "auto"` and optional resolved profile metadata.
|
||||
- Keep strict-local validation for every option string.
|
||||
- Update `_mineru_environment()` to merge:
|
||||
- `MINERU_DEVICE_MODE=cuda`,
|
||||
- `CUDA_VISIBLE_DEVICES=<selected index>`,
|
||||
- profile environment variables from `mineru_profile.py`.
|
||||
- Preserve previous environment values after subprocess execution.
|
||||
- Include profile details in `engine_options`.
|
||||
|
||||
Expected output:
|
||||
|
||||
- Real MinerU still receives only direct local CLI command shape:
|
||||
|
||||
```text
|
||||
mineru -p <input> -o <output>
|
||||
```
|
||||
|
||||
- Tuning is done through local environment variables, not remote/API/backend flags.
|
||||
|
||||
### WP15.4: Conversion And CLI Wiring
|
||||
|
||||
Actions:
|
||||
|
||||
- Add `--mineru-profile` to `pdf2md convert`.
|
||||
- Accept `--gpu auto`.
|
||||
- Resolve selected GPU and profile before calling the adapter.
|
||||
- Surface profile warnings in conversion metadata/report warnings.
|
||||
- Preserve existing `--gpu cuda:0` default.
|
||||
- Ensure `convert_pdf()` can receive the profile through the Python API.
|
||||
|
||||
Expected output:
|
||||
|
||||
- Default conversions use `mineru_profile=auto`.
|
||||
- Existing calls with no new flags continue to work.
|
||||
- Metadata explains which profile was applied.
|
||||
|
||||
### WP15.5: Doctor Reporting
|
||||
|
||||
Actions:
|
||||
|
||||
- Reuse `gpu.py` inventory parsing in `doctor.py`.
|
||||
- Keep the existing `gpu` and `pytorch` checks, but make GPU details more explicit.
|
||||
- Add a doctor detail line for auto-selected GPU and recommended profile.
|
||||
- Keep warning-only behavior for Pascal/pre-Turing GPUs.
|
||||
|
||||
Expected output:
|
||||
|
||||
- On a stronger PC, `pdf2md doctor` shows enough evidence to decide whether `auto` or `performance` is appropriate.
|
||||
- On the current GTX 1070 Ti, doctor still warns and recommends safe/conservative behavior.
|
||||
|
||||
### WP15.6: Documentation
|
||||
|
||||
Actions:
|
||||
|
||||
- Update README setup and conversion docs with `--mineru-profile`.
|
||||
- Update ARCHITECTURE to document that tuning uses strict-local environment variables only.
|
||||
- Update PRD CLI section if the new public flag is added.
|
||||
- Update `docs/V1IMPLEMENTATIONPLAN.md`, `PLAN.md`, and `PROGRESS.md`.
|
||||
- Archive implementation details in `docs/WORKARCHIVE.md` only after implementation and verification.
|
||||
|
||||
Expected output:
|
||||
|
||||
- Users can move the repo to a stronger NVIDIA GPU PC, run `pdf2md doctor`, and understand the selected profile.
|
||||
|
||||
## Tests
|
||||
|
||||
Default fast tests:
|
||||
|
||||
- GPU inventory parser handles one RTX GPU, multiple GPUs, no GPU lines, and malformed memory fields.
|
||||
- `select_gpu(..., "auto")` selects the largest VRAM GPU.
|
||||
- `select_gpu(..., "cuda:1")` selects index 1 and errors when absent.
|
||||
- `select_gpu(..., "1")` normalizes to index 1.
|
||||
- `auto` profile returns safe values for GTX 1070 Ti 8GB.
|
||||
- `auto` profile returns moderately aggressive values for an RTX GPU with 16GB or more.
|
||||
- `performance` profile returns performance values only for 16GB+ Turing-or-newer GPUs.
|
||||
- `performance` profile on GTX 1070 Ti downgrades to safe and returns a warning.
|
||||
- Adapter sets and restores `MINERU_DEVICE_MODE`, `CUDA_VISIBLE_DEVICES`, `MINERU_PROCESSING_WINDOW_SIZE`, `MINERU_API_MAX_CONCURRENT_REQUESTS`, and `MINERU_PDF_RENDER_THREADS`.
|
||||
- Strict-local validation rejects remote/API/backend-like option strings in profile-related fields.
|
||||
- CLI default passes `mineru_profile=auto`.
|
||||
- CLI accepts `--mineru-profile safe` and `--mineru-profile performance`.
|
||||
- CLI rejects invalid profile values.
|
||||
- Doctor report includes visible GPU details and recommended profile with mocked command outputs.
|
||||
- Existing conversion, chunking, metadata, report, and UI tests remain green.
|
||||
|
||||
Optional local validation on a stronger NVIDIA GPU PC:
|
||||
|
||||
```powershell
|
||||
uv run pdf2md doctor
|
||||
$env:MINERU_MODEL_SOURCE='local'
|
||||
uv run pdf2md convert samples\FourNodeQuadrilateralShellElementMITC4.pdf --out outputs\fournode-sprint15-auto --overwrite --chunk-pages --gpu auto --mineru-profile auto --strict-local
|
||||
```
|
||||
|
||||
Expected optional validation:
|
||||
|
||||
- Doctor reports the stronger GPU name, VRAM, and recommended profile.
|
||||
- Conversion metadata records `mineru_profile` and selected GPU information.
|
||||
- Generated outputs stay ignored and uncommitted.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
- `--mineru-profile auto` is the default conversion behavior.
|
||||
- `auto` uses safe settings on the current GTX 1070 Ti 8GB and stronger settings only on 16GB+ Turing-or-newer NVIDIA GPUs.
|
||||
- `--gpu auto` can choose the largest visible NVIDIA GPU without adding remote/runtime backend support.
|
||||
- MinerU command shape remains direct local CLI only.
|
||||
- Strict-local prohibitions remain enforced.
|
||||
- `pdf2md doctor` provides actionable GPU/profile information.
|
||||
- Metadata/report preserve the applied runtime profile.
|
||||
- Default tests remain fast, mocked, local, and independent of real MinerU/GPU/model files/network/samples.
|
||||
|
||||
## Hard Failure Criteria
|
||||
|
||||
- Implementation adds runtime backend selection or exposes `--backend`.
|
||||
- Implementation passes `--api-url`, `--url`, router, HTTP client backend, or remote OpenAI-compatible backend values.
|
||||
- `auto` profile applies aggressive settings to GTX 1070 Ti 8GB or other pre-Turing/low-VRAM GPUs.
|
||||
- Existing `--gpu cuda:0` behavior breaks.
|
||||
- Profile tuning disables formula or table parsing.
|
||||
- Doctor or tests require real GPU, real MinerU execution, model files, network, Obsidian, MathJax, or `samples/`.
|
||||
- Sample PDFs, generated outputs, local model files, or `dist/pdf2md-ui.exe` are committed.
|
||||
|
||||
## Implementation Task Plan
|
||||
|
||||
### Task 1: GPU Inventory
|
||||
|
||||
Files:
|
||||
|
||||
- Create `src/pdf2md/gpu.py`
|
||||
- Create `tests/test_gpu.py`
|
||||
|
||||
Steps:
|
||||
|
||||
- [x] Add failing tests for parsing `nvidia-smi` CSV output.
|
||||
- [x] Add failing tests for `auto`, `cuda:N`, and numeric GPU selection.
|
||||
- [x] Implement immutable GPU records and parser helpers.
|
||||
- [x] Implement selection errors as `ValueError` with clear messages.
|
||||
- [x] Run `uv run pytest tests/test_gpu.py`.
|
||||
- [x] Commit GPU inventory boundary.
|
||||
|
||||
### Task 2: MinerU Profile Policy
|
||||
|
||||
Files:
|
||||
|
||||
- Create `src/pdf2md/mineru_profile.py`
|
||||
- Create `tests/test_mineru_profile.py`
|
||||
|
||||
Steps:
|
||||
|
||||
- [x] Add failing tests for safe, auto, and performance profile policy.
|
||||
- [x] Add tests proving 16GB+ Turing-or-newer GPUs get the moderately aggressive auto environment.
|
||||
- [x] Add tests proving GTX 1070 Ti 8GB stays safe.
|
||||
- [x] Implement the allowlisted environment mapping.
|
||||
- [x] Run `uv run pytest tests/test_mineru_profile.py tests/test_gpu.py`.
|
||||
- [x] Commit profile policy.
|
||||
|
||||
### Task 3: Adapter And Conversion Wiring
|
||||
|
||||
Files:
|
||||
|
||||
- Modify `src/pdf2md/mineru_adapter.py`
|
||||
- Modify `src/pdf2md/conversion.py`
|
||||
- Modify `tests/test_mineru_adapter.py`
|
||||
- Modify `tests/test_conversion.py`
|
||||
|
||||
Steps:
|
||||
|
||||
- [x] Add failing adapter tests for profile environment variables and environment restoration.
|
||||
- [x] Add failing conversion tests that metadata receives applied profile information.
|
||||
- [x] Extend `MinerUOptions` and conversion options minimally.
|
||||
- [x] Merge GPU and profile environment variables before the MinerU subprocess.
|
||||
- [x] Run `uv run pytest tests/test_mineru_adapter.py tests/test_conversion.py tests/test_mineru_profile.py tests/test_gpu.py`.
|
||||
- [x] Commit adapter/conversion wiring.
|
||||
|
||||
### Task 4: CLI And Doctor
|
||||
|
||||
Files:
|
||||
|
||||
- Modify `src/pdf2md/cli.py`
|
||||
- Modify `src/pdf2md/doctor.py`
|
||||
- Modify `tests/test_cli.py`
|
||||
- Modify `tests/test_doctor.py`
|
||||
|
||||
Steps:
|
||||
|
||||
- [x] Add failing CLI tests for default `auto`, explicit `safe`, explicit `performance`, invalid profile rejection, and `--gpu auto`.
|
||||
- [x] Add failing doctor tests for GPU inventory and recommended profile details.
|
||||
- [x] Implement CLI argument parsing and doctor report additions.
|
||||
- [x] Run `uv run pytest tests/test_cli.py tests/test_doctor.py tests/test_gpu.py tests/test_mineru_profile.py`.
|
||||
- [x] Commit CLI and doctor wiring.
|
||||
|
||||
### Task 5: UI And Documentation
|
||||
|
||||
Files:
|
||||
|
||||
- Modify `src/pdf2md_ui/runner.py` only if explicit UI profile passthrough is needed
|
||||
- Modify `src/pdf2md_ui/app.py` only if explicit UI profile control is needed
|
||||
- Modify `tests/test_ui_runner.py` only if runner command construction changes
|
||||
- Modify `README.md`
|
||||
- Modify `ARCHITECTURE.md`
|
||||
- Modify `PRD.md`
|
||||
- Modify `docs/V1IMPLEMENTATIONPLAN.md`
|
||||
- Modify `PLAN.md`
|
||||
- Modify `PROGRESS.md`
|
||||
- Modify `docs/WORKARCHIVE.md` after implementation
|
||||
|
||||
Steps:
|
||||
|
||||
- [x] Keep UI unchanged if default CLI `auto` profile is enough for the first implementation pass.
|
||||
- [x] If UI exposes a profile control, add tests for fixed argument-list construction with `shell=False`.
|
||||
- [x] Document `--mineru-profile`, `--gpu auto`, profile policy, strict-local boundaries, and stronger-PC validation command.
|
||||
- [x] Run focused docs/UI tests if changed.
|
||||
- [x] Run final verification commands.
|
||||
- [x] Commit documentation and final coordination updates.
|
||||
|
||||
## Verification Commands
|
||||
|
||||
```powershell
|
||||
uv run pytest tests/test_gpu.py tests/test_mineru_profile.py tests/test_mineru_adapter.py tests/test_conversion.py tests/test_cli.py tests/test_doctor.py
|
||||
uv run pytest
|
||||
git diff --check
|
||||
git status --short --untracked-files=all
|
||||
```
|
||||
|
||||
Optional stronger-PC validation is listed in the Tests section and must remain explicit opt-in.
|
||||
|
||||
## Handoff Requirements
|
||||
|
||||
After implementation:
|
||||
|
||||
- Update `PROGRESS.md` with files changed, commands run, test outcomes, optional stronger-PC validation outcome, known failures, residual risks, and next action.
|
||||
- Archive completed implementation details in `docs/WORKARCHIVE.md`.
|
||||
- Keep generated outputs, sample PDFs, local model files, and UI build artifacts out of the commit.
|
||||
- Record the detected GPU, applied profile, and whether `samples\FourNodeQuadrilateralShellElementMITC4.pdf` completed on the stronger PC.
|
||||
|
||||
Implementation handoff:
|
||||
|
||||
- Files changed: `src/pdf2md/gpu.py`, `src/pdf2md/mineru_profile.py`, `src/pdf2md/mineru_adapter.py`, `src/pdf2md/conversion.py`, `src/pdf2md/cli.py`, `src/pdf2md/doctor.py`, docs, and focused tests.
|
||||
- Commands run: `uv run pytest tests/test_gpu.py tests/test_mineru_profile.py tests/test_mineru_adapter.py tests/test_conversion.py tests/test_cli.py tests/test_doctor.py`; `uv run pytest`; `uv run pdf2md doctor`.
|
||||
- Tests passed: targeted Sprint 15 suite passed 101 tests; full default suite passed 225 tests with 1 optional skip; local doctor returned WARN with expected GTX 1070 Ti safe-profile recommendation.
|
||||
- Known failures: optional stronger-PC real MinerU conversion validation was not run in this workspace.
|
||||
- Residual risks: GTX 1070 Ti 8GB remains likely to stall on hard pages; stronger-PC behavior still needs local runtime validation.
|
||||
- Next action: on a stronger NVIDIA GPU PC, run `pdf2md doctor` and an explicit local conversion with `--gpu auto --mineru-profile auto`.
|
||||
|
||||
## Future Sprint Boundary
|
||||
|
||||
A later sprint may add page-level timeout handling, resumable page caches, or a performance mode that can run multiple page conversions concurrently on GPUs with enough VRAM. Those behaviors are intentionally out of Sprint 15 scope.
|
||||
@@ -0,0 +1,412 @@
|
||||
# Sprint 16 Contract: Simplified Output Layout
|
||||
|
||||
Status: Implemented
|
||||
Last updated: 2026-05-12
|
||||
|
||||
## Objective
|
||||
|
||||
Simplify conversion outputs so each input PDF gets one predictable output folder named after the PDF stem, all images live under one `images` folder, Markdown parts use `_001`, `_002` numbering, one human-readable report is written per PDF, and no metadata JSON file is persisted.
|
||||
|
||||
This sprint changes the public output contract. It supersedes the older v1 output layout that wrote sibling `<stem>.md`, `<stem>.assets`, `<stem>.metadata.json`, and `<stem>.report.md` files.
|
||||
|
||||
## Product Output Contract
|
||||
|
||||
For an input PDF:
|
||||
|
||||
```text
|
||||
paper.pdf
|
||||
```
|
||||
|
||||
and output root:
|
||||
|
||||
```text
|
||||
outputs/
|
||||
```
|
||||
|
||||
write:
|
||||
|
||||
```text
|
||||
outputs/
|
||||
paper/
|
||||
paper_001.md
|
||||
paper_002.md
|
||||
paper_report.md
|
||||
images/
|
||||
...
|
||||
```
|
||||
|
||||
Rules:
|
||||
|
||||
- `paper` is the PDF stem, meaning the original filename without `.pdf`.
|
||||
- A one-part conversion still writes `paper_001.md`.
|
||||
- A multi-part conversion writes `paper_001.md`, `paper_002.md`, and so on.
|
||||
- Part numbering uses at least three digits and grows only when the part count exceeds 999.
|
||||
- All generated image and media assets for the PDF live under `paper/images/`.
|
||||
- Markdown links must point to `images/<asset-name>`.
|
||||
- The report is a single file at `paper/paper_report.md`.
|
||||
- No `<stem>.metadata.json`, part metadata JSON, or sidecar metadata JSON is written.
|
||||
- Internal metadata records may still be built in memory to produce reports, warnings, counts, and `ConversionResult` fields.
|
||||
|
||||
## Contract Assumptions
|
||||
|
||||
- The user request "metadata is not needed" means metadata JSON should not be written as a user-facing output file. It does not mean removing internal metadata objects needed for report generation and warning aggregation.
|
||||
- Keep `--chunk-pages` semantics from Sprint 14: when enabled, MinerU receives one source page per run and final Markdown files are grouped by `chunk_pages`.
|
||||
- If `--chunk-pages` is absent, the whole PDF is still converted in one MinerU run and written as `<stem>_001.md`.
|
||||
- Keep `--chunk-pages` without a value as the default grouping size of 20.
|
||||
- Keep `--metadata` accepted as a backward-compatible no-op for one sprint, but update help text to say metadata JSON output is disabled in the simplified layout.
|
||||
- `pdf2md recheck` remains supported only for legacy outputs that still have adjacent metadata JSON. New simplified outputs should fail recheck clearly until a later sprint designs metadata-free recheck.
|
||||
- Recursive directory conversion should preserve the discovered relative parent before the PDF stem folder: `outputs/<relative-parent>/<stem>/<stem>_001.md`.
|
||||
- If two inputs would map to the same output folder and overwrite is false, fail during preflight. Do not invent automatic suffixes.
|
||||
- `--keep-raw` should place raw MinerU diagnostics under `paper/raw/` so raw outputs do not clutter the main folder.
|
||||
|
||||
## Touched Surfaces
|
||||
|
||||
Allowed during implementation:
|
||||
|
||||
- Modify `src/pdf2md/paths.py`.
|
||||
- Modify `src/pdf2md/pdf_splitter.py` only if part naming needs helper support.
|
||||
- Modify `src/pdf2md/conversion.py`.
|
||||
- Modify `src/pdf2md/report.py` or add a focused aggregate report helper if one report needs multiple part summaries.
|
||||
- Modify `src/pdf2md/cli.py`.
|
||||
- Modify `src/pdf2md_ui/runner.py` and `src/pdf2md_ui/app.py` only if UI text or expected output descriptions mention metadata/report paths.
|
||||
- Modify `tests/test_paths.py`.
|
||||
- Modify `tests/test_conversion.py`.
|
||||
- Modify `tests/test_cli.py`.
|
||||
- Modify `tests/test_report.py`.
|
||||
- Modify `tests/test_ui_runner.py` only if UI command/output assumptions change.
|
||||
- Modify `tests/integration/test_v1_fast_release_gate.py`.
|
||||
- Modify `tests/integration/test_optional_mineru_fixtures.py`.
|
||||
- Modify `README.md`.
|
||||
- Modify `PRD.md`.
|
||||
- Modify `ARCHITECTURE.md`.
|
||||
- Modify `docs/V1IMPLEMENTATIONPLAN.md`.
|
||||
- Modify `PLAN.md`.
|
||||
- Modify `PROGRESS.md`.
|
||||
- Modify `docs/WORKARCHIVE.md` after implementation.
|
||||
|
||||
Not allowed:
|
||||
|
||||
- Do not change MinerU 3.1.0 as the fixed engine.
|
||||
- Do not add another conversion engine.
|
||||
- Do not add remote/API/backend paths.
|
||||
- Do not change `--gpu`, `--mineru-profile`, or strict-local behavior except where report text reflects the new layout.
|
||||
- Do not make default tests depend on real MinerU, GPU, CUDA, model files, network, Obsidian, MathJax, or `samples/`.
|
||||
- Do not commit generated `outputs/`, sample PDFs, local model files, or `dist/pdf2md-ui.exe`.
|
||||
|
||||
## Architecture Plan
|
||||
|
||||
### WP16.1: Document-Level Output Layout
|
||||
|
||||
Add or reshape path planning so final outputs are planned per source PDF folder instead of as sibling files.
|
||||
|
||||
Expected final paths for a single PDF:
|
||||
|
||||
```text
|
||||
<out>/<stem>/<stem>_001.md
|
||||
<out>/<stem>/images/
|
||||
<out>/<stem>/<stem>_report.md
|
||||
```
|
||||
|
||||
Expected final paths for recursive input:
|
||||
|
||||
```text
|
||||
<out>/<relative-parent>/<stem>/<stem>_001.md
|
||||
<out>/<relative-parent>/<stem>/images/
|
||||
<out>/<relative-parent>/<stem>/<stem>_report.md
|
||||
```
|
||||
|
||||
Implementation guidance:
|
||||
|
||||
- Keep `DiscoveredPdf.relative_parent` behavior.
|
||||
- Add a focused part-planning helper rather than encoding final output names through fake temporary PDF filenames.
|
||||
- Keep `PlannedOutput` if the existing conversion code can use it cleanly, but allow multiple Markdown parts to share the same `assets_dir` and `report_path`.
|
||||
- Duplicate-path detection must reject duplicate Markdown files and raw directories, but it must allow shared `images/` and shared report paths for parts belonging to the same source PDF.
|
||||
|
||||
### WP16.2: Markdown Part Numbering
|
||||
|
||||
Replace public part names:
|
||||
|
||||
```text
|
||||
paper.part-001.pages-001-020.md
|
||||
paper.part-002.pages-021-040.md
|
||||
```
|
||||
|
||||
with:
|
||||
|
||||
```text
|
||||
paper_001.md
|
||||
paper_002.md
|
||||
```
|
||||
|
||||
Rules:
|
||||
|
||||
- Part index is based on final output group order, not source page number.
|
||||
- The report must still record source page ranges for each part.
|
||||
- Failed groups should not create a Markdown file, but the report must mention the failed part and source page range.
|
||||
|
||||
### WP16.3: Shared Images Folder
|
||||
|
||||
Replace per-output asset directories:
|
||||
|
||||
```text
|
||||
paper.part-001.pages-001-020.assets/
|
||||
paper.part-002.pages-021-040.assets/
|
||||
```
|
||||
|
||||
with:
|
||||
|
||||
```text
|
||||
paper/images/
|
||||
```
|
||||
|
||||
Implementation guidance:
|
||||
|
||||
- Copy all assets for one source PDF into the shared `images/` folder.
|
||||
- Rewrite Markdown links to `images/<asset-name>`.
|
||||
- Use deterministic collision-safe filenames. Recommended pattern:
|
||||
- page-known assets: `page-001_<original-name>`, with `-002` suffixes when needed.
|
||||
- page-unknown assets: `asset-001<suffix>`, preserving the original suffix when available.
|
||||
- Keep asset-link validation pointed at the shared `images/` directory.
|
||||
|
||||
### WP16.4: One Report, No Metadata JSON
|
||||
|
||||
Stop writing metadata JSON as a user-facing output file.
|
||||
|
||||
Implementation guidance:
|
||||
|
||||
- Continue building internal metadata dictionaries or records for each part so report generation and `ConversionResult` summaries stay traceable.
|
||||
- Add an aggregate report path at `<stem>/<stem>_report.md`.
|
||||
- The report must include:
|
||||
- source PDF path,
|
||||
- output folder path,
|
||||
- Markdown part list with page ranges,
|
||||
- engine and engine options,
|
||||
- final status,
|
||||
- warning count,
|
||||
- asset count,
|
||||
- missing/invalid asset link counts,
|
||||
- inline/display formula counts,
|
||||
- MathJax render error count,
|
||||
- text fidelity summary when available,
|
||||
- failed source pages or failed parts when any exist,
|
||||
- warnings grouped by page or part.
|
||||
- `ConversionResult.metadata_path` should be `None` for simplified outputs.
|
||||
- `ConversionResult.report_path` should point to the shared report path.
|
||||
|
||||
### WP16.5: CLI, UI, And Documentation
|
||||
|
||||
Update user-facing docs and tests to remove metadata JSON as an expected output.
|
||||
|
||||
Implementation guidance:
|
||||
|
||||
- `pdf2md convert` summary may keep printing Markdown paths and warning counts.
|
||||
- Update CLI help for `--metadata` to say metadata JSON output is disabled or deprecated in the simplified layout.
|
||||
- Update README examples to show the new folder layout.
|
||||
- Update PRD and ARCHITECTURE so they no longer claim metadata JSON is required as a public artifact.
|
||||
- Keep internal provenance wording clear: warnings and report are still derived from internal metadata-like records.
|
||||
- Update optional fixture documentation so generated metadata JSON is not required for sample validation.
|
||||
|
||||
## Implementation Task Plan
|
||||
|
||||
### Task 1: Path Planning For Simplified Layout
|
||||
|
||||
Files:
|
||||
|
||||
- Modify `src/pdf2md/paths.py`.
|
||||
- Modify `tests/test_paths.py`.
|
||||
|
||||
Steps:
|
||||
|
||||
- [ ] Add failing tests showing `plan_outputs()` maps `paper.pdf` to `out/paper/paper_001.md`, `out/paper/images`, no metadata path, and `out/paper/paper_report.md`.
|
||||
- [ ] Add a failing test for Korean filenames, using the PDF stem exactly as the output folder and file prefix.
|
||||
- [ ] Add a failing test for recursive input preserving `relative_parent`.
|
||||
- [ ] Add a failing test that duplicate source stems in the same relative parent conflict before conversion.
|
||||
- [ ] Implement the minimal path planning changes.
|
||||
- [ ] Run `uv run pytest tests/test_paths.py`.
|
||||
- [ ] Commit path planning changes.
|
||||
|
||||
### Task 2: Single-Output Conversion Writes Simplified Files
|
||||
|
||||
Files:
|
||||
|
||||
- Modify `src/pdf2md/conversion.py`.
|
||||
- Modify `tests/test_conversion.py`.
|
||||
- Modify `tests/test_cli.py`.
|
||||
|
||||
Steps:
|
||||
|
||||
- [ ] Add failing conversion tests showing a non-chunked fake-adapter conversion writes `out/paper/paper_001.md`, `out/paper/images`, and `out/paper/paper_report.md`.
|
||||
- [ ] Add failing assertions that no `.metadata.json` file is written and `result.metadata_path is None`.
|
||||
- [ ] Add failing CLI test showing `pdf2md convert paper.pdf --out out` creates the simplified folder.
|
||||
- [ ] Implement the minimal conversion changes for non-chunked output.
|
||||
- [ ] Run `uv run pytest tests/test_conversion.py tests/test_cli.py tests/test_paths.py`.
|
||||
- [ ] Commit single-output conversion changes.
|
||||
|
||||
### Task 3: Grouped Output Parts And Shared Images
|
||||
|
||||
Files:
|
||||
|
||||
- Modify `src/pdf2md/conversion.py`.
|
||||
- Modify `src/pdf2md/pdf_splitter.py` only if a small helper is needed.
|
||||
- Modify `tests/test_conversion.py`.
|
||||
- Modify `tests/test_cli.py`.
|
||||
|
||||
Steps:
|
||||
|
||||
- [ ] Add failing tests for `chunk_pages=20` showing final Markdown names are `paper_001.md`, `paper_002.md`, not `paper.part-...md`.
|
||||
- [ ] Add failing tests proving all grouped assets are copied into `paper/images/` and Markdown links use `images/...`.
|
||||
- [ ] Add failing tests proving asset collisions across pages get deterministic unique filenames.
|
||||
- [ ] Add failing tests proving failed page conversions are represented in the shared report while later pages still convert.
|
||||
- [ ] Implement grouped output naming and shared image handling.
|
||||
- [ ] Run `uv run pytest tests/test_conversion.py tests/test_cli.py tests/test_pdf_splitter.py`.
|
||||
- [ ] Commit grouped output changes.
|
||||
|
||||
### Task 4: Aggregate Report Without Metadata JSON
|
||||
|
||||
Files:
|
||||
|
||||
- Modify `src/pdf2md/report.py` or add a focused aggregate report helper.
|
||||
- Modify `src/pdf2md/conversion.py`.
|
||||
- Modify `tests/test_report.py`.
|
||||
- Modify `tests/test_conversion.py`.
|
||||
|
||||
Steps:
|
||||
|
||||
- [ ] Add failing report tests for a one-file report listing multiple Markdown parts and source page ranges.
|
||||
- [ ] Add failing conversion tests proving only one report exists for a chunked PDF.
|
||||
- [ ] Add failing tests proving report summary totals combine all output parts.
|
||||
- [ ] Add failing tests proving all-failed conversions write a report but no Markdown part.
|
||||
- [ ] Implement aggregate report rendering from internal metadata records.
|
||||
- [ ] Run `uv run pytest tests/test_report.py tests/test_conversion.py`.
|
||||
- [ ] Commit report changes.
|
||||
|
||||
### Task 5: Recheck, CLI Compatibility, UI Text, And Docs
|
||||
|
||||
Files:
|
||||
|
||||
- Modify `src/pdf2md/cli.py`.
|
||||
- Modify `src/pdf2md/conversion.py`.
|
||||
- Modify `src/pdf2md_ui/runner.py` and `src/pdf2md_ui/app.py` only if text/output assumptions change.
|
||||
- Modify `README.md`.
|
||||
- Modify `PRD.md`.
|
||||
- Modify `ARCHITECTURE.md`.
|
||||
- Modify `docs/V1IMPLEMENTATIONPLAN.md`.
|
||||
- Modify `tests/test_cli.py`.
|
||||
- Modify `tests/test_ui_runner.py` only if UI behavior changes.
|
||||
- Modify `tests/integration/test_v1_fast_release_gate.py`.
|
||||
- Modify `tests/integration/test_optional_mineru_fixtures.py`.
|
||||
|
||||
Steps:
|
||||
|
||||
- [ ] Add failing CLI tests proving `--metadata` remains accepted but no metadata JSON is written.
|
||||
- [ ] Add failing recheck test proving simplified outputs without metadata fail with a clear legacy-metadata message.
|
||||
- [ ] Update integration tests to require Markdown part files, one report, and image links, not metadata JSON.
|
||||
- [ ] Update README, PRD, ARCHITECTURE, and release-gate wording for the simplified layout.
|
||||
- [ ] Implement CLI/recheck/doc changes.
|
||||
- [ ] Run `uv run pytest tests/test_cli.py tests/test_ui_runner.py tests/integration/test_v1_fast_release_gate.py`.
|
||||
- [ ] Commit CLI, UI, integration, and documentation changes.
|
||||
|
||||
### Task 6: Final Verification And Handoff
|
||||
|
||||
Files:
|
||||
|
||||
- Modify `PLAN.md`.
|
||||
- Modify `PROGRESS.md`.
|
||||
- Modify `docs/WORKARCHIVE.md` after implementation.
|
||||
- Modify `docs/Sprints/SPRINT16CONTRACT.md` status and handoff fields.
|
||||
|
||||
Steps:
|
||||
|
||||
- [ ] Run focused Sprint 16 verification:
|
||||
|
||||
```powershell
|
||||
uv run pytest tests/test_paths.py tests/test_conversion.py tests/test_cli.py tests/test_report.py tests/integration/test_v1_fast_release_gate.py
|
||||
```
|
||||
|
||||
- [ ] Run full default verification:
|
||||
|
||||
```powershell
|
||||
uv run pytest
|
||||
```
|
||||
|
||||
- [ ] Run diff check:
|
||||
|
||||
```powershell
|
||||
git diff --check
|
||||
```
|
||||
|
||||
- [ ] Update `PROGRESS.md` with files changed, checks run, residual risks, and next actions.
|
||||
- [ ] Archive completed implementation evidence in `docs/WORKARCHIVE.md`.
|
||||
- [ ] Commit final coordination updates.
|
||||
|
||||
## Verification Commands
|
||||
|
||||
```powershell
|
||||
uv run pytest tests/test_paths.py tests/test_conversion.py tests/test_cli.py tests/test_report.py tests/integration/test_v1_fast_release_gate.py
|
||||
uv run pytest
|
||||
git diff --check
|
||||
git status --short --untracked-files=all
|
||||
```
|
||||
|
||||
Optional local fixture validation after implementation:
|
||||
|
||||
```powershell
|
||||
$env:MINERU_MODEL_SOURCE='local'
|
||||
uv run pdf2md convert samples\SolidElement.pdf --out outputs\SolidElement_sprint16_layout --overwrite --chunk-pages --gpu auto --mineru-profile auto --strict-local
|
||||
```
|
||||
|
||||
Expected optional validation:
|
||||
|
||||
- Output folder is `outputs\SolidElement\` or the explicitly provided output root plus `SolidElement\`, depending on the command.
|
||||
- Markdown part is `SolidElement_001.md` for the 6-page sample.
|
||||
- Report is `SolidElement_report.md`.
|
||||
- Images are under `images\`.
|
||||
- No metadata JSON exists.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
- Each input PDF writes into an output folder named after the PDF stem.
|
||||
- Markdown outputs are named `<stem>_001.md`, `<stem>_002.md`, and so on.
|
||||
- All image/media assets for one PDF live under `<stem>/images/`.
|
||||
- Markdown links point to `images/...`.
|
||||
- Exactly one report file is written per input PDF at `<stem>/<stem>_report.md`.
|
||||
- No metadata JSON file is written for new conversions.
|
||||
- Internal warning, provenance, formula count, asset count, and text fidelity information remains available in the report.
|
||||
- Chunk mode still converts one source page per MinerU run and groups Markdown by `chunk_pages`.
|
||||
- Strict-local and MinerU-only constraints remain unchanged.
|
||||
- Default tests stay fast and local.
|
||||
|
||||
## Hard Failure Criteria
|
||||
|
||||
- Any new conversion writes `.metadata.json` as a public output.
|
||||
- Output files keep old `part-001.pages-...` names.
|
||||
- Assets are split into per-part `.assets` folders.
|
||||
- More than one report is written for one input PDF.
|
||||
- Markdown links point outside the PDF output folder.
|
||||
- Chunk mode stops using one source page per MinerU run.
|
||||
- Strict-local enforcement is weakened.
|
||||
- Default tests require real MinerU, GPU, model files, network, Obsidian, MathJax, or `samples/`.
|
||||
- Sample PDFs, generated outputs, local model files, or `dist/pdf2md-ui.exe` are committed.
|
||||
|
||||
## Open Questions
|
||||
|
||||
- Should metadata-free `pdf2md recheck` be restored in a later sprint by deriving enough state from the report and Markdown, or is rerunning conversion acceptable for simplified outputs?
|
||||
- Should raw MinerU outputs under `--keep-raw` be flattened into `raw/` or kept per part under `raw/<stem>_001/`? This contract recommends per-part raw folders to avoid collisions.
|
||||
|
||||
## Handoff Requirements
|
||||
|
||||
After implementation:
|
||||
|
||||
- Update this contract status to `Implemented`.
|
||||
- Record final file layout examples in `README.md`.
|
||||
- Record verification commands and outcomes in `PROGRESS.md`.
|
||||
- Archive implementation and optional sample validation results in `docs/WORKARCHIVE.md`.
|
||||
- Keep generated outputs and sample PDFs uncommitted.
|
||||
|
||||
## Implementation Handoff
|
||||
|
||||
- Files changed: `src/pdf2md/paths.py`, `src/pdf2md/conversion.py`, `src/pdf2md/report.py`, `src/pdf2md/cli.py`, `src/pdf2md_ui/runner.py`, focused tests, and current docs.
|
||||
- Output layout implemented: `<out>/<stem>/<stem>_001.md`, additional numbered parts when grouped, `<out>/<stem>/images/`, and `<out>/<stem>/<stem>_report.md`.
|
||||
- Metadata JSON behavior: new conversions do not write public `.metadata.json`; `ConversionResult.metadata_path` is `None`; internal metadata-like records still feed reports and tests.
|
||||
- Recheck behavior: `pdf2md recheck` remains legacy-only and requires adjacent metadata JSON.
|
||||
- Verification recorded in `PROGRESS.md`: focused Sprint 16 tests passed, full `uv run pytest` passed 227 tests with 1 optional skip, and `git diff --check` passed with line-ending warnings only.
|
||||
@@ -0,0 +1,440 @@
|
||||
# Sprint 17 Contract: Offline Windows Installer
|
||||
|
||||
Status: Abandoned
|
||||
Last updated: 2026-05-13
|
||||
|
||||
## Abandonment Note
|
||||
|
||||
Sprint 17 was abandoned at the user's request on 2026-05-13 before implementation began. This document remains as a historical planning record only. Do not implement or extend this contract unless the user explicitly reopens offline installer work.
|
||||
|
||||
## Objective
|
||||
|
||||
Create a large offline Windows installer that can install the existing local `pdf2md` runtime on another Windows PC without internet access.
|
||||
|
||||
The installer must install or stage all application-owned files needed after download time: the minimal UI executable, the project runtime, a target-local Python virtual environment created from bundled wheels, CUDA PyTorch wheels, MinerU 3.1.0 wheels and dependencies, local MinerU model files, optional local Node.js/MathJax assets, Start Menu shortcuts, setup logs, and a post-install `pdf2md doctor` verification path.
|
||||
|
||||
This sprint does not change conversion behavior. It packages the already implemented CLI/UI/runtime for offline use.
|
||||
|
||||
## Product Decision
|
||||
|
||||
The offline package should create the target PC virtual environment during installation instead of copying the current development `.venv`.
|
||||
|
||||
Reasoning:
|
||||
|
||||
- Python virtual environments and console entry points often contain absolute paths and are not a reliable redistribution unit.
|
||||
- A target-local `.venv` created from a bundled wheelhouse is more reproducible and easier to repair.
|
||||
- The installer can keep the wheelhouse for offline repair, uninstall/reinstall, and audit.
|
||||
|
||||
## Installer Shape
|
||||
|
||||
Recommended installer technology:
|
||||
|
||||
- Inno Setup for the Windows installer shell because it can compile scripts from the command line with `ISCC.exe`, returns deterministic exit codes, and is simple enough for a per-user installer.
|
||||
- PowerShell scripts for payload build, target runtime install, and target verification.
|
||||
- PyInstaller remains only the UI executable builder. It must not become the full MinerU/PyTorch/model bundler.
|
||||
|
||||
Default install root:
|
||||
|
||||
```text
|
||||
%LOCALAPPDATA%\Programs\ConvertPDFToMD\
|
||||
```
|
||||
|
||||
Installed layout:
|
||||
|
||||
```text
|
||||
ConvertPDFToMD/
|
||||
app/
|
||||
pdf2md-ui.exe
|
||||
runtime/
|
||||
pyproject.toml
|
||||
uv.lock
|
||||
README.md
|
||||
src/
|
||||
tools/
|
||||
package.json
|
||||
package-lock.json
|
||||
.venv/
|
||||
payload/
|
||||
python/
|
||||
uv/
|
||||
wheelhouse/
|
||||
requirements-runtime-cu126.txt
|
||||
models/
|
||||
node/
|
||||
node_modules/
|
||||
payload-manifest.json
|
||||
SHA256SUMS.txt
|
||||
THIRD_PARTY_NOTICES.md
|
||||
scripts/
|
||||
install-runtime.ps1
|
||||
repair-runtime.ps1
|
||||
run-doctor.ps1
|
||||
logs/
|
||||
```
|
||||
|
||||
Generated artifacts that must remain untracked:
|
||||
|
||||
```text
|
||||
dist/offline-installer/
|
||||
dist/Pdf2MdOfflineSetup-*.exe
|
||||
```
|
||||
|
||||
## Payload Contents
|
||||
|
||||
The first offline payload targets Windows x64, Python 3.12, CUDA PyTorch `2.6.0+cu126`, `torchvision 0.21.0+cu126`, and `mineru[core]==3.1.0`.
|
||||
|
||||
Required:
|
||||
|
||||
- `dist/pdf2md-ui.exe` from the existing PyInstaller build.
|
||||
- Tracked project runtime files needed to run `uv run pdf2md`.
|
||||
- A Windows x64 Python 3.12 installer or an equivalent approved Python runtime package.
|
||||
- A Windows x64 `uv.exe`.
|
||||
- A wheelhouse containing:
|
||||
- the current project wheel,
|
||||
- `pypdf`,
|
||||
- `torch==2.6.0`,
|
||||
- `torchvision==0.21.0`,
|
||||
- `mineru[core]==3.1.0`,
|
||||
- all transitive Python runtime dependencies.
|
||||
- Local MinerU model files and the model config template needed for `MINERU_MODEL_SOURCE=local`.
|
||||
- A manifest listing every payload file, size, SHA-256 hash, source URL or local source, and license family.
|
||||
|
||||
Optional but recommended:
|
||||
|
||||
- Portable local Node.js runtime.
|
||||
- `node_modules/` containing the locked MathJax checker dependencies from `package-lock.json`.
|
||||
|
||||
Explicitly excluded:
|
||||
|
||||
- `samples/`.
|
||||
- `outputs/`.
|
||||
- `.git/`.
|
||||
- The development `.venv/`.
|
||||
- Local generated PyInstaller `build/` folders and `.spec` files unless the implementation deliberately adds a stable project-owned spec file.
|
||||
- NVIDIA GPU drivers and CUDA Toolkit installers. The installer may check for a compatible NVIDIA driver through `nvidia-smi`, but it should not redistribute GPU drivers in this sprint.
|
||||
|
||||
## Touched Surfaces
|
||||
|
||||
Allowed during implementation:
|
||||
|
||||
- Create `packaging/offline/build-offline-payload.ps1`.
|
||||
- Create `packaging/offline/verify-offline-payload.ps1`.
|
||||
- Create `packaging/offline/install-runtime.ps1`.
|
||||
- Create `packaging/offline/repair-runtime.ps1`.
|
||||
- Create `packaging/offline/run-doctor.ps1`.
|
||||
- Create `packaging/offline/Pdf2MdOffline.iss`.
|
||||
- Create `packaging/offline/requirements-runtime-cu126.txt`.
|
||||
- Create `packaging/offline/README.md`.
|
||||
- Create `packaging/offline/THIRD_PARTY_NOTICES.md`.
|
||||
- Create `src/pdf2md/packaging_manifest.py` only if a Python helper is simpler than repeating manifest logic in PowerShell.
|
||||
- Modify `src/pdf2md_ui/runner.py` so the UI can resolve an installed target-local `.venv\Scripts\pdf2md.exe` before falling back to PATH or `uv run pdf2md`.
|
||||
- Modify `src/pdf2md_ui/app.py` only if the project root default must prefer the installed runtime folder.
|
||||
- Modify `tests/test_ui_runner.py`.
|
||||
- Create `tests/test_offline_packaging.py`.
|
||||
- Modify `README.md`.
|
||||
- Modify `docs/V1RELEASECHECKLIST.md`.
|
||||
- Modify `PLAN.md`.
|
||||
- Modify `PROGRESS.md`.
|
||||
- Modify `docs/WORKARCHIVE.md` after implementation.
|
||||
|
||||
Not allowed:
|
||||
|
||||
- Do not change MinerU 3.1.0 as the fixed conversion engine.
|
||||
- Do not add a second conversion engine.
|
||||
- Do not add runtime network calls, `--api-url`, router mode, remote APIs, HTTP client backends, remote OpenAI-compatible backends, or hosted renderers.
|
||||
- Do not copy the development `.venv` as the installed runtime.
|
||||
- Do not make default tests depend on real MinerU, GPU, model files, network, Obsidian, MathJax, Inno Setup, or `samples/`.
|
||||
- Do not commit generated installer payloads, model files, wheelhouse files, Python installers, `dist/`, `outputs/`, or `samples/`.
|
||||
|
||||
## Architecture Plan
|
||||
|
||||
### WP17.1: Offline Payload Builder
|
||||
|
||||
Add a build script that creates a clean staging folder under `dist/offline-installer/` with `app/`, `runtime/`, and `payload/` subfolders that mirror the final install layout.
|
||||
|
||||
Responsibilities:
|
||||
|
||||
- Rebuild `dist/pdf2md-ui.exe`.
|
||||
- Build the project wheel into the staging wheelhouse.
|
||||
- Download or collect Python wheels for the target runtime on a connected build PC.
|
||||
- Collect the Windows Python runtime package and `uv.exe`.
|
||||
- Copy project runtime files without `.git`, `.venv`, `outputs/`, `samples/`, and build trash.
|
||||
- Copy local MinerU model files from a configured source path.
|
||||
- Optionally copy portable Node.js and the locked `node_modules/`.
|
||||
- Generate `payload-manifest.json` and `SHA256SUMS.txt`.
|
||||
- Fail if any required file is missing or if any wheel dependency would require internet during installation.
|
||||
|
||||
The builder may use `python -m pip download` on the connected build PC. The target installer must use only local files, for example `uv pip install --no-index --find-links`.
|
||||
|
||||
### WP17.2: Target Runtime Installer
|
||||
|
||||
Add a PowerShell install script that runs from the installed payload and creates the real runtime on the target PC.
|
||||
|
||||
Responsibilities:
|
||||
|
||||
- Verify payload hashes before installing.
|
||||
- Install or locate Python 3.12 x64.
|
||||
- Create `runtime\.venv` on the target PC.
|
||||
- Install packages from `payload\wheelhouse` with network disabled.
|
||||
- Install the project wheel into the target `.venv`.
|
||||
- Preserve the bundled wheelhouse for offline repair.
|
||||
- Configure `MINERU_MODEL_SOURCE=local` for UI/CLI child processes.
|
||||
- Configure local MinerU model paths without silently overwriting an unrelated user `mineru.json`.
|
||||
- If `%USERPROFILE%\mineru.json` already exists and points elsewhere, prompt in interactive mode; in silent mode, fail clearly and leave `repair-runtime.ps1` instructions.
|
||||
- Run `pdf2md doctor` and write the result to `logs\doctor-after-install.txt`.
|
||||
|
||||
### WP17.3: UI Runtime Resolution
|
||||
|
||||
Adjust the UI runner for an installed offline layout.
|
||||
|
||||
Resolution order:
|
||||
|
||||
1. Explicit configured `pdf2md` command.
|
||||
2. Installed runtime `.venv\Scripts\pdf2md.exe` under the selected project root.
|
||||
3. `pdf2md` on PATH.
|
||||
4. Bundled `uv.exe` plus `uv run --offline pdf2md` under the selected project root.
|
||||
5. Existing system `uv run pdf2md` fallback.
|
||||
|
||||
Child environment rules:
|
||||
|
||||
- Set `MINERU_MODEL_SOURCE=local` unless explicitly set.
|
||||
- Add installed `.venv\Scripts` to PATH for runtime console scripts.
|
||||
- Add installed portable Node.js path to PATH when bundled.
|
||||
- Set `UV_OFFLINE=1` when using the installed offline runtime.
|
||||
- Do not add remote endpoints or backend flags.
|
||||
|
||||
### WP17.4: Inno Setup Installer
|
||||
|
||||
Add an Inno Setup script that installs the payload and invokes the target runtime installer.
|
||||
|
||||
Installer behavior:
|
||||
|
||||
- Default to per-user install under `%LOCALAPPDATA%\Programs\ConvertPDFToMD`.
|
||||
- Create Start Menu shortcuts for:
|
||||
- `ConvertPDFToMD` UI,
|
||||
- `PDF2MD Doctor`,
|
||||
- `Repair PDF2MD Runtime`.
|
||||
- Run `install-runtime.ps1` after files are copied.
|
||||
- Show the doctor log path if setup finishes with WARN.
|
||||
- Fail the install on target runtime setup failure unless the user explicitly chooses to keep files for manual repair.
|
||||
|
||||
### WP17.5: License, Manifest, And Offline Verification
|
||||
|
||||
Add docs and checks for redistribution risk.
|
||||
|
||||
Required records:
|
||||
|
||||
- Python, uv, PyInstaller, PyTorch, MinerU, model files, Node.js, MathJax, and transitive Python/npm dependency notices.
|
||||
- A manifest with file hashes and source URLs.
|
||||
- A clear statement that runtime conversion remains local-only and that setup payload creation can use internet only on the build PC.
|
||||
|
||||
Verification tiers:
|
||||
|
||||
- Fast tests use fake staging folders and fake wheel/model files.
|
||||
- Build-PC packaging smoke can create the staging folder without committing payload.
|
||||
- Offline target smoke uses a clean Windows VM with networking disabled.
|
||||
|
||||
## Implementation Task Plan
|
||||
|
||||
### Task 1: Packaging Manifest And Ignore Policy
|
||||
|
||||
Files:
|
||||
|
||||
- Create `tests/test_offline_packaging.py`.
|
||||
- Create `src/pdf2md/packaging_manifest.py` if needed.
|
||||
- Modify `.gitignore`.
|
||||
|
||||
Steps:
|
||||
|
||||
- Add failing tests for manifest generation with SHA-256, file size, relative path, and source label.
|
||||
- Add failing tests that payload paths under `dist/offline-installer/`, wheelhouse files, model files, and generated installer executables stay ignored.
|
||||
- Implement the smallest manifest helper or PowerShell-compatible JSON format.
|
||||
- Run `uv run pytest tests/test_offline_packaging.py`.
|
||||
- Commit manifest and ignore-policy changes.
|
||||
|
||||
### Task 2: Offline Payload Builder
|
||||
|
||||
Files:
|
||||
|
||||
- Create `packaging/offline/build-offline-payload.ps1`.
|
||||
- Create `packaging/offline/requirements-runtime-cu126.txt`.
|
||||
- Create `packaging/offline/README.md`.
|
||||
- Create `packaging/offline/verify-offline-payload.ps1`.
|
||||
- Modify `tests/test_offline_packaging.py`.
|
||||
|
||||
Steps:
|
||||
|
||||
- Add tests that the builder rejects missing UI exe, missing model source, missing Python runtime package, missing `uv.exe`, and empty wheelhouse.
|
||||
- Add tests that the builder excludes `.venv`, `.git`, `samples`, `outputs`, `node_modules` unless explicitly copied as the optional locked MathJax payload.
|
||||
- Implement payload staging, manifest generation, and payload verification.
|
||||
- Run `uv run pytest tests/test_offline_packaging.py`.
|
||||
- Run a dry build command that uses fake payload inputs.
|
||||
- Commit builder changes.
|
||||
|
||||
### Task 3: Target Runtime Install And Repair Scripts
|
||||
|
||||
Files:
|
||||
|
||||
- Create `packaging/offline/install-runtime.ps1`.
|
||||
- Create `packaging/offline/repair-runtime.ps1`.
|
||||
- Create `packaging/offline/run-doctor.ps1`.
|
||||
- Modify `tests/test_offline_packaging.py`.
|
||||
|
||||
Steps:
|
||||
|
||||
- Add tests that scripts contain `--no-index`, `--find-links`, `UV_OFFLINE=1`, and no `http://` or `https://` target-install commands.
|
||||
- Add tests that existing `mineru.json` handling is explicit and never silently overwritten.
|
||||
- Implement target-local `.venv` creation, offline package install, model config handling, doctor logging, and repair flow.
|
||||
- Run `uv run pytest tests/test_offline_packaging.py`.
|
||||
- Commit install-script changes.
|
||||
|
||||
### Task 4: UI Installed Runtime Resolution
|
||||
|
||||
Files:
|
||||
|
||||
- Modify `src/pdf2md_ui/runner.py`.
|
||||
- Modify `src/pdf2md_ui/app.py` only if needed.
|
||||
- Modify `tests/test_ui_runner.py`.
|
||||
|
||||
Steps:
|
||||
|
||||
- Add failing tests for project-root `.venv\Scripts\pdf2md.exe` resolution before PATH.
|
||||
- Add failing tests for bundled `uv.exe` plus `uv run --offline pdf2md` fallback.
|
||||
- Add failing tests that the child environment prepends `.venv\Scripts` and bundled Node.js when present.
|
||||
- Implement the minimal runner changes.
|
||||
- Run `uv run pytest tests/test_ui_runner.py`.
|
||||
- Commit UI resolution changes.
|
||||
|
||||
### Task 5: Inno Setup Script
|
||||
|
||||
Files:
|
||||
|
||||
- Create `packaging/offline/Pdf2MdOffline.iss`.
|
||||
- Modify `tests/test_offline_packaging.py`.
|
||||
|
||||
Steps:
|
||||
|
||||
- Add tests that the Inno script references the expected payload directories, Start Menu shortcuts, and runtime install script.
|
||||
- Add tests that the script does not reference `samples`, `outputs`, `.venv`, or remote URLs.
|
||||
- Implement the Inno script.
|
||||
- On a build PC with Inno Setup installed, run `ISCC.exe packaging\offline\Pdf2MdOffline.iss`.
|
||||
- Commit installer-script changes without committing the generated installer.
|
||||
|
||||
### Task 6: Documentation And Release Gate
|
||||
|
||||
Files:
|
||||
|
||||
- Modify `README.md`.
|
||||
- Modify `docs/V1RELEASECHECKLIST.md`.
|
||||
- Modify `docs/Sprints/SPRINT17CONTRACT.md`.
|
||||
- Modify `PLAN.md`.
|
||||
- Modify `PROGRESS.md`.
|
||||
- Modify `docs/WORKARCHIVE.md` after implementation.
|
||||
|
||||
Steps:
|
||||
|
||||
- Document build-PC prerequisites and target-PC prerequisites.
|
||||
- Document the offline artifact layout, expected size risk, and repair flow.
|
||||
- Document the clean offline VM smoke test.
|
||||
- Record final verification outcomes and residual risks.
|
||||
- Commit documentation and handoff updates.
|
||||
|
||||
## Verification Commands
|
||||
|
||||
Default fast checks:
|
||||
|
||||
```powershell
|
||||
uv run pytest tests/test_offline_packaging.py tests/test_ui_runner.py
|
||||
uv run pytest
|
||||
git diff --check
|
||||
git status --short --untracked-files=all
|
||||
```
|
||||
|
||||
Build-PC packaging checks:
|
||||
|
||||
```powershell
|
||||
uv run --group ui-build pyinstaller --clean --onefile --windowed --name pdf2md-ui src\pdf2md_ui\app.py
|
||||
$pythonInstaller = "C:\BuildCache\python-3.12-amd64.exe"
|
||||
$uvExe = "C:\BuildCache\uv.exe"
|
||||
$mineruModels = "C:\BuildCache\mineru-models"
|
||||
powershell -ExecutionPolicy Bypass -File packaging\offline\build-offline-payload.ps1 -Configuration Release -PythonInstaller $pythonInstaller -UvExe $uvExe -MinerUModelSource $mineruModels
|
||||
powershell -ExecutionPolicy Bypass -File packaging\offline\verify-offline-payload.ps1 -PayloadRoot dist\offline-installer\payload
|
||||
ISCC.exe packaging\offline\Pdf2MdOffline.iss
|
||||
```
|
||||
|
||||
Offline target smoke:
|
||||
|
||||
```powershell
|
||||
# Run on a clean Windows x64 VM with networking disabled after copying only the installer.
|
||||
.\Pdf2MdOfflineSetup-*.exe
|
||||
& "$env:LOCALAPPDATA\Programs\ConvertPDFToMD\scripts\run-doctor.ps1"
|
||||
& "$env:LOCALAPPDATA\Programs\ConvertPDFToMD\runtime\.venv\Scripts\pdf2md.exe" --version
|
||||
& "$env:LOCALAPPDATA\Programs\ConvertPDFToMD\runtime\.venv\Scripts\pdf2md.exe" doctor
|
||||
```
|
||||
|
||||
Optional conversion smoke on the offline target:
|
||||
|
||||
```powershell
|
||||
& "$env:LOCALAPPDATA\Programs\ConvertPDFToMD\runtime\.venv\Scripts\pdf2md.exe" convert C:\LocalTest\SolidElement.pdf --out C:\LocalTest\outputs --overwrite --chunk-pages --gpu auto --mineru-profile auto --strict-local
|
||||
```
|
||||
|
||||
Expected optional output:
|
||||
|
||||
```text
|
||||
C:\LocalTest\outputs\SolidElement\SolidElement_001.md
|
||||
C:\LocalTest\outputs\SolidElement\SolidElement_report.md
|
||||
C:\LocalTest\outputs\SolidElement\images\
|
||||
```
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
- The generated installer can install the runtime on a clean Windows x64 target without internet access.
|
||||
- The target runtime has a newly created local `.venv`; it is not a copied development `.venv`.
|
||||
- `pdf2md --version` runs from the installed `.venv`.
|
||||
- `pdf2md doctor` runs without network access and reports all install-relevant failures or warnings clearly.
|
||||
- The UI launches from the Start Menu and resolves the installed runtime without manual project-root configuration.
|
||||
- MinerU uses local models through `MINERU_MODEL_SOURCE=local` and local model config.
|
||||
- Python package installation uses only bundled local wheels.
|
||||
- The wheelhouse and model payload are hash-verified before install.
|
||||
- No generated payload, model file, wheel, installer exe, sample PDF, or conversion output is committed.
|
||||
- Default tests remain fast and independent of real MinerU, GPU, model files, network, Inno Setup, MathJax, or `samples/`.
|
||||
|
||||
## Hard Failure Criteria
|
||||
|
||||
- The target installer downloads anything from the internet.
|
||||
- The UI or CLI introduces a runtime document upload path.
|
||||
- The installer silently overwrites an unrelated existing `mineru.json`.
|
||||
- The installer copies the development `.venv` as the installed runtime.
|
||||
- The installed UI cannot find `pdf2md` without manually editing settings on a clean install.
|
||||
- `pdf2md doctor` is skipped or its failure is hidden.
|
||||
- Payload hash verification is missing.
|
||||
- License/model redistribution review is skipped before sharing the installer outside the current personal environment.
|
||||
- NVIDIA drivers or CUDA Toolkit installers are redistributed in this sprint.
|
||||
|
||||
## Open Risks
|
||||
|
||||
- The final installer may be very large because CUDA PyTorch wheels, MinerU dependencies, model weights, and optional Node/MathJax assets are large.
|
||||
- MinerU model redistribution terms and transitive package/model licenses must be reviewed before broader sharing.
|
||||
- Target PCs still need compatible NVIDIA hardware and drivers. The installer can verify and report this, but it cannot guarantee GPU compatibility.
|
||||
- Some conversions can still stall or run slowly on GTX 1070 Ti 8GB; packaging does not solve runtime performance.
|
||||
- Inno Setup may need practical size and antivirus/SmartScreen validation once real model payloads are included.
|
||||
|
||||
## Sources
|
||||
|
||||
- PyInstaller usage: https://pyinstaller.org/en/stable/usage.html
|
||||
- Inno Setup command-line compiler: https://documentation.help/Inno-Setup/topic_compilercmdline.htm
|
||||
- uv CLI `--offline` behavior: https://docs.astral.sh/uv/reference/cli/
|
||||
- uv cache behavior: https://docs.astral.sh/uv/concepts/cache/
|
||||
- pip offline install/download behavior: https://pip.pypa.io/en/stable/cli/pip_install.html and https://pip.pypa.io/en/stable/cli/pip_download/
|
||||
- PyTorch previous version wheel command for CUDA 12.6: https://pytorch.org/get-started/previous-versions/
|
||||
- MinerU local model source behavior: https://opendatalab.github.io/MinerU/usage/model_source/
|
||||
|
||||
## Handoff Requirements
|
||||
|
||||
After implementation:
|
||||
|
||||
- Update this contract status to `Implemented` or record the failed gate.
|
||||
- Record payload size and generated installer path in `PROGRESS.md`.
|
||||
- Record verification commands and outcomes in `PROGRESS.md`.
|
||||
- Archive implementation evidence and offline VM smoke results in `docs/WORKARCHIVE.md`.
|
||||
- Keep generated offline payloads, wheels, model files, installer exe, `dist/`, `outputs/`, and `samples/` uncommitted.
|
||||
@@ -134,7 +134,7 @@ Not allowed:
|
||||
- Do not run model setup automatically.
|
||||
- Do not require the local GTX 1070 Ti to pass CUDA/PyTorch checks in the default test loop.
|
||||
- Do not improve OCR/model accuracy.
|
||||
- Do not introduce a manual review UI or web UI.
|
||||
- Do not introduce a manual review UI, hosted web UI, or local desktop launcher in Sprint 9.
|
||||
- Do not add alternate conversion engines or fallback engines.
|
||||
- Do not benchmark against cloud OCR/API services.
|
||||
- Do not commit sample PDFs, sample-derived outputs, or large binary fixtures.
|
||||
|
||||
Reference in New Issue
Block a user