modify pdftomd

This commit is contained in:
김경종
2026-05-14 10:16:59 +09:00
parent 2232b51fc9
commit dc11880140
69 changed files with 7784 additions and 1150 deletions
+3 -3
View File
@@ -1,13 +1,13 @@
# Knowledge Base: Local PDF-to-Markdown Converter for Math-Heavy Documents
Last updated: 2026-05-07
Last updated: 2026-05-11
## 1. Product Direction
This project will build a local-first PDF-to-Markdown converter for math-heavy academic PDFs and books. The v1 target is intentionally narrow:
- Processing policy: local-only. Do not send user PDFs to cloud OCR or external AI APIs.
- Primary interface: CLI plus Python library.
- Primary interface: CLI plus Python library. A later thin local desktop launcher may wrap the CLI, but it must not become a separate conversion pipeline.
- Primary output: Obsidian-friendly Markdown.
- Main conversion engine: MinerU 3.1.0.
- Math output: inline math as `$...$`, display math as `$$...$$`.
@@ -73,7 +73,7 @@ Rules:
- Inline math: `$...$`.
- Display math: `$$...$$` on separate lines.
- Store extracted images in a sibling assets directory, for example `paper.assets/page-003-figure-01.png`.
- Store extracted images in the PDF output folder's shared `images/` directory, for example `paper/images/page-003_figure-01.png`.
- Use relative links from the Markdown file to assets.
- Preserve page boundaries in metadata, not by noisy visible page markers in the main Markdown.
- Prefer normal Markdown tables for simple tables.
+2 -2
View File
@@ -19,7 +19,7 @@ Relevant existing behavior:
- Conversion remains local-only.
- MinerU 3.1.0 remains the only PDF conversion engine.
- Quality warnings are non-fatal unless no usable output can be produced.
- Metadata and `.report.md` already include `math_render_error_count`.
- Internal provenance and `_report.md` include `math_render_error_count`.
- Default tests must not require real MinerU, GPU, Node.js, MathJax, network, Obsidian, or sample PDFs.
## References
@@ -237,7 +237,7 @@ Optional local tests:
- `pdf2md doctor` reports MathJax checker availability clearly.
- Conversion still succeeds when MathJax is unavailable, with an info warning.
- Conversion still succeeds when individual formulas fail, with warning records.
- `.metadata.json` and `.report.md` show actual math render failure counts when MathJax is available.
- Internal provenance and `_report.md` show actual math render failure counts when MathJax is available.
- The generated Markdown is not changed by the checker.
## Hard Failure Criteria
+218
View File
@@ -0,0 +1,218 @@
# Sprint 12 Contract: Minimal Windows UI Launcher
Status: Implemented with residual conversion-smoke risk
Last updated: 2026-05-11
## Objective
Build a minimal Windows desktop launcher for the existing `pdf2md` CLI and package the launcher itself as `dist/pdf2md-ui.exe`.
The UI must remain a thin local launcher. It must not become a second conversion engine, a hosted app, a manual review workflow, or a bundled redistribution of MinerU, CUDA PyTorch, model weights, Node.js, or MathJax.
## Research Basis
- Primary research document: `docs/UI_RESEARCH.md`.
- The recommended implementation path is `tkinter`/`ttk`, a subprocess runner around `pdf2md` or `uv run pdf2md`, and PyInstaller for the Windows executable.
## Current Precondition
- `pdf2md doctor`, `pdf2md convert`, and `pdf2md recheck` are implemented.
- Conversion remains strict-local and MinerU-only.
- Current CLI output is coarse during MinerU execution because the adapter captures MinerU subprocess output internally.
- UI research is complete.
- UI implementation exists under `src/pdf2md_ui/`.
- `dist\pdf2md-ui.exe` can be built with PyInstaller.
## Touched Surfaces
Allowed during implementation:
- `src/pdf2md_ui/__init__.py`
- `src/pdf2md_ui/app.py`
- `src/pdf2md_ui/runner.py`
- `tests/test_ui_runner.py`
- `pyproject.toml`
- `uv.lock`
- `README.md`
- `PLAN.md`
- `PROGRESS.md`
- `docs/WORKARCHIVE.md`
- `docs/V1IMPLEMENTATIONPLAN.md`
Generated but not committed unless explicitly requested:
- `build/`
- `dist/`
- `*.spec`
- generated conversion outputs under `outputs/`
Not allowed:
- Runtime document upload paths.
- Remote OCR, hosted LLM/VLM, hosted renderers, or remote document parsing APIs.
- `--api-url`, router mode, HTTP client backends, remote OpenAI-compatible endpoints, or runtime engine selection.
- Direct UI calls to `mineru`; the UI must call the project-owned `pdf2md` CLI.
- Bundling MinerU, CUDA PyTorch, local model weights, Node.js, or MathJax into the first UI executable.
- Batch queues, drag/drop, PDF preview, Markdown preview, Obsidian automation, installer generation, or code signing in this sprint.
- Mandatory default tests that require real MinerU, GPU, model files, network, Obsidian, or `samples/`.
## Product Behavior
The first UI is a single-window launcher:
- Select one input PDF.
- Select an output root, defaulting to `outputs`; the current CLI creates the final `<stem>\` folder inside it.
- Configure only existing CLI options:
- overwrite
- keep raw output
- optional grouped pages with default `20`
- GPU device with default `cuda:0`, including `auto` when supported by the CLI
- MinerU profile `auto|safe|performance` with default `auto`
- Run `Doctor`.
- Run `Convert`.
- Run `Recheck` for an existing Markdown output.
- Cancel a running subprocess.
- Open the output directory after completion.
- Show a read-only log and indeterminate progress while a command is running.
Command resolution:
1. Use a configured command if present.
2. Else use `pdf2md` from `PATH`.
3. Else use `uv run pdf2md` from a configured project root containing `pyproject.toml`.
4. Else report a setup error and direct the user to run `pdf2md doctor`.
## Architecture Plan
### WP12.1: CLI Runner
Actions:
- Add a runner module that builds fixed argument lists for `doctor`, `convert`, and `recheck`.
- Use `subprocess.Popen` with `shell=False`.
- Set `MINERU_MODEL_SOURCE=local` in the child environment unless already set.
- Merge stderr into stdout for a single UI log stream.
- Read subprocess output on a worker thread and report status events to the UI.
- Add a Windows process-tree cancellation helper that uses `taskkill /pid <pid> /t /f` only after normal termination does not finish promptly.
Expected output:
- Testable command-construction and process-management code that never accepts arbitrary shell text from the UI.
### WP12.2: Minimal Tk UI
Actions:
- Add a `tkinter`/`ttk` app with file and directory pickers, option controls, command buttons, progress indicator, and log pane.
- Keep long-running work off Tk's event handler thread.
- Disable conflicting controls while a command is running.
- Surface non-zero exit codes clearly.
Expected output:
- A simple local GUI for existing CLI workflows.
### WP12.3: Build
Actions:
- Add PyInstaller only to a build dependency group such as `ui-build`.
- Build the executable with:
```powershell
uv run --group ui-build pyinstaller --clean --onefile --windowed --name pdf2md-ui src\pdf2md_ui\app.py
```
Expected output:
- `dist\pdf2md-ui.exe` exists after the build.
## Verification Checks
Default checks:
- `uv run pytest tests/test_ui_runner.py`
- `uv run pytest tests/test_cli.py` if shared CLI behavior changes
- `git diff --check`
- `git status --short --untracked-files=all`
Build check:
```powershell
uv run --group ui-build pyinstaller --clean --onefile --windowed --name pdf2md-ui src\pdf2md_ui\app.py
Test-Path dist\pdf2md-ui.exe
```
Manual smoke:
1. Launch `dist\pdf2md-ui.exe`.
2. Run Doctor from the UI.
3. Convert one small local sample into an ignored `outputs/` directory.
4. Confirm Markdown, report Markdown, and assets are produced as expected for the active output layout.
## Acceptance Criteria
- The UI invokes `pdf2md` or `uv run pdf2md`; it never invokes `mineru` directly.
- Commands are fixed argument lists and run with `shell=False`.
- The UI remains responsive while a conversion is running.
- Cancel attempts to stop the process tree on Windows.
- Doctor and conversion exit codes are visible in the UI.
- PyInstaller produces `dist\pdf2md-ui.exe`.
- Default tests stay independent of real MinerU, GPU, model files, network, Obsidian, and `samples/`.
## Hard Failure Criteria
- UI code exposes arbitrary shell command execution.
- UI exposes remote/API options or weakens strict-local policy.
- UI claims conversion success without checking the CLI exit code.
- UI freezes during a long conversion because the CLI runs on Tk's event handler thread.
- The first UI executable bundles MinerU, CUDA PyTorch, model weights, Node.js, or MathJax.
- Build outputs, generated conversion outputs, local models, or sample PDFs are committed.
## Handoff Requirements
After implementation:
- Update `PROGRESS.md` with files changed, commands run, test outcomes, build outcome, known failures, residual risks, and next action.
- Move completed implementation details to `docs/WORKARCHIVE.md` after verification.
- Keep sample PDFs and generated outputs out of the commit.
## Implementation Handoff
Files changed:
- `src/pdf2md_ui/__init__.py`
- `src/pdf2md_ui/app.py`
- `src/pdf2md_ui/runner.py`
- `tests/test_ui_runner.py`
- `pyproject.toml`
- `uv.lock`
- `README.md`
- `PLAN.md`
- `PROGRESS.md`
- `docs/WORKARCHIVE.md`
- `docs/V1IMPLEMENTATIONPLAN.md`
Verification:
- `uv run pytest tests\test_ui_runner.py`: passed 16 tests.
- `uv run pytest`: passed 188 tests with 1 optional skip.
- `uv run --group ui-build pyinstaller --clean --onefile --windowed --name pdf2md-ui src\pdf2md_ui\app.py`: passed.
- `Test-Path dist\pdf2md-ui.exe`: returned `True`.
- `uv run pdf2md doctor`: returned WARN only for the documented GTX 1070 Ti/Pascal compatibility risk.
- Launch smoke for `dist\pdf2md-ui.exe`: process started and was then terminated by the smoke script.
Follow-up refresh on 2026-05-12:
- Updated the UI command builder and form controls for the Sprint 15 `--mineru-profile auto|safe|performance` CLI option.
- Rebuilt `dist\pdf2md-ui.exe` after Sprint 16 simplified output layout and Sprint 15 profile changes.
- `uv run pytest tests\test_ui_runner.py`: passed 17 tests.
- Launch smoke for the rebuilt `dist\pdf2md-ui.exe`: process started and was then terminated by the smoke script.
Known failure:
- A CLI conversion smoke using `samples\FourNodeQuadrilateralShellElementMITC4.pdf` and the same command shape used by the UI did not finish within the 15-minute timeout. The spawned process tree was terminated with `taskkill`.
Residual risk:
- A hands-on UI Doctor click and UI conversion click should still be run when the local MinerU runtime is expected to complete within an acceptable time.
+292
View File
@@ -0,0 +1,292 @@
# Sprint 13 Contract: Text Layer Fidelity Diagnostics
Status: Implemented
Last updated: 2026-05-11
## Objective
Add a local pypdf-based text fidelity diagnostic pass that compares source PDF text-layer extraction with MinerU-generated Markdown text on a per-page basis where page mapping is available.
The first priority is diagnosis, not automatic body-text replacement. This sprint should record enough evidence in metadata JSON and `<stem>.report.md` to identify pages where MinerU likely misrecognized Korean body text, especially missing Hangul syllables, unexpected CJK ideographs, and abnormal spacing. It may mark pypdf text as a future replacement candidate, but it must not replace Markdown body text in this sprint.
## Current Precondition
- MinerU 3.1.0 remains the only conversion engine.
- Conversion runs through direct local `mineru` CLI execution only.
- `pypdf` is already used by the project for local PDF chunk planning.
- `pdf2md convert` writes Markdown, metadata JSON, and `<stem>.report.md`.
- `pdf2md recheck` can regenerate metadata/report from an existing Markdown file.
- Chunked conversion records original source page ranges in metadata `engine_options.chunk`.
- The 2007 Korean shell-structure sample showed clear text fidelity problems:
- pypdf can extract more accurate Hangul from the digital text layer.
- MinerU Markdown can omit Hangul syllables or misrecognize headings/body text as unrelated CJK characters.
- The source text layer itself can contain abnormal spacing between Hangul syllables.
## Touched Surfaces
Allowed during implementation:
- `src/pdf2md/text_fidelity.py`
- `src/pdf2md/ir.py`
- `src/pdf2md/metadata.py`
- `src/pdf2md/report.py`
- `src/pdf2md/conversion.py`
- `tests/test_text_fidelity.py`
- `tests/test_metadata.py`
- `tests/test_report.py`
- `tests/test_conversion.py`
- `docs/V1IMPLEMENTATIONPLAN.md`
- `PLAN.md`
- `PROGRESS.md`
- `docs/WORKARCHIVE.md` after completion
Allowed only if needed for CLI/API wiring:
- `src/pdf2md/cli.py`
- `tests/test_cli.py`
- `README.md`
Not allowed:
- Replacing Markdown body text with pypdf text in this sprint.
- Adding a second conversion engine or engine selector.
- Adding remote OCR, hosted LLM/VLM, remote document parsing, `--api-url`, router mode, HTTP client backends, or remote OpenAI-compatible endpoints.
- Mandatory default tests that require real MinerU, GPU, model files, network, Obsidian, or committed `samples/`.
- Committing sample PDFs or generated `outputs/`.
## Product Behavior
Text fidelity diagnostics should run automatically after MinerU Markdown normalization and local quality checks have produced the final Markdown candidate.
For each page that can be compared, metadata should record a compact diagnostic object with at least:
- `page_index`: zero-based output page index.
- `source_page_number`: one-based original PDF page number when known.
- `pypdf_text_available`: whether pypdf extracted non-empty source text.
- `markdown_text_available`: whether comparable Markdown text exists for the page.
- `pypdf_hangul_count`: Hangul syllable count from pypdf text.
- `markdown_hangul_count`: Hangul syllable count from Markdown text.
- `hangul_count_delta`: `markdown_hangul_count - pypdf_hangul_count`.
- `hangul_count_ratio`: Markdown Hangul count divided by pypdf Hangul count, or `null` when unavailable.
- `unexpected_cjk_count`: count of CJK Unified Ideographs in Markdown that are suspicious in a page with Korean source text.
- `pypdf_hangul_spacing_anomaly_ratio`: ratio of Hangul-to-Hangul whitespace breaks in pypdf text.
- `markdown_hangul_spacing_anomaly_ratio`: ratio of Hangul-to-Hangul whitespace breaks in Markdown text.
- `text_similarity`: normalized text similarity between pypdf text and Markdown text.
- `replacement_candidate`: `true` only when pypdf text appears more reliable than Markdown text under conservative thresholds.
- `comparison_status`: one of `checked`, `source_text_missing`, `markdown_page_unavailable`, or `page_mapping_uncertain`.
Metadata summary should include:
- `text_fidelity_checked_page_count`.
- `text_fidelity_low_page_count`.
- `text_fidelity_unexpected_cjk_count`.
- `text_fidelity_replacement_candidate_page_count`.
- `text_fidelity_page_mapping_uncertain_count`.
Report Markdown should add a dedicated `## Text Fidelity` section showing:
- checked page count and low-fidelity page count.
- total unexpected CJK count.
- replacement candidate page count.
- pages with low similarity.
- pages with high unexpected CJK count.
- pages where page-level comparison could not be trusted.
Warning behavior:
- Add `TEXT_LAYER_AVAILABLE` as an info warning when pypdf source text is available and diagnostics run.
- Add `TEXT_FIDELITY_LOW` as a warning for pages below the fidelity threshold.
- Add `UNEXPECTED_CJK_IN_KOREAN_TEXT` as a warning when suspicious CJK ideographs appear in Markdown for pages with Korean source text.
- Add `HANGUL_SPACING_SUSPECT` as an info or warning-level signal when pypdf or Markdown has high Hangul spacing anomaly ratio.
- Add `TEXT_PAGE_MAPPING_UNCERTAIN` as an info warning when page-level Markdown mapping is not reliable enough for per-page metrics.
Replacement candidate policy:
- `replacement_candidate` is a diagnostic marker only.
- It must not change Markdown output.
- It should be `true` only when:
- pypdf source text is available,
- pypdf Hangul count is materially higher than Markdown Hangul count or Markdown has suspicious CJK ideographs,
- pypdf spacing anomalies are not so severe that the source text layer is clearly unusable,
- page mapping is `checked`.
## Architecture Plan
### WP13.1: Text Fidelity Module
Actions:
- Add `src/pdf2md/text_fidelity.py`.
- Use `pypdf.PdfReader` to extract source page text locally.
- Define immutable result records for per-page metrics and summary metrics.
- Strip Markdown syntax, image links, fenced code, inline code, and math spans before text comparison.
- Normalize text for comparison without mutating the output Markdown:
- Unicode NFKC normalization for comparison strings only.
- collapse whitespace for similarity only.
- keep raw-count metrics independent enough to expose spacing anomalies.
- Count Hangul syllables with the Hangul syllable block.
- Count suspicious CJK ideographs with CJK Unified Ideograph ranges, excluding Hangul ranges.
- Compute similarity with a deterministic standard-library algorithm such as `difflib.SequenceMatcher`.
Expected output:
- Pure local helper functions that are independently testable and do not call MinerU, network services, or the filesystem except for reading the source PDF.
### WP13.2: Page Mapping Boundary
Actions:
- Derive source page numbers from `engine_options.chunk` when chunking is active.
- Use project page records and any reliable raw structured page count to decide whether page-level comparison is possible.
- If Markdown cannot be mapped to pages reliably, produce `TEXT_PAGE_MAPPING_UNCERTAIN` and avoid pretending per-page metrics are exact.
- For the initial implementation, allow a conservative fallback for single-page mocked outputs and chunk outputs where one Markdown file corresponds to a known source page range.
Expected output:
- Page-level diagnostics are only marked `checked` when the mapping is credible.
- Ambiguous cases are visible in metadata/report instead of producing misleading page metrics.
### WP13.3: Metadata And Warning Integration
Actions:
- Add warning codes in `src/pdf2md/ir.py`.
- Add text fidelity fields to metadata without changing existing top-level fields used by current tests.
- Extend `build_summary()` to include text fidelity summary counts when diagnostics are present.
- Ensure warnings retain `page_index` where available.
- Preserve JSON serializability and deterministic key ordering on write.
Expected output:
- Metadata contains compact page-level text fidelity diagnostics and summary counts.
- Existing metadata consumers remain compatible.
### WP13.4: Report Integration
Actions:
- Extend `render_report()` to render a `## Text Fidelity` section when diagnostics exist.
- Keep the report derived from metadata and quality results.
- Include low-fidelity pages and replacement candidate pages in human-readable form.
- Do not include full extracted page text in the report.
Expected output:
- A human can identify which pages need attention without opening metadata JSON first.
### WP13.5: Conversion And Recheck Integration
Actions:
- Run text fidelity diagnostics during `convert` after final Markdown preparation and before metadata/report writing.
- Run the same diagnostics during `recheck` when the original source PDF path still exists.
- If the source PDF is missing during `recheck`, preserve existing behavior and add a clear nonfatal warning or omit diagnostics.
- Keep chunked conversion page ranges tied to original source page numbers.
Expected output:
- Fresh conversions and rechecks can produce text fidelity diagnostics without rerunning MinerU.
### WP13.6: Tests
Default fast tests:
- pypdf extraction boundary handles generated local PDFs without requiring real MinerU or sample files.
- Hangul count, unexpected CJK count, and spacing anomaly ratio helpers use direct Korean/CJK strings.
- Markdown text stripping ignores math, image links, fenced code, and inline code.
- Similarity score is deterministic for equivalent and degraded text.
- Metadata contains text fidelity summary fields when diagnostics are present.
- Report contains `## Text Fidelity` and page-level warning summaries.
- Conversion with a fake adapter records `TEXT_FIDELITY_LOW` when Markdown omits Hangul from a source-text PDF.
- Recheck reruns diagnostics when source PDF exists.
- Missing source PDF during recheck remains nonfatal.
Optional local validation:
- Convert the local 2007 Korean shell-structure sample with chunking to ignored `outputs\`.
- Confirm the report flags the pages where the previous output had missing Hangul and unexpected CJK characters.
- Do not commit sample PDFs or generated outputs.
## Acceptance Criteria
- Default tests pass without real MinerU, GPU, model files, network, Obsidian, or `samples/`.
- Diagnostics are local-only and use pypdf source text only from the local PDF.
- Metadata JSON records page-level text fidelity metrics where page mapping is credible.
- Metadata summary records aggregate text fidelity counts.
- `<stem>.report.md` includes a text fidelity section when diagnostics exist.
- Suspicious Korean text loss produces structured warnings with page provenance where available.
- Replacement candidate markers are recorded only as diagnostics and do not alter Markdown content.
- Existing math, asset, table, chunk, strict-local, and UI behavior remains unchanged.
## Hard Failure Criteria
- Markdown body text is replaced automatically in this sprint.
- Page-level metrics are reported as exact when page mapping is uncertain.
- Diagnostics upload PDFs, page text, Markdown, or extracted text to any remote service.
- Default tests require MinerU, CUDA/GPU, model files, network, Obsidian, or `samples/`.
- Existing output schema fields are removed or renamed.
- `samples/`, generated `outputs/`, or `dist/pdf2md-ui.exe` are committed.
## Verification Commands
```powershell
uv run pytest tests/test_text_fidelity.py tests/test_metadata.py tests/test_report.py tests/test_conversion.py
uv run pytest
git diff --check
git status --short --untracked-files=all
```
Optional local validation:
```powershell
$env:MINERU_MODEL_SOURCE='local'
$pdf = (Get-ChildItem samples -Filter '2007*.pdf' | Select-Object -First 1).FullName
uv run pdf2md convert $pdf --out outputs\sprint13-2007-text-fidelity --overwrite --chunk-pages 5
```
## Handoff Requirements
After implementation:
- Update `PROGRESS.md` with files changed, commands run, test outcomes, optional sample validation outcome, known failures, residual risks, and next action.
- Archive completed implementation details in `docs/WORKARCHIVE.md` after verification.
- Keep sample PDFs, generated outputs, and build artifacts out of the commit.
- Record whether page-level mapping was exact, approximate, or unavailable for the validated sample.
## Implementation Handoff
Files changed:
- `src/pdf2md/text_fidelity.py`
- `src/pdf2md/ir.py`
- `src/pdf2md/metadata.py`
- `src/pdf2md/report.py`
- `src/pdf2md/conversion.py`
- `tests/test_text_fidelity.py`
- `tests/test_metadata.py`
- `tests/test_report.py`
- `tests/test_conversion.py`
- `ARCHITECTURE.md`
- `PLAN.md`
- `PROGRESS.md`
- `docs/WORKARCHIVE.md`
- `docs/V1IMPLEMENTATIONPLAN.md`
Verification:
- `uv run pytest tests/test_text_fidelity.py tests/test_metadata.py tests/test_report.py tests/test_conversion.py`: passed 49 tests.
- `uv run pytest`: passed 198 tests with 1 optional skip.
Known failures:
- None in the default fast test suite.
Residual risks:
- Page-level Markdown mapping is only scored when credible. Multi-page Markdown without reliable page boundaries is reported as `TEXT_PAGE_MAPPING_UNCERTAIN` rather than guessed.
- Automatic body-text replacement remains out of scope and is not implemented.
- Optional real MinerU validation on the local 2007 Korean shell-structure sample was not run during implementation to avoid a long GPU conversion.
## Future Sprint Boundary
A later sprint may implement controlled body-text replacement from pypdf text after Sprint 13 diagnostics show reliable thresholds. That future sprint must have its own contract and must preserve math, tables, figures, asset links, and Markdown structure from MinerU unless explicitly redesigned.
+378
View File
@@ -0,0 +1,378 @@
# Sprint 14 Contract: Single-Page Conversion With Grouped Outputs
Status: Implemented
Last updated: 2026-05-11
## Objective
Replace the current fixed-size pre-conversion chunking behavior with a safer long-PDF workflow:
1. When chunk mode is active, split the source PDF into one-page temporary PDFs.
2. Convert each one-page PDF sequentially through the existing local MinerU CLI adapter.
3. Merge successful converted page Markdown into grouped output files after every configured output group size.
4. Keep the default output group size at 20 pages when `--chunk-pages` is supplied without a value.
This sprint is motivated by local evidence from `samples/2007쉘구조물의유한요소해석에대하여.pdf`: a 5-page MinerU input chunk stalled on GTX 1070 Ti 8GB, while one-page conversion completed all 13 pages.
## Current Precondition
- MinerU 3.1.0 remains the only conversion engine.
- Conversion runs through direct local `mineru` CLI execution only.
- Strict-local allows only the direct CLI and MinerU CLI-internal temporary local `mineru-api`; remote API/backend paths remain prohibited.
- `pypdf` is already available and used for local PDF chunk planning and temporary chunk PDF writing.
- `pdf2md convert` currently supports `--chunk-pages [PAGES]`.
- Existing chunk mode currently treats `chunk_pages` as the MinerU input PDF page count and writes one final Markdown file per input chunk.
- `convert_pdf(..., chunk_pages=N)` currently returns `BatchConversionResult` in chunk mode.
- Sprint 13 text fidelity diagnostics are most accurate when each MinerU Markdown output maps to exactly one source page.
## Contract Assumptions
- Keep chunk mode opt-in for this sprint. If `chunk_pages` is `None`, the existing non-chunked full-PDF conversion path remains unchanged.
- Keep the public option name `--chunk-pages` for CLI/API compatibility, but redefine its behavior in chunk mode as the output group size, not the MinerU input size.
- If `--chunk-pages` is present without a value, use `DEFAULT_CHUNK_PAGES == 20` as the output group size.
- In chunk mode, even a PDF with fewer than `chunk_pages` pages is converted internally one page at a time and emitted as one grouped output file.
- Final grouped outputs are the public conversion results. Temporary per-page Markdown, metadata, reports, assets, and one-page PDFs are not retained unless a later sprint explicitly adds debug retention.
## Touched Surfaces
Allowed during implementation:
- `src/pdf2md/pdf_splitter.py`
- `src/pdf2md/conversion.py`
- `src/pdf2md/paths.py`
- `src/pdf2md/metadata.py`
- `src/pdf2md/report.py`
- `src/pdf2md/cli.py`
- `src/pdf2md_ui/app.py`
- `src/pdf2md_ui/runner.py`
- `tests/test_pdf_splitter.py`
- `tests/test_conversion.py`
- `tests/test_cli.py`
- `tests/test_paths.py`
- `tests/test_metadata.py`
- `tests/test_report.py`
- `tests/test_ui_runner.py`
- `README.md`
- `ARCHITECTURE.md`
- `docs/V1IMPLEMENTATIONPLAN.md`
- `PLAN.md`
- `PROGRESS.md`
- `docs/WORKARCHIVE.md` after implementation
Allowed if a focused helper boundary keeps `conversion.py` simpler:
- Create `src/pdf2md/page_grouping.py`
- Create `tests/test_page_grouping.py`
Not allowed:
- Adding another conversion engine or runtime engine selector.
- Running page conversions in parallel by default. GTX 1070 Ti 8GB memory pressure makes sequential conversion the safe default.
- Adding cloud OCR, hosted LLM/VLM, remote document parsing, `--api-url`, router mode, HTTP client backends, or remote OpenAI-compatible endpoints.
- Making default tests depend on real MinerU, GPU, model files, network, Obsidian, MathJax, or `samples/`.
- Committing sample PDFs, generated `outputs/`, retained temporary page outputs, or `dist/pdf2md-ui.exe`.
## Product Behavior
### Activation
Existing non-chunked conversion remains unchanged:
```powershell
uv run pdf2md convert paper.pdf --out outputs
```
Grouped page conversion is enabled by `--chunk-pages`:
```powershell
uv run pdf2md convert paper.pdf --out outputs --chunk-pages
uv run pdf2md convert paper.pdf --out outputs --chunk-pages 20
uv run pdf2md convert paper.pdf --out outputs --chunk-pages 1
```
Behavior:
- `--chunk-pages` means output group size.
- `--chunk-pages 20` converts pages 1, 2, 3, ... as independent one-page MinerU jobs, then emits grouped outputs covering pages 1-20, 21-40, and so on.
- `--chunk-pages 1` emits one final output file per source page.
- `convert_pdf(..., chunk_pages=N)` still returns `BatchConversionResult`; each `ConversionResult` represents one final grouped output file, not each internal one-page MinerU run.
### Output Naming
Use the existing part/page-range naming shape for grouped outputs:
```text
<stem>.part-001.pages-001-020.md
<stem>.part-001.pages-001-020.metadata.json
<stem>.part-001.pages-001-020.report.md
<stem>.part-001.pages-001-020.assets/
<stem>.part-002.pages-021-040.md
...
```
If a 13-page PDF is converted with `--chunk-pages 20`, it emits:
```text
<stem>.part-001.pages-001-013.md
<stem>.part-001.pages-001-013.metadata.json
<stem>.part-001.pages-001-013.report.md
<stem>.part-001.pages-001-013.assets/
```
This is an intentional behavior change from Sprint 10: short PDFs in chunk mode no longer bypass chunk mode and no longer write `<stem>.md`.
### Internal Page Conversion
For every source page in chunk mode:
- Write a one-page temporary PDF with pypdf.
- Run the existing local MinerU adapter against that one-page PDF.
- Normalize Markdown, copy page assets into a temporary page assets directory, run MathJax checks/repair, and run Sprint 13 text fidelity diagnostics against the original source page.
- Delete the one-page temporary PDF and temporary per-page final files after grouped output generation.
The implementation should reuse existing conversion primitives where practical, but it must avoid writing final public files for every page before grouping.
### Markdown Grouping
For each output group:
- Concatenate successful page Markdown in source page order.
- Separate pages with blank lines and an HTML comment that is invisible in Obsidian preview:
```markdown
<!-- source-page: 7 -->
```
- Do not add visible page headings or instructional text.
- If a page conversion fails, do not invent Markdown for that page. Add an invisible comment at the page boundary:
```markdown
<!-- source-page: 7 conversion failed; see report -->
```
- Preserve Obsidian-friendly math delimiters and display math spacing after concatenation.
### Asset Grouping
Assets from temporary per-page outputs must be copied into the grouped assets directory with collision-proof names.
Recommended destination layout:
```text
<stem>.part-001.pages-001-020.assets/page-001/<asset-name>
<stem>.part-001.pages-001-020.assets/page-002/<asset-name>
```
Markdown image links must be rewritten to the grouped assets directory. This keeps repeated MinerU asset filenames from different pages from overwriting each other.
### Metadata And Report Grouping
Grouped metadata must be derived from per-page conversion records plus group-level checks.
Required metadata behavior:
- `source_pdf` remains the original source PDF path.
- `source_sha256` remains the original source PDF hash.
- `pages` contains one page record per source page in the group.
- Page indexes in grouped metadata are group-local zero-based indexes.
- Original source page numbers remain visible in chunk/page conversion provenance.
- Warnings from per-page conversions are preserved with adjusted group-local page indexes.
- Warnings for failed page conversions are added with original source page context.
- `text_fidelity` records are carried from one-page checks and keep exact `source_page_number` values.
- Summary counts are aggregated from the grouped metadata and grouped Markdown.
Required `engine_options` shape:
```json
{
"chunk": {
"original_source_pdf": "...",
"chunk_index": 1,
"total_chunks": 3,
"source_page_start": 1,
"source_page_end": 20,
"chunk_page_count": 20
},
"page_conversion": {
"mode": "single_page",
"mineru_input_page_count": 1,
"output_group_page_count": 20,
"failed_source_pages": []
}
}
```
Report Markdown must continue to include the existing chunk context line and should add a concise page-conversion line, for example:
```text
- Page conversion mode: single-page MinerU inputs, grouped output size: 20
```
## Failure Policy
- Convert pages sequentially.
- If a page fails, continue with later pages.
- If at least one page in a group succeeds, write the grouped Markdown/metadata/report and mark final status `partial`.
- If every page in a group fails, return a failed `ConversionResult` for that grouped output and do not write Markdown for that group.
- Failed pages must be visible in metadata/report warnings.
- There is no silent fallback and no retry loop in this sprint.
## Architecture Plan
### WP14.1: Page And Group Planning
Actions:
- Extend `pdf_splitter.py` or add `page_grouping.py` with project-owned records for:
- one-page MinerU input plans,
- final output group plans,
- original source page ranges,
- deterministic output stems.
- Keep pypdf page extraction local and temporary.
- Validate output group size as a positive integer.
- Plan output groups before conversion starts so overwrite/conflict behavior remains deterministic.
Expected output:
- A 41-page PDF with group size 20 plans 41 one-page MinerU inputs and 3 final grouped outputs.
- A 13-page PDF with group size 20 plans 13 one-page MinerU inputs and 1 final grouped output.
### WP14.2: Conversion Orchestration
Actions:
- Rework chunk-mode `convert_pdf()` and `convert_input()` orchestration so `chunk_pages` creates grouped output tasks.
- Run one-page MinerU inputs in source-page order.
- Keep temporary page PDFs and intermediate page outputs under local temporary directories.
- Keep `BatchConversionResult` at the grouped-output level.
- Keep strict-local validation unchanged.
Expected output:
- The public API keeps returning multiple grouped results in chunk mode while the adapter is called once per source page internally.
### WP14.3: Markdown And Asset Group Assembly
Actions:
- Build a focused helper to merge page Markdown and page assets into a grouped output.
- Insert invisible `<!-- source-page: N -->` boundaries.
- Rewrite per-page asset links to `page-NNN/` asset subdirectories.
- Run final group-level local quality checks after asset rewriting.
Expected output:
- Grouped Markdown renders in Obsidian and assets do not collide across pages.
### WP14.4: Metadata, Warnings, And Report Assembly
Actions:
- Aggregate per-page metadata into grouped metadata.
- Adjust page indexes from page-local `0` to group-local indexes.
- Preserve original source page numbers in `engine_options` and text fidelity records.
- Add `page_conversion` engine options.
- Add a report line for single-page conversion mode and grouped output size.
Expected output:
- Metadata/report can explain both facts: MinerU saw one page at a time, while the user received grouped Markdown files.
### WP14.5: CLI, UI, And Documentation
Actions:
- Update CLI help for `--chunk-pages` from "pre-conversion PDF chunking" to "group converted pages into output files of N pages; MinerU runs one page at a time."
- Update README and architecture docs with the new behavior.
- Update the Windows UI label/help text so the field represents output group size.
- Keep runner command construction using `--chunk-pages N`.
Expected output:
- Users do not confuse `--chunk-pages 20` with a 20-page MinerU input.
### WP14.6: Tests
Default fast tests:
- Generated blank local PDFs verify page count and group planning for 1, 13, 20, 21, 40, and 41 pages.
- `--chunk-pages` without a value still passes `20`.
- `convert_pdf(..., chunk_pages=20)` for 41 pages calls the fake adapter 41 times and returns 3 grouped `ConversionResult` objects.
- `convert_pdf(..., chunk_pages=20)` for 13 pages calls the fake adapter 13 times and returns 1 grouped output named `part-001.pages-001-013`.
- `convert_pdf(..., chunk_pages=1)` returns one grouped output per source page.
- Temporary one-page PDFs and temporary per-page outputs are deleted after conversion.
- A failed internal page conversion does not stop later pages and appears in grouped metadata/report.
- A group with only failed pages returns a failed result and writes no Markdown.
- Asset filenames from different pages do not collide in the grouped assets directory.
- Per-page warnings and text fidelity records are adjusted to group-local page indexes while preserving original source page numbers.
- Existing non-chunked conversion tests keep passing unchanged.
- UI runner tests continue to build fixed argument lists with `shell=False`.
Optional local validation:
```powershell
$env:MINERU_MODEL_SOURCE='local'
$pdf = (Get-ChildItem samples -Filter '2007*.pdf' | Select-Object -First 1).FullName
uv run pdf2md convert $pdf --out outputs\sprint14-2007-page-grouped --overwrite --chunk-pages
```
Expected optional validation:
- The 13-page Korean sample emits one grouped Markdown file for pages 1-13.
- Metadata/report show exact page-level text fidelity records.
- Generated outputs stay ignored and uncommitted.
## Acceptance Criteria
- Chunk mode runs MinerU on one-page temporary PDFs only.
- `chunk_pages` controls final grouped output page count.
- Default group size remains 20 when `--chunk-pages` is supplied without a value.
- Grouped Markdown, metadata JSON, report Markdown, and grouped assets directory are written.
- Grouped metadata preserves original source PDF, original source SHA-256, group page range, one-page conversion mode, page warnings, and text fidelity provenance.
- Failed page conversions are explicit, nonfatal to later pages, and visible in report/metadata.
- Default tests remain fast and local.
- Strict-local policy remains unchanged.
- Non-chunked conversion behavior remains backward-compatible.
## Hard Failure Criteria
- Chunk mode sends more than one source page to MinerU in a single temporary PDF.
- `--chunk-pages` continues to mean MinerU input chunk size after this sprint.
- Grouped outputs lose source page provenance or hide failed pages.
- Asset links collide or point outside the grouped assets directory.
- Default tests require real MinerU, GPU, model files, network, Obsidian, MathJax, or `samples/`.
- The implementation adds a remote API/backend path, alternate conversion engine, router mode, or OpenAI-compatible backend.
- Sample PDFs, generated outputs, retained temporary page outputs, or `dist/pdf2md-ui.exe` are committed.
## Verification Commands
```powershell
uv run pytest tests/test_pdf_splitter.py tests/test_conversion.py tests/test_cli.py tests/test_paths.py tests/test_metadata.py tests/test_report.py tests/test_ui_runner.py
uv run pytest
git diff --check
git status --short --untracked-files=all
```
Optional local validation command is listed in WP14.6 and should be run only when a long GPU conversion is acceptable.
## Handoff Requirements
After implementation:
- Update `PROGRESS.md` with files changed, commands run, test outcomes, optional sample validation outcome, known failures, residual risks, and next action.
- Archive completed implementation details in `docs/WORKARCHIVE.md` after verification.
- Keep sample PDFs, generated outputs, retained temporary page outputs, and build artifacts out of the commit.
- Record whether the 2007 Korean sample was validated with grouped page conversion and how many grouped outputs were produced.
Implementation handoff on 2026-05-11:
- Implemented grouped page conversion in `src/pdf2md/conversion.py` with one-page temporary MinerU inputs and grouped public outputs.
- Added report output for `page_conversion` engine options.
- Updated CLI help, UI label text, README, architecture, implementation plan, and coordination/archive docs.
- Verification: targeted Sprint 14 tests passed, the 101-test related suite passed, and full `uv run pytest` passed 202 tests with 1 optional skip.
- Optional real MinerU validation on the 2007 Korean sample was not run during this implementation pass.
## Future Sprint Boundary
A later sprint may make grouped page conversion the default even without `--chunk-pages`, add resumable page caches, or add a debug option to retain intermediate per-page outputs. Those behaviors are intentionally out of Sprint 14 scope.
+431
View File
@@ -0,0 +1,431 @@
# Sprint 15 Contract: NVIDIA GPU Detection And Auto MinerU Profile
Status: Implemented
Last updated: 2026-05-12
## Objective
Add a strict-local runtime profiling layer that detects installed NVIDIA GPUs and applies conservative MinerU environment tuning by default.
The default runtime profile is `auto`. In `auto`, the converter should keep 8GB and pre-Turing GPUs conservative, while allowing a slightly more aggressive local MinerU configuration only when the selected NVIDIA GPU has at least 16GB VRAM and no pre-Turing compatibility warning.
This sprint is motivated by local evidence from `samples\FourNodeQuadrilateralShellElementMITC4.pdf`: Sprint 14's one-page conversion path used `cuda:0` correctly, but GTX 1070 Ti 8GB stayed near full VRAM use and stalled on source page 2. The next useful test should be on a stronger NVIDIA GPU with explicit runtime diagnostics and reproducible MinerU environment settings.
## Source Basis
Use these source-backed facts during implementation:
- MinerU CLI supports `mineru -p <input_path> -o <output_path>` and, without `--api-url`, launches a temporary local `mineru-api`: https://opendatalab.github.io/MinerU/usage/cli_tools/
- MinerU CLI documents `-b/--backend`, `-f/--formula`, `-t/--table`, `--api-url`, and related options, but this project must not expose remote/API or backend selection paths in v1: https://opendatalab.github.io/MinerU/usage/cli_tools/
- MinerU environment variables include `MINERU_PDF_RENDER_THREADS`, `MINERU_PROCESSING_WINDOW_SIZE`, `MINERU_API_MAX_CONCURRENT_REQUESTS`, and timeout settings: https://opendatalab.github.io/MinerU/usage/cli_tools/
- MinerU advanced CLI docs support selecting visible GPU devices with `CUDA_VISIBLE_DEVICES`: https://opendatalab.github.io/MinerU/usage/advanced_cli_parameters/
- MinerU local deployment docs list auto-engine GPU requirements around 8GB+ VRAM and GPU acceleration for Volta-or-later devices: https://opendatalab.github.io/MinerU/quick_start/
- MinerU extension docs say `vllm` and `lmdeploy` acceleration extras are alternatives and should not both be installed just for this sprint: https://opendatalab.github.io/MinerU/quick_start/extension_modules/
Access date for the source review: 2026-05-12.
## Current Precondition
- MinerU 3.1.0 remains the only conversion engine.
- Conversion runs through direct local `mineru` CLI execution only.
- Strict-local allows only the direct CLI and MinerU CLI-internal temporary local `mineru-api`; remote API/backend paths remain prohibited.
- `pdf2md convert` defaults to `--gpu cuda:0`.
- `MinerUAdapter` currently maps `cuda:N` to `MINERU_DEVICE_MODE=cuda` and `CUDA_VISIBLE_DEVICES=N`.
- `pdf2md doctor` already reports NVIDIA GPU visibility, PyTorch CUDA visibility, GPU names, and Pascal/pre-Turing warnings.
- Sprint 14 chunk mode runs one source page per MinerU invocation when `--chunk-pages` is active.
## Contract Assumptions
- Keep `--gpu cuda:0` as the default for backward compatibility with PRD and existing docs.
- Add `--gpu auto` as an opt-in GPU selection mode that chooses the visible NVIDIA GPU with the largest reported VRAM.
- Add `--mineru-profile {auto,safe,performance}` with default `auto`.
- Keep all conversion requests sequential in Sprint 15. Do not introduce parallel page conversion.
- Keep formula and table parsing enabled. Do not optimize by disabling required output quality features.
- Do not add `--backend`, `--api-url`, `--url`, router mode, HTTP client backend, remote OpenAI-compatible backend, or remote model server support.
- Treat MinerU environment tuning as best-effort. If GPU inventory cannot be read, continue with safe profile settings and a warning/provenance record rather than guessing aggressive values.
## Touched Surfaces
Allowed during implementation:
- Create `src/pdf2md/gpu.py`
- Create `src/pdf2md/mineru_profile.py`
- Modify `src/pdf2md/mineru_adapter.py`
- Modify `src/pdf2md/conversion.py`
- Modify `src/pdf2md/cli.py`
- Modify `src/pdf2md/doctor.py`
- Modify `src/pdf2md_ui/runner.py` only if the UI command builder needs profile passthrough
- Modify `src/pdf2md_ui/app.py` only if a minimal profile control is necessary
- Add `tests/test_gpu.py`
- Add `tests/test_mineru_profile.py`
- Modify `tests/test_mineru_adapter.py`
- Modify `tests/test_conversion.py`
- Modify `tests/test_cli.py`
- Modify `tests/test_doctor.py`
- Modify `tests/test_ui_runner.py` only if UI command construction changes
- Modify `README.md`
- Modify `ARCHITECTURE.md`
- Modify `PRD.md` if CLI option documentation changes
- Modify `docs/V1IMPLEMENTATIONPLAN.md`
- Modify `PLAN.md`
- Modify `PROGRESS.md`
- Modify `docs/WORKARCHIVE.md` after implementation
Not allowed:
- Adding another conversion engine or runtime engine selector.
- Passing `--api-url`, `--url`, or any remote endpoint to MinerU.
- Adding `mineru-router`, HTTP client backend, or OpenAI-compatible backend usage.
- Installing `vllm`, `lmdeploy`, CUDA packages, models, or any runtime package automatically.
- Changing the default conversion engine or disabling formula/table recognition.
- Making default tests depend on real MinerU, GPU, CUDA, PyTorch, model files, network, Obsidian, MathJax, or `samples/`.
- Committing sample PDFs, generated `outputs/`, retained temporary page outputs, local model files, or `dist/pdf2md-ui.exe`.
## Product Behavior
### CLI
Existing behavior remains valid:
```powershell
uv run pdf2md convert paper.pdf --out outputs
uv run pdf2md convert paper.pdf --out outputs --gpu cuda:0
```
New behavior:
```powershell
uv run pdf2md convert paper.pdf --out outputs --mineru-profile auto
uv run pdf2md convert paper.pdf --out outputs --mineru-profile safe
uv run pdf2md convert paper.pdf --out outputs --mineru-profile performance
uv run pdf2md convert paper.pdf --out outputs --gpu auto --mineru-profile auto
```
Rules:
- `--mineru-profile` defaults to `auto`.
- `--gpu cuda:N` selects a concrete CUDA index and tunes MinerU for that selected GPU when inventory is available.
- `--gpu N` is still normalized to `cuda:N`.
- `--gpu auto` selects the visible NVIDIA GPU with the largest VRAM from local GPU inventory.
- If `--gpu auto` cannot find a visible NVIDIA GPU, fail clearly before conversion rather than silently switching to CPU.
- If `--mineru-profile performance` is requested on a selected GPU below 16GB VRAM or with pre-Turing risk, downgrade to safe settings with a warning in metadata/report. Do not fail solely because performance was unsafe.
### Doctor
`pdf2md doctor` should report:
- All visible NVIDIA GPUs with index, name, total VRAM, and driver version from `nvidia-smi`.
- PyTorch CUDA device names and compute capabilities when available.
- Selected default GPU recommendation for `--gpu auto`.
- Recommended MinerU profile for the detected primary GPU.
- Existing Pascal/pre-Turing warnings.
Doctor must not require a real conversion, model load, network access, or package download.
### Auto Profile Policy
Use a small deterministic policy table. Values are intentionally conservative because the converter runs real PDFs and should prefer completion over peak throughput.
| Selected GPU | Auto policy | MinerU environment |
| --- | --- | --- |
| No GPU inventory, CUDA requested | Safe fallback with warning | `MINERU_PROCESSING_WINDOW_SIZE=1`, `MINERU_API_MAX_CONCURRENT_REQUESTS=1`, `MINERU_PDF_RENDER_THREADS=1` |
| Pre-Turing or VRAM < 12GB | Safe | `MINERU_PROCESSING_WINDOW_SIZE=1`, `MINERU_API_MAX_CONCURRENT_REQUESTS=1`, `MINERU_PDF_RENDER_THREADS=1` |
| 12GB <= VRAM < 16GB | Auto conservative | `MINERU_PROCESSING_WINDOW_SIZE=4`, `MINERU_API_MAX_CONCURRENT_REQUESTS=1`, `MINERU_PDF_RENDER_THREADS=2` |
| VRAM >= 16GB and Turing-or-newer | Auto moderately aggressive | `MINERU_PROCESSING_WINDOW_SIZE=8`, `MINERU_API_MAX_CONCURRENT_REQUESTS=1`, `MINERU_PDF_RENDER_THREADS=4` |
| Explicit `safe` | Safe regardless of GPU | `MINERU_PROCESSING_WINDOW_SIZE=1`, `MINERU_API_MAX_CONCURRENT_REQUESTS=1`, `MINERU_PDF_RENDER_THREADS=1` |
| Explicit `performance` on VRAM >= 16GB and Turing-or-newer | Performance | `MINERU_PROCESSING_WINDOW_SIZE=16`, `MINERU_API_MAX_CONCURRENT_REQUESTS=1`, `MINERU_PDF_RENDER_THREADS=4` |
| Explicit `performance` on weaker GPU | Downgraded safe with warning | safe values |
Do not set `MINERU_HYBRID_BATCH_RATIO` in Sprint 15 because MinerU docs describe it as commonly used for `hybrid-http-client`, which this project prohibits in v1.
Do not set backend CLI flags in Sprint 15. The default MinerU backend remains MinerU-owned.
## Architecture Plan
### WP15.1: GPU Inventory Boundary
Actions:
- Add `src/pdf2md/gpu.py`.
- Define immutable `GpuInfo` and `GpuInventory` records.
- Parse `nvidia-smi --query-gpu=index,name,memory.total,driver_version --format=csv,noheader,nounits`.
- Parse memory in MiB as an integer.
- Mark pre-Turing risk using the existing name-based heuristic for GTX 10xx and pre-Turing names.
- Optionally enrich compute capability through PyTorch when available, but keep PyTorch optional and mockable.
- Provide `select_gpu(gpus, requested)` for `cuda:N`, `N`, and `auto`.
Expected output:
- GPU detection is independently testable with captured command output strings.
- No real `nvidia-smi`, GPU, or PyTorch is needed in default tests.
### WP15.2: MinerU Profile Policy
Actions:
- Add `src/pdf2md/mineru_profile.py`.
- Define supported profile names: `auto`, `safe`, `performance`.
- Define a result record containing:
- requested profile,
- applied profile,
- selected GPU index if known,
- selected GPU name if known,
- selected GPU VRAM MiB if known,
- environment variables to set,
- warnings or info messages as project `WarningRecord` values.
- Implement the policy table above.
- Keep profile environment values in a small allowlist.
Expected output:
- The policy can be tested without running MinerU.
- Performance profile cannot silently overcommit weak GPUs.
### WP15.3: Adapter Environment Integration
Actions:
- Extend `MinerUOptions` with `mineru_profile: str = "auto"` and optional resolved profile metadata.
- Keep strict-local validation for every option string.
- Update `_mineru_environment()` to merge:
- `MINERU_DEVICE_MODE=cuda`,
- `CUDA_VISIBLE_DEVICES=<selected index>`,
- profile environment variables from `mineru_profile.py`.
- Preserve previous environment values after subprocess execution.
- Include profile details in `engine_options`.
Expected output:
- Real MinerU still receives only direct local CLI command shape:
```text
mineru -p <input> -o <output>
```
- Tuning is done through local environment variables, not remote/API/backend flags.
### WP15.4: Conversion And CLI Wiring
Actions:
- Add `--mineru-profile` to `pdf2md convert`.
- Accept `--gpu auto`.
- Resolve selected GPU and profile before calling the adapter.
- Surface profile warnings in conversion metadata/report warnings.
- Preserve existing `--gpu cuda:0` default.
- Ensure `convert_pdf()` can receive the profile through the Python API.
Expected output:
- Default conversions use `mineru_profile=auto`.
- Existing calls with no new flags continue to work.
- Metadata explains which profile was applied.
### WP15.5: Doctor Reporting
Actions:
- Reuse `gpu.py` inventory parsing in `doctor.py`.
- Keep the existing `gpu` and `pytorch` checks, but make GPU details more explicit.
- Add a doctor detail line for auto-selected GPU and recommended profile.
- Keep warning-only behavior for Pascal/pre-Turing GPUs.
Expected output:
- On a stronger PC, `pdf2md doctor` shows enough evidence to decide whether `auto` or `performance` is appropriate.
- On the current GTX 1070 Ti, doctor still warns and recommends safe/conservative behavior.
### WP15.6: Documentation
Actions:
- Update README setup and conversion docs with `--mineru-profile`.
- Update ARCHITECTURE to document that tuning uses strict-local environment variables only.
- Update PRD CLI section if the new public flag is added.
- Update `docs/V1IMPLEMENTATIONPLAN.md`, `PLAN.md`, and `PROGRESS.md`.
- Archive implementation details in `docs/WORKARCHIVE.md` only after implementation and verification.
Expected output:
- Users can move the repo to a stronger NVIDIA GPU PC, run `pdf2md doctor`, and understand the selected profile.
## Tests
Default fast tests:
- GPU inventory parser handles one RTX GPU, multiple GPUs, no GPU lines, and malformed memory fields.
- `select_gpu(..., "auto")` selects the largest VRAM GPU.
- `select_gpu(..., "cuda:1")` selects index 1 and errors when absent.
- `select_gpu(..., "1")` normalizes to index 1.
- `auto` profile returns safe values for GTX 1070 Ti 8GB.
- `auto` profile returns moderately aggressive values for an RTX GPU with 16GB or more.
- `performance` profile returns performance values only for 16GB+ Turing-or-newer GPUs.
- `performance` profile on GTX 1070 Ti downgrades to safe and returns a warning.
- Adapter sets and restores `MINERU_DEVICE_MODE`, `CUDA_VISIBLE_DEVICES`, `MINERU_PROCESSING_WINDOW_SIZE`, `MINERU_API_MAX_CONCURRENT_REQUESTS`, and `MINERU_PDF_RENDER_THREADS`.
- Strict-local validation rejects remote/API/backend-like option strings in profile-related fields.
- CLI default passes `mineru_profile=auto`.
- CLI accepts `--mineru-profile safe` and `--mineru-profile performance`.
- CLI rejects invalid profile values.
- Doctor report includes visible GPU details and recommended profile with mocked command outputs.
- Existing conversion, chunking, metadata, report, and UI tests remain green.
Optional local validation on a stronger NVIDIA GPU PC:
```powershell
uv run pdf2md doctor
$env:MINERU_MODEL_SOURCE='local'
uv run pdf2md convert samples\FourNodeQuadrilateralShellElementMITC4.pdf --out outputs\fournode-sprint15-auto --overwrite --chunk-pages --gpu auto --mineru-profile auto --strict-local
```
Expected optional validation:
- Doctor reports the stronger GPU name, VRAM, and recommended profile.
- Conversion metadata records `mineru_profile` and selected GPU information.
- Generated outputs stay ignored and uncommitted.
## Acceptance Criteria
- `--mineru-profile auto` is the default conversion behavior.
- `auto` uses safe settings on the current GTX 1070 Ti 8GB and stronger settings only on 16GB+ Turing-or-newer NVIDIA GPUs.
- `--gpu auto` can choose the largest visible NVIDIA GPU without adding remote/runtime backend support.
- MinerU command shape remains direct local CLI only.
- Strict-local prohibitions remain enforced.
- `pdf2md doctor` provides actionable GPU/profile information.
- Metadata/report preserve the applied runtime profile.
- Default tests remain fast, mocked, local, and independent of real MinerU/GPU/model files/network/samples.
## Hard Failure Criteria
- Implementation adds runtime backend selection or exposes `--backend`.
- Implementation passes `--api-url`, `--url`, router, HTTP client backend, or remote OpenAI-compatible backend values.
- `auto` profile applies aggressive settings to GTX 1070 Ti 8GB or other pre-Turing/low-VRAM GPUs.
- Existing `--gpu cuda:0` behavior breaks.
- Profile tuning disables formula or table parsing.
- Doctor or tests require real GPU, real MinerU execution, model files, network, Obsidian, MathJax, or `samples/`.
- Sample PDFs, generated outputs, local model files, or `dist/pdf2md-ui.exe` are committed.
## Implementation Task Plan
### Task 1: GPU Inventory
Files:
- Create `src/pdf2md/gpu.py`
- Create `tests/test_gpu.py`
Steps:
- [x] Add failing tests for parsing `nvidia-smi` CSV output.
- [x] Add failing tests for `auto`, `cuda:N`, and numeric GPU selection.
- [x] Implement immutable GPU records and parser helpers.
- [x] Implement selection errors as `ValueError` with clear messages.
- [x] Run `uv run pytest tests/test_gpu.py`.
- [x] Commit GPU inventory boundary.
### Task 2: MinerU Profile Policy
Files:
- Create `src/pdf2md/mineru_profile.py`
- Create `tests/test_mineru_profile.py`
Steps:
- [x] Add failing tests for safe, auto, and performance profile policy.
- [x] Add tests proving 16GB+ Turing-or-newer GPUs get the moderately aggressive auto environment.
- [x] Add tests proving GTX 1070 Ti 8GB stays safe.
- [x] Implement the allowlisted environment mapping.
- [x] Run `uv run pytest tests/test_mineru_profile.py tests/test_gpu.py`.
- [x] Commit profile policy.
### Task 3: Adapter And Conversion Wiring
Files:
- Modify `src/pdf2md/mineru_adapter.py`
- Modify `src/pdf2md/conversion.py`
- Modify `tests/test_mineru_adapter.py`
- Modify `tests/test_conversion.py`
Steps:
- [x] Add failing adapter tests for profile environment variables and environment restoration.
- [x] Add failing conversion tests that metadata receives applied profile information.
- [x] Extend `MinerUOptions` and conversion options minimally.
- [x] Merge GPU and profile environment variables before the MinerU subprocess.
- [x] Run `uv run pytest tests/test_mineru_adapter.py tests/test_conversion.py tests/test_mineru_profile.py tests/test_gpu.py`.
- [x] Commit adapter/conversion wiring.
### Task 4: CLI And Doctor
Files:
- Modify `src/pdf2md/cli.py`
- Modify `src/pdf2md/doctor.py`
- Modify `tests/test_cli.py`
- Modify `tests/test_doctor.py`
Steps:
- [x] Add failing CLI tests for default `auto`, explicit `safe`, explicit `performance`, invalid profile rejection, and `--gpu auto`.
- [x] Add failing doctor tests for GPU inventory and recommended profile details.
- [x] Implement CLI argument parsing and doctor report additions.
- [x] Run `uv run pytest tests/test_cli.py tests/test_doctor.py tests/test_gpu.py tests/test_mineru_profile.py`.
- [x] Commit CLI and doctor wiring.
### Task 5: UI And Documentation
Files:
- Modify `src/pdf2md_ui/runner.py` only if explicit UI profile passthrough is needed
- Modify `src/pdf2md_ui/app.py` only if explicit UI profile control is needed
- Modify `tests/test_ui_runner.py` only if runner command construction changes
- Modify `README.md`
- Modify `ARCHITECTURE.md`
- Modify `PRD.md`
- Modify `docs/V1IMPLEMENTATIONPLAN.md`
- Modify `PLAN.md`
- Modify `PROGRESS.md`
- Modify `docs/WORKARCHIVE.md` after implementation
Steps:
- [x] Keep UI unchanged if default CLI `auto` profile is enough for the first implementation pass.
- [x] If UI exposes a profile control, add tests for fixed argument-list construction with `shell=False`.
- [x] Document `--mineru-profile`, `--gpu auto`, profile policy, strict-local boundaries, and stronger-PC validation command.
- [x] Run focused docs/UI tests if changed.
- [x] Run final verification commands.
- [x] Commit documentation and final coordination updates.
## Verification Commands
```powershell
uv run pytest tests/test_gpu.py tests/test_mineru_profile.py tests/test_mineru_adapter.py tests/test_conversion.py tests/test_cli.py tests/test_doctor.py
uv run pytest
git diff --check
git status --short --untracked-files=all
```
Optional stronger-PC validation is listed in the Tests section and must remain explicit opt-in.
## Handoff Requirements
After implementation:
- Update `PROGRESS.md` with files changed, commands run, test outcomes, optional stronger-PC validation outcome, known failures, residual risks, and next action.
- Archive completed implementation details in `docs/WORKARCHIVE.md`.
- Keep generated outputs, sample PDFs, local model files, and UI build artifacts out of the commit.
- Record the detected GPU, applied profile, and whether `samples\FourNodeQuadrilateralShellElementMITC4.pdf` completed on the stronger PC.
Implementation handoff:
- Files changed: `src/pdf2md/gpu.py`, `src/pdf2md/mineru_profile.py`, `src/pdf2md/mineru_adapter.py`, `src/pdf2md/conversion.py`, `src/pdf2md/cli.py`, `src/pdf2md/doctor.py`, docs, and focused tests.
- Commands run: `uv run pytest tests/test_gpu.py tests/test_mineru_profile.py tests/test_mineru_adapter.py tests/test_conversion.py tests/test_cli.py tests/test_doctor.py`; `uv run pytest`; `uv run pdf2md doctor`.
- Tests passed: targeted Sprint 15 suite passed 101 tests; full default suite passed 225 tests with 1 optional skip; local doctor returned WARN with expected GTX 1070 Ti safe-profile recommendation.
- Known failures: optional stronger-PC real MinerU conversion validation was not run in this workspace.
- Residual risks: GTX 1070 Ti 8GB remains likely to stall on hard pages; stronger-PC behavior still needs local runtime validation.
- Next action: on a stronger NVIDIA GPU PC, run `pdf2md doctor` and an explicit local conversion with `--gpu auto --mineru-profile auto`.
## Future Sprint Boundary
A later sprint may add page-level timeout handling, resumable page caches, or a performance mode that can run multiple page conversions concurrently on GPUs with enough VRAM. Those behaviors are intentionally out of Sprint 15 scope.
+412
View File
@@ -0,0 +1,412 @@
# Sprint 16 Contract: Simplified Output Layout
Status: Implemented
Last updated: 2026-05-12
## Objective
Simplify conversion outputs so each input PDF gets one predictable output folder named after the PDF stem, all images live under one `images` folder, Markdown parts use `_001`, `_002` numbering, one human-readable report is written per PDF, and no metadata JSON file is persisted.
This sprint changes the public output contract. It supersedes the older v1 output layout that wrote sibling `<stem>.md`, `<stem>.assets`, `<stem>.metadata.json`, and `<stem>.report.md` files.
## Product Output Contract
For an input PDF:
```text
paper.pdf
```
and output root:
```text
outputs/
```
write:
```text
outputs/
paper/
paper_001.md
paper_002.md
paper_report.md
images/
...
```
Rules:
- `paper` is the PDF stem, meaning the original filename without `.pdf`.
- A one-part conversion still writes `paper_001.md`.
- A multi-part conversion writes `paper_001.md`, `paper_002.md`, and so on.
- Part numbering uses at least three digits and grows only when the part count exceeds 999.
- All generated image and media assets for the PDF live under `paper/images/`.
- Markdown links must point to `images/<asset-name>`.
- The report is a single file at `paper/paper_report.md`.
- No `<stem>.metadata.json`, part metadata JSON, or sidecar metadata JSON is written.
- Internal metadata records may still be built in memory to produce reports, warnings, counts, and `ConversionResult` fields.
## Contract Assumptions
- The user request "metadata is not needed" means metadata JSON should not be written as a user-facing output file. It does not mean removing internal metadata objects needed for report generation and warning aggregation.
- Keep `--chunk-pages` semantics from Sprint 14: when enabled, MinerU receives one source page per run and final Markdown files are grouped by `chunk_pages`.
- If `--chunk-pages` is absent, the whole PDF is still converted in one MinerU run and written as `<stem>_001.md`.
- Keep `--chunk-pages` without a value as the default grouping size of 20.
- Keep `--metadata` accepted as a backward-compatible no-op for one sprint, but update help text to say metadata JSON output is disabled in the simplified layout.
- `pdf2md recheck` remains supported only for legacy outputs that still have adjacent metadata JSON. New simplified outputs should fail recheck clearly until a later sprint designs metadata-free recheck.
- Recursive directory conversion should preserve the discovered relative parent before the PDF stem folder: `outputs/<relative-parent>/<stem>/<stem>_001.md`.
- If two inputs would map to the same output folder and overwrite is false, fail during preflight. Do not invent automatic suffixes.
- `--keep-raw` should place raw MinerU diagnostics under `paper/raw/` so raw outputs do not clutter the main folder.
## Touched Surfaces
Allowed during implementation:
- Modify `src/pdf2md/paths.py`.
- Modify `src/pdf2md/pdf_splitter.py` only if part naming needs helper support.
- Modify `src/pdf2md/conversion.py`.
- Modify `src/pdf2md/report.py` or add a focused aggregate report helper if one report needs multiple part summaries.
- Modify `src/pdf2md/cli.py`.
- Modify `src/pdf2md_ui/runner.py` and `src/pdf2md_ui/app.py` only if UI text or expected output descriptions mention metadata/report paths.
- Modify `tests/test_paths.py`.
- Modify `tests/test_conversion.py`.
- Modify `tests/test_cli.py`.
- Modify `tests/test_report.py`.
- Modify `tests/test_ui_runner.py` only if UI command/output assumptions change.
- Modify `tests/integration/test_v1_fast_release_gate.py`.
- Modify `tests/integration/test_optional_mineru_fixtures.py`.
- Modify `README.md`.
- Modify `PRD.md`.
- Modify `ARCHITECTURE.md`.
- Modify `docs/V1IMPLEMENTATIONPLAN.md`.
- Modify `PLAN.md`.
- Modify `PROGRESS.md`.
- Modify `docs/WORKARCHIVE.md` after implementation.
Not allowed:
- Do not change MinerU 3.1.0 as the fixed engine.
- Do not add another conversion engine.
- Do not add remote/API/backend paths.
- Do not change `--gpu`, `--mineru-profile`, or strict-local behavior except where report text reflects the new layout.
- Do not make default tests depend on real MinerU, GPU, CUDA, model files, network, Obsidian, MathJax, or `samples/`.
- Do not commit generated `outputs/`, sample PDFs, local model files, or `dist/pdf2md-ui.exe`.
## Architecture Plan
### WP16.1: Document-Level Output Layout
Add or reshape path planning so final outputs are planned per source PDF folder instead of as sibling files.
Expected final paths for a single PDF:
```text
<out>/<stem>/<stem>_001.md
<out>/<stem>/images/
<out>/<stem>/<stem>_report.md
```
Expected final paths for recursive input:
```text
<out>/<relative-parent>/<stem>/<stem>_001.md
<out>/<relative-parent>/<stem>/images/
<out>/<relative-parent>/<stem>/<stem>_report.md
```
Implementation guidance:
- Keep `DiscoveredPdf.relative_parent` behavior.
- Add a focused part-planning helper rather than encoding final output names through fake temporary PDF filenames.
- Keep `PlannedOutput` if the existing conversion code can use it cleanly, but allow multiple Markdown parts to share the same `assets_dir` and `report_path`.
- Duplicate-path detection must reject duplicate Markdown files and raw directories, but it must allow shared `images/` and shared report paths for parts belonging to the same source PDF.
### WP16.2: Markdown Part Numbering
Replace public part names:
```text
paper.part-001.pages-001-020.md
paper.part-002.pages-021-040.md
```
with:
```text
paper_001.md
paper_002.md
```
Rules:
- Part index is based on final output group order, not source page number.
- The report must still record source page ranges for each part.
- Failed groups should not create a Markdown file, but the report must mention the failed part and source page range.
### WP16.3: Shared Images Folder
Replace per-output asset directories:
```text
paper.part-001.pages-001-020.assets/
paper.part-002.pages-021-040.assets/
```
with:
```text
paper/images/
```
Implementation guidance:
- Copy all assets for one source PDF into the shared `images/` folder.
- Rewrite Markdown links to `images/<asset-name>`.
- Use deterministic collision-safe filenames. Recommended pattern:
- page-known assets: `page-001_<original-name>`, with `-002` suffixes when needed.
- page-unknown assets: `asset-001<suffix>`, preserving the original suffix when available.
- Keep asset-link validation pointed at the shared `images/` directory.
### WP16.4: One Report, No Metadata JSON
Stop writing metadata JSON as a user-facing output file.
Implementation guidance:
- Continue building internal metadata dictionaries or records for each part so report generation and `ConversionResult` summaries stay traceable.
- Add an aggregate report path at `<stem>/<stem>_report.md`.
- The report must include:
- source PDF path,
- output folder path,
- Markdown part list with page ranges,
- engine and engine options,
- final status,
- warning count,
- asset count,
- missing/invalid asset link counts,
- inline/display formula counts,
- MathJax render error count,
- text fidelity summary when available,
- failed source pages or failed parts when any exist,
- warnings grouped by page or part.
- `ConversionResult.metadata_path` should be `None` for simplified outputs.
- `ConversionResult.report_path` should point to the shared report path.
### WP16.5: CLI, UI, And Documentation
Update user-facing docs and tests to remove metadata JSON as an expected output.
Implementation guidance:
- `pdf2md convert` summary may keep printing Markdown paths and warning counts.
- Update CLI help for `--metadata` to say metadata JSON output is disabled or deprecated in the simplified layout.
- Update README examples to show the new folder layout.
- Update PRD and ARCHITECTURE so they no longer claim metadata JSON is required as a public artifact.
- Keep internal provenance wording clear: warnings and report are still derived from internal metadata-like records.
- Update optional fixture documentation so generated metadata JSON is not required for sample validation.
## Implementation Task Plan
### Task 1: Path Planning For Simplified Layout
Files:
- Modify `src/pdf2md/paths.py`.
- Modify `tests/test_paths.py`.
Steps:
- [ ] Add failing tests showing `plan_outputs()` maps `paper.pdf` to `out/paper/paper_001.md`, `out/paper/images`, no metadata path, and `out/paper/paper_report.md`.
- [ ] Add a failing test for Korean filenames, using the PDF stem exactly as the output folder and file prefix.
- [ ] Add a failing test for recursive input preserving `relative_parent`.
- [ ] Add a failing test that duplicate source stems in the same relative parent conflict before conversion.
- [ ] Implement the minimal path planning changes.
- [ ] Run `uv run pytest tests/test_paths.py`.
- [ ] Commit path planning changes.
### Task 2: Single-Output Conversion Writes Simplified Files
Files:
- Modify `src/pdf2md/conversion.py`.
- Modify `tests/test_conversion.py`.
- Modify `tests/test_cli.py`.
Steps:
- [ ] Add failing conversion tests showing a non-chunked fake-adapter conversion writes `out/paper/paper_001.md`, `out/paper/images`, and `out/paper/paper_report.md`.
- [ ] Add failing assertions that no `.metadata.json` file is written and `result.metadata_path is None`.
- [ ] Add failing CLI test showing `pdf2md convert paper.pdf --out out` creates the simplified folder.
- [ ] Implement the minimal conversion changes for non-chunked output.
- [ ] Run `uv run pytest tests/test_conversion.py tests/test_cli.py tests/test_paths.py`.
- [ ] Commit single-output conversion changes.
### Task 3: Grouped Output Parts And Shared Images
Files:
- Modify `src/pdf2md/conversion.py`.
- Modify `src/pdf2md/pdf_splitter.py` only if a small helper is needed.
- Modify `tests/test_conversion.py`.
- Modify `tests/test_cli.py`.
Steps:
- [ ] Add failing tests for `chunk_pages=20` showing final Markdown names are `paper_001.md`, `paper_002.md`, not `paper.part-...md`.
- [ ] Add failing tests proving all grouped assets are copied into `paper/images/` and Markdown links use `images/...`.
- [ ] Add failing tests proving asset collisions across pages get deterministic unique filenames.
- [ ] Add failing tests proving failed page conversions are represented in the shared report while later pages still convert.
- [ ] Implement grouped output naming and shared image handling.
- [ ] Run `uv run pytest tests/test_conversion.py tests/test_cli.py tests/test_pdf_splitter.py`.
- [ ] Commit grouped output changes.
### Task 4: Aggregate Report Without Metadata JSON
Files:
- Modify `src/pdf2md/report.py` or add a focused aggregate report helper.
- Modify `src/pdf2md/conversion.py`.
- Modify `tests/test_report.py`.
- Modify `tests/test_conversion.py`.
Steps:
- [ ] Add failing report tests for a one-file report listing multiple Markdown parts and source page ranges.
- [ ] Add failing conversion tests proving only one report exists for a chunked PDF.
- [ ] Add failing tests proving report summary totals combine all output parts.
- [ ] Add failing tests proving all-failed conversions write a report but no Markdown part.
- [ ] Implement aggregate report rendering from internal metadata records.
- [ ] Run `uv run pytest tests/test_report.py tests/test_conversion.py`.
- [ ] Commit report changes.
### Task 5: Recheck, CLI Compatibility, UI Text, And Docs
Files:
- Modify `src/pdf2md/cli.py`.
- Modify `src/pdf2md/conversion.py`.
- Modify `src/pdf2md_ui/runner.py` and `src/pdf2md_ui/app.py` only if text/output assumptions change.
- Modify `README.md`.
- Modify `PRD.md`.
- Modify `ARCHITECTURE.md`.
- Modify `docs/V1IMPLEMENTATIONPLAN.md`.
- Modify `tests/test_cli.py`.
- Modify `tests/test_ui_runner.py` only if UI behavior changes.
- Modify `tests/integration/test_v1_fast_release_gate.py`.
- Modify `tests/integration/test_optional_mineru_fixtures.py`.
Steps:
- [ ] Add failing CLI tests proving `--metadata` remains accepted but no metadata JSON is written.
- [ ] Add failing recheck test proving simplified outputs without metadata fail with a clear legacy-metadata message.
- [ ] Update integration tests to require Markdown part files, one report, and image links, not metadata JSON.
- [ ] Update README, PRD, ARCHITECTURE, and release-gate wording for the simplified layout.
- [ ] Implement CLI/recheck/doc changes.
- [ ] Run `uv run pytest tests/test_cli.py tests/test_ui_runner.py tests/integration/test_v1_fast_release_gate.py`.
- [ ] Commit CLI, UI, integration, and documentation changes.
### Task 6: Final Verification And Handoff
Files:
- Modify `PLAN.md`.
- Modify `PROGRESS.md`.
- Modify `docs/WORKARCHIVE.md` after implementation.
- Modify `docs/Sprints/SPRINT16CONTRACT.md` status and handoff fields.
Steps:
- [ ] Run focused Sprint 16 verification:
```powershell
uv run pytest tests/test_paths.py tests/test_conversion.py tests/test_cli.py tests/test_report.py tests/integration/test_v1_fast_release_gate.py
```
- [ ] Run full default verification:
```powershell
uv run pytest
```
- [ ] Run diff check:
```powershell
git diff --check
```
- [ ] Update `PROGRESS.md` with files changed, checks run, residual risks, and next actions.
- [ ] Archive completed implementation evidence in `docs/WORKARCHIVE.md`.
- [ ] Commit final coordination updates.
## Verification Commands
```powershell
uv run pytest tests/test_paths.py tests/test_conversion.py tests/test_cli.py tests/test_report.py tests/integration/test_v1_fast_release_gate.py
uv run pytest
git diff --check
git status --short --untracked-files=all
```
Optional local fixture validation after implementation:
```powershell
$env:MINERU_MODEL_SOURCE='local'
uv run pdf2md convert samples\SolidElement.pdf --out outputs\SolidElement_sprint16_layout --overwrite --chunk-pages --gpu auto --mineru-profile auto --strict-local
```
Expected optional validation:
- Output folder is `outputs\SolidElement\` or the explicitly provided output root plus `SolidElement\`, depending on the command.
- Markdown part is `SolidElement_001.md` for the 6-page sample.
- Report is `SolidElement_report.md`.
- Images are under `images\`.
- No metadata JSON exists.
## Acceptance Criteria
- Each input PDF writes into an output folder named after the PDF stem.
- Markdown outputs are named `<stem>_001.md`, `<stem>_002.md`, and so on.
- All image/media assets for one PDF live under `<stem>/images/`.
- Markdown links point to `images/...`.
- Exactly one report file is written per input PDF at `<stem>/<stem>_report.md`.
- No metadata JSON file is written for new conversions.
- Internal warning, provenance, formula count, asset count, and text fidelity information remains available in the report.
- Chunk mode still converts one source page per MinerU run and groups Markdown by `chunk_pages`.
- Strict-local and MinerU-only constraints remain unchanged.
- Default tests stay fast and local.
## Hard Failure Criteria
- Any new conversion writes `.metadata.json` as a public output.
- Output files keep old `part-001.pages-...` names.
- Assets are split into per-part `.assets` folders.
- More than one report is written for one input PDF.
- Markdown links point outside the PDF output folder.
- Chunk mode stops using one source page per MinerU run.
- Strict-local enforcement is weakened.
- Default tests require real MinerU, GPU, model files, network, Obsidian, MathJax, or `samples/`.
- Sample PDFs, generated outputs, local model files, or `dist/pdf2md-ui.exe` are committed.
## Open Questions
- Should metadata-free `pdf2md recheck` be restored in a later sprint by deriving enough state from the report and Markdown, or is rerunning conversion acceptable for simplified outputs?
- Should raw MinerU outputs under `--keep-raw` be flattened into `raw/` or kept per part under `raw/<stem>_001/`? This contract recommends per-part raw folders to avoid collisions.
## Handoff Requirements
After implementation:
- Update this contract status to `Implemented`.
- Record final file layout examples in `README.md`.
- Record verification commands and outcomes in `PROGRESS.md`.
- Archive implementation and optional sample validation results in `docs/WORKARCHIVE.md`.
- Keep generated outputs and sample PDFs uncommitted.
## Implementation Handoff
- Files changed: `src/pdf2md/paths.py`, `src/pdf2md/conversion.py`, `src/pdf2md/report.py`, `src/pdf2md/cli.py`, `src/pdf2md_ui/runner.py`, focused tests, and current docs.
- Output layout implemented: `<out>/<stem>/<stem>_001.md`, additional numbered parts when grouped, `<out>/<stem>/images/`, and `<out>/<stem>/<stem>_report.md`.
- Metadata JSON behavior: new conversions do not write public `.metadata.json`; `ConversionResult.metadata_path` is `None`; internal metadata-like records still feed reports and tests.
- Recheck behavior: `pdf2md recheck` remains legacy-only and requires adjacent metadata JSON.
- Verification recorded in `PROGRESS.md`: focused Sprint 16 tests passed, full `uv run pytest` passed 227 tests with 1 optional skip, and `git diff --check` passed with line-ending warnings only.
+440
View File
@@ -0,0 +1,440 @@
# Sprint 17 Contract: Offline Windows Installer
Status: Abandoned
Last updated: 2026-05-13
## Abandonment Note
Sprint 17 was abandoned at the user's request on 2026-05-13 before implementation began. This document remains as a historical planning record only. Do not implement or extend this contract unless the user explicitly reopens offline installer work.
## Objective
Create a large offline Windows installer that can install the existing local `pdf2md` runtime on another Windows PC without internet access.
The installer must install or stage all application-owned files needed after download time: the minimal UI executable, the project runtime, a target-local Python virtual environment created from bundled wheels, CUDA PyTorch wheels, MinerU 3.1.0 wheels and dependencies, local MinerU model files, optional local Node.js/MathJax assets, Start Menu shortcuts, setup logs, and a post-install `pdf2md doctor` verification path.
This sprint does not change conversion behavior. It packages the already implemented CLI/UI/runtime for offline use.
## Product Decision
The offline package should create the target PC virtual environment during installation instead of copying the current development `.venv`.
Reasoning:
- Python virtual environments and console entry points often contain absolute paths and are not a reliable redistribution unit.
- A target-local `.venv` created from a bundled wheelhouse is more reproducible and easier to repair.
- The installer can keep the wheelhouse for offline repair, uninstall/reinstall, and audit.
## Installer Shape
Recommended installer technology:
- Inno Setup for the Windows installer shell because it can compile scripts from the command line with `ISCC.exe`, returns deterministic exit codes, and is simple enough for a per-user installer.
- PowerShell scripts for payload build, target runtime install, and target verification.
- PyInstaller remains only the UI executable builder. It must not become the full MinerU/PyTorch/model bundler.
Default install root:
```text
%LOCALAPPDATA%\Programs\ConvertPDFToMD\
```
Installed layout:
```text
ConvertPDFToMD/
app/
pdf2md-ui.exe
runtime/
pyproject.toml
uv.lock
README.md
src/
tools/
package.json
package-lock.json
.venv/
payload/
python/
uv/
wheelhouse/
requirements-runtime-cu126.txt
models/
node/
node_modules/
payload-manifest.json
SHA256SUMS.txt
THIRD_PARTY_NOTICES.md
scripts/
install-runtime.ps1
repair-runtime.ps1
run-doctor.ps1
logs/
```
Generated artifacts that must remain untracked:
```text
dist/offline-installer/
dist/Pdf2MdOfflineSetup-*.exe
```
## Payload Contents
The first offline payload targets Windows x64, Python 3.12, CUDA PyTorch `2.6.0+cu126`, `torchvision 0.21.0+cu126`, and `mineru[core]==3.1.0`.
Required:
- `dist/pdf2md-ui.exe` from the existing PyInstaller build.
- Tracked project runtime files needed to run `uv run pdf2md`.
- A Windows x64 Python 3.12 installer or an equivalent approved Python runtime package.
- A Windows x64 `uv.exe`.
- A wheelhouse containing:
- the current project wheel,
- `pypdf`,
- `torch==2.6.0`,
- `torchvision==0.21.0`,
- `mineru[core]==3.1.0`,
- all transitive Python runtime dependencies.
- Local MinerU model files and the model config template needed for `MINERU_MODEL_SOURCE=local`.
- A manifest listing every payload file, size, SHA-256 hash, source URL or local source, and license family.
Optional but recommended:
- Portable local Node.js runtime.
- `node_modules/` containing the locked MathJax checker dependencies from `package-lock.json`.
Explicitly excluded:
- `samples/`.
- `outputs/`.
- `.git/`.
- The development `.venv/`.
- Local generated PyInstaller `build/` folders and `.spec` files unless the implementation deliberately adds a stable project-owned spec file.
- NVIDIA GPU drivers and CUDA Toolkit installers. The installer may check for a compatible NVIDIA driver through `nvidia-smi`, but it should not redistribute GPU drivers in this sprint.
## Touched Surfaces
Allowed during implementation:
- Create `packaging/offline/build-offline-payload.ps1`.
- Create `packaging/offline/verify-offline-payload.ps1`.
- Create `packaging/offline/install-runtime.ps1`.
- Create `packaging/offline/repair-runtime.ps1`.
- Create `packaging/offline/run-doctor.ps1`.
- Create `packaging/offline/Pdf2MdOffline.iss`.
- Create `packaging/offline/requirements-runtime-cu126.txt`.
- Create `packaging/offline/README.md`.
- Create `packaging/offline/THIRD_PARTY_NOTICES.md`.
- Create `src/pdf2md/packaging_manifest.py` only if a Python helper is simpler than repeating manifest logic in PowerShell.
- Modify `src/pdf2md_ui/runner.py` so the UI can resolve an installed target-local `.venv\Scripts\pdf2md.exe` before falling back to PATH or `uv run pdf2md`.
- Modify `src/pdf2md_ui/app.py` only if the project root default must prefer the installed runtime folder.
- Modify `tests/test_ui_runner.py`.
- Create `tests/test_offline_packaging.py`.
- Modify `README.md`.
- Modify `docs/V1RELEASECHECKLIST.md`.
- Modify `PLAN.md`.
- Modify `PROGRESS.md`.
- Modify `docs/WORKARCHIVE.md` after implementation.
Not allowed:
- Do not change MinerU 3.1.0 as the fixed conversion engine.
- Do not add a second conversion engine.
- Do not add runtime network calls, `--api-url`, router mode, remote APIs, HTTP client backends, remote OpenAI-compatible backends, or hosted renderers.
- Do not copy the development `.venv` as the installed runtime.
- Do not make default tests depend on real MinerU, GPU, model files, network, Obsidian, MathJax, Inno Setup, or `samples/`.
- Do not commit generated installer payloads, model files, wheelhouse files, Python installers, `dist/`, `outputs/`, or `samples/`.
## Architecture Plan
### WP17.1: Offline Payload Builder
Add a build script that creates a clean staging folder under `dist/offline-installer/` with `app/`, `runtime/`, and `payload/` subfolders that mirror the final install layout.
Responsibilities:
- Rebuild `dist/pdf2md-ui.exe`.
- Build the project wheel into the staging wheelhouse.
- Download or collect Python wheels for the target runtime on a connected build PC.
- Collect the Windows Python runtime package and `uv.exe`.
- Copy project runtime files without `.git`, `.venv`, `outputs/`, `samples/`, and build trash.
- Copy local MinerU model files from a configured source path.
- Optionally copy portable Node.js and the locked `node_modules/`.
- Generate `payload-manifest.json` and `SHA256SUMS.txt`.
- Fail if any required file is missing or if any wheel dependency would require internet during installation.
The builder may use `python -m pip download` on the connected build PC. The target installer must use only local files, for example `uv pip install --no-index --find-links`.
### WP17.2: Target Runtime Installer
Add a PowerShell install script that runs from the installed payload and creates the real runtime on the target PC.
Responsibilities:
- Verify payload hashes before installing.
- Install or locate Python 3.12 x64.
- Create `runtime\.venv` on the target PC.
- Install packages from `payload\wheelhouse` with network disabled.
- Install the project wheel into the target `.venv`.
- Preserve the bundled wheelhouse for offline repair.
- Configure `MINERU_MODEL_SOURCE=local` for UI/CLI child processes.
- Configure local MinerU model paths without silently overwriting an unrelated user `mineru.json`.
- If `%USERPROFILE%\mineru.json` already exists and points elsewhere, prompt in interactive mode; in silent mode, fail clearly and leave `repair-runtime.ps1` instructions.
- Run `pdf2md doctor` and write the result to `logs\doctor-after-install.txt`.
### WP17.3: UI Runtime Resolution
Adjust the UI runner for an installed offline layout.
Resolution order:
1. Explicit configured `pdf2md` command.
2. Installed runtime `.venv\Scripts\pdf2md.exe` under the selected project root.
3. `pdf2md` on PATH.
4. Bundled `uv.exe` plus `uv run --offline pdf2md` under the selected project root.
5. Existing system `uv run pdf2md` fallback.
Child environment rules:
- Set `MINERU_MODEL_SOURCE=local` unless explicitly set.
- Add installed `.venv\Scripts` to PATH for runtime console scripts.
- Add installed portable Node.js path to PATH when bundled.
- Set `UV_OFFLINE=1` when using the installed offline runtime.
- Do not add remote endpoints or backend flags.
### WP17.4: Inno Setup Installer
Add an Inno Setup script that installs the payload and invokes the target runtime installer.
Installer behavior:
- Default to per-user install under `%LOCALAPPDATA%\Programs\ConvertPDFToMD`.
- Create Start Menu shortcuts for:
- `ConvertPDFToMD` UI,
- `PDF2MD Doctor`,
- `Repair PDF2MD Runtime`.
- Run `install-runtime.ps1` after files are copied.
- Show the doctor log path if setup finishes with WARN.
- Fail the install on target runtime setup failure unless the user explicitly chooses to keep files for manual repair.
### WP17.5: License, Manifest, And Offline Verification
Add docs and checks for redistribution risk.
Required records:
- Python, uv, PyInstaller, PyTorch, MinerU, model files, Node.js, MathJax, and transitive Python/npm dependency notices.
- A manifest with file hashes and source URLs.
- A clear statement that runtime conversion remains local-only and that setup payload creation can use internet only on the build PC.
Verification tiers:
- Fast tests use fake staging folders and fake wheel/model files.
- Build-PC packaging smoke can create the staging folder without committing payload.
- Offline target smoke uses a clean Windows VM with networking disabled.
## Implementation Task Plan
### Task 1: Packaging Manifest And Ignore Policy
Files:
- Create `tests/test_offline_packaging.py`.
- Create `src/pdf2md/packaging_manifest.py` if needed.
- Modify `.gitignore`.
Steps:
- Add failing tests for manifest generation with SHA-256, file size, relative path, and source label.
- Add failing tests that payload paths under `dist/offline-installer/`, wheelhouse files, model files, and generated installer executables stay ignored.
- Implement the smallest manifest helper or PowerShell-compatible JSON format.
- Run `uv run pytest tests/test_offline_packaging.py`.
- Commit manifest and ignore-policy changes.
### Task 2: Offline Payload Builder
Files:
- Create `packaging/offline/build-offline-payload.ps1`.
- Create `packaging/offline/requirements-runtime-cu126.txt`.
- Create `packaging/offline/README.md`.
- Create `packaging/offline/verify-offline-payload.ps1`.
- Modify `tests/test_offline_packaging.py`.
Steps:
- Add tests that the builder rejects missing UI exe, missing model source, missing Python runtime package, missing `uv.exe`, and empty wheelhouse.
- Add tests that the builder excludes `.venv`, `.git`, `samples`, `outputs`, `node_modules` unless explicitly copied as the optional locked MathJax payload.
- Implement payload staging, manifest generation, and payload verification.
- Run `uv run pytest tests/test_offline_packaging.py`.
- Run a dry build command that uses fake payload inputs.
- Commit builder changes.
### Task 3: Target Runtime Install And Repair Scripts
Files:
- Create `packaging/offline/install-runtime.ps1`.
- Create `packaging/offline/repair-runtime.ps1`.
- Create `packaging/offline/run-doctor.ps1`.
- Modify `tests/test_offline_packaging.py`.
Steps:
- Add tests that scripts contain `--no-index`, `--find-links`, `UV_OFFLINE=1`, and no `http://` or `https://` target-install commands.
- Add tests that existing `mineru.json` handling is explicit and never silently overwritten.
- Implement target-local `.venv` creation, offline package install, model config handling, doctor logging, and repair flow.
- Run `uv run pytest tests/test_offline_packaging.py`.
- Commit install-script changes.
### Task 4: UI Installed Runtime Resolution
Files:
- Modify `src/pdf2md_ui/runner.py`.
- Modify `src/pdf2md_ui/app.py` only if needed.
- Modify `tests/test_ui_runner.py`.
Steps:
- Add failing tests for project-root `.venv\Scripts\pdf2md.exe` resolution before PATH.
- Add failing tests for bundled `uv.exe` plus `uv run --offline pdf2md` fallback.
- Add failing tests that the child environment prepends `.venv\Scripts` and bundled Node.js when present.
- Implement the minimal runner changes.
- Run `uv run pytest tests/test_ui_runner.py`.
- Commit UI resolution changes.
### Task 5: Inno Setup Script
Files:
- Create `packaging/offline/Pdf2MdOffline.iss`.
- Modify `tests/test_offline_packaging.py`.
Steps:
- Add tests that the Inno script references the expected payload directories, Start Menu shortcuts, and runtime install script.
- Add tests that the script does not reference `samples`, `outputs`, `.venv`, or remote URLs.
- Implement the Inno script.
- On a build PC with Inno Setup installed, run `ISCC.exe packaging\offline\Pdf2MdOffline.iss`.
- Commit installer-script changes without committing the generated installer.
### Task 6: Documentation And Release Gate
Files:
- Modify `README.md`.
- Modify `docs/V1RELEASECHECKLIST.md`.
- Modify `docs/Sprints/SPRINT17CONTRACT.md`.
- Modify `PLAN.md`.
- Modify `PROGRESS.md`.
- Modify `docs/WORKARCHIVE.md` after implementation.
Steps:
- Document build-PC prerequisites and target-PC prerequisites.
- Document the offline artifact layout, expected size risk, and repair flow.
- Document the clean offline VM smoke test.
- Record final verification outcomes and residual risks.
- Commit documentation and handoff updates.
## Verification Commands
Default fast checks:
```powershell
uv run pytest tests/test_offline_packaging.py tests/test_ui_runner.py
uv run pytest
git diff --check
git status --short --untracked-files=all
```
Build-PC packaging checks:
```powershell
uv run --group ui-build pyinstaller --clean --onefile --windowed --name pdf2md-ui src\pdf2md_ui\app.py
$pythonInstaller = "C:\BuildCache\python-3.12-amd64.exe"
$uvExe = "C:\BuildCache\uv.exe"
$mineruModels = "C:\BuildCache\mineru-models"
powershell -ExecutionPolicy Bypass -File packaging\offline\build-offline-payload.ps1 -Configuration Release -PythonInstaller $pythonInstaller -UvExe $uvExe -MinerUModelSource $mineruModels
powershell -ExecutionPolicy Bypass -File packaging\offline\verify-offline-payload.ps1 -PayloadRoot dist\offline-installer\payload
ISCC.exe packaging\offline\Pdf2MdOffline.iss
```
Offline target smoke:
```powershell
# Run on a clean Windows x64 VM with networking disabled after copying only the installer.
.\Pdf2MdOfflineSetup-*.exe
& "$env:LOCALAPPDATA\Programs\ConvertPDFToMD\scripts\run-doctor.ps1"
& "$env:LOCALAPPDATA\Programs\ConvertPDFToMD\runtime\.venv\Scripts\pdf2md.exe" --version
& "$env:LOCALAPPDATA\Programs\ConvertPDFToMD\runtime\.venv\Scripts\pdf2md.exe" doctor
```
Optional conversion smoke on the offline target:
```powershell
& "$env:LOCALAPPDATA\Programs\ConvertPDFToMD\runtime\.venv\Scripts\pdf2md.exe" convert C:\LocalTest\SolidElement.pdf --out C:\LocalTest\outputs --overwrite --chunk-pages --gpu auto --mineru-profile auto --strict-local
```
Expected optional output:
```text
C:\LocalTest\outputs\SolidElement\SolidElement_001.md
C:\LocalTest\outputs\SolidElement\SolidElement_report.md
C:\LocalTest\outputs\SolidElement\images\
```
## Acceptance Criteria
- The generated installer can install the runtime on a clean Windows x64 target without internet access.
- The target runtime has a newly created local `.venv`; it is not a copied development `.venv`.
- `pdf2md --version` runs from the installed `.venv`.
- `pdf2md doctor` runs without network access and reports all install-relevant failures or warnings clearly.
- The UI launches from the Start Menu and resolves the installed runtime without manual project-root configuration.
- MinerU uses local models through `MINERU_MODEL_SOURCE=local` and local model config.
- Python package installation uses only bundled local wheels.
- The wheelhouse and model payload are hash-verified before install.
- No generated payload, model file, wheel, installer exe, sample PDF, or conversion output is committed.
- Default tests remain fast and independent of real MinerU, GPU, model files, network, Inno Setup, MathJax, or `samples/`.
## Hard Failure Criteria
- The target installer downloads anything from the internet.
- The UI or CLI introduces a runtime document upload path.
- The installer silently overwrites an unrelated existing `mineru.json`.
- The installer copies the development `.venv` as the installed runtime.
- The installed UI cannot find `pdf2md` without manually editing settings on a clean install.
- `pdf2md doctor` is skipped or its failure is hidden.
- Payload hash verification is missing.
- License/model redistribution review is skipped before sharing the installer outside the current personal environment.
- NVIDIA drivers or CUDA Toolkit installers are redistributed in this sprint.
## Open Risks
- The final installer may be very large because CUDA PyTorch wheels, MinerU dependencies, model weights, and optional Node/MathJax assets are large.
- MinerU model redistribution terms and transitive package/model licenses must be reviewed before broader sharing.
- Target PCs still need compatible NVIDIA hardware and drivers. The installer can verify and report this, but it cannot guarantee GPU compatibility.
- Some conversions can still stall or run slowly on GTX 1070 Ti 8GB; packaging does not solve runtime performance.
- Inno Setup may need practical size and antivirus/SmartScreen validation once real model payloads are included.
## Sources
- PyInstaller usage: https://pyinstaller.org/en/stable/usage.html
- Inno Setup command-line compiler: https://documentation.help/Inno-Setup/topic_compilercmdline.htm
- uv CLI `--offline` behavior: https://docs.astral.sh/uv/reference/cli/
- uv cache behavior: https://docs.astral.sh/uv/concepts/cache/
- pip offline install/download behavior: https://pip.pypa.io/en/stable/cli/pip_install.html and https://pip.pypa.io/en/stable/cli/pip_download/
- PyTorch previous version wheel command for CUDA 12.6: https://pytorch.org/get-started/previous-versions/
- MinerU local model source behavior: https://opendatalab.github.io/MinerU/usage/model_source/
## Handoff Requirements
After implementation:
- Update this contract status to `Implemented` or record the failed gate.
- Record payload size and generated installer path in `PROGRESS.md`.
- Record verification commands and outcomes in `PROGRESS.md`.
- Archive implementation evidence and offline VM smoke results in `docs/WORKARCHIVE.md`.
- Keep generated offline payloads, wheels, model files, installer exe, `dist/`, `outputs/`, and `samples/` uncommitted.
+1 -1
View File
@@ -134,7 +134,7 @@ Not allowed:
- Do not run model setup automatically.
- Do not require the local GTX 1070 Ti to pass CUDA/PyTorch checks in the default test loop.
- Do not improve OCR/model accuracy.
- Do not introduce a manual review UI or web UI.
- Do not introduce a manual review UI, hosted web UI, or local desktop launcher in Sprint 9.
- Do not add alternate conversion engines or fallback engines.
- Do not benchmark against cloud OCR/API services.
- Do not commit sample PDFs, sample-derived outputs, or large binary fixtures.
+237
View File
@@ -0,0 +1,237 @@
# UI Research: Minimal Windows Launcher For pdf2md
Last updated: 2026-05-11
## Scope
User request:
- Build a minimal UI that uses the existing `pdf2md` CLI.
- Build it into a Windows `.exe`.
- Research the implementation path before coding.
This document is research and planning input only. It does not change runtime behavior.
## Current Project Fit
The existing converter is already centered on a CLI:
```powershell
uv run pdf2md doctor
uv run pdf2md convert INPUT --out OUTPUT --overwrite
uv run pdf2md recheck OUTPUT.md
```
The UI should preserve the current architecture:
- Use MinerU 3.1.0 through the direct local `mineru` CLI only.
- Keep strict-local behavior. Do not expose `--api-url`, remote endpoints, router mode, cloud OCR, remote LLMs, or external document uploads.
- Treat the UI `.exe` as a launcher for the existing local runtime, not as a fully self-contained bundle of MinerU, PyTorch, CUDA DLLs, local models, Node.js, and MathJax.
- Keep generated Markdown parts, report Markdown, assets, and raw output behavior owned by the existing CLI.
## Recommendation
Use a thin Python desktop launcher:
- UI framework: `tkinter` plus `tkinter.ttk`.
- CLI execution: `subprocess.Popen` with `shell=False`, argument lists, a worker thread, and a queue back to the UI thread.
- Packaging: PyInstaller `--onefile --windowed` for a lightweight `pdf2md-ui.exe`.
- Runtime command: prefer `pdf2md` if it is on `PATH`; otherwise run `uv run pdf2md` with a configured project root.
This is the lowest-risk path because `tkinter` is in the Python standard library, `ttk` provides native themed widgets, and PyInstaller directly supports graphical windowed apps on Windows. The UI remains small and avoids bundling the large GPU conversion stack into the UI executable.
## Why Not Bundle The Whole Converter Into One EXE
Bundling the full conversion runtime into a single executable is not a good v1 target:
- The runtime includes CUDA PyTorch, MinerU, model files, optional Node.js/MathJax support, and local cache/config state.
- Model weights and transitive licenses are already documented as redistribution-sensitive.
- One-file executables extract at startup; large bundles can start slowly and create antivirus or SmartScreen friction.
- The project already uses `uv` and a known local `.venv`; the UI can call that stable runtime.
Recommended v1 interpretation of ".exe":
- Build `pdf2md-ui.exe` as the desktop UI.
- Require the local converter runtime to be installed and pass `pdf2md doctor`.
- Let the UI surface doctor failures clearly instead of pretending to be a complete installer.
Future redistribution can be revisited later as a separate packaging and license sprint.
## UI Framework Options
| Option | Fit | Pros | Cons | Decision |
| --- | --- | --- | --- | --- |
| `tkinter` + `ttk` | Strong | Standard library, native file dialogs, themed widgets, minimal dependencies, easy PyInstaller build. Python docs warn that long work must not block Tk's single-threaded event loop, which matches a worker-thread runner design. | Visual polish is modest. Advanced drag/drop usually needs extra packages. | Recommended for v1. |
| PySide6 / Qt for Python | Medium | Polished widgets, strong desktop model, official Python bindings. | Adds large Qt dependency, LGPL/commercial considerations, more complex deployment. Qt docs describe PyInstaller and Nuitka paths, plus caveats around virtualenv/system package selection and Qt plugin bundling. | Keep as a later polish option. |
| CustomTkinter | Medium | More modern look on top of Tkinter. | Official wiki notes PyInstaller packaging data-file issues and recommends `--onedir` instead of `--onefile`. Adds dependency for mostly visual benefit. | Avoid for v1. |
| Flet | Low/medium | Modern Flutter-based Python UI, official `flet build windows`. | Windows packaging requires Visual Studio 2022 with Desktop development with C++ workload. Heavier stack than needed for a form/log launcher. | Avoid for v1. |
| Tauri | Low | Sidecar pattern can embed external binaries and produce polished small desktop apps. | Requires Rust and frontend stack, sidecar permissions, target-triple binary naming, and more architecture than needed. | Avoid for v1. |
| Briefcase | Medium | Produces Windows app folders, MSI installers, and ZIPs; useful for installer-style distribution. | More installer-oriented than needed for a first thin launcher. | Consider after v1 UI works. |
## Packaging Options
| Tool | Relevant facts | Fit |
| --- | --- | --- |
| PyInstaller | Supports one-folder and one-file bundles. On Windows it can create graphical apps without a console window. `--onefile`, `--windowed`, `--name`, `--icon`, and spec files cover the expected needs. PyInstaller's license includes an exception allowing bundled applications to be shipped under the application's own license, subject to dependency licenses. | Recommended. |
| Nuitka | Can create standalone, onefile, and app-mode outputs, and emits `.exe` on Windows. Requires a C compiler/toolchain and has longer build complexity. | Good later if PyInstaller output has startup or AV problems. |
| `pyside6-deploy` | Official Qt for Python deployment tool wrapping Nuitka. Produces `.exe` on Windows. | Only relevant if choosing PySide6. |
| Briefcase | Windows outputs include app folders plus MSI or ZIP packaging. Uses an embedded Python distribution. | Useful for installer sprint, not the first UI executable. |
| Flet build | Official Windows build path exists but requires Visual Studio C++ workload. | Too much setup for this project. |
## CLI Runner Design
The UI should not call MinerU directly. It should call the project-owned CLI:
```text
pdf2md doctor
pdf2md convert <input.pdf> --out <output-dir> --overwrite --gpu cuda:0
pdf2md recheck <output.md>
```
Command resolution:
1. If the configured command exists, use it.
2. Else if `pdf2md` is on `PATH`, run `pdf2md`.
3. Else if `uv` is on `PATH` and a configured project root contains `pyproject.toml`, run `uv run pdf2md` with `cwd=<project-root>`.
4. Else show a setup error and suggest running `pdf2md doctor` in the repository.
Subprocess rules:
- Always pass an argument list with `shell=False`.
- Set `cwd` explicitly when running through `uv`.
- Set `MINERU_MODEL_SOURCE=local` in the child environment unless the user already set it.
- Merge stderr into stdout for a single UI log stream.
- Read output line by line in a background thread.
- Communicate to Tk through `queue.Queue` and `root.after(...)`.
- Store the process PID so Cancel can terminate it.
Cancellation on Windows:
- First call `Popen.terminate()`.
- If the process does not exit promptly, call `taskkill /pid <pid> /t /f` to end the process tree. Microsoft documents `/t` as ending child processes and `/f` as forceful termination.
Current limitation:
- The existing MinerU adapter uses `subprocess.run(..., capture_output=True)` inside `pdf2md`, so detailed MinerU progress may not stream until the CLI completes. The v1 UI should use an indeterminate progress bar plus final CLI output. A future CLI sprint can add streaming progress/events if needed.
## Minimal UI Shape
Single window, no landing page:
- Input PDF: file picker.
- Output directory: directory picker, defaulting to `outputs/<pdf-stem>`.
- Options:
- `Overwrite` checkbox.
- `Keep raw MinerU output` checkbox.
- `Group pages` checkbox plus numeric field, default `20`.
- `GPU` field, default `cuda:0`.
- Buttons:
- `Doctor`.
- `Convert`.
- `Cancel`.
- `Open output`.
- Status:
- Indeterminate progress bar while running.
- Read-only log pane.
- Last output paths from CLI/report when conversion completes.
No v1 drag/drop, batch queue, config editor, PDF preview, Markdown preview, or Obsidian integration. Those would add scope without helping the first `.exe` workflow.
## Build Shape
Proposed files:
```text
src/
pdf2md_ui/
__init__.py
app.py
runner.py
tests/
test_ui_runner.py
```
Proposed dependency policy:
- No runtime GUI dependency beyond the standard library.
- Add PyInstaller only to a local dependency group such as `ui-build`, not to the converter runtime dependencies.
Proposed build commands:
```powershell
uv add --group ui-build "pyinstaller>=6.20,<7"
uv run --group ui-build pyinstaller --clean --onefile --windowed --name pdf2md-ui src\pdf2md_ui\app.py
```
Expected artifact:
```text
dist/pdf2md-ui.exe
```
The built UI executable should be tested from the repository first, because `uv run pdf2md` needs a project root. If the executable is moved elsewhere, the UI should ask for and remember the project root in a small settings file under `%APPDATA%\pdf2md-ui\settings.json`.
## Verification Plan
Fast tests:
- Command resolution with fake PATH/project-root cases.
- Command construction for `doctor`, `convert`, `recheck`.
- No generated command contains prohibited strict-local tokens such as `--api-url`, `http://`, `https://`, `router`, or `openai`.
- Output-directory defaulting for ASCII and non-ASCII PDF names using temporary files.
- Cancel path calls the Windows process-tree termination helper when needed, using a mocked process.
Build verification:
```powershell
uv run pytest tests/test_ui_runner.py
uv run --group ui-build pyinstaller --clean --onefile --windowed --name pdf2md-ui src\pdf2md_ui\app.py
Test-Path dist\pdf2md-ui.exe
```
Manual smoke verification:
1. Launch `dist\pdf2md-ui.exe`.
2. Run Doctor from the UI.
3. Select a small local sample PDF.
4. Convert to an ignored `outputs/` folder.
5. Confirm the UI reports completion and the simplified output folder contains `*_001.md`, `images/`, and `*_report.md`.
## Security, Privacy, And Distribution Notes
- The UI must not introduce any network document path.
- The UI must not expose arbitrary command execution. It should build fixed `pdf2md` argument lists from validated fields.
- Use `shell=False`; never concatenate user-provided paths into a command string.
- Do not store PDF contents or extracted text in settings.
- Do not include sample PDFs or generated outputs in the build or commit.
- Unsigned Windows executables may trigger SmartScreen. Microsoft documents that unsigned files start with no reputation, and even signed new binaries can show warnings until reputation accumulates. Code signing can be planned later if the tool is distributed beyond personal use.
- If signing is added later, SignTool from the Windows SDK is the documented Microsoft tool. Current SignTool docs require digest options such as `/fd` and `/td`, with SHA-256 recommended.
## Open Risks
- A thin launcher depends on an installed and healthy local runtime. The UI must make `doctor` prominent.
- Current CLI progress is coarse because `pdf2md` captures MinerU subprocess output. This is acceptable for v1 but limits progress detail.
- Cancelling a conversion can leave partially written ignored outputs; the UI should label a cancelled run clearly and not delete user-selected output directories unless a later requirement defines cleanup.
- If the UI is redistributed, licenses for MinerU, PyTorch, Qt if ever used, model weights, and bundled tools must be reviewed before packaging more than the thin UI launcher.
## Sources
- Python `tkinter`: https://docs.python.org/3/library/tkinter.html
- Python `tkinter.ttk`: https://docs.python.org/3/library/tkinter.ttk.html
- Python `subprocess`: https://docs.python.org/3/library/subprocess.html
- PyInstaller usage: https://pyinstaller.org/en/stable/usage.html
- PyInstaller requirements: https://pyinstaller.org/en/stable/requirements.html
- PyInstaller license: https://pyinstaller.org/en/stable/license.html
- PyInstaller runtime information: https://pyinstaller.org/en/stable/runtime-information.html
- Nuitka user manual: https://nuitka.net/user-documentation/user-manual.html
- Qt for Python PyInstaller deployment: https://doc.qt.io/qtforpython-6/deployment/deployment-pyinstaller.html
- `pyside6-deploy`: https://doc.qt.io/qtforpython-6.5/deployment/deployment-pyside6-deploy.html
- Qt for Python licenses: https://doc.qt.io/qtforpython-6/licenses.html
- Flet build: https://flet.dev/docs/cli/flet-build/
- Flet Windows packaging: https://flet.dev/docs/publish/windows/
- Tauri sidecars: https://tauri.app/develop/sidecar/
- Briefcase Windows packaging: https://briefcase.beeware.org/en/latest/reference/platforms/windows/
- uv dependency groups: https://docs.astral.sh/uv/concepts/projects/dependencies/
- Microsoft `taskkill`: https://learn.microsoft.com/en-us/windows-server/administration/windows-commands/taskkill
- Microsoft SmartScreen reputation: https://learn.microsoft.com/en-us/windows/apps/package-and-deploy/smartscreen-reputation
- Microsoft SignTool: https://learn.microsoft.com/en-us/windows/win32/seccrypto/signtool
+77 -620
View File
@@ -1,28 +1,45 @@
# V1 Implementation Plan: Local PDF-to-Markdown Converter
Last updated: 2026-05-08
Last updated: 2026-05-13
This document is the implementation plan for v1. It does not replace `PRD.md` or `ARCHITECTURE.md`; use those files as the source of product requirements and system design. This plan explains the order of work, sprint contracts, verification gates, and agent ownership for implementing the converter.
This document tracks the current v1 implementation state and open future decisions. It does not replace `PRD.md` or `ARCHITECTURE.md`; use those files as the source of product requirements and system design. Completed sprint details are archived in `docs/WORKARCHIVE.md`, and detailed acceptance criteria remain in `docs/Sprints/*.md`.
Sprint 1 created the Python package scaffold and CLI placeholder. Sprint 2 created path planning. Sprint 3 created project-owned records and metadata construction. Sprint 4 created the mocked direct local MinerU adapter boundary. Sprint 5 created the Obsidian Markdown normalization boundary. Sprint 6 created local quality-check and report-rendering boundaries. Sprint 7 implemented conversion orchestration, the public conversion API, and the `pdf2md convert` CLI path with fake-adapter tests. Sprint 8 implemented mockable doctor diagnostics, the `pdf2md doctor` CLI path, and setup documentation. Sprint 9 implemented fast mocked integration tests, explicit opt-in local MinerU fixture evaluation, and the v1 release checklist. Sprint 10 implemented opt-in pre-conversion PDF chunking for long documents. Sprint 11 implemented conservative MathJax warning mitigation for failed math spans.
## 1. Current V1 State
## 1. V1 Outcome
The core v1 converter is implemented through Sprint 16. The implemented system includes:
- Python 3.12 package and `pdf2md` CLI.
- Direct local MinerU 3.1.0 CLI adapter with strict-local enforcement.
- Obsidian-friendly Markdown normalization.
- Internal provenance, structured warnings, quality checks, and one human-readable report.
- `pdf2md doctor`.
- Optional grouped page conversion through `--chunk-pages`.
- Local MathJax render checking and conservative failed-span repair.
- pypdf-based text fidelity diagnostics.
- NVIDIA GPU inventory, `--gpu auto`, and `--mineru-profile auto|safe|performance`.
- Simplified output layout: `<out>/<stem>/<stem>_001.md`, shared `<out>/<stem>/images/`, and `<out>/<stem>/<stem>_report.md`.
- No public metadata JSON for new conversions.
- Minimal Windows UI launcher over the existing CLI, including direct-folder PDF batch conversion through sequential `pdf2md convert` subprocesses.
Historical implementation evidence, verification commands, and sample conversion results are in `docs/WORKARCHIVE.md`.
## 2. V1 Outcome
v1 is complete when a local user can run:
```bash
uv run pdf2md doctor
uv run pdf2md convert paper.pdf --out out --metadata
uv run pdf2md convert pdfs --out out --recursive --metadata
uv run pdf2md convert paper.pdf --out out
uv run pdf2md convert pdfs --out out --recursive
```
and receive, for each PDF:
- Obsidian-friendly Markdown.
- A stable sibling assets directory when assets exist.
- `<stem>.metadata.json`.
- `<stem>.report.md`.
- Clear warnings when math, tables, assets, reading order, GPU availability, or MinerU execution are uncertain.
- Obsidian-friendly Markdown parts under `<out>/<stem>/<stem>_001.md`, `<stem>_002.md`, and so on.
- A stable shared image/media directory under `<out>/<stem>/images/`.
- One human-readable report under `<out>/<stem>/<stem>_report.md`.
- No persisted metadata JSON for new conversions.
- Clear warnings when math, tables, assets, reading order, text fidelity, GPU availability, or MinerU execution are uncertain.
Long PDFs can be chunked explicitly:
@@ -31,11 +48,11 @@ uv run pdf2md convert paper.pdf --out out --chunk-pages
uv run pdf2md convert paper.pdf --out out --chunk-pages 20
```
Chunked conversion writes separate outputs per chunk and does not merge Markdown files.
When `--chunk-pages` is active, MinerU receives one-page temporary PDFs and final Markdown files are grouped by the configured page count. Temporary one-page PDFs and intermediate per-page outputs are deleted.
The converter must use MinerU 3.1.0 through direct local CLI execution only. It must not silently fallback to another engine.
The Windows UI launcher is a convenience wrapper over `pdf2md`; it is not a separate conversion pipeline. UI folder batch conversion runs direct-child PDFs sequentially through the same CLI conversion path.
## 2. Non-Negotiable Constraints
## 3. Non-Negotiable Constraints
- Python 3.12 and `uv`.
- MinerU 3.1.0 is the only conversion engine.
@@ -45,34 +62,10 @@ The converter must use MinerU 3.1.0 through direct local CLI execution only. It
- Target hardware: NVIDIA GTX 1070 Ti 8GB.
- Digital PDFs with text layers are the v1 priority.
- `samples/` is local fixture context and must not be committed unless explicitly requested.
- UI launcher must invoke `pdf2md` or `uv run pdf2md`; it must not call MinerU directly or bundle the full conversion runtime.
- Every substantial implementation chunk needs a sprint contract and independent evaluation.
## 3. Harness Operating Model
Use the project long-running harness only for substantial implementation work.
1. `harness-planner-agent` turns the next user request into a sprint contract.
2. `evaluation-agent` reviews the contract before code changes start.
3. `feature-generator-agent` implements one approved contract at a time.
4. `feature-generator-agent` runs self-checks and records residual risks.
5. `evaluation-agent` independently verifies the result against the contract.
6. The parent agent updates `PROGRESS.md`, commits the completed change, and leaves a handoff.
After a chunk is no longer active, archive completed-work details in `docs/WORKARCHIVE.md` and keep `PROGRESS.md` focused on current status, blockers, and next actions.
Each sprint contract must include:
- Objective.
- Touched surfaces.
- Expected outputs.
- Non-goals.
- Verification checks.
- Hard failure criteria.
- Handoff fields.
## 4. Proposed Repository Layout
Create this layout incrementally; do not scaffold unused modules before a sprint needs them.
## 4. Current Repository Layout
```text
pyproject.toml
@@ -91,615 +84,79 @@ src/
quality.py
report.py
doctor.py
gpu.py
mineru_profile.py
math_render.py
math_repair.py
text_fidelity.py
pdf2md_ui/
__init__.py
app.py
runner.py
tests/
unit/
integration/
fixtures/
scripts/
install-mineru.ps1
install-models.py
docs/
Sprints/
superpowers/
```
Planned module responsibilities:
Do not scaffold unused modules before a sprint needs them.
- `cli.py`: command parsing, CLI summaries, exit codes.
- `conversion.py`: orchestration for one PDF and batch input.
- `paths.py`: input discovery, output path planning, overwrite checks.
- `mineru_adapter.py`: direct local MinerU CLI boundary.
- `ir.py`: project-owned document/page/block/asset/warning records.
- `markdown.py`: Obsidian Markdown normalization.
- `metadata.py`: metadata schema creation and warning aggregation.
- `quality.py`: local checks for assets, math renderability, and output sanity.
- `report.py`: `<stem>.report.md` generation from metadata.
- `doctor.py`: environment, dependency, CUDA/GPU, MinerU, and cache diagnostics.
## 5. Sprint Sequence
### Sprint 0: Source And Environment Verification
Active contract:
- `docs/Sprints/SPRINT0CONTRACT.md`
Objective:
- Verify the facts needed before implementation starts.
Touched surfaces:
- `docs/KNOWLEDGEBASE.md`
- `docs/V1IMPLEMENTATIONPLAN.md` if sequencing changes
- `PROGRESS.md`
Expected outputs:
- Confirmed MinerU 3.1.0 install command, CLI invocation shape, version command, output paths, and local execution behavior.
- Confirmed Python 3.12, `uv`, CUDA/PyTorch, and GTX 1070 Ti 8GB risks.
- Confirmed license notes needed before redistribution.
Verification checks:
- All volatile facts cite official MinerU, Python, uv, PyTorch/CUDA, or license sources.
- No candidate engine comparison is reintroduced.
- No implementation code is created.
Hard failure criteria:
- MinerU 3.1.0 cannot be reasonably invoked through a direct local CLI on the target environment.
- Python 3.12 compatibility is not viable without changing project requirements.
Primary agents:
- `research-agent`
- `local-setup-agent`
- `license-privacy-agent`
### Sprint 1: Project Scaffold And Fast Test Loop
Active contract:
- `docs/Sprints/SPRINT1CONTRACT.md`
Objective:
- Create the minimal Python project structure and a fast local test loop.
Touched surfaces:
- `pyproject.toml`
- `src/pdf2md/__init__.py`
- `tests/`
- Development documentation if needed
Expected outputs:
- `uv sync` works.
- `uv run pytest` works.
- Project package imports as `pdf2md`.
- CLI entry point name `pdf2md` is reserved but may initially expose only `doctor` or a clear placeholder until later sprints.
- If `uv` is still unavailable locally, Sprint 1 records that blocker and is not marked complete.
Verification checks:
- Import test passes.
- Empty test suite or initial scaffold tests pass.
- No runtime network dependency is introduced.
Hard failure criteria:
- Project cannot be installed with `uv`.
- Scaffolding adds speculative config systems, extra engines, or unused abstractions.
Primary agents:
- `harness-planner-agent`
- `feature-generator-agent`
- `evaluation-agent`
### Sprint 2: Paths, Input Discovery, And Overwrite Planning
Active contract:
- `docs/Sprints/SPRINT2CONTRACT.md`
Objective:
- Implement deterministic input and output planning before conversion logic exists.
Touched surfaces:
- `paths.py`
- `conversion.py` skeleton if needed
- CLI path handling tests
Expected outputs:
- Single PDF discovery.
- Directory PDF discovery.
- Recursive traversal only when requested.
- Deterministic output paths for Markdown, assets, metadata JSON, report, and optional raw MinerU output.
- Existing-output protection unless `--overwrite` is passed.
Verification checks:
- Unit tests for single PDF path planning.
- Unit tests for directory and recursive discovery.
- Unit tests for overwrite behavior.
- Tests include Korean or non-ASCII filename handling using generated temporary files, not committed sample PDFs.
Hard failure criteria:
- Output planning can overwrite user files without explicit overwrite intent.
- Directory conversion descends recursively without `--recursive`.
Primary agents:
- `feature-generator-agent`
- `evaluation-agent`
### Sprint 3: Domain Records, Metadata, And Warning Model
Active contract:
- `docs/Sprints/SPRINT3CONTRACT.md`
Objective:
- Define project-owned records before binding to MinerU output.
Touched surfaces:
- `ir.py`
- `metadata.py`
- `report.py` skeleton if needed
- Unit tests
Expected outputs:
- Document, page, block, asset, and warning records.
- Stable warning codes from `ARCHITECTURE.md`.
- Metadata JSON builder with required top-level and summary fields.
- Warning aggregation logic.
Verification checks:
- Unit tests for metadata schema creation.
- Unit tests for warning aggregation.
- Unit tests for optional fields such as bbox and confidence being preserved only when present.
Hard failure criteria:
- Public API requires raw MinerU objects.
- Metadata omits source PDF, SHA-256, engine, pages, warnings, assets, or summary.
Primary agents:
- `metadata-agent`
- `feature-generator-agent`
- `evaluation-agent`
### Sprint 4: MinerU Adapter With Mocked Contract
Active contract:
- `docs/Sprints/SPRINT4CONTRACT.md`
Objective:
- Build the direct local MinerU adapter boundary with mocked outputs first.
Touched surfaces:
- `mineru_adapter.py`
- `doctor.py` partial checks
- Adapter tests with fake subprocess results and fake output directories
Expected outputs:
- Adapter availability check.
- Version check.
- Direct CLI command construction.
- Strict-local command validation.
- Subprocess execution wrapper capturing stdout, stderr, exit code, and paths.
- Parsed adapter result object with raw Markdown, raw structured data when available, assets, warnings, engine, engine version, options, exit code, and stderr.
- Baseline command shape based on MinerU 3.1.0 direct local CLI: `mineru -p <input> -o <output>`.
- Strict-local validation allows CLI-internal temporary local `mineru-api` orchestration, while rejecting `--api-url`, remote APIs, router mode, HTTP client backends, and remote OpenAI-compatible backends.
Verification checks:
- Mocked successful MinerU output test.
- Mocked missing MinerU test.
- Mocked non-zero exit test.
- Test that prohibited remote/API flags cannot be introduced.
- No real MinerU/model dependency in default tests.
Hard failure criteria:
- Adapter passes `--api-url`, uses router mode, uses an HTTP client backend, or connects to a remote API or remote OpenAI-compatible backend.
- Adapter falls back to another engine after MinerU failure.
- Tests require model downloads by default.
Primary agents:
- `mineru-integration-agent`
- `feature-generator-agent`
- `evaluation-agent`
### Sprint 5: Obsidian Markdown Normalization And Assets
Active contract:
- `docs/Sprints/SPRINT5CONTRACT.md`
Objective:
- Normalize MinerU/project IR output into Obsidian-friendly Markdown.
Touched surfaces:
- `markdown.py`
- `quality.py` partial asset link checks
- Unit tests
Expected outputs:
- Inline math delimiter normalization to `$...$`.
- Display math delimiter normalization to `$$...$$`.
- Blank-line normalization around display math.
- Relative asset link normalization.
- Simple table preservation and complex table fallback warnings.
- No visible page markers by default.
Verification checks:
- Unit tests for inline math.
- Unit tests for display math spacing.
- Unit tests for underscores/carets inside math.
- Unit tests for relative asset links.
- Unit tests for table fallback warning behavior.
Hard failure criteria:
- Normalization rewrites LaTeX semantics without deterministic tests.
- Generated links are absolute when relative links are required.
- Page provenance is only visible in Markdown and missing from metadata.
Primary agents:
- `obsidian-markdown-agent`
- `feature-generator-agent`
- `evaluation-agent`
### Sprint 6: Quality Checks And Report Generation
Active contract:
- `docs/Sprints/SPRINT6CONTRACT.md`
Objective:
- Produce local quality signals and human-readable reports from metadata.
Touched surfaces:
- `quality.py`
- `report.py`
- `metadata.py`
- Unit tests
Expected outputs:
- Missing asset link count.
- Math renderability check interface with graceful unavailable-tool handling.
- Pages-with-warnings summary.
- `<stem>.report.md` generated from metadata.
- Final status: `success`, `partial`, or `failed`.
Verification checks:
- Unit tests for report content.
- Unit tests for missing asset link count.
- Unit tests for math render failure aggregation.
- Report generation does not re-run MinerU.
Hard failure criteria:
- Report diverges from JSON metadata.
- Math render failures are silently ignored.
- Quality checks require network access.
Primary agents:
- `metadata-agent`
- `evaluation-agent`
- `feature-generator-agent`
### Sprint 7: Conversion Orchestrator, CLI, And Python API
Active contract:
- `docs/Sprints/SPRINT7CONTRACT.md`
Objective:
- Connect path planning, MinerU adapter, normalization, metadata, report, and summaries.
Touched surfaces:
- `conversion.py`
- `cli.py`
- `__init__.py`
- CLI and API tests
Expected outputs:
- `convert_pdf(input_path, output_dir, metadata=True)` public API.
- `pdf2md convert INPUT --out OUTPUT_DIR`.
- `--metadata`, `--keep-raw`, `--recursive`, `--overwrite`, `--gpu`, and `--strict-local` behavior.
- Batch conversion for directories.
- CLI summary with warning counts.
Verification checks:
- API test with mocked MinerU adapter.
- CLI single PDF test with mocked MinerU adapter.
- CLI directory test with mocked MinerU adapter.
- Existing output test.
- Failure summary test.
Hard failure criteria:
- Public API exposes raw MinerU objects as required return fields.
- CLI writes outputs after a hard failure that should stop conversion.
- CLI suppresses warning counts.
Primary agents:
- `feature-generator-agent`
- `requirements-guard-agent`
- `evaluation-agent`
### Sprint 8: Doctor And Setup Documentation
Active contract:
- `docs/Sprints/SPRINT8CONTRACT.md`
## 5. Active Next Sprint
Status:
- Implemented.
- No active implementation sprint.
Objective:
Next implementation work should start from a new user-approved requirement and, if substantial, a new sprint contract.
- Make local setup failures explicit before users run conversions.
## 6. Abandoned Planning
Touched surfaces:
- `doctor.py`
- `cli.py`
- `README.md`
- `scripts/install-mineru.ps1`
- `scripts/install-models.py`
- Tests for mocked environment checks
Expected outputs:
- `pdf2md doctor` reports Python version, `uv`, CUDA/PyTorch GPU visibility, MinerU availability, MinerU version, and detectable model/cache paths.
- GPU unavailable warning is clear.
- Missing `uv` is reported clearly.
- Pre-Turing/Pascal GPU risk is reported clearly for GTX 1070 Ti compute capability 6.1.
- Missing required dependency causes doctor failure.
- Setup docs explain Windows PowerShell, Python 3.12, `uv`, MinerU, models, GPU expectations, and local-only behavior.
Verification checks:
- Mocked doctor tests for success, missing MinerU, missing GPU, and missing dependency.
- Documentation review for no cloud/API runtime path.
Hard failure criteria:
- Doctor says the environment is healthy when MinerU is missing.
- Doctor implies cloud/API fallback is supported.
Primary agents:
- `local-setup-agent`
- `license-privacy-agent`
- `evaluation-agent`
### Sprint 9: Local Fixture Evaluation And V1 Release Gate
Active contract:
- `docs/Sprints/SPRINT9CONTRACT.md`
### Sprint 17: Offline Windows Installer
Status:
- Implemented.
- Abandoned at the user's request on 2026-05-13.
Objective:
Historical references:
- Validate the end-to-end v1 behavior against local samples without committing samples.
- `docs/Sprints/SPRINT17CONTRACT.md`.
- `docs/superpowers/plans/2026-05-12-offline-installer.md`.
Touched surfaces:
Do not implement or extend Sprint 17 unless the user explicitly reopens offline installer work.
- `tests/integration/`
- Optional local-only fixture manifest that does not include sample PDFs
- `README.md`
- `PROGRESS.md`
## 7. Future Decisions
Expected outputs:
- Decide whether simplified outputs need a metadata-free `pdf2md recheck`; current `recheck` remains legacy-only for outputs with adjacent metadata JSON.
- Validate `--gpu auto --mineru-profile auto` on a stronger NVIDIA GPU PC.
- Fast mocked integration suite.
- Optional MinerU-dependent local test command.
- Local sample coverage notes in `PROGRESS.md`.
- V1 release checklist status.
## 8. Harness Operating Model
Verification checks:
Use the project long-running harness only for substantial implementation work.
- `uv run pytest` passes without model downloads.
- Optional MinerU test is clearly marked and skipped unless explicitly enabled.
- Representative sample produces Markdown, metadata JSON, report Markdown, and asset paths.
- Obsidian math delimiter expectations are met.
- No sample PDFs are staged.
1. `harness-planner-agent` turns the next user request into a sprint contract.
2. `evaluation-agent` reviews the contract before code changes start.
3. `feature-generator-agent` implements one approved contract at a time.
4. `feature-generator-agent` runs self-checks and records residual risks.
5. `evaluation-agent` independently verifies the result against the contract.
6. The parent agent updates `PROGRESS.md`, commits the completed change, and leaves a handoff.
Hard failure criteria:
After a chunk is no longer active, archive completed-work details in `docs/WORKARCHIVE.md` and keep `PROGRESS.md` focused on current status, blockers, and next actions.
- Default tests require GPU, MinerU models, or network access.
- Sample files are added to git unintentionally.
- V1 release checklist passes without metadata/report generation.
## 9. Completed Sprint Archive
Primary agents and skills:
Completed sprint details have been moved out of this active implementation plan.
- `evaluation-agent`
- `requirements-guard-agent`
- `fixture-evaluation` skill
- Summary and verification evidence: `docs/WORKARCHIVE.md`.
- Detailed historical contracts: `docs/Sprints/SPRINT0CONTRACT.md` through `docs/Sprints/SPRINT16CONTRACT.md`.
- UI folder batch design and execution record: `docs/superpowers/specs/2026-05-13-ui-folder-batch-conversion-design.md` and `docs/superpowers/plans/2026-05-13-ui-folder-batch-conversion.md`.
- Abandoned Sprint 17 planning record: `docs/Sprints/SPRINT17CONTRACT.md` and `docs/superpowers/plans/2026-05-12-offline-installer.md`.
### Sprint 10: Pre-Conversion PDF Page Chunking
Active contract:
- `docs/Sprints/SPRINT10CONTRACT.md`
Status:
- Implemented.
Objective:
- Split long PDFs into temporary fixed-size page chunks before MinerU conversion.
Touched surfaces:
- `pdf_splitter.py`
- `conversion.py`
- `cli.py`
- `report.py`
- README and Sprint 10 documentation
- Unit tests for splitter, conversion, CLI, and report behavior
Expected outputs:
- `pdf2md convert INPUT --out OUTPUT --chunk-pages` enables 20-page chunks.
- `pdf2md convert INPUT --out OUTPUT --chunk-pages N` enables custom positive chunk size.
- `convert_pdf(..., chunk_pages=N)` returns a `BatchConversionResult` in chunk mode.
- Temporary chunk PDFs are deleted after conversion completes.
- Chunk Markdown files are separate and named with original page ranges.
- Metadata and report content expose original source path and chunk page ranges.
Verification checks:
- pypdf-based local blank PDF tests cover page counts, chunk ranges, and written chunk page counts.
- Mocked conversion tests verify one adapter call per chunk, failed-chunk continuation, chunk metadata/report context, and temporary chunk cleanup.
- CLI tests verify `--chunk-pages` without a value uses 20 pages.
Hard failure criteria:
- Chunking uploads document content or uses another conversion engine.
- Chunk outputs are merged.
- Default tests require real MinerU, GPU, model files, network, Obsidian, LaTeX tooling, or `samples/`.
### Sprint 11: MathJax Warning Mitigation
Active contract:
- `docs/Sprints/SPRINT11CONTRACT.md`
Status:
- Implemented.
Objective:
- Repair narrow MathJax-invalid formula artifacts after initial local validation and before final output writing.
Touched surfaces:
- `quality.py`
- `math_repair.py`
- `conversion.py`
- `ir.py`
- Unit tests for quality details, repair rules, conversion, and recheck behavior
Expected outputs:
- Failed math expression records expose body, display mode, span, and checker message.
- Repair candidates are generated only for failed math spans.
- Repeated same-direction scripts are disambiguated with an empty group.
- Truncated `\end{a}` array endings are repaired when array environments are unbalanced.
- `convert` and `recheck` share the same repair behavior.
- Applied repairs are recorded as `MATH_RENDER_REPAIRED` info warnings and do not count as math render errors.
Verification checks:
- Default fast tests pass without real MinerU, GPU, Node.js, MathJax, network, Obsidian, or `samples/`.
- `samples/MITC공부.pdf` validates locally with `Math render error count: 0`.
Hard failure criteria:
- Repair changes math spans that did not fail local MathJax validation.
- Repair claims success without candidate revalidation.
- Repair introduces remote services, alternate engines, or mandatory sample-dependent default tests.
## 6. Cross-Cutting Acceptance Criteria
Every implementation sprint must preserve these acceptance criteria:
- No runtime remote document processing path exists.
- MinerU is the only conversion engine.
- Failures are explicit and traceable.
- Warnings are structured and countable.
- Markdown and metadata can be traced back to source pages where available.
- Reports are generated from metadata.
- Default tests are fast and local.
- `samples/` remains untracked unless explicitly requested.
## 7. First Implementation Request Contract Template
Use this template when implementation begins.
```markdown
## Sprint Contract
Objective:
Touched surfaces:
Expected outputs:
Non-goals:
Verification checks:
Hard failure criteria:
Handoff fields:
- Files changed:
- Commands run:
- Tests passed:
- Known failures:
- Residual risks:
- Next action:
```
## 8. Open Risks
- MinerU 3.1.0 install and CLI behavior are source-verified, but real local output still needs a later local probe before release.
- GTX 1070 Ti 8GB is visible locally, but it is Pascal compute capability 6.1; `doctor` and setup docs must make CUDA/PyTorch limits clear.
- `uv` is installed per-user at `C:\Users\user\.local\bin`, but a new shell may need PATH refresh before `uv` is visible.
- Formula renderability checks and conservative warning mitigation are implemented, but formula reconstruction remains best effort and should keep warnings/provenance visible.
- Some PDFs will have tables or formulas that cannot be faithfully represented in Markdown; metadata and `.report.md` must surface this instead of hiding it.
- Redistribution license obligations must be reviewed before packaging, redistribution, or bundling model weights.
## 9. Recommended Next Step
Run optional real local MinerU validation on a long sample only when requested. Default verification should continue to use mocked adapters and generated temporary PDFs so it remains independent of MinerU, GPU, model files, network access, and `samples/`.
Facts carried forward from Sprint 0:
Facts carried forward from completed work:
- MinerU is fixed to version 3.1.0.
- Direct local CLI command shape is `mineru -p <input> -o <output>`.
- MinerU output layout should be treated as optional-file based until locally probed.
- Python 3.12 is compatible with the pinned MinerU package range.
- GTX 1070 Ti CUDA/PyTorch support needs explicit doctor validation.
- MinerU/model license posture is acceptable for personal local use. Redistribution remains out of scope until reviewed.
- Formula reconstruction remains best effort and must keep warnings/provenance visible.
- MinerU/model license posture is acceptable for personal local use. Redistribution remains gated by license review.
+8 -9
View File
@@ -76,13 +76,13 @@ This optional pytest path runs `pdf2md doctor` first. If doctor has a hard failu
A sample conversion is successful only when all of these are true:
- The command exits 0.
- The planned Markdown file exists: `<output>\<stem>.md`.
- The planned metadata JSON exists: `<output>\<stem>.metadata.json`.
- The planned quality report exists: `<output>\<stem>.report.md`.
- Metadata and report warning counts are consistent enough to explain math, table, reading-order, asset, MinerU, and checker-unavailable risks.
- The planned Markdown part exists: `<output>\<stem>\<stem>_001.md`.
- The planned quality report exists: `<output>\<stem>\<stem>_report.md`.
- No public `.metadata.json` sidecar is written for new conversions.
- The report warning counts are consistent enough to explain math, table, reading-order, asset, MinerU, and checker-unavailable risks.
- Any Markdown image links resolve relative to the Markdown file, or missing/broken links are reported as warnings.
Missing Markdown, metadata JSON, or `.report.md` means the sample failed or is blocked. Do not count it as a partial success for release gating.
Missing Markdown part or `_report.md` means the sample failed or is blocked. Do not count it as a partial success for release gating.
For each attempted sample, record at least:
@@ -90,8 +90,7 @@ For each attempted sample, record at least:
- Command run.
- Exit code.
- Generated Markdown path.
- Generated metadata JSON path.
- Generated `.report.md` path.
- Generated `_report.md` path.
- Warning count and final status.
- Math renderability failures or checker-unavailable count.
- Table fallback or degradation count when available.
@@ -110,7 +109,7 @@ Local fixture coverage should include these risk categories where samples are av
- Figure, caption, or extracted asset links.
- Korean or non-ASCII filename/path handling.
Observed local fixture map as of 2026-05-08:
Observed local fixture map as of 2026-05-11:
| Local sample | Fixture risks covered | Notes |
| --- | --- | --- |
@@ -126,7 +125,7 @@ Coverage gaps to keep visible:
- A table-dominant sample with known formula cells would make table degradation easier to judge.
- A figure-heavy sample with expected extracted assets would make asset link validation easier to judge.
Do not score fixture quality only by plain-text edit distance. Include math delimiter/renderability behavior, tables, reading order, assets, metadata fields, warning usefulness, and `.report.md` usefulness.
Do not score fixture quality only by plain-text edit distance. Include math delimiter/renderability behavior, tables, reading order, assets, report provenance, warning usefulness, and `_report.md` usefulness.
## No-Sample-Commit Check
+53 -2
View File
@@ -1,6 +1,6 @@
# Work Archive
Last updated: 2026-05-08
Last updated: 2026-05-13
This document stores completed project work, historical sprint outcomes, environment setup results, and sample conversion evidence. `PROGRESS.md` should stay focused on current status, blockers, and next actions. Read this archive when a task needs past implementation context, previous verification commands, or historical handoff details.
@@ -34,6 +34,16 @@ This document stores completed project work, historical sprint outcomes, environ
| GPU default/runtime setup | Made conversion default to `cuda:0`, mapped CUDA requests to MinerU subprocess environment variables, rebuilt `.venv`, installed CUDA-enabled PyTorch and MinerU 3.1.0, downloaded MinerU models, and set `MINERU_MODEL_SOURCE=local`. | `README.md`, `src/pdf2md/mineru_adapter.py`, `src/pdf2md/conversion.py` |
| MathJax checker | Planned and implemented local MathJax render checker with Node.js helper, Python wrapper, conversion integration, and doctor diagnostics. | `docs/MATHJAXCHECKERPLAN.md`, `tools/mathjax-checker/check.mjs`, `src/pdf2md/math_render.py` |
| Sprint 10 | Implemented opt-in pre-conversion PDF chunking with `pypdf`, temporary chunk PDF cleanup, `--chunk-pages [PAGES]`, chunk metadata/report context, and mocked tests. | `docs/Sprints/SPRINT10CONTRACT.md`, `src/pdf2md/pdf_splitter.py` |
| Sprint 11 | Implemented conservative MathJax warning mitigation with failed-expression details, `src/pdf2md/math_repair.py`, shared `convert`/`recheck` repair integration, and `MATH_RENDER_REPAIRED` info warnings. | `docs/Sprints/SPRINT11CONTRACT.md`, `src/pdf2md/math_repair.py`, `src/pdf2md/quality.py`, `src/pdf2md/conversion.py` |
| UI research and Sprint 12 planning | Researched minimal Windows UI launcher options and planned a thin `tkinter`/`ttk` launcher over the existing CLI with PyInstaller build output at `dist/pdf2md-ui.exe`. | `docs/UI_RESEARCH.md`, `docs/Sprints/SPRINT12CONTRACT.md`, `PLAN.md` |
| Sprint 12 | Implemented a minimal `tkinter`/`ttk` Windows UI launcher over `pdf2md` or `uv run pdf2md`, with fixed argument-list subprocess calls, worker-thread logging, cancellation, Recheck support, and PyInstaller build output at `dist/pdf2md-ui.exe`. | `docs/Sprints/SPRINT12CONTRACT.md`, `src/pdf2md_ui/`, `tests/test_ui_runner.py` |
| Sprint 13 | Implemented local pypdf text layer fidelity diagnostics, including Hangul count deltas, unexpected CJK counts, text similarity, Hangul spacing anomaly ratios, replacement-candidate markers, metadata/report integration, and `recheck` support without automatic body-text replacement. | `docs/Sprints/SPRINT13CONTRACT.md`, `src/pdf2md/text_fidelity.py`, `src/pdf2md/conversion.py`, `src/pdf2md/metadata.py`, `src/pdf2md/report.py` |
| Sprint 14 | Changed chunk mode so MinerU receives one source page per run while final Markdown, metadata, report, and assets are grouped by `chunk_pages`. Failed page conversions are nonfatal within partially successful groups and are recorded in metadata/report output. | `docs/Sprints/SPRINT14CONTRACT.md`, `src/pdf2md/conversion.py`, `src/pdf2md/report.py`, `tests/test_conversion.py` |
| Sprint 15 | Implemented NVIDIA GPU inventory parsing, optional `--gpu auto`, default `--mineru-profile auto`, conservative MinerU environment tuning, profile provenance in metadata/report output, and doctor GPU/profile recommendations. | `docs/Sprints/SPRINT15CONTRACT.md`, `src/pdf2md/gpu.py`, `src/pdf2md/mineru_profile.py`, `src/pdf2md/conversion.py`, `src/pdf2md/doctor.py` |
| Sprint 16 | Simplified public conversion outputs to one PDF-stem folder, numbered Markdown parts, shared `images/`, one `_report.md`, no persisted metadata JSON, compatibility-no-op `--metadata`, and legacy-only `recheck`. | `docs/Sprints/SPRINT16CONTRACT.md`, `src/pdf2md/paths.py`, `src/pdf2md/conversion.py`, `src/pdf2md/report.py`, `src/pdf2md/cli.py` |
| UI direct-folder batch conversion | Added a minimal UI workflow that selects one folder, discovers direct-child PDFs only, and sequentially runs the existing `pdf2md convert` command for each file with the selected options. | `docs/superpowers/specs/2026-05-13-ui-folder-batch-conversion-design.md`, `docs/superpowers/plans/2026-05-13-ui-folder-batch-conversion.md`, `src/pdf2md_ui/runner.py`, `src/pdf2md_ui/app.py` |
| Sprint 17 planning | Planned a large offline Windows installer, then abandoned the sprint at the user's request before implementation began. | `docs/Sprints/SPRINT17CONTRACT.md`, `docs/superpowers/plans/2026-05-12-offline-installer.md` |
| Documentation archive cleanup | Moved completed implementation details out of `PLAN.md`, `PROGRESS.md`, and `docs/V1IMPLEMENTATIONPLAN.md`, then removed Sprint 17 from active planned work after it was abandoned. | `PLAN.md`, `PROGRESS.md`, `docs/V1IMPLEMENTATIONPLAN.md`, `docs/WORKARCHIVE.md` |
## Runtime Setup Archive
@@ -43,12 +53,13 @@ This document stores completed project work, historical sprint outcomes, environ
- `uv` installed per-user at `C:\Users\user\.local\bin`.
- GPU target: NVIDIA GTX 1070 Ti 8GB.
- Local GPU observed: NVIDIA GeForce GTX 1070 Ti, driver 577.00, 8192 MiB VRAM, WDDM.
- Default conversion device/profile: `--gpu cuda:0` and `--mineru-profile auto`.
- MinerU execution mode: direct local `mineru` CLI only.
- MinerU 3.1.0 CLI-internal temporary local `mineru-api` is allowed when the CLI runs without `--api-url`.
- GTX 1070 Ti runtime setup used `torch==2.6.0+cu126`, `torchvision==0.21.0+cu126`, and `mineru[core]==3.1.0`.
- MinerU models were downloaded with `uv run mineru-models-download -s huggingface -m all`.
- Runtime model loading uses `MINERU_MODEL_SOURCE=local`.
- Current doctor status after setup is WARN because GTX 1070 Ti is Pascal/pre-Turing; MinerU, CUDA PyTorch, local model config, MathJax checker, and strict-local checks pass.
- Current doctor status after setup is WARN because GTX 1070 Ti is Pascal/pre-Turing; MinerU, CUDA PyTorch, local model config, MathJax checker, and strict-local checks pass. Sprint 15 doctor output selects `cuda:0` for `--gpu auto` on this machine and recommends MinerU profile `safe`.
## Sample Conversion Archive
@@ -58,6 +69,11 @@ Generated outputs are ignored under `outputs/` and are not committed.
| --- | --- | --- | --- |
| `samples/MITC공부.pdf` | Completed after CUDA-enabled runtime setup. | `outputs/MITC공부/` | 13 pages, 107 assets, 23 inline formulas, 103 display formulas, 1 info warning at the time of that run because the local MathJax checker was unavailable. |
| `samples/FourNodeQuadrilateralShellElementMITC4.pdf` | Completed with default GPU request and `MINERU_MODEL_SOURCE=local`. | `outputs/FourNodeQuadrilateralShellElementMITC4/` | Report status `success`: 7 pages, 22 assets, 38 inline formulas, 16 display formulas, 0 math render errors, 0 warnings. |
| `samples/FourNodeQuadrilateralShellElementMITC4.pdf` | Sprint 14 sample smoke stalled and was terminated. | No final output directory. | On 2026-05-12, `--chunk-pages` entered the one-page conversion path and used `cuda:0` with GPU utilization near 100%. Source page 1 completed, but source page 2 stayed active for more than 15 minutes total runtime with no final grouped output, so the process tree was terminated and the temporary `pdf2md.pages.*` directory was removed. |
| `samples/MITC공부.pdf` | Reconverted after Sprint 11 mitigation. | `outputs/MITC공부/` and `outputs/sprint11-MITC공부/` | Report status `partial` from 2 `MATH_RENDER_REPAIRED` info warnings: 13 pages, 107 assets, 23 inline formulas, 103 display formulas, 0 MathJax render errors, and 0 missing or invalid asset links. |
| `samples/2007쉘구조물의유한요소해석에대하여.pdf` | Completed after Sprint 13 validation with 1-page chunking. | `outputs/2007쉘구조물의유한요소해석에대하여_pages1/` | A fresh `--chunk-pages 5` attempt stayed on part 001 for over 40 minutes with GPU near full utilization and no output, so it was terminated. The clean `--chunk-pages 1` run completed 13/13 chunks with 0 failures, 44 warnings, 0 MathJax render errors, 13 low text-fidelity pages, 15 unexpected CJK characters, 13 diagnostic replacement-candidate pages, and 0 uncertain page mappings. |
| `samples/SolidElement.pdf` | Completed after Sprint 15 GPU/profile implementation with `--gpu auto --mineru-profile auto --chunk-pages`. | `outputs/SolidElement_sprint15_auto_20260512/` | Completed in about 11 minutes 51 seconds on GTX 1070 Ti. Report status `partial`: 6 pages, 0 failed pages, safe profile applied, 71 assets, 3 inline formulas, 55 display formulas, 0 MathJax render errors, 0 missing/invalid asset links, 11 warnings, and 5 low text-fidelity pages. |
| `samples/SolidElement.pdf` | Completed after Sprint 16 simplified output layout with `--gpu auto --mineru-profile auto --chunk-pages`. | `outputs/SolidElement/` | Completed in about 17 minutes 51 seconds on GTX 1070 Ti. Produced `SolidElement_001.md`, `SolidElement_report.md`, shared `images/` with 71 assets, and no persisted metadata JSON. Report status `partial`: 6 pages, 0 failed pages, safe profile applied, 3 inline formulas, 55 display formulas, 0 MathJax render errors, 0 missing/invalid asset links, 11 warnings, and 5 low text-fidelity pages. |
## Historical Verification Highlights
@@ -73,6 +89,41 @@ Generated outputs are ignored under `outputs/` and are not committed.
- CUDA runtime rebuild: verified CUDA with an actual tensor operation on `NVIDIA GeForce GTX 1070 Ti`, compute capability 6.1; `mineru --version` reported 3.1.0.
- MathJax checker: `npm run mathjax-checker:health` returned `{"ok":true}` after local `npm install`; full suite passed 150 tests with 1 optional skip after integration.
- Sprint 10 chunking: targeted chunking tests passed 42 tests; full default suite passed 163 tests with 1 optional skip; `git diff --check` passed with line-ending warnings only.
- Sprint 11 MathJax warning mitigation: targeted tests passed 56 tests; full default suite passed 172 tests with 1 optional skip; requested `samples/MITC공부.pdf` validation produced 0 MathJax render errors and 2 traceable repair info warnings.
- UI research and Sprint 12 planning: `docs/UI_RESEARCH.md` and `docs/Sprints/SPRINT12CONTRACT.md` were added; no implementation tests were required because this was documentation and planning only.
- Sprint 12 UI implementation: `uv run pytest tests\test_ui_runner.py` passed 16 tests; `uv run pytest` passed 188 tests with 1 optional skip; `uv run --group ui-build pyinstaller --clean --onefile --windowed --name pdf2md-ui src\pdf2md_ui\app.py` produced `dist\pdf2md-ui.exe`; `uv run pdf2md doctor` returned WARN only for the documented GTX 1070 Ti/Pascal compatibility risk; launch smoke confirmed the executable process starts.
- Sprint 12 residual smoke risk: a direct CLI conversion smoke using `samples\FourNodeQuadrilateralShellElementMITC4.pdf` and the same command shape used by the UI exceeded the 15-minute timeout on 2026-05-11. The spawned process tree was terminated with `taskkill`.
- Sprint 13 text fidelity diagnostics: `uv run pytest tests/test_text_fidelity.py tests/test_metadata.py tests/test_report.py tests/test_conversion.py` passed 49 tests; `uv run pytest` passed 198 tests with 1 optional skip.
- Sprint 13 sample validation on 2026-05-11: `samples/2007쉘구조물의유한요소해석에대하여.pdf` completed with `--chunk-pages 1` under `outputs/2007쉘구조물의유한요소해석에대하여_pages1/`; generated 13 Markdown files, 13 metadata JSON files, and 13 report files.
- Sprint 14 grouped page conversion: targeted red tests first failed against the Sprint 10 chunking behavior, then passed after implementation. `uv run pytest tests/test_conversion.py tests/test_cli.py tests/test_report.py tests/test_pdf_splitter.py tests/test_paths.py tests/test_metadata.py tests/test_ui_runner.py` passed 101 tests; full `uv run pytest` passed 202 tests with 1 optional skip.
- Sprint 14 sample smoke on 2026-05-12: `uv run pdf2md convert samples\FourNodeQuadrilateralShellElementMITC4.pdf --out outputs\FourNodeQuadrilateralShellElementMITC4_sprint14_20260512_112342 --chunk-pages --strict-local` used `cuda:0` with GPU utilization near 100%, reached source page 2, then exceeded 15 minutes total runtime without producing a final output directory. The process tree was terminated and the leftover temporary directory was removed.
- Sprint 15 NVIDIA GPU detection/profile tuning: targeted tests `uv run pytest tests/test_gpu.py tests/test_mineru_profile.py tests/test_mineru_adapter.py tests/test_conversion.py tests/test_cli.py tests/test_doctor.py` passed 101 tests. Full `uv run pytest` passed 225 tests with 1 optional skip. `uv run pdf2md doctor` returned WARN on the local GTX 1070 Ti, reported GPU 0 with 8192 MiB VRAM, selected `cuda:0` for `--gpu auto`, and recommended profile `safe`. Optional stronger-PC real MinerU conversion validation was not run in this workspace.
- SolidElement sample validation on 2026-05-12: `uv run pdf2md convert samples\SolidElement.pdf --out outputs\SolidElement_sprint15_auto_20260512 --overwrite --chunk-pages --gpu auto --mineru-profile auto --strict-local` completed successfully with one grouped output and no failed source pages.
- Sprint 16 simplified output layout: focused verification `uv run pytest tests/test_paths.py tests/test_conversion.py tests/test_cli.py tests/test_report.py tests/test_ui_runner.py tests/integration/test_v1_fast_release_gate.py -q` passed 91 tests; full `uv run pytest` passed 227 tests with 1 optional skip; `git diff --check` passed with line-ending warnings only. New conversions write `<out>/<stem>/<stem>_001.md`, shared `<out>/<stem>/images/`, and `<out>/<stem>/<stem>_report.md`; no public `.metadata.json` is written.
- Sprint 16 SolidElement sample validation on 2026-05-12: `uv run pdf2md convert samples\SolidElement.pdf --out outputs --overwrite --chunk-pages --gpu auto --mineru-profile auto --strict-local` completed successfully with one simplified Markdown part, one report, shared images, no public metadata JSON, and no failed source pages.
- UI direct-folder batch conversion on 2026-05-13: `uv run pytest tests/test_ui_runner.py -q` passed 19 tests; `uv run python -m py_compile src\pdf2md_ui\app.py src\pdf2md_ui\runner.py` passed; `uv run pytest -q` passed 230 tests with 1 skipped; PyInstaller rebuilt `dist\pdf2md-ui.exe`; a short process-start smoke confirmed the executable starts.
- Sprint 17 planning on 2026-05-12: `docs/Sprints/SPRINT17CONTRACT.md` and `docs/superpowers/plans/2026-05-12-offline-installer.md` were added. No implementation tests were required because this was planning only.
- Sprint 17 abandonment on 2026-05-13: offline installer planning was abandoned at the user's request before implementation began. The contract and plan remain historical records only.
## Archived V1 Implementation Plan
`docs/V1IMPLEMENTATIONPLAN.md` now tracks current state and planned next work only. Completed Sprint 0 through Sprint 16 details are archived here and in their respective `docs/Sprints/SPRINT*CONTRACT.md` files.
Current completed v1 capability summary:
- Python 3.12 package and `pdf2md` CLI.
- Direct local MinerU 3.1.0 CLI adapter with strict-local enforcement.
- Obsidian Markdown normalization, local quality checks, internal provenance, and one human-readable report.
- `pdf2md doctor`, local MathJax checking, conservative MathJax warning mitigation, and pypdf text fidelity diagnostics.
- Opt-in grouped page conversion where MinerU receives one source page per run.
- NVIDIA GPU detection, `--gpu auto`, and `--mineru-profile auto|safe|performance`.
- Simplified public output layout with no public metadata JSON for new conversions.
- Minimal Windows UI launcher with direct-folder batch conversion through sequential existing CLI calls.
Current planned next work:
- No active implementation sprint. Future substantial work should start from a new user-approved requirement and sprint contract.
## Historical Blockers And Resolutions
@@ -0,0 +1,683 @@
# Offline Windows Installer Implementation Plan
> **Status:** Abandoned at the user's request on 2026-05-13 before implementation began. This file is retained as historical planning context only. Do not execute this plan unless the user explicitly reopens offline installer work.
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Build an offline Windows installer that installs the existing `pdf2md` CLI/UI runtime on another Windows x64 PC without internet access.
**Architecture:** Build a large installer payload on an internet-connected build PC, then create the target PC `.venv` locally from bundled wheels during installation. Keep conversion behavior unchanged and keep the UI as a launcher over the installed project-owned `pdf2md` CLI.
**Tech Stack:** Python 3.12, uv, pip wheelhouse/download workflow, PyInstaller, PowerShell, Inno Setup, MinerU 3.1.0, CUDA PyTorch `2.6.0+cu126`, optional Node.js/MathJax.
---
## File Structure
- `docs/Sprints/SPRINT17CONTRACT.md`: sprint contract, scope, acceptance criteria, and hard failure criteria.
- `packaging/offline/build-offline-payload.ps1`: connected build-PC script that stages all offline files under `dist/offline-installer/`.
- `packaging/offline/verify-offline-payload.ps1`: build-PC and target-PC script that validates `payload-manifest.json` and hashes.
- `packaging/offline/install-runtime.ps1`: target-PC installer script that hash-verifies the payload, creates `.venv`, installs from local wheels, configures local models, and runs doctor.
- `packaging/offline/repair-runtime.ps1`: target-PC repair script that recreates `.venv` from the retained wheelhouse.
- `packaging/offline/run-doctor.ps1`: shortcut target for post-install diagnostics.
- `packaging/offline/Pdf2MdOffline.iss`: Inno Setup installer script.
- `packaging/offline/requirements-runtime-cu126.txt`: pinned offline runtime requirement set for Windows x64 CUDA 12.6 wheels.
- `packaging/offline/README.md`: build and install instructions.
- `packaging/offline/THIRD_PARTY_NOTICES.md`: redistribution notes and license links for bundled payload families.
- `src/pdf2md/packaging_manifest.py`: optional small helper for deterministic manifest/hash generation.
- `src/pdf2md_ui/runner.py`: installed runtime command resolution and child environment updates.
- `src/pdf2md_ui/app.py`: installed runtime project-root default only if needed.
- `tests/test_offline_packaging.py`: fast tests for manifest, script safety, and installer script contents with fake payloads.
- `tests/test_ui_runner.py`: fast tests for installed `.venv` and bundled `uv --offline` command resolution.
- `.gitignore`: ignore generated payload, wheelhouse, models, and installer outputs.
- `README.md` and `docs/V1RELEASECHECKLIST.md`: user-facing build/release documentation.
- `PLAN.md`, `PROGRESS.md`, `docs/WORKARCHIVE.md`: coordination and handoff.
## Task 1: Packaging Manifest And Ignore Policy
**Files:**
- Create: `tests/test_offline_packaging.py`
- Create: `src/pdf2md/packaging_manifest.py`
- Modify: `.gitignore`
- [ ] **Step 1: Write the failing manifest tests**
```python
from pathlib import Path
from pdf2md.packaging_manifest import build_payload_manifest
def test_build_payload_manifest_records_hash_size_and_source(tmp_path: Path) -> None:
payload = tmp_path / "payload"
payload.mkdir()
wheel = payload / "wheelhouse" / "example-1.0-py3-none-any.whl"
wheel.parent.mkdir()
wheel.write_bytes(b"wheel-bytes")
manifest = build_payload_manifest(
payload,
sources={"wheelhouse/example-1.0-py3-none-any.whl": "local test wheel"},
)
assert manifest["files"] == [
{
"path": "wheelhouse/example-1.0-py3-none-any.whl",
"size": 11,
"sha256": "9ceb18f15662bb87e54af2f5953c0484d2ef76f5444d87913360b9ef87d7296d",
"source": "local test wheel",
}
]
def test_build_payload_manifest_uses_forward_slash_relative_paths(tmp_path: Path) -> None:
payload = tmp_path / "payload"
nested = payload / "models" / "mineru" / "model.bin"
nested.parent.mkdir(parents=True)
nested.write_bytes(b"model")
manifest = build_payload_manifest(payload, sources={})
assert manifest["files"][0]["path"] == "models/mineru/model.bin"
```
- [ ] **Step 2: Run the manifest tests to verify failure**
Run:
```powershell
uv run pytest tests/test_offline_packaging.py -q
```
Expected: FAIL because `pdf2md.packaging_manifest` does not exist.
- [ ] **Step 3: Implement the minimal manifest helper**
```python
"""Offline installer payload manifest helpers."""
from __future__ import annotations
import hashlib
from pathlib import Path
from typing import Mapping, TypedDict
class ManifestFile(TypedDict):
path: str
size: int
sha256: str
source: str
class PayloadManifest(TypedDict):
files: list[ManifestFile]
def build_payload_manifest(payload_root: str | Path, *, sources: Mapping[str, str]) -> PayloadManifest:
root = Path(payload_root)
files: list[ManifestFile] = []
for path in sorted(candidate for candidate in root.rglob("*") if candidate.is_file()):
relative = path.relative_to(root).as_posix()
files.append(
{
"path": relative,
"size": path.stat().st_size,
"sha256": _sha256(path),
"source": sources.get(relative, "unknown"),
}
)
return {"files": files}
def _sha256(path: Path) -> str:
digest = hashlib.sha256()
with path.open("rb") as handle:
for chunk in iter(lambda: handle.read(1024 * 1024), b""):
digest.update(chunk)
return digest.hexdigest()
```
- [ ] **Step 4: Add generated payload ignores**
Append to `.gitignore`:
```gitignore
dist/
packaging/offline/_payload/
packaging/offline/_wheelhouse/
packaging/offline/_models/
*.issig
*.exe.tmp
```
If `dist/` is already ignored implicitly by an existing entry, keep one clear `dist/` entry and avoid duplicates.
- [ ] **Step 5: Run tests**
Run:
```powershell
uv run pytest tests/test_offline_packaging.py -q
git diff --check
```
Expected: tests PASS; diff check has no whitespace errors.
- [ ] **Step 6: Commit**
```powershell
git add .gitignore src\pdf2md\packaging_manifest.py tests\test_offline_packaging.py
git commit -m "feat: add offline payload manifest helper"
```
## Task 2: Offline Payload Builder
**Files:**
- Create: `packaging/offline/build-offline-payload.ps1`
- Create: `packaging/offline/verify-offline-payload.ps1`
- Create: `packaging/offline/requirements-runtime-cu126.txt`
- Create: `packaging/offline/README.md`
- Modify: `tests/test_offline_packaging.py`
- [ ] **Step 1: Write tests for builder safety**
```python
from pathlib import Path
def test_payload_builder_excludes_development_and_sample_paths() -> None:
script = Path("packaging/offline/build-offline-payload.ps1").read_text(encoding="utf-8")
assert ".git" in script
assert ".venv" in script
assert "samples" in script
assert "outputs" in script
assert "Copy-Item -Recurse -Force" in script
def test_runtime_requirements_pin_core_gpu_stack() -> None:
requirements = Path("packaging/offline/requirements-runtime-cu126.txt").read_text(encoding="utf-8")
assert "torch==2.6.0" in requirements
assert "torchvision==0.21.0" in requirements
assert "mineru[core]==3.1.0" in requirements
assert "pypdf" in requirements
```
- [ ] **Step 2: Run tests to verify failure**
Run:
```powershell
uv run pytest tests/test_offline_packaging.py -q
```
Expected: FAIL because the packaging files do not exist.
- [ ] **Step 3: Create the pinned requirements file**
```text
convert-pdf-to-md==0.1.0
pypdf>=6.10.2,<7
torch==2.6.0
torchvision==0.21.0
mineru[core]==3.1.0
```
- [ ] **Step 4: Create the payload builder skeleton**
The script must accept explicit input paths and fail when required payload pieces are missing:
```powershell
param(
[string]$Configuration = "Release",
[string]$PythonInstaller,
[string]$UvExe,
[string]$MinerUModelSource,
[string]$NodeRoot = "",
[string]$OutputRoot = "dist\offline-installer"
)
$ErrorActionPreference = "Stop"
$RepoRoot = Resolve-Path (Join-Path $PSScriptRoot "..\..")
$StageRoot = Join-Path $RepoRoot $OutputRoot
$AppRoot = Join-Path $StageRoot "app"
$RuntimeRoot = Join-Path $StageRoot "runtime"
$PayloadRoot = Join-Path $StageRoot "payload"
if (-not (Test-Path $PythonInstaller)) { throw "Missing Python installer: $PythonInstaller" }
if (-not (Test-Path $UvExe)) { throw "Missing uv.exe: $UvExe" }
if (-not (Test-Path $MinerUModelSource)) { throw "Missing MinerU model source: $MinerUModelSource" }
if (-not (Test-Path (Join-Path $RepoRoot "dist\pdf2md-ui.exe"))) { throw "Missing UI exe. Build dist\pdf2md-ui.exe first." }
Remove-Item -LiteralPath $StageRoot -Recurse -Force -ErrorAction SilentlyContinue
New-Item -ItemType Directory -Path $AppRoot,$RuntimeRoot,$PayloadRoot | Out-Null
$Excluded = @(".git", ".venv", "samples", "outputs", "dist", "build", "node_modules", ".pytest_cache", "__pycache__")
Copy-Item -Recurse -Force (Join-Path $RepoRoot "src") (Join-Path $RuntimeRoot "src")
Copy-Item -Force (Join-Path $RepoRoot "pyproject.toml") (Join-Path $RuntimeRoot "pyproject.toml")
Copy-Item -Force (Join-Path $RepoRoot "uv.lock") (Join-Path $RuntimeRoot "uv.lock")
Copy-Item -Force (Join-Path $RepoRoot "README.md") (Join-Path $RuntimeRoot "README.md")
Copy-Item -Force (Join-Path $RepoRoot "dist\pdf2md-ui.exe") (Join-Path $AppRoot "pdf2md-ui.exe")
New-Item -ItemType Directory -Path (Join-Path $PayloadRoot "python"),(Join-Path $PayloadRoot "uv") | Out-Null
Copy-Item -Force $PythonInstaller (Join-Path $PayloadRoot "python\python-3.12-amd64.exe")
Copy-Item -Force $UvExe (Join-Path $PayloadRoot "uv\uv.exe")
Copy-Item -Recurse -Force $MinerUModelSource (Join-Path $PayloadRoot "models")
if ($NodeRoot -and (Test-Path $NodeRoot)) {
Copy-Item -Recurse -Force $NodeRoot (Join-Path $PayloadRoot "node")
}
Write-Host "Offline installer stage created at $StageRoot"
Write-Host "Use pip download on the connected build PC to fill payload\wheelhouse before compiling the installer."
```
- [ ] **Step 5: Document the connected wheelhouse build command**
Add to `packaging/offline/README.md`:
```powershell
uv build --wheel
Copy-Item dist\convert_pdf_to_md-0.1.0-py3-none-any.whl dist\offline-installer\payload\wheelhouse\
py -3.12 -m pip download -d dist\offline-installer\payload\wheelhouse -r packaging\offline\requirements-runtime-cu126.txt --find-links dist\offline-installer\payload\wheelhouse --extra-index-url https://download.pytorch.org/whl/cu126
```
- [ ] **Step 6: Add the payload verifier**
`verify-offline-payload.ps1` must read `payload\payload-manifest.json`, recompute SHA-256 for each listed file, and fail when a file is missing or changed.
- [ ] **Step 7: Run tests**
Run:
```powershell
uv run pytest tests/test_offline_packaging.py -q
git diff --check
```
Expected: PASS.
- [ ] **Step 8: Commit**
```powershell
git add packaging\offline\build-offline-payload.ps1 packaging\offline\verify-offline-payload.ps1 packaging\offline\requirements-runtime-cu126.txt packaging\offline\README.md tests\test_offline_packaging.py
git commit -m "feat: plan offline payload builder"
```
## Task 3: Target Runtime Install And Repair Scripts
**Files:**
- Create: `packaging/offline/install-runtime.ps1`
- Create: `packaging/offline/repair-runtime.ps1`
- Create: `packaging/offline/run-doctor.ps1`
- Modify: `tests/test_offline_packaging.py`
- [ ] **Step 1: Write script safety tests**
```python
from pathlib import Path
def test_install_runtime_uses_only_local_package_sources() -> None:
script = Path("packaging/offline/install-runtime.ps1").read_text(encoding="utf-8")
assert "--no-index" in script
assert "--find-links" in script
assert "UV_OFFLINE" in script
assert "https://" not in script
assert "http://" not in script
def test_install_runtime_does_not_silently_overwrite_mineru_config() -> None:
script = Path("packaging/offline/install-runtime.ps1").read_text(encoding="utf-8")
assert "mineru.json" in script
assert "Backup" in script
assert "Silent" in script
assert "throw" in script
```
- [ ] **Step 2: Run tests to verify failure**
Run:
```powershell
uv run pytest tests/test_offline_packaging.py -q
```
Expected: FAIL because scripts do not exist.
- [ ] **Step 3: Implement `install-runtime.ps1`**
The script must:
```powershell
param(
[string]$InstallRoot = "$env:LOCALAPPDATA\Programs\ConvertPDFToMD",
[switch]$Silent
)
$ErrorActionPreference = "Stop"
$PayloadRoot = Join-Path $InstallRoot "payload"
$RuntimeRoot = Join-Path $InstallRoot "runtime"
$VenvPython = Join-Path $RuntimeRoot ".venv\Scripts\python.exe"
$VenvPdf2Md = Join-Path $RuntimeRoot ".venv\Scripts\pdf2md.exe"
$UvExe = Join-Path $PayloadRoot "uv\uv.exe"
$Wheelhouse = Join-Path $PayloadRoot "wheelhouse"
$Requirements = Join-Path $PayloadRoot "requirements-runtime-cu126.txt"
$LogRoot = Join-Path $InstallRoot "logs"
New-Item -ItemType Directory -Path $LogRoot -Force | Out-Null
$env:UV_OFFLINE = "1"
$env:MINERU_MODEL_SOURCE = "local"
if (-not (Test-Path $UvExe)) { throw "Missing bundled uv.exe: $UvExe" }
if (-not (Test-Path $Wheelhouse)) { throw "Missing wheelhouse: $Wheelhouse" }
if (-not (Test-Path $Requirements)) { throw "Missing requirements: $Requirements" }
& $UvExe venv (Join-Path $RuntimeRoot ".venv") --python 3.12
if ($LASTEXITCODE -ne 0) { throw "uv venv failed with exit code $LASTEXITCODE" }
& $UvExe pip install --python $VenvPython --no-index --find-links $Wheelhouse -r $Requirements
if ($LASTEXITCODE -ne 0) { throw "offline package install failed with exit code $LASTEXITCODE" }
& $UvExe pip check --python $VenvPython
if ($LASTEXITCODE -ne 0) { throw "uv pip check failed with exit code $LASTEXITCODE" }
$MinerUConfig = Join-Path $env:USERPROFILE "mineru.json"
if (Test-Path $MinerUConfig) {
if ($Silent) { throw "Existing mineru.json requires interactive confirmation: $MinerUConfig" }
$Backup = "$MinerUConfig.pdf2md-backup-$(Get-Date -Format yyyyMMddHHmmss)"
Copy-Item -Force $MinerUConfig $Backup
}
& $VenvPdf2Md doctor *> (Join-Path $LogRoot "doctor-after-install.txt")
if ($LASTEXITCODE -ne 0) { throw "pdf2md doctor failed with exit code $LASTEXITCODE" }
```
- [ ] **Step 4: Implement repair and doctor scripts**
`repair-runtime.ps1` reruns `install-runtime.ps1` for an existing install root. `run-doctor.ps1` runs the installed `.venv\Scripts\pdf2md.exe doctor` and writes `logs\doctor-latest.txt`.
- [ ] **Step 5: Run tests**
Run:
```powershell
uv run pytest tests/test_offline_packaging.py -q
git diff --check
```
Expected: PASS.
- [ ] **Step 6: Commit**
```powershell
git add packaging\offline\install-runtime.ps1 packaging\offline\repair-runtime.ps1 packaging\offline\run-doctor.ps1 tests\test_offline_packaging.py
git commit -m "feat: add offline runtime install scripts"
```
## Task 4: UI Installed Runtime Resolution
**Files:**
- Modify: `src/pdf2md_ui/runner.py`
- Modify: `src/pdf2md_ui/app.py` only if needed
- Modify: `tests/test_ui_runner.py`
- [ ] **Step 1: Add failing runner tests**
```python
from pathlib import Path
from pdf2md_ui.runner import resolve_cli_command
def test_resolve_prefers_project_venv_pdf2md(tmp_path: Path) -> None:
root = tmp_path / "runtime"
scripts = root / ".venv" / "Scripts"
scripts.mkdir(parents=True)
(root / "pyproject.toml").write_text("[project]\nname='x'\n", encoding="utf-8")
pdf2md = scripts / "pdf2md.exe"
pdf2md.write_text("", encoding="utf-8")
resolved = resolve_cli_command(project_root=root, which=lambda name: None)
assert resolved.args_prefix == (str(pdf2md),)
assert resolved.cwd is None
assert resolved.source == "venv"
def test_resolve_uses_bundled_uv_offline_when_no_venv_command(tmp_path: Path) -> None:
root = tmp_path / "runtime"
root.mkdir()
(root / "pyproject.toml").write_text("[project]\nname='x'\n", encoding="utf-8")
uv = tmp_path / "payload" / "uv" / "uv.exe"
uv.parent.mkdir(parents=True)
uv.write_text("", encoding="utf-8")
resolved = resolve_cli_command(project_root=root, bundled_uv=uv, which=lambda name: None)
assert resolved.args_prefix == (str(uv), "run", "--offline", "pdf2md")
assert resolved.cwd == root
assert resolved.source == "bundled-uv"
```
- [ ] **Step 2: Run tests to verify failure**
Run:
```powershell
uv run pytest tests/test_ui_runner.py -q
```
Expected: FAIL because the runner does not yet support installed `.venv` or bundled uv resolution.
- [ ] **Step 3: Implement minimal runner changes**
Add `bundled_uv` as an optional keyword to `resolve_cli_command`, check `<project_root>\.venv\Scripts\pdf2md.exe` after configured command and before PATH, and use bundled `uv run --offline pdf2md` before system `uv`.
- [ ] **Step 4: Add child environment tests**
Add a test that `build_child_environment(project_root=runtime_root)` prepends `.venv\Scripts` and `payload\node` when those folders exist, while preserving `MINERU_MODEL_SOURCE=custom` if the user already set it.
- [ ] **Step 5: Run tests**
Run:
```powershell
uv run pytest tests/test_ui_runner.py -q
```
Expected: PASS.
- [ ] **Step 6: Commit**
```powershell
git add src\pdf2md_ui\runner.py src\pdf2md_ui\app.py tests\test_ui_runner.py
git commit -m "feat: resolve installed offline runtime from UI"
```
## Task 5: Inno Setup Script
**Files:**
- Create: `packaging/offline/Pdf2MdOffline.iss`
- Modify: `tests/test_offline_packaging.py`
- [ ] **Step 1: Add Inno script tests**
```python
from pathlib import Path
def test_inno_script_installs_payload_and_shortcuts() -> None:
script = Path("packaging/offline/Pdf2MdOffline.iss").read_text(encoding="utf-8")
assert "DefaultDirName={localappdata}\\Programs\\ConvertPDFToMD" in script
assert "payload\\*" in script
assert "app\\*" in script
assert "runtime\\*" in script
assert "pdf2md-ui.exe" in script
assert "install-runtime.ps1" in script
assert "PDF2MD Doctor" in script
assert "Repair PDF2MD Runtime" in script
def test_inno_script_excludes_development_artifacts() -> None:
script = Path("packaging/offline/Pdf2MdOffline.iss").read_text(encoding="utf-8")
assert "samples" not in script
assert "outputs" not in script
assert ".venv" not in script
```
- [ ] **Step 2: Run tests to verify failure**
Run:
```powershell
uv run pytest tests/test_offline_packaging.py -q
```
Expected: FAIL because the Inno script does not exist.
- [ ] **Step 3: Create the Inno script**
```ini
[Setup]
AppId={{PDF2MD-OFFLINE-INSTALLER}}
AppName=ConvertPDFToMD
AppVersion=0.1.0
DefaultDirName={localappdata}\Programs\ConvertPDFToMD
DefaultGroupName=ConvertPDFToMD
OutputDir=..\..\dist
OutputBaseFilename=Pdf2MdOfflineSetup-0.1.0
Compression=lzma2
SolidCompression=yes
PrivilegesRequired=lowest
[Files]
Source: "..\..\dist\offline-installer\payload\*"; DestDir: "{app}\payload"; Flags: recursesubdirs createallsubdirs
Source: "..\..\dist\offline-installer\app\*"; DestDir: "{app}\app"; Flags: recursesubdirs createallsubdirs
Source: "..\..\dist\offline-installer\runtime\*"; DestDir: "{app}\runtime"; Flags: recursesubdirs createallsubdirs
Source: "install-runtime.ps1"; DestDir: "{app}\scripts"
Source: "repair-runtime.ps1"; DestDir: "{app}\scripts"
Source: "run-doctor.ps1"; DestDir: "{app}\scripts"
[Icons]
Name: "{group}\ConvertPDFToMD"; Filename: "{app}\app\pdf2md-ui.exe"; WorkingDir: "{app}\runtime"
Name: "{group}\PDF2MD Doctor"; Filename: "powershell.exe"; Parameters: "-ExecutionPolicy Bypass -File ""{app}\scripts\run-doctor.ps1"""; WorkingDir: "{app}"
Name: "{group}\Repair PDF2MD Runtime"; Filename: "powershell.exe"; Parameters: "-ExecutionPolicy Bypass -File ""{app}\scripts\repair-runtime.ps1"""; WorkingDir: "{app}"
[Run]
Filename: "powershell.exe"; Parameters: "-ExecutionPolicy Bypass -File ""{app}\scripts\install-runtime.ps1"" -InstallRoot ""{app}"""; StatusMsg: "Installing offline pdf2md runtime..."; Flags: runhidden
```
- [ ] **Step 4: Run tests**
Run:
```powershell
uv run pytest tests/test_offline_packaging.py -q
git diff --check
```
Expected: PASS.
- [ ] **Step 5: Compile with Inno Setup on a build PC**
Run:
```powershell
ISCC.exe packaging\offline\Pdf2MdOffline.iss
```
Expected: exit code 0 and `dist\Pdf2MdOfflineSetup-0.1.0.exe` exists. Do not commit the generated exe.
- [ ] **Step 6: Commit**
```powershell
git add packaging\offline\Pdf2MdOffline.iss tests\test_offline_packaging.py
git commit -m "feat: add offline installer script"
```
## Task 6: Documentation, Verification, And Handoff
**Files:**
- Modify: `README.md`
- Modify: `docs/V1RELEASECHECKLIST.md`
- Modify: `docs/Sprints/SPRINT17CONTRACT.md`
- Modify: `PLAN.md`
- Modify: `PROGRESS.md`
- Modify: `docs/WORKARCHIVE.md`
- [ ] **Step 1: Document build and install flow**
Add a README section with:
```markdown
## Offline Windows Installer
The offline installer is built on an internet-connected Windows x64 build PC, then copied to a target Windows x64 PC with networking disabled. The target installer creates a fresh `.venv` from bundled wheels; it does not copy the development `.venv`.
```
- [ ] **Step 2: Document verification gates**
Add to `docs/V1RELEASECHECKLIST.md`:
```markdown
### Offline Installer Gate
- Build `dist\pdf2md-ui.exe`.
- Stage the offline payload.
- Verify payload hashes.
- Compile the Inno Setup installer.
- Install on a clean Windows x64 VM with networking disabled.
- Run `pdf2md doctor` from the installed `.venv`.
- Run one optional local conversion only when a local test PDF is available and generated outputs remain ignored.
```
- [ ] **Step 3: Run final fast tests**
Run:
```powershell
uv run pytest tests/test_offline_packaging.py tests/test_ui_runner.py
uv run pytest
git diff --check
```
Expected: PASS, except pre-existing documented optional skips.
- [ ] **Step 4: Run packaging smoke on build PC**
Run:
```powershell
uv run --group ui-build pyinstaller --clean --onefile --windowed --name pdf2md-ui src\pdf2md_ui\app.py
$pythonInstaller = "C:\BuildCache\python-3.12-amd64.exe"
$uvExe = "C:\BuildCache\uv.exe"
$mineruModels = "C:\BuildCache\mineru-models"
powershell -ExecutionPolicy Bypass -File packaging\offline\build-offline-payload.ps1 -Configuration Release -PythonInstaller $pythonInstaller -UvExe $uvExe -MinerUModelSource $mineruModels
ISCC.exe packaging\offline\Pdf2MdOffline.iss
```
Expected: installer exe exists under `dist\`; generated files remain untracked.
- [ ] **Step 5: Update coordination docs**
Record changed files, verification output, generated installer path, payload size, and residual risks in `PROGRESS.md`. Move final implementation evidence and offline VM smoke results to `docs/WORKARCHIVE.md`.
- [ ] **Step 6: Commit final docs**
```powershell
git add README.md docs\V1RELEASECHECKLIST.md docs\Sprints\SPRINT17CONTRACT.md PLAN.md PROGRESS.md docs\WORKARCHIVE.md
git commit -m "docs: record offline installer release gate"
```
## Execution Notes
- Do not commit payload contents, wheels, model files, Python installers, Node binaries, generated installer exe files, `samples/`, or `outputs/`.
- Keep runtime conversion strict-local. Setup-time payload creation may use internet only on the build PC.
- Treat license/model redistribution review as a release gate before sharing the installer outside the current personal environment.
@@ -0,0 +1,111 @@
# UI Folder Batch Conversion Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Add a minimal UI folder workflow that converts every direct-child PDF in a selected folder by sequentially invoking the existing `pdf2md convert` CLI.
**Architecture:** Keep the converter and CLI unchanged. Add deterministic folder discovery and batch command construction to `src/pdf2md_ui/runner.py`, then make `src/pdf2md_ui/app.py` run a list of `CommandSpec` objects sequentially on the existing worker-thread/event-queue pattern.
**Tech Stack:** Python 3.12, tkinter/ttk, pytest, PyInstaller, existing `pdf2md_ui.runner` subprocess wrapper.
---
### Task 1: Runner Batch Helpers
**Files:**
- Modify: `tests/test_ui_runner.py`
- Modify: `src/pdf2md_ui/runner.py`
- [x] **Step 1: Write failing tests**
```python
def test_list_direct_pdf_files_returns_sorted_direct_children_only(tmp_path: Path) -> None:
(tmp_path / "b.PDF").write_text("", encoding="utf-8")
(tmp_path / "a.pdf").write_text("", encoding="utf-8")
nested = tmp_path / "nested"
nested.mkdir()
(nested / "c.pdf").write_text("", encoding="utf-8")
(tmp_path / "notes.txt").write_text("", encoding="utf-8")
assert [path.name for path in list_direct_pdf_files(tmp_path)] == ["a.pdf", "b.PDF"]
```
```python
def test_build_batch_convert_commands_reuses_convert_options(tmp_path: Path) -> None:
resolved = ResolvedCommand(("pdf2md",), cwd=None, source="path")
pdfs = [tmp_path / "a.pdf", tmp_path / "b.pdf"]
commands = build_batch_convert_commands(
resolved,
pdfs,
tmp_path / "out",
overwrite=True,
keep_raw=True,
chunk_pages=5,
gpu="auto",
mineru_profile="safe",
)
assert [command.args[2] for command in commands] == [str(pdfs[0]), str(pdfs[1])]
assert all("--chunk-pages" in command.args for command in commands)
assert all("--mineru-profile" in command.args for command in commands)
```
- [x] **Step 2: Run tests to verify RED**
Run: `uv run pytest tests/test_ui_runner.py::test_list_direct_pdf_files_returns_sorted_direct_children_only tests/test_ui_runner.py::test_build_batch_convert_commands_reuses_convert_options -q`
Expected: FAIL because the new helpers are not defined.
- [x] **Step 3: Implement minimal runner helpers**
Add `list_direct_pdf_files(folder)` using `Path.iterdir()` and case-insensitive `.pdf` suffix matching. Add `build_batch_convert_commands()` that loops over the provided PDF paths and delegates to `build_convert_command()`.
- [x] **Step 4: Run tests to verify GREEN**
Run: `uv run pytest tests/test_ui_runner.py -q`
Expected: all UI runner tests pass.
### Task 2: Tk UI Batch Execution
**Files:**
- Modify: `src/pdf2md_ui/app.py`
- [x] **Step 1: Add folder state and controls**
Add `input_folder_var`, a path row labeled `Input folder`, and a `Convert folder` button beside the existing action buttons.
- [x] **Step 2: Add batch command startup**
Implement `_choose_folder()`, `_run_folder_convert()`, and `_start_command_sequence()`. `_run_folder_convert()` validates the folder and output directory, parses `chunk_pages`, builds commands through the runner helper, and starts the sequence.
- [x] **Step 3: Add sequential worker behavior**
Run each command synchronously on the worker thread. Emit log messages before each file starts. Stop after the first non-zero exit code. If Cancel is requested, terminate the active command and do not start later commands.
- [x] **Step 4: Run focused tests**
Run: `uv run pytest tests/test_ui_runner.py -q`
Expected: all UI runner tests pass; UI app imports without syntax errors through test collection.
### Task 3: Build and Handoff
**Files:**
- Modify: `PROGRESS.md`
- Generated ignored output: `dist/pdf2md-ui.exe`
- [x] **Step 1: Rebuild the UI executable**
Run: `uv run --group ui-build pyinstaller --clean --onefile --windowed --name pdf2md-ui src\pdf2md_ui\app.py`
Expected: exit code 0 and `dist\pdf2md-ui.exe` exists.
- [x] **Step 2: Update progress**
Record the new UI folder batch feature and verification commands in `PROGRESS.md`.
- [x] **Step 3: Check and commit**
Run: `git diff --check`, `git status --short`, then commit only the scoped source, test, and documentation changes.
@@ -0,0 +1,33 @@
# UI Folder Batch Conversion Design
## Goal
Add a minimal UI workflow that lets the user select one folder and convert every PDF directly inside that folder to Markdown.
## Scope
- Include only `*.pdf` files directly under the selected folder.
- Exclude PDFs in nested folders.
- Reuse the existing `pdf2md convert` CLI command for each PDF.
- Keep conversion sequential to avoid GPU and MinerU runtime contention.
- Apply the existing UI conversion options to every PDF in the batch: output directory, overwrite, keep raw, grouped pages, GPU, and MinerU profile.
## Design
The runner layer owns folder discovery and batch command construction. It will expose a small helper that returns direct-child PDF paths in deterministic name order and another helper that builds one fixed-argument `CommandSpec` per PDF by calling the existing `build_convert_command()`.
The Tk UI adds an input-folder row and a folder-convert button. When the user starts folder conversion, the UI validates the selected folder, builds the command list, and runs commands one at a time on the existing worker thread pattern. It logs each PDF before it starts, stops on the first non-zero exit code, and honors Cancel by terminating the currently running process and not starting later PDFs.
## Non-Goals
- No recursive folder conversion.
- No parallel conversion.
- No new CLI command.
- No direct MinerU invocation from the UI.
- No remote/API options or arbitrary shell command execution.
## Verification
- Add focused runner tests for direct-child PDF discovery, nested PDF exclusion, deterministic ordering, and batch command construction.
- Run `uv run pytest tests/test_ui_runner.py`.
- Rebuild the UI executable with PyInstaller and confirm `dist/pdf2md-ui.exe` exists.