Files
PDFToMD/docs/Sprints/SPRINT17CONTRACT.md
T
2026-05-14 10:16:59 +09:00

441 lines
18 KiB
Markdown

# Sprint 17 Contract: Offline Windows Installer
Status: Abandoned
Last updated: 2026-05-13
## Abandonment Note
Sprint 17 was abandoned at the user's request on 2026-05-13 before implementation began. This document remains as a historical planning record only. Do not implement or extend this contract unless the user explicitly reopens offline installer work.
## Objective
Create a large offline Windows installer that can install the existing local `pdf2md` runtime on another Windows PC without internet access.
The installer must install or stage all application-owned files needed after download time: the minimal UI executable, the project runtime, a target-local Python virtual environment created from bundled wheels, CUDA PyTorch wheels, MinerU 3.1.0 wheels and dependencies, local MinerU model files, optional local Node.js/MathJax assets, Start Menu shortcuts, setup logs, and a post-install `pdf2md doctor` verification path.
This sprint does not change conversion behavior. It packages the already implemented CLI/UI/runtime for offline use.
## Product Decision
The offline package should create the target PC virtual environment during installation instead of copying the current development `.venv`.
Reasoning:
- Python virtual environments and console entry points often contain absolute paths and are not a reliable redistribution unit.
- A target-local `.venv` created from a bundled wheelhouse is more reproducible and easier to repair.
- The installer can keep the wheelhouse for offline repair, uninstall/reinstall, and audit.
## Installer Shape
Recommended installer technology:
- Inno Setup for the Windows installer shell because it can compile scripts from the command line with `ISCC.exe`, returns deterministic exit codes, and is simple enough for a per-user installer.
- PowerShell scripts for payload build, target runtime install, and target verification.
- PyInstaller remains only the UI executable builder. It must not become the full MinerU/PyTorch/model bundler.
Default install root:
```text
%LOCALAPPDATA%\Programs\ConvertPDFToMD\
```
Installed layout:
```text
ConvertPDFToMD/
app/
pdf2md-ui.exe
runtime/
pyproject.toml
uv.lock
README.md
src/
tools/
package.json
package-lock.json
.venv/
payload/
python/
uv/
wheelhouse/
requirements-runtime-cu126.txt
models/
node/
node_modules/
payload-manifest.json
SHA256SUMS.txt
THIRD_PARTY_NOTICES.md
scripts/
install-runtime.ps1
repair-runtime.ps1
run-doctor.ps1
logs/
```
Generated artifacts that must remain untracked:
```text
dist/offline-installer/
dist/Pdf2MdOfflineSetup-*.exe
```
## Payload Contents
The first offline payload targets Windows x64, Python 3.12, CUDA PyTorch `2.6.0+cu126`, `torchvision 0.21.0+cu126`, and `mineru[core]==3.1.0`.
Required:
- `dist/pdf2md-ui.exe` from the existing PyInstaller build.
- Tracked project runtime files needed to run `uv run pdf2md`.
- A Windows x64 Python 3.12 installer or an equivalent approved Python runtime package.
- A Windows x64 `uv.exe`.
- A wheelhouse containing:
- the current project wheel,
- `pypdf`,
- `torch==2.6.0`,
- `torchvision==0.21.0`,
- `mineru[core]==3.1.0`,
- all transitive Python runtime dependencies.
- Local MinerU model files and the model config template needed for `MINERU_MODEL_SOURCE=local`.
- A manifest listing every payload file, size, SHA-256 hash, source URL or local source, and license family.
Optional but recommended:
- Portable local Node.js runtime.
- `node_modules/` containing the locked MathJax checker dependencies from `package-lock.json`.
Explicitly excluded:
- `samples/`.
- `outputs/`.
- `.git/`.
- The development `.venv/`.
- Local generated PyInstaller `build/` folders and `.spec` files unless the implementation deliberately adds a stable project-owned spec file.
- NVIDIA GPU drivers and CUDA Toolkit installers. The installer may check for a compatible NVIDIA driver through `nvidia-smi`, but it should not redistribute GPU drivers in this sprint.
## Touched Surfaces
Allowed during implementation:
- Create `packaging/offline/build-offline-payload.ps1`.
- Create `packaging/offline/verify-offline-payload.ps1`.
- Create `packaging/offline/install-runtime.ps1`.
- Create `packaging/offline/repair-runtime.ps1`.
- Create `packaging/offline/run-doctor.ps1`.
- Create `packaging/offline/Pdf2MdOffline.iss`.
- Create `packaging/offline/requirements-runtime-cu126.txt`.
- Create `packaging/offline/README.md`.
- Create `packaging/offline/THIRD_PARTY_NOTICES.md`.
- Create `src/pdf2md/packaging_manifest.py` only if a Python helper is simpler than repeating manifest logic in PowerShell.
- Modify `src/pdf2md_ui/runner.py` so the UI can resolve an installed target-local `.venv\Scripts\pdf2md.exe` before falling back to PATH or `uv run pdf2md`.
- Modify `src/pdf2md_ui/app.py` only if the project root default must prefer the installed runtime folder.
- Modify `tests/test_ui_runner.py`.
- Create `tests/test_offline_packaging.py`.
- Modify `README.md`.
- Modify `docs/V1RELEASECHECKLIST.md`.
- Modify `PLAN.md`.
- Modify `PROGRESS.md`.
- Modify `docs/WORKARCHIVE.md` after implementation.
Not allowed:
- Do not change MinerU 3.1.0 as the fixed conversion engine.
- Do not add a second conversion engine.
- Do not add runtime network calls, `--api-url`, router mode, remote APIs, HTTP client backends, remote OpenAI-compatible backends, or hosted renderers.
- Do not copy the development `.venv` as the installed runtime.
- Do not make default tests depend on real MinerU, GPU, model files, network, Obsidian, MathJax, Inno Setup, or `samples/`.
- Do not commit generated installer payloads, model files, wheelhouse files, Python installers, `dist/`, `outputs/`, or `samples/`.
## Architecture Plan
### WP17.1: Offline Payload Builder
Add a build script that creates a clean staging folder under `dist/offline-installer/` with `app/`, `runtime/`, and `payload/` subfolders that mirror the final install layout.
Responsibilities:
- Rebuild `dist/pdf2md-ui.exe`.
- Build the project wheel into the staging wheelhouse.
- Download or collect Python wheels for the target runtime on a connected build PC.
- Collect the Windows Python runtime package and `uv.exe`.
- Copy project runtime files without `.git`, `.venv`, `outputs/`, `samples/`, and build trash.
- Copy local MinerU model files from a configured source path.
- Optionally copy portable Node.js and the locked `node_modules/`.
- Generate `payload-manifest.json` and `SHA256SUMS.txt`.
- Fail if any required file is missing or if any wheel dependency would require internet during installation.
The builder may use `python -m pip download` on the connected build PC. The target installer must use only local files, for example `uv pip install --no-index --find-links`.
### WP17.2: Target Runtime Installer
Add a PowerShell install script that runs from the installed payload and creates the real runtime on the target PC.
Responsibilities:
- Verify payload hashes before installing.
- Install or locate Python 3.12 x64.
- Create `runtime\.venv` on the target PC.
- Install packages from `payload\wheelhouse` with network disabled.
- Install the project wheel into the target `.venv`.
- Preserve the bundled wheelhouse for offline repair.
- Configure `MINERU_MODEL_SOURCE=local` for UI/CLI child processes.
- Configure local MinerU model paths without silently overwriting an unrelated user `mineru.json`.
- If `%USERPROFILE%\mineru.json` already exists and points elsewhere, prompt in interactive mode; in silent mode, fail clearly and leave `repair-runtime.ps1` instructions.
- Run `pdf2md doctor` and write the result to `logs\doctor-after-install.txt`.
### WP17.3: UI Runtime Resolution
Adjust the UI runner for an installed offline layout.
Resolution order:
1. Explicit configured `pdf2md` command.
2. Installed runtime `.venv\Scripts\pdf2md.exe` under the selected project root.
3. `pdf2md` on PATH.
4. Bundled `uv.exe` plus `uv run --offline pdf2md` under the selected project root.
5. Existing system `uv run pdf2md` fallback.
Child environment rules:
- Set `MINERU_MODEL_SOURCE=local` unless explicitly set.
- Add installed `.venv\Scripts` to PATH for runtime console scripts.
- Add installed portable Node.js path to PATH when bundled.
- Set `UV_OFFLINE=1` when using the installed offline runtime.
- Do not add remote endpoints or backend flags.
### WP17.4: Inno Setup Installer
Add an Inno Setup script that installs the payload and invokes the target runtime installer.
Installer behavior:
- Default to per-user install under `%LOCALAPPDATA%\Programs\ConvertPDFToMD`.
- Create Start Menu shortcuts for:
- `ConvertPDFToMD` UI,
- `PDF2MD Doctor`,
- `Repair PDF2MD Runtime`.
- Run `install-runtime.ps1` after files are copied.
- Show the doctor log path if setup finishes with WARN.
- Fail the install on target runtime setup failure unless the user explicitly chooses to keep files for manual repair.
### WP17.5: License, Manifest, And Offline Verification
Add docs and checks for redistribution risk.
Required records:
- Python, uv, PyInstaller, PyTorch, MinerU, model files, Node.js, MathJax, and transitive Python/npm dependency notices.
- A manifest with file hashes and source URLs.
- A clear statement that runtime conversion remains local-only and that setup payload creation can use internet only on the build PC.
Verification tiers:
- Fast tests use fake staging folders and fake wheel/model files.
- Build-PC packaging smoke can create the staging folder without committing payload.
- Offline target smoke uses a clean Windows VM with networking disabled.
## Implementation Task Plan
### Task 1: Packaging Manifest And Ignore Policy
Files:
- Create `tests/test_offline_packaging.py`.
- Create `src/pdf2md/packaging_manifest.py` if needed.
- Modify `.gitignore`.
Steps:
- Add failing tests for manifest generation with SHA-256, file size, relative path, and source label.
- Add failing tests that payload paths under `dist/offline-installer/`, wheelhouse files, model files, and generated installer executables stay ignored.
- Implement the smallest manifest helper or PowerShell-compatible JSON format.
- Run `uv run pytest tests/test_offline_packaging.py`.
- Commit manifest and ignore-policy changes.
### Task 2: Offline Payload Builder
Files:
- Create `packaging/offline/build-offline-payload.ps1`.
- Create `packaging/offline/requirements-runtime-cu126.txt`.
- Create `packaging/offline/README.md`.
- Create `packaging/offline/verify-offline-payload.ps1`.
- Modify `tests/test_offline_packaging.py`.
Steps:
- Add tests that the builder rejects missing UI exe, missing model source, missing Python runtime package, missing `uv.exe`, and empty wheelhouse.
- Add tests that the builder excludes `.venv`, `.git`, `samples`, `outputs`, `node_modules` unless explicitly copied as the optional locked MathJax payload.
- Implement payload staging, manifest generation, and payload verification.
- Run `uv run pytest tests/test_offline_packaging.py`.
- Run a dry build command that uses fake payload inputs.
- Commit builder changes.
### Task 3: Target Runtime Install And Repair Scripts
Files:
- Create `packaging/offline/install-runtime.ps1`.
- Create `packaging/offline/repair-runtime.ps1`.
- Create `packaging/offline/run-doctor.ps1`.
- Modify `tests/test_offline_packaging.py`.
Steps:
- Add tests that scripts contain `--no-index`, `--find-links`, `UV_OFFLINE=1`, and no `http://` or `https://` target-install commands.
- Add tests that existing `mineru.json` handling is explicit and never silently overwritten.
- Implement target-local `.venv` creation, offline package install, model config handling, doctor logging, and repair flow.
- Run `uv run pytest tests/test_offline_packaging.py`.
- Commit install-script changes.
### Task 4: UI Installed Runtime Resolution
Files:
- Modify `src/pdf2md_ui/runner.py`.
- Modify `src/pdf2md_ui/app.py` only if needed.
- Modify `tests/test_ui_runner.py`.
Steps:
- Add failing tests for project-root `.venv\Scripts\pdf2md.exe` resolution before PATH.
- Add failing tests for bundled `uv.exe` plus `uv run --offline pdf2md` fallback.
- Add failing tests that the child environment prepends `.venv\Scripts` and bundled Node.js when present.
- Implement the minimal runner changes.
- Run `uv run pytest tests/test_ui_runner.py`.
- Commit UI resolution changes.
### Task 5: Inno Setup Script
Files:
- Create `packaging/offline/Pdf2MdOffline.iss`.
- Modify `tests/test_offline_packaging.py`.
Steps:
- Add tests that the Inno script references the expected payload directories, Start Menu shortcuts, and runtime install script.
- Add tests that the script does not reference `samples`, `outputs`, `.venv`, or remote URLs.
- Implement the Inno script.
- On a build PC with Inno Setup installed, run `ISCC.exe packaging\offline\Pdf2MdOffline.iss`.
- Commit installer-script changes without committing the generated installer.
### Task 6: Documentation And Release Gate
Files:
- Modify `README.md`.
- Modify `docs/V1RELEASECHECKLIST.md`.
- Modify `docs/Sprints/SPRINT17CONTRACT.md`.
- Modify `PLAN.md`.
- Modify `PROGRESS.md`.
- Modify `docs/WORKARCHIVE.md` after implementation.
Steps:
- Document build-PC prerequisites and target-PC prerequisites.
- Document the offline artifact layout, expected size risk, and repair flow.
- Document the clean offline VM smoke test.
- Record final verification outcomes and residual risks.
- Commit documentation and handoff updates.
## Verification Commands
Default fast checks:
```powershell
uv run pytest tests/test_offline_packaging.py tests/test_ui_runner.py
uv run pytest
git diff --check
git status --short --untracked-files=all
```
Build-PC packaging checks:
```powershell
uv run --group ui-build pyinstaller --clean --onefile --windowed --name pdf2md-ui src\pdf2md_ui\app.py
$pythonInstaller = "C:\BuildCache\python-3.12-amd64.exe"
$uvExe = "C:\BuildCache\uv.exe"
$mineruModels = "C:\BuildCache\mineru-models"
powershell -ExecutionPolicy Bypass -File packaging\offline\build-offline-payload.ps1 -Configuration Release -PythonInstaller $pythonInstaller -UvExe $uvExe -MinerUModelSource $mineruModels
powershell -ExecutionPolicy Bypass -File packaging\offline\verify-offline-payload.ps1 -PayloadRoot dist\offline-installer\payload
ISCC.exe packaging\offline\Pdf2MdOffline.iss
```
Offline target smoke:
```powershell
# Run on a clean Windows x64 VM with networking disabled after copying only the installer.
.\Pdf2MdOfflineSetup-*.exe
& "$env:LOCALAPPDATA\Programs\ConvertPDFToMD\scripts\run-doctor.ps1"
& "$env:LOCALAPPDATA\Programs\ConvertPDFToMD\runtime\.venv\Scripts\pdf2md.exe" --version
& "$env:LOCALAPPDATA\Programs\ConvertPDFToMD\runtime\.venv\Scripts\pdf2md.exe" doctor
```
Optional conversion smoke on the offline target:
```powershell
& "$env:LOCALAPPDATA\Programs\ConvertPDFToMD\runtime\.venv\Scripts\pdf2md.exe" convert C:\LocalTest\SolidElement.pdf --out C:\LocalTest\outputs --overwrite --chunk-pages --gpu auto --mineru-profile auto --strict-local
```
Expected optional output:
```text
C:\LocalTest\outputs\SolidElement\SolidElement_001.md
C:\LocalTest\outputs\SolidElement\SolidElement_report.md
C:\LocalTest\outputs\SolidElement\images\
```
## Acceptance Criteria
- The generated installer can install the runtime on a clean Windows x64 target without internet access.
- The target runtime has a newly created local `.venv`; it is not a copied development `.venv`.
- `pdf2md --version` runs from the installed `.venv`.
- `pdf2md doctor` runs without network access and reports all install-relevant failures or warnings clearly.
- The UI launches from the Start Menu and resolves the installed runtime without manual project-root configuration.
- MinerU uses local models through `MINERU_MODEL_SOURCE=local` and local model config.
- Python package installation uses only bundled local wheels.
- The wheelhouse and model payload are hash-verified before install.
- No generated payload, model file, wheel, installer exe, sample PDF, or conversion output is committed.
- Default tests remain fast and independent of real MinerU, GPU, model files, network, Inno Setup, MathJax, or `samples/`.
## Hard Failure Criteria
- The target installer downloads anything from the internet.
- The UI or CLI introduces a runtime document upload path.
- The installer silently overwrites an unrelated existing `mineru.json`.
- The installer copies the development `.venv` as the installed runtime.
- The installed UI cannot find `pdf2md` without manually editing settings on a clean install.
- `pdf2md doctor` is skipped or its failure is hidden.
- Payload hash verification is missing.
- License/model redistribution review is skipped before sharing the installer outside the current personal environment.
- NVIDIA drivers or CUDA Toolkit installers are redistributed in this sprint.
## Open Risks
- The final installer may be very large because CUDA PyTorch wheels, MinerU dependencies, model weights, and optional Node/MathJax assets are large.
- MinerU model redistribution terms and transitive package/model licenses must be reviewed before broader sharing.
- Target PCs still need compatible NVIDIA hardware and drivers. The installer can verify and report this, but it cannot guarantee GPU compatibility.
- Some conversions can still stall or run slowly on GTX 1070 Ti 8GB; packaging does not solve runtime performance.
- Inno Setup may need practical size and antivirus/SmartScreen validation once real model payloads are included.
## Sources
- PyInstaller usage: https://pyinstaller.org/en/stable/usage.html
- Inno Setup command-line compiler: https://documentation.help/Inno-Setup/topic_compilercmdline.htm
- uv CLI `--offline` behavior: https://docs.astral.sh/uv/reference/cli/
- uv cache behavior: https://docs.astral.sh/uv/concepts/cache/
- pip offline install/download behavior: https://pip.pypa.io/en/stable/cli/pip_install.html and https://pip.pypa.io/en/stable/cli/pip_download/
- PyTorch previous version wheel command for CUDA 12.6: https://pytorch.org/get-started/previous-versions/
- MinerU local model source behavior: https://opendatalab.github.io/MinerU/usage/model_source/
## Handoff Requirements
After implementation:
- Update this contract status to `Implemented` or record the failed gate.
- Record payload size and generated installer path in `PROGRESS.md`.
- Record verification commands and outcomes in `PROGRESS.md`.
- Archive implementation evidence and offline VM smoke results in `docs/WORKARCHIVE.md`.
- Keep generated offline payloads, wheels, model files, installer exe, `dist/`, `outputs/`, and `samples/` uncommitted.