modify pdftomd
This commit is contained in:
+53
-2
@@ -1,6 +1,6 @@
|
||||
# Work Archive
|
||||
|
||||
Last updated: 2026-05-08
|
||||
Last updated: 2026-05-13
|
||||
|
||||
This document stores completed project work, historical sprint outcomes, environment setup results, and sample conversion evidence. `PROGRESS.md` should stay focused on current status, blockers, and next actions. Read this archive when a task needs past implementation context, previous verification commands, or historical handoff details.
|
||||
|
||||
@@ -34,6 +34,16 @@ This document stores completed project work, historical sprint outcomes, environ
|
||||
| GPU default/runtime setup | Made conversion default to `cuda:0`, mapped CUDA requests to MinerU subprocess environment variables, rebuilt `.venv`, installed CUDA-enabled PyTorch and MinerU 3.1.0, downloaded MinerU models, and set `MINERU_MODEL_SOURCE=local`. | `README.md`, `src/pdf2md/mineru_adapter.py`, `src/pdf2md/conversion.py` |
|
||||
| MathJax checker | Planned and implemented local MathJax render checker with Node.js helper, Python wrapper, conversion integration, and doctor diagnostics. | `docs/MATHJAXCHECKERPLAN.md`, `tools/mathjax-checker/check.mjs`, `src/pdf2md/math_render.py` |
|
||||
| Sprint 10 | Implemented opt-in pre-conversion PDF chunking with `pypdf`, temporary chunk PDF cleanup, `--chunk-pages [PAGES]`, chunk metadata/report context, and mocked tests. | `docs/Sprints/SPRINT10CONTRACT.md`, `src/pdf2md/pdf_splitter.py` |
|
||||
| Sprint 11 | Implemented conservative MathJax warning mitigation with failed-expression details, `src/pdf2md/math_repair.py`, shared `convert`/`recheck` repair integration, and `MATH_RENDER_REPAIRED` info warnings. | `docs/Sprints/SPRINT11CONTRACT.md`, `src/pdf2md/math_repair.py`, `src/pdf2md/quality.py`, `src/pdf2md/conversion.py` |
|
||||
| UI research and Sprint 12 planning | Researched minimal Windows UI launcher options and planned a thin `tkinter`/`ttk` launcher over the existing CLI with PyInstaller build output at `dist/pdf2md-ui.exe`. | `docs/UI_RESEARCH.md`, `docs/Sprints/SPRINT12CONTRACT.md`, `PLAN.md` |
|
||||
| Sprint 12 | Implemented a minimal `tkinter`/`ttk` Windows UI launcher over `pdf2md` or `uv run pdf2md`, with fixed argument-list subprocess calls, worker-thread logging, cancellation, Recheck support, and PyInstaller build output at `dist/pdf2md-ui.exe`. | `docs/Sprints/SPRINT12CONTRACT.md`, `src/pdf2md_ui/`, `tests/test_ui_runner.py` |
|
||||
| Sprint 13 | Implemented local pypdf text layer fidelity diagnostics, including Hangul count deltas, unexpected CJK counts, text similarity, Hangul spacing anomaly ratios, replacement-candidate markers, metadata/report integration, and `recheck` support without automatic body-text replacement. | `docs/Sprints/SPRINT13CONTRACT.md`, `src/pdf2md/text_fidelity.py`, `src/pdf2md/conversion.py`, `src/pdf2md/metadata.py`, `src/pdf2md/report.py` |
|
||||
| Sprint 14 | Changed chunk mode so MinerU receives one source page per run while final Markdown, metadata, report, and assets are grouped by `chunk_pages`. Failed page conversions are nonfatal within partially successful groups and are recorded in metadata/report output. | `docs/Sprints/SPRINT14CONTRACT.md`, `src/pdf2md/conversion.py`, `src/pdf2md/report.py`, `tests/test_conversion.py` |
|
||||
| Sprint 15 | Implemented NVIDIA GPU inventory parsing, optional `--gpu auto`, default `--mineru-profile auto`, conservative MinerU environment tuning, profile provenance in metadata/report output, and doctor GPU/profile recommendations. | `docs/Sprints/SPRINT15CONTRACT.md`, `src/pdf2md/gpu.py`, `src/pdf2md/mineru_profile.py`, `src/pdf2md/conversion.py`, `src/pdf2md/doctor.py` |
|
||||
| Sprint 16 | Simplified public conversion outputs to one PDF-stem folder, numbered Markdown parts, shared `images/`, one `_report.md`, no persisted metadata JSON, compatibility-no-op `--metadata`, and legacy-only `recheck`. | `docs/Sprints/SPRINT16CONTRACT.md`, `src/pdf2md/paths.py`, `src/pdf2md/conversion.py`, `src/pdf2md/report.py`, `src/pdf2md/cli.py` |
|
||||
| UI direct-folder batch conversion | Added a minimal UI workflow that selects one folder, discovers direct-child PDFs only, and sequentially runs the existing `pdf2md convert` command for each file with the selected options. | `docs/superpowers/specs/2026-05-13-ui-folder-batch-conversion-design.md`, `docs/superpowers/plans/2026-05-13-ui-folder-batch-conversion.md`, `src/pdf2md_ui/runner.py`, `src/pdf2md_ui/app.py` |
|
||||
| Sprint 17 planning | Planned a large offline Windows installer, then abandoned the sprint at the user's request before implementation began. | `docs/Sprints/SPRINT17CONTRACT.md`, `docs/superpowers/plans/2026-05-12-offline-installer.md` |
|
||||
| Documentation archive cleanup | Moved completed implementation details out of `PLAN.md`, `PROGRESS.md`, and `docs/V1IMPLEMENTATIONPLAN.md`, then removed Sprint 17 from active planned work after it was abandoned. | `PLAN.md`, `PROGRESS.md`, `docs/V1IMPLEMENTATIONPLAN.md`, `docs/WORKARCHIVE.md` |
|
||||
|
||||
## Runtime Setup Archive
|
||||
|
||||
@@ -43,12 +53,13 @@ This document stores completed project work, historical sprint outcomes, environ
|
||||
- `uv` installed per-user at `C:\Users\user\.local\bin`.
|
||||
- GPU target: NVIDIA GTX 1070 Ti 8GB.
|
||||
- Local GPU observed: NVIDIA GeForce GTX 1070 Ti, driver 577.00, 8192 MiB VRAM, WDDM.
|
||||
- Default conversion device/profile: `--gpu cuda:0` and `--mineru-profile auto`.
|
||||
- MinerU execution mode: direct local `mineru` CLI only.
|
||||
- MinerU 3.1.0 CLI-internal temporary local `mineru-api` is allowed when the CLI runs without `--api-url`.
|
||||
- GTX 1070 Ti runtime setup used `torch==2.6.0+cu126`, `torchvision==0.21.0+cu126`, and `mineru[core]==3.1.0`.
|
||||
- MinerU models were downloaded with `uv run mineru-models-download -s huggingface -m all`.
|
||||
- Runtime model loading uses `MINERU_MODEL_SOURCE=local`.
|
||||
- Current doctor status after setup is WARN because GTX 1070 Ti is Pascal/pre-Turing; MinerU, CUDA PyTorch, local model config, MathJax checker, and strict-local checks pass.
|
||||
- Current doctor status after setup is WARN because GTX 1070 Ti is Pascal/pre-Turing; MinerU, CUDA PyTorch, local model config, MathJax checker, and strict-local checks pass. Sprint 15 doctor output selects `cuda:0` for `--gpu auto` on this machine and recommends MinerU profile `safe`.
|
||||
|
||||
## Sample Conversion Archive
|
||||
|
||||
@@ -58,6 +69,11 @@ Generated outputs are ignored under `outputs/` and are not committed.
|
||||
| --- | --- | --- | --- |
|
||||
| `samples/MITC공부.pdf` | Completed after CUDA-enabled runtime setup. | `outputs/MITC공부/` | 13 pages, 107 assets, 23 inline formulas, 103 display formulas, 1 info warning at the time of that run because the local MathJax checker was unavailable. |
|
||||
| `samples/FourNodeQuadrilateralShellElementMITC4.pdf` | Completed with default GPU request and `MINERU_MODEL_SOURCE=local`. | `outputs/FourNodeQuadrilateralShellElementMITC4/` | Report status `success`: 7 pages, 22 assets, 38 inline formulas, 16 display formulas, 0 math render errors, 0 warnings. |
|
||||
| `samples/FourNodeQuadrilateralShellElementMITC4.pdf` | Sprint 14 sample smoke stalled and was terminated. | No final output directory. | On 2026-05-12, `--chunk-pages` entered the one-page conversion path and used `cuda:0` with GPU utilization near 100%. Source page 1 completed, but source page 2 stayed active for more than 15 minutes total runtime with no final grouped output, so the process tree was terminated and the temporary `pdf2md.pages.*` directory was removed. |
|
||||
| `samples/MITC공부.pdf` | Reconverted after Sprint 11 mitigation. | `outputs/MITC공부/` and `outputs/sprint11-MITC공부/` | Report status `partial` from 2 `MATH_RENDER_REPAIRED` info warnings: 13 pages, 107 assets, 23 inline formulas, 103 display formulas, 0 MathJax render errors, and 0 missing or invalid asset links. |
|
||||
| `samples/2007쉘구조물의유한요소해석에대하여.pdf` | Completed after Sprint 13 validation with 1-page chunking. | `outputs/2007쉘구조물의유한요소해석에대하여_pages1/` | A fresh `--chunk-pages 5` attempt stayed on part 001 for over 40 minutes with GPU near full utilization and no output, so it was terminated. The clean `--chunk-pages 1` run completed 13/13 chunks with 0 failures, 44 warnings, 0 MathJax render errors, 13 low text-fidelity pages, 15 unexpected CJK characters, 13 diagnostic replacement-candidate pages, and 0 uncertain page mappings. |
|
||||
| `samples/SolidElement.pdf` | Completed after Sprint 15 GPU/profile implementation with `--gpu auto --mineru-profile auto --chunk-pages`. | `outputs/SolidElement_sprint15_auto_20260512/` | Completed in about 11 minutes 51 seconds on GTX 1070 Ti. Report status `partial`: 6 pages, 0 failed pages, safe profile applied, 71 assets, 3 inline formulas, 55 display formulas, 0 MathJax render errors, 0 missing/invalid asset links, 11 warnings, and 5 low text-fidelity pages. |
|
||||
| `samples/SolidElement.pdf` | Completed after Sprint 16 simplified output layout with `--gpu auto --mineru-profile auto --chunk-pages`. | `outputs/SolidElement/` | Completed in about 17 minutes 51 seconds on GTX 1070 Ti. Produced `SolidElement_001.md`, `SolidElement_report.md`, shared `images/` with 71 assets, and no persisted metadata JSON. Report status `partial`: 6 pages, 0 failed pages, safe profile applied, 3 inline formulas, 55 display formulas, 0 MathJax render errors, 0 missing/invalid asset links, 11 warnings, and 5 low text-fidelity pages. |
|
||||
|
||||
## Historical Verification Highlights
|
||||
|
||||
@@ -73,6 +89,41 @@ Generated outputs are ignored under `outputs/` and are not committed.
|
||||
- CUDA runtime rebuild: verified CUDA with an actual tensor operation on `NVIDIA GeForce GTX 1070 Ti`, compute capability 6.1; `mineru --version` reported 3.1.0.
|
||||
- MathJax checker: `npm run mathjax-checker:health` returned `{"ok":true}` after local `npm install`; full suite passed 150 tests with 1 optional skip after integration.
|
||||
- Sprint 10 chunking: targeted chunking tests passed 42 tests; full default suite passed 163 tests with 1 optional skip; `git diff --check` passed with line-ending warnings only.
|
||||
- Sprint 11 MathJax warning mitigation: targeted tests passed 56 tests; full default suite passed 172 tests with 1 optional skip; requested `samples/MITC공부.pdf` validation produced 0 MathJax render errors and 2 traceable repair info warnings.
|
||||
- UI research and Sprint 12 planning: `docs/UI_RESEARCH.md` and `docs/Sprints/SPRINT12CONTRACT.md` were added; no implementation tests were required because this was documentation and planning only.
|
||||
- Sprint 12 UI implementation: `uv run pytest tests\test_ui_runner.py` passed 16 tests; `uv run pytest` passed 188 tests with 1 optional skip; `uv run --group ui-build pyinstaller --clean --onefile --windowed --name pdf2md-ui src\pdf2md_ui\app.py` produced `dist\pdf2md-ui.exe`; `uv run pdf2md doctor` returned WARN only for the documented GTX 1070 Ti/Pascal compatibility risk; launch smoke confirmed the executable process starts.
|
||||
- Sprint 12 residual smoke risk: a direct CLI conversion smoke using `samples\FourNodeQuadrilateralShellElementMITC4.pdf` and the same command shape used by the UI exceeded the 15-minute timeout on 2026-05-11. The spawned process tree was terminated with `taskkill`.
|
||||
- Sprint 13 text fidelity diagnostics: `uv run pytest tests/test_text_fidelity.py tests/test_metadata.py tests/test_report.py tests/test_conversion.py` passed 49 tests; `uv run pytest` passed 198 tests with 1 optional skip.
|
||||
- Sprint 13 sample validation on 2026-05-11: `samples/2007쉘구조물의유한요소해석에대하여.pdf` completed with `--chunk-pages 1` under `outputs/2007쉘구조물의유한요소해석에대하여_pages1/`; generated 13 Markdown files, 13 metadata JSON files, and 13 report files.
|
||||
|
||||
- Sprint 14 grouped page conversion: targeted red tests first failed against the Sprint 10 chunking behavior, then passed after implementation. `uv run pytest tests/test_conversion.py tests/test_cli.py tests/test_report.py tests/test_pdf_splitter.py tests/test_paths.py tests/test_metadata.py tests/test_ui_runner.py` passed 101 tests; full `uv run pytest` passed 202 tests with 1 optional skip.
|
||||
- Sprint 14 sample smoke on 2026-05-12: `uv run pdf2md convert samples\FourNodeQuadrilateralShellElementMITC4.pdf --out outputs\FourNodeQuadrilateralShellElementMITC4_sprint14_20260512_112342 --chunk-pages --strict-local` used `cuda:0` with GPU utilization near 100%, reached source page 2, then exceeded 15 minutes total runtime without producing a final output directory. The process tree was terminated and the leftover temporary directory was removed.
|
||||
- Sprint 15 NVIDIA GPU detection/profile tuning: targeted tests `uv run pytest tests/test_gpu.py tests/test_mineru_profile.py tests/test_mineru_adapter.py tests/test_conversion.py tests/test_cli.py tests/test_doctor.py` passed 101 tests. Full `uv run pytest` passed 225 tests with 1 optional skip. `uv run pdf2md doctor` returned WARN on the local GTX 1070 Ti, reported GPU 0 with 8192 MiB VRAM, selected `cuda:0` for `--gpu auto`, and recommended profile `safe`. Optional stronger-PC real MinerU conversion validation was not run in this workspace.
|
||||
- SolidElement sample validation on 2026-05-12: `uv run pdf2md convert samples\SolidElement.pdf --out outputs\SolidElement_sprint15_auto_20260512 --overwrite --chunk-pages --gpu auto --mineru-profile auto --strict-local` completed successfully with one grouped output and no failed source pages.
|
||||
- Sprint 16 simplified output layout: focused verification `uv run pytest tests/test_paths.py tests/test_conversion.py tests/test_cli.py tests/test_report.py tests/test_ui_runner.py tests/integration/test_v1_fast_release_gate.py -q` passed 91 tests; full `uv run pytest` passed 227 tests with 1 optional skip; `git diff --check` passed with line-ending warnings only. New conversions write `<out>/<stem>/<stem>_001.md`, shared `<out>/<stem>/images/`, and `<out>/<stem>/<stem>_report.md`; no public `.metadata.json` is written.
|
||||
- Sprint 16 SolidElement sample validation on 2026-05-12: `uv run pdf2md convert samples\SolidElement.pdf --out outputs --overwrite --chunk-pages --gpu auto --mineru-profile auto --strict-local` completed successfully with one simplified Markdown part, one report, shared images, no public metadata JSON, and no failed source pages.
|
||||
- UI direct-folder batch conversion on 2026-05-13: `uv run pytest tests/test_ui_runner.py -q` passed 19 tests; `uv run python -m py_compile src\pdf2md_ui\app.py src\pdf2md_ui\runner.py` passed; `uv run pytest -q` passed 230 tests with 1 skipped; PyInstaller rebuilt `dist\pdf2md-ui.exe`; a short process-start smoke confirmed the executable starts.
|
||||
- Sprint 17 planning on 2026-05-12: `docs/Sprints/SPRINT17CONTRACT.md` and `docs/superpowers/plans/2026-05-12-offline-installer.md` were added. No implementation tests were required because this was planning only.
|
||||
- Sprint 17 abandonment on 2026-05-13: offline installer planning was abandoned at the user's request before implementation began. The contract and plan remain historical records only.
|
||||
|
||||
## Archived V1 Implementation Plan
|
||||
|
||||
`docs/V1IMPLEMENTATIONPLAN.md` now tracks current state and planned next work only. Completed Sprint 0 through Sprint 16 details are archived here and in their respective `docs/Sprints/SPRINT*CONTRACT.md` files.
|
||||
|
||||
Current completed v1 capability summary:
|
||||
|
||||
- Python 3.12 package and `pdf2md` CLI.
|
||||
- Direct local MinerU 3.1.0 CLI adapter with strict-local enforcement.
|
||||
- Obsidian Markdown normalization, local quality checks, internal provenance, and one human-readable report.
|
||||
- `pdf2md doctor`, local MathJax checking, conservative MathJax warning mitigation, and pypdf text fidelity diagnostics.
|
||||
- Opt-in grouped page conversion where MinerU receives one source page per run.
|
||||
- NVIDIA GPU detection, `--gpu auto`, and `--mineru-profile auto|safe|performance`.
|
||||
- Simplified public output layout with no public metadata JSON for new conversions.
|
||||
- Minimal Windows UI launcher with direct-folder batch conversion through sequential existing CLI calls.
|
||||
|
||||
Current planned next work:
|
||||
|
||||
- No active implementation sprint. Future substantial work should start from a new user-approved requirement and sprint contract.
|
||||
|
||||
## Historical Blockers And Resolutions
|
||||
|
||||
|
||||
Reference in New Issue
Block a user