modify pdftomd

This commit is contained in:
김경종
2026-05-14 10:16:59 +09:00
parent 2232b51fc9
commit dc11880140
69 changed files with 7784 additions and 1150 deletions
+25 -38
View File
@@ -6,64 +6,51 @@ This file records current progress for agents. Read it before starting work, the
- Project direction is documented in `PRD.md`, `ARCHITECTURE.md`, `AGENTS.md`, and `docs/KNOWLEDGEBASE.md`.
- MinerU 3.1.0 is fixed as the only conversion engine.
- The converter currently includes path planning, project-owned records, metadata, direct local MinerU adapter boundary, Obsidian Markdown normalization, local quality checks, report rendering, conversion orchestration, `pdf2md convert`, `pdf2md recheck`, `pdf2md doctor`, local MathJax render checking, conservative MathJax warning mitigation, release-gate tests, and opt-in pre-conversion PDF chunking.
- `docs/V1IMPLEMENTATIONPLAN.md` defines the v1 implementation sequence.
- `docs/Sprints/` contains completed sprint contracts through Sprint 11.
- `docs/WORKARCHIVE.md` contains completed sprint history, historical verification results, runtime setup notes, and sample conversion evidence.
- The converter currently includes path planning, project-owned records, internal provenance, direct local MinerU adapter boundary, Obsidian Markdown normalization, local quality checks, report rendering, conversion orchestration, simplified output layout, `pdf2md convert`, legacy `pdf2md recheck`, `pdf2md doctor`, local MathJax render checking, conservative MathJax warning mitigation, release-gate tests, opt-in grouped page conversion, a minimal Windows UI launcher with direct-folder PDF batch conversion, pypdf-based text layer fidelity diagnostics, NVIDIA GPU inventory, optional `--gpu auto`, and MinerU profile tuning.
- `docs/V1IMPLEMENTATIONPLAN.md` now tracks current v1 state and open future decisions; completed implementation details are archived in `docs/WORKARCHIVE.md`.
- `docs/Sprints/` contains completed sprint contracts through Sprint 16 and the abandoned Sprint 17 offline installer contract.
- `docs/superpowers/specs/2026-05-13-ui-folder-batch-conversion-design.md` and `docs/superpowers/plans/2026-05-13-ui-folder-batch-conversion.md` record the completed UI direct-folder batch work.
- `docs/WORKARCHIVE.md` contains completed sprint history, historical verification results, runtime setup notes, sample conversion evidence, archived UI work, and abandoned Sprint 17 planning context.
- `samples/` exists locally as fixture context.
- `outputs/` is ignored and contains local generated conversion outputs.
- `outputs/`, `build/`, and `dist/` are local generated artifact locations and must stay out of commits.
## Environment Notes
- OS/workspace: Windows PowerShell in `C:\git\PDFToMD`.
- OS/workspace: Windows PowerShell in `D:\Work\Repos\AICoding\ConvertPDFToMD`.
- Python target: 3.12.
- Local project Python observed: 3.12.13 in `.venv`.
- `uv` is installed per-user at `C:\Users\baram\.local\bin`.
- Target GPU documented for the original project setup: NVIDIA GTX 1070 Ti 8GB.
- Current PC GPU observed by `doctor`: NVIDIA GeForce RTX 4080 SUPER 16GB.
- Local project Python observed: 3.12.7 through `uv run pdf2md doctor` on 2026-05-11.
- `uv` is installed per-user at `C:\Users\user\.local\bin`.
- Target GPU documented for this project setup: NVIDIA GTX 1070 Ti 8GB.
- Current PC GPU observed by `doctor`: NVIDIA GeForce GTX 1070 Ti 8GB.
- Default conversion device: `cuda:0`.
- Default MinerU profile: `auto`.
- MinerU execution mode: direct local `mineru` CLI only.
- Strict-local allows MinerU 3.1.0's CLI-internal temporary local `mineru-api` when the CLI runs without `--api-url`.
- Strict-local prohibits `--api-url`, remote APIs, router mode, HTTP client backends, and remote OpenAI-compatible backends.
- Current `.venv` has project fast-test dependencies, CUDA-enabled PyTorch `2.6.0+cu126`, `torchvision 0.21.0+cu126`, and `mineru[core]==3.1.0`.
- Current `pdf2md doctor` status is PASS. MinerU, RTX 4080 SUPER CUDA PyTorch, local model config, MathJax, and strict-local checks pass.
- Current `.venv` has project fast-test dependencies, CUDA-enabled PyTorch `2.6.0+cu126`, `torchvision 0.21.0+cu126`, `mineru[core]==3.1.0`, local MathJax npm dependencies, and local MinerU models.
- Current `pdf2md doctor` status is WARN because GTX 1070 Ti is Pascal/pre-Turing; MinerU, CUDA PyTorch visibility, local model config, MathJax, and strict-local checks otherwise pass. Doctor selects `cuda:0` for `--gpu auto` on this machine and recommends MinerU profile `safe`.
- MinerU models were downloaded from Hugging Face by explicit setup command. Runtime model loading uses `MINERU_MODEL_SOURCE=local`.
## Recent Completed Work
- Archived completed sprint and setup history into `docs/WORKARCHIVE.md`.
- Added `docs/WORKARCHIVE.md` references to `AGENTS.md`, `PLAN.md`, `docs/V1IMPLEMENTATIONPLAN.md`, relevant `.codex/agents/*.toml`, `.codex/commands/*.md`, and project skills.
- Sprint 10 is implemented with `pypdf>=6.10.2,<7`, `src/pdf2md/pdf_splitter.py`, `--chunk-pages [PAGES]`, chunk-aware conversion orchestration, temporary chunk cleanup, and chunk report context.
- `--chunk-pages` is opt-in; when present without a value it uses 20 pages.
- `convert_pdf()` returns `BatchConversionResult` when `chunk_pages` is set and keeps returning `ConversionResult` when chunking is unset.
- Converted `samples/FourNodeQuadrilateralShellElementMITC4.pdf` with `MINERU_MODEL_SOURCE=local` and default `--gpu cuda:0`; output was written to ignored `outputs/FourNodeQuadrilateralShellElementMITC4/`.
- The FourNode sample conversion report status was `success`: 7 pages, 22 assets, 38 inline formulas, 16 display formulas, 0 math render errors, and 0 warnings.
- Installed uv `0.11.12` at `C:\Users\baram\.local\bin`, installed uv-managed CPython `3.12.13`, created `.venv`, and ran `uv sync`.
- Verified base project environment with `uv run pytest`: 163 passed, 1 skipped.
- Installed runtime dependencies on this PC: CUDA PyTorch `2.6.0+cu126`, `torchvision 0.21.0+cu126`, `mineru[core]==3.1.0`, local MathJax npm dependencies, and local MinerU models.
- Set user environment variable `MINERU_MODEL_SOURCE=local`.
- Verified full local runtime with `uv run pdf2md doctor`: PASS.
- Verified real local sample conversion: `samples/FourNodeQuadrilateralShellElementMITC4.pdf` to ignored `outputs/runtime-smoke/`, status `success`, 7 pages, 22 assets, 38 inline formulas, 16 display formulas, 0 math render errors, and 0 warnings.
- Converted `samples/MITC공부.pdf` to ignored `outputs/MITC공부/`; report status was `partial`: 13 pages, 107 assets, 23 inline formulas, 103 display formulas, 2 MathJax render warnings, and 0 missing or invalid asset links.
- Added `recheck_markdown()` and `pdf2md recheck <markdown.md>` to rerun local quality checks for an existing generated Markdown file and rewrite the adjacent metadata JSON and `.report.md` without rerunning MinerU.
- Verified `uv run pdf2md recheck outputs\MITC공부\MITC공부.md`; the command regenerated metadata/report and still reported 2 warnings because the current Markdown still contains the two MathJax-invalid expressions.
- Reconverted `samples/MITC공부.pdf` with `--overwrite` to ignored `outputs/MITC공부/`; report status remains `partial`: 13 pages, 107 assets, 23 inline formulas, 103 display formulas, 2 MathJax render warnings, and 0 missing or invalid asset links.
- Sprint 11 implemented conservative MathJax warning mitigation with failed-expression details, `src/pdf2md/math_repair.py`, shared `convert`/`recheck` repair integration, and `MATH_RENDER_REPAIRED` info warnings.
- Verified default fast suite: `uv run pytest` passed 172 tests with 1 skipped.
- Verified requested real sample: `uv run pdf2md convert samples\MITC공부.pdf --out outputs\sprint11-MITC공부 --overwrite` succeeded with 13 pages, 107 assets, 23 inline formulas, 103 display formulas, 0 MathJax render errors, and 2 `MATH_RENDER_REPAIRED` info warnings.
- Reconverted `samples/MITC공부.pdf` to ignored `outputs/MITC공부/` with Sprint 11 mitigation; report status is `partial` from 2 `MATH_RENDER_REPAIRED` info warnings, with 13 pages, 107 assets, 23 inline formulas, 103 display formulas, 0 MathJax render errors, and 0 missing or invalid asset links.
- Archived completed coordination details from `PLAN.md`, `PROGRESS.md`, and `docs/V1IMPLEMENTATIONPLAN.md` into `docs/WORKARCHIVE.md`.
- Refreshed current docs so abandoned Sprint 17 offline installer planning, completed UI direct-folder batch conversion, simplified output layout, legacy-only `recheck`, and no-public-metadata behavior are consistently referenced.
- Updated project agent/source-document references so future document reviews and implementation work can find Sprint 15/16 contracts, abandoned Sprint 17 context, and the UI folder batch design/plan.
- Abandoned Sprint 17 offline installer planning at the user's request. The contract and plan remain as historical records only.
## In Progress
- No active implementation chunk.
- No active implementation sprint.
## Blockers
- No active blocker.
- Residual risk: direct CLI conversion smokes for `samples\FourNodeQuadrilateralShellElementMITC4.pdf` exceeded the 15-minute timeout on 2026-05-11 and stalled on source page 2 with Sprint 14 `--chunk-pages` on 2026-05-12, so hands-on UI conversion smoke remains pending.
- Residual risk: conversion can still be impractically slow or stall on GTX 1070 Ti 8GB for some source pages even when Sprint 14 sends one source page to MinerU at a time.
## Next Actions
1. Review generated sample Markdown outputs in Obsidian if visual quality needs manual assessment.
2. Run additional real local sample validation only if requested, especially for new MathJax failure messages not covered by Sprint 11's narrow repair rules.
3. Run optional real local chunked conversion on a long sample only if requested.
4. Preserve strict-local runtime behavior: use local model paths, direct CLI execution, and no user-specified API or remote backend.
1. Run hands-on UI smoke when practical: launch `dist\pdf2md-ui.exe`, click Doctor, then run one small local conversion to ignored `outputs/`.
2. Preserve strict-local runtime behavior: use local model paths, direct CLI execution, and no user-specified API or remote backend.
3. Decide in a future sprint whether simplified outputs need metadata-free `pdf2md recheck`; current behavior intentionally remains legacy-only.
4. On a stronger NVIDIA GPU PC, run `uv run pdf2md doctor` and an optional local conversion with `--gpu auto --mineru-profile auto` to validate the auto profile against ignored `outputs/`.