modify pdftomd
This commit is contained in:
+25
-38
@@ -6,64 +6,51 @@ This file records current progress for agents. Read it before starting work, the
|
||||
|
||||
- Project direction is documented in `PRD.md`, `ARCHITECTURE.md`, `AGENTS.md`, and `docs/KNOWLEDGEBASE.md`.
|
||||
- MinerU 3.1.0 is fixed as the only conversion engine.
|
||||
- The converter currently includes path planning, project-owned records, metadata, direct local MinerU adapter boundary, Obsidian Markdown normalization, local quality checks, report rendering, conversion orchestration, `pdf2md convert`, `pdf2md recheck`, `pdf2md doctor`, local MathJax render checking, conservative MathJax warning mitigation, release-gate tests, and opt-in pre-conversion PDF chunking.
|
||||
- `docs/V1IMPLEMENTATIONPLAN.md` defines the v1 implementation sequence.
|
||||
- `docs/Sprints/` contains completed sprint contracts through Sprint 11.
|
||||
- `docs/WORKARCHIVE.md` contains completed sprint history, historical verification results, runtime setup notes, and sample conversion evidence.
|
||||
- The converter currently includes path planning, project-owned records, internal provenance, direct local MinerU adapter boundary, Obsidian Markdown normalization, local quality checks, report rendering, conversion orchestration, simplified output layout, `pdf2md convert`, legacy `pdf2md recheck`, `pdf2md doctor`, local MathJax render checking, conservative MathJax warning mitigation, release-gate tests, opt-in grouped page conversion, a minimal Windows UI launcher with direct-folder PDF batch conversion, pypdf-based text layer fidelity diagnostics, NVIDIA GPU inventory, optional `--gpu auto`, and MinerU profile tuning.
|
||||
- `docs/V1IMPLEMENTATIONPLAN.md` now tracks current v1 state and open future decisions; completed implementation details are archived in `docs/WORKARCHIVE.md`.
|
||||
- `docs/Sprints/` contains completed sprint contracts through Sprint 16 and the abandoned Sprint 17 offline installer contract.
|
||||
- `docs/superpowers/specs/2026-05-13-ui-folder-batch-conversion-design.md` and `docs/superpowers/plans/2026-05-13-ui-folder-batch-conversion.md` record the completed UI direct-folder batch work.
|
||||
- `docs/WORKARCHIVE.md` contains completed sprint history, historical verification results, runtime setup notes, sample conversion evidence, archived UI work, and abandoned Sprint 17 planning context.
|
||||
- `samples/` exists locally as fixture context.
|
||||
- `outputs/` is ignored and contains local generated conversion outputs.
|
||||
- `outputs/`, `build/`, and `dist/` are local generated artifact locations and must stay out of commits.
|
||||
|
||||
## Environment Notes
|
||||
|
||||
- OS/workspace: Windows PowerShell in `C:\git\PDFToMD`.
|
||||
- OS/workspace: Windows PowerShell in `D:\Work\Repos\AICoding\ConvertPDFToMD`.
|
||||
- Python target: 3.12.
|
||||
- Local project Python observed: 3.12.13 in `.venv`.
|
||||
- `uv` is installed per-user at `C:\Users\baram\.local\bin`.
|
||||
- Target GPU documented for the original project setup: NVIDIA GTX 1070 Ti 8GB.
|
||||
- Current PC GPU observed by `doctor`: NVIDIA GeForce RTX 4080 SUPER 16GB.
|
||||
- Local project Python observed: 3.12.7 through `uv run pdf2md doctor` on 2026-05-11.
|
||||
- `uv` is installed per-user at `C:\Users\user\.local\bin`.
|
||||
- Target GPU documented for this project setup: NVIDIA GTX 1070 Ti 8GB.
|
||||
- Current PC GPU observed by `doctor`: NVIDIA GeForce GTX 1070 Ti 8GB.
|
||||
- Default conversion device: `cuda:0`.
|
||||
- Default MinerU profile: `auto`.
|
||||
- MinerU execution mode: direct local `mineru` CLI only.
|
||||
- Strict-local allows MinerU 3.1.0's CLI-internal temporary local `mineru-api` when the CLI runs without `--api-url`.
|
||||
- Strict-local prohibits `--api-url`, remote APIs, router mode, HTTP client backends, and remote OpenAI-compatible backends.
|
||||
- Current `.venv` has project fast-test dependencies, CUDA-enabled PyTorch `2.6.0+cu126`, `torchvision 0.21.0+cu126`, and `mineru[core]==3.1.0`.
|
||||
- Current `pdf2md doctor` status is PASS. MinerU, RTX 4080 SUPER CUDA PyTorch, local model config, MathJax, and strict-local checks pass.
|
||||
- Current `.venv` has project fast-test dependencies, CUDA-enabled PyTorch `2.6.0+cu126`, `torchvision 0.21.0+cu126`, `mineru[core]==3.1.0`, local MathJax npm dependencies, and local MinerU models.
|
||||
- Current `pdf2md doctor` status is WARN because GTX 1070 Ti is Pascal/pre-Turing; MinerU, CUDA PyTorch visibility, local model config, MathJax, and strict-local checks otherwise pass. Doctor selects `cuda:0` for `--gpu auto` on this machine and recommends MinerU profile `safe`.
|
||||
- MinerU models were downloaded from Hugging Face by explicit setup command. Runtime model loading uses `MINERU_MODEL_SOURCE=local`.
|
||||
|
||||
## Recent Completed Work
|
||||
|
||||
- Archived completed sprint and setup history into `docs/WORKARCHIVE.md`.
|
||||
- Added `docs/WORKARCHIVE.md` references to `AGENTS.md`, `PLAN.md`, `docs/V1IMPLEMENTATIONPLAN.md`, relevant `.codex/agents/*.toml`, `.codex/commands/*.md`, and project skills.
|
||||
- Sprint 10 is implemented with `pypdf>=6.10.2,<7`, `src/pdf2md/pdf_splitter.py`, `--chunk-pages [PAGES]`, chunk-aware conversion orchestration, temporary chunk cleanup, and chunk report context.
|
||||
- `--chunk-pages` is opt-in; when present without a value it uses 20 pages.
|
||||
- `convert_pdf()` returns `BatchConversionResult` when `chunk_pages` is set and keeps returning `ConversionResult` when chunking is unset.
|
||||
- Converted `samples/FourNodeQuadrilateralShellElementMITC4.pdf` with `MINERU_MODEL_SOURCE=local` and default `--gpu cuda:0`; output was written to ignored `outputs/FourNodeQuadrilateralShellElementMITC4/`.
|
||||
- The FourNode sample conversion report status was `success`: 7 pages, 22 assets, 38 inline formulas, 16 display formulas, 0 math render errors, and 0 warnings.
|
||||
- Installed uv `0.11.12` at `C:\Users\baram\.local\bin`, installed uv-managed CPython `3.12.13`, created `.venv`, and ran `uv sync`.
|
||||
- Verified base project environment with `uv run pytest`: 163 passed, 1 skipped.
|
||||
- Installed runtime dependencies on this PC: CUDA PyTorch `2.6.0+cu126`, `torchvision 0.21.0+cu126`, `mineru[core]==3.1.0`, local MathJax npm dependencies, and local MinerU models.
|
||||
- Set user environment variable `MINERU_MODEL_SOURCE=local`.
|
||||
- Verified full local runtime with `uv run pdf2md doctor`: PASS.
|
||||
- Verified real local sample conversion: `samples/FourNodeQuadrilateralShellElementMITC4.pdf` to ignored `outputs/runtime-smoke/`, status `success`, 7 pages, 22 assets, 38 inline formulas, 16 display formulas, 0 math render errors, and 0 warnings.
|
||||
- Converted `samples/MITC공부.pdf` to ignored `outputs/MITC공부/`; report status was `partial`: 13 pages, 107 assets, 23 inline formulas, 103 display formulas, 2 MathJax render warnings, and 0 missing or invalid asset links.
|
||||
- Added `recheck_markdown()` and `pdf2md recheck <markdown.md>` to rerun local quality checks for an existing generated Markdown file and rewrite the adjacent metadata JSON and `.report.md` without rerunning MinerU.
|
||||
- Verified `uv run pdf2md recheck outputs\MITC공부\MITC공부.md`; the command regenerated metadata/report and still reported 2 warnings because the current Markdown still contains the two MathJax-invalid expressions.
|
||||
- Reconverted `samples/MITC공부.pdf` with `--overwrite` to ignored `outputs/MITC공부/`; report status remains `partial`: 13 pages, 107 assets, 23 inline formulas, 103 display formulas, 2 MathJax render warnings, and 0 missing or invalid asset links.
|
||||
- Sprint 11 implemented conservative MathJax warning mitigation with failed-expression details, `src/pdf2md/math_repair.py`, shared `convert`/`recheck` repair integration, and `MATH_RENDER_REPAIRED` info warnings.
|
||||
- Verified default fast suite: `uv run pytest` passed 172 tests with 1 skipped.
|
||||
- Verified requested real sample: `uv run pdf2md convert samples\MITC공부.pdf --out outputs\sprint11-MITC공부 --overwrite` succeeded with 13 pages, 107 assets, 23 inline formulas, 103 display formulas, 0 MathJax render errors, and 2 `MATH_RENDER_REPAIRED` info warnings.
|
||||
- Reconverted `samples/MITC공부.pdf` to ignored `outputs/MITC공부/` with Sprint 11 mitigation; report status is `partial` from 2 `MATH_RENDER_REPAIRED` info warnings, with 13 pages, 107 assets, 23 inline formulas, 103 display formulas, 0 MathJax render errors, and 0 missing or invalid asset links.
|
||||
- Archived completed coordination details from `PLAN.md`, `PROGRESS.md`, and `docs/V1IMPLEMENTATIONPLAN.md` into `docs/WORKARCHIVE.md`.
|
||||
- Refreshed current docs so abandoned Sprint 17 offline installer planning, completed UI direct-folder batch conversion, simplified output layout, legacy-only `recheck`, and no-public-metadata behavior are consistently referenced.
|
||||
- Updated project agent/source-document references so future document reviews and implementation work can find Sprint 15/16 contracts, abandoned Sprint 17 context, and the UI folder batch design/plan.
|
||||
- Abandoned Sprint 17 offline installer planning at the user's request. The contract and plan remain as historical records only.
|
||||
|
||||
## In Progress
|
||||
|
||||
- No active implementation chunk.
|
||||
- No active implementation sprint.
|
||||
|
||||
## Blockers
|
||||
|
||||
- No active blocker.
|
||||
- Residual risk: direct CLI conversion smokes for `samples\FourNodeQuadrilateralShellElementMITC4.pdf` exceeded the 15-minute timeout on 2026-05-11 and stalled on source page 2 with Sprint 14 `--chunk-pages` on 2026-05-12, so hands-on UI conversion smoke remains pending.
|
||||
- Residual risk: conversion can still be impractically slow or stall on GTX 1070 Ti 8GB for some source pages even when Sprint 14 sends one source page to MinerU at a time.
|
||||
|
||||
## Next Actions
|
||||
|
||||
1. Review generated sample Markdown outputs in Obsidian if visual quality needs manual assessment.
|
||||
2. Run additional real local sample validation only if requested, especially for new MathJax failure messages not covered by Sprint 11's narrow repair rules.
|
||||
3. Run optional real local chunked conversion on a long sample only if requested.
|
||||
4. Preserve strict-local runtime behavior: use local model paths, direct CLI execution, and no user-specified API or remote backend.
|
||||
1. Run hands-on UI smoke when practical: launch `dist\pdf2md-ui.exe`, click Doctor, then run one small local conversion to ignored `outputs/`.
|
||||
2. Preserve strict-local runtime behavior: use local model paths, direct CLI execution, and no user-specified API or remote backend.
|
||||
3. Decide in a future sprint whether simplified outputs need metadata-free `pdf2md recheck`; current behavior intentionally remains legacy-only.
|
||||
4. On a stronger NVIDIA GPU PC, run `uv run pdf2md doctor` and an optional local conversion with `--gpu auto --mineru-profile auto` to validate the auto profile against ignored `outputs/`.
|
||||
|
||||
Reference in New Issue
Block a user