modify pdftomd
This commit is contained in:
+77
-620
@@ -1,28 +1,45 @@
|
||||
# V1 Implementation Plan: Local PDF-to-Markdown Converter
|
||||
|
||||
Last updated: 2026-05-08
|
||||
Last updated: 2026-05-13
|
||||
|
||||
This document is the implementation plan for v1. It does not replace `PRD.md` or `ARCHITECTURE.md`; use those files as the source of product requirements and system design. This plan explains the order of work, sprint contracts, verification gates, and agent ownership for implementing the converter.
|
||||
This document tracks the current v1 implementation state and open future decisions. It does not replace `PRD.md` or `ARCHITECTURE.md`; use those files as the source of product requirements and system design. Completed sprint details are archived in `docs/WORKARCHIVE.md`, and detailed acceptance criteria remain in `docs/Sprints/*.md`.
|
||||
|
||||
Sprint 1 created the Python package scaffold and CLI placeholder. Sprint 2 created path planning. Sprint 3 created project-owned records and metadata construction. Sprint 4 created the mocked direct local MinerU adapter boundary. Sprint 5 created the Obsidian Markdown normalization boundary. Sprint 6 created local quality-check and report-rendering boundaries. Sprint 7 implemented conversion orchestration, the public conversion API, and the `pdf2md convert` CLI path with fake-adapter tests. Sprint 8 implemented mockable doctor diagnostics, the `pdf2md doctor` CLI path, and setup documentation. Sprint 9 implemented fast mocked integration tests, explicit opt-in local MinerU fixture evaluation, and the v1 release checklist. Sprint 10 implemented opt-in pre-conversion PDF chunking for long documents. Sprint 11 implemented conservative MathJax warning mitigation for failed math spans.
|
||||
## 1. Current V1 State
|
||||
|
||||
## 1. V1 Outcome
|
||||
The core v1 converter is implemented through Sprint 16. The implemented system includes:
|
||||
|
||||
- Python 3.12 package and `pdf2md` CLI.
|
||||
- Direct local MinerU 3.1.0 CLI adapter with strict-local enforcement.
|
||||
- Obsidian-friendly Markdown normalization.
|
||||
- Internal provenance, structured warnings, quality checks, and one human-readable report.
|
||||
- `pdf2md doctor`.
|
||||
- Optional grouped page conversion through `--chunk-pages`.
|
||||
- Local MathJax render checking and conservative failed-span repair.
|
||||
- pypdf-based text fidelity diagnostics.
|
||||
- NVIDIA GPU inventory, `--gpu auto`, and `--mineru-profile auto|safe|performance`.
|
||||
- Simplified output layout: `<out>/<stem>/<stem>_001.md`, shared `<out>/<stem>/images/`, and `<out>/<stem>/<stem>_report.md`.
|
||||
- No public metadata JSON for new conversions.
|
||||
- Minimal Windows UI launcher over the existing CLI, including direct-folder PDF batch conversion through sequential `pdf2md convert` subprocesses.
|
||||
|
||||
Historical implementation evidence, verification commands, and sample conversion results are in `docs/WORKARCHIVE.md`.
|
||||
|
||||
## 2. V1 Outcome
|
||||
|
||||
v1 is complete when a local user can run:
|
||||
|
||||
```bash
|
||||
uv run pdf2md doctor
|
||||
uv run pdf2md convert paper.pdf --out out --metadata
|
||||
uv run pdf2md convert pdfs --out out --recursive --metadata
|
||||
uv run pdf2md convert paper.pdf --out out
|
||||
uv run pdf2md convert pdfs --out out --recursive
|
||||
```
|
||||
|
||||
and receive, for each PDF:
|
||||
|
||||
- Obsidian-friendly Markdown.
|
||||
- A stable sibling assets directory when assets exist.
|
||||
- `<stem>.metadata.json`.
|
||||
- `<stem>.report.md`.
|
||||
- Clear warnings when math, tables, assets, reading order, GPU availability, or MinerU execution are uncertain.
|
||||
- Obsidian-friendly Markdown parts under `<out>/<stem>/<stem>_001.md`, `<stem>_002.md`, and so on.
|
||||
- A stable shared image/media directory under `<out>/<stem>/images/`.
|
||||
- One human-readable report under `<out>/<stem>/<stem>_report.md`.
|
||||
- No persisted metadata JSON for new conversions.
|
||||
- Clear warnings when math, tables, assets, reading order, text fidelity, GPU availability, or MinerU execution are uncertain.
|
||||
|
||||
Long PDFs can be chunked explicitly:
|
||||
|
||||
@@ -31,11 +48,11 @@ uv run pdf2md convert paper.pdf --out out --chunk-pages
|
||||
uv run pdf2md convert paper.pdf --out out --chunk-pages 20
|
||||
```
|
||||
|
||||
Chunked conversion writes separate outputs per chunk and does not merge Markdown files.
|
||||
When `--chunk-pages` is active, MinerU receives one-page temporary PDFs and final Markdown files are grouped by the configured page count. Temporary one-page PDFs and intermediate per-page outputs are deleted.
|
||||
|
||||
The converter must use MinerU 3.1.0 through direct local CLI execution only. It must not silently fallback to another engine.
|
||||
The Windows UI launcher is a convenience wrapper over `pdf2md`; it is not a separate conversion pipeline. UI folder batch conversion runs direct-child PDFs sequentially through the same CLI conversion path.
|
||||
|
||||
## 2. Non-Negotiable Constraints
|
||||
## 3. Non-Negotiable Constraints
|
||||
|
||||
- Python 3.12 and `uv`.
|
||||
- MinerU 3.1.0 is the only conversion engine.
|
||||
@@ -45,34 +62,10 @@ The converter must use MinerU 3.1.0 through direct local CLI execution only. It
|
||||
- Target hardware: NVIDIA GTX 1070 Ti 8GB.
|
||||
- Digital PDFs with text layers are the v1 priority.
|
||||
- `samples/` is local fixture context and must not be committed unless explicitly requested.
|
||||
- UI launcher must invoke `pdf2md` or `uv run pdf2md`; it must not call MinerU directly or bundle the full conversion runtime.
|
||||
- Every substantial implementation chunk needs a sprint contract and independent evaluation.
|
||||
|
||||
## 3. Harness Operating Model
|
||||
|
||||
Use the project long-running harness only for substantial implementation work.
|
||||
|
||||
1. `harness-planner-agent` turns the next user request into a sprint contract.
|
||||
2. `evaluation-agent` reviews the contract before code changes start.
|
||||
3. `feature-generator-agent` implements one approved contract at a time.
|
||||
4. `feature-generator-agent` runs self-checks and records residual risks.
|
||||
5. `evaluation-agent` independently verifies the result against the contract.
|
||||
6. The parent agent updates `PROGRESS.md`, commits the completed change, and leaves a handoff.
|
||||
|
||||
After a chunk is no longer active, archive completed-work details in `docs/WORKARCHIVE.md` and keep `PROGRESS.md` focused on current status, blockers, and next actions.
|
||||
|
||||
Each sprint contract must include:
|
||||
|
||||
- Objective.
|
||||
- Touched surfaces.
|
||||
- Expected outputs.
|
||||
- Non-goals.
|
||||
- Verification checks.
|
||||
- Hard failure criteria.
|
||||
- Handoff fields.
|
||||
|
||||
## 4. Proposed Repository Layout
|
||||
|
||||
Create this layout incrementally; do not scaffold unused modules before a sprint needs them.
|
||||
## 4. Current Repository Layout
|
||||
|
||||
```text
|
||||
pyproject.toml
|
||||
@@ -91,615 +84,79 @@ src/
|
||||
quality.py
|
||||
report.py
|
||||
doctor.py
|
||||
gpu.py
|
||||
mineru_profile.py
|
||||
math_render.py
|
||||
math_repair.py
|
||||
text_fidelity.py
|
||||
pdf2md_ui/
|
||||
__init__.py
|
||||
app.py
|
||||
runner.py
|
||||
tests/
|
||||
unit/
|
||||
integration/
|
||||
fixtures/
|
||||
scripts/
|
||||
install-mineru.ps1
|
||||
install-models.py
|
||||
docs/
|
||||
Sprints/
|
||||
superpowers/
|
||||
```
|
||||
|
||||
Planned module responsibilities:
|
||||
Do not scaffold unused modules before a sprint needs them.
|
||||
|
||||
- `cli.py`: command parsing, CLI summaries, exit codes.
|
||||
- `conversion.py`: orchestration for one PDF and batch input.
|
||||
- `paths.py`: input discovery, output path planning, overwrite checks.
|
||||
- `mineru_adapter.py`: direct local MinerU CLI boundary.
|
||||
- `ir.py`: project-owned document/page/block/asset/warning records.
|
||||
- `markdown.py`: Obsidian Markdown normalization.
|
||||
- `metadata.py`: metadata schema creation and warning aggregation.
|
||||
- `quality.py`: local checks for assets, math renderability, and output sanity.
|
||||
- `report.py`: `<stem>.report.md` generation from metadata.
|
||||
- `doctor.py`: environment, dependency, CUDA/GPU, MinerU, and cache diagnostics.
|
||||
|
||||
## 5. Sprint Sequence
|
||||
|
||||
### Sprint 0: Source And Environment Verification
|
||||
|
||||
Active contract:
|
||||
|
||||
- `docs/Sprints/SPRINT0CONTRACT.md`
|
||||
|
||||
Objective:
|
||||
|
||||
- Verify the facts needed before implementation starts.
|
||||
|
||||
Touched surfaces:
|
||||
|
||||
- `docs/KNOWLEDGEBASE.md`
|
||||
- `docs/V1IMPLEMENTATIONPLAN.md` if sequencing changes
|
||||
- `PROGRESS.md`
|
||||
|
||||
Expected outputs:
|
||||
|
||||
- Confirmed MinerU 3.1.0 install command, CLI invocation shape, version command, output paths, and local execution behavior.
|
||||
- Confirmed Python 3.12, `uv`, CUDA/PyTorch, and GTX 1070 Ti 8GB risks.
|
||||
- Confirmed license notes needed before redistribution.
|
||||
|
||||
Verification checks:
|
||||
|
||||
- All volatile facts cite official MinerU, Python, uv, PyTorch/CUDA, or license sources.
|
||||
- No candidate engine comparison is reintroduced.
|
||||
- No implementation code is created.
|
||||
|
||||
Hard failure criteria:
|
||||
|
||||
- MinerU 3.1.0 cannot be reasonably invoked through a direct local CLI on the target environment.
|
||||
- Python 3.12 compatibility is not viable without changing project requirements.
|
||||
|
||||
Primary agents:
|
||||
|
||||
- `research-agent`
|
||||
- `local-setup-agent`
|
||||
- `license-privacy-agent`
|
||||
|
||||
### Sprint 1: Project Scaffold And Fast Test Loop
|
||||
|
||||
Active contract:
|
||||
|
||||
- `docs/Sprints/SPRINT1CONTRACT.md`
|
||||
|
||||
Objective:
|
||||
|
||||
- Create the minimal Python project structure and a fast local test loop.
|
||||
|
||||
Touched surfaces:
|
||||
|
||||
- `pyproject.toml`
|
||||
- `src/pdf2md/__init__.py`
|
||||
- `tests/`
|
||||
- Development documentation if needed
|
||||
|
||||
Expected outputs:
|
||||
|
||||
- `uv sync` works.
|
||||
- `uv run pytest` works.
|
||||
- Project package imports as `pdf2md`.
|
||||
- CLI entry point name `pdf2md` is reserved but may initially expose only `doctor` or a clear placeholder until later sprints.
|
||||
- If `uv` is still unavailable locally, Sprint 1 records that blocker and is not marked complete.
|
||||
|
||||
Verification checks:
|
||||
|
||||
- Import test passes.
|
||||
- Empty test suite or initial scaffold tests pass.
|
||||
- No runtime network dependency is introduced.
|
||||
|
||||
Hard failure criteria:
|
||||
|
||||
- Project cannot be installed with `uv`.
|
||||
- Scaffolding adds speculative config systems, extra engines, or unused abstractions.
|
||||
|
||||
Primary agents:
|
||||
|
||||
- `harness-planner-agent`
|
||||
- `feature-generator-agent`
|
||||
- `evaluation-agent`
|
||||
|
||||
### Sprint 2: Paths, Input Discovery, And Overwrite Planning
|
||||
|
||||
Active contract:
|
||||
|
||||
- `docs/Sprints/SPRINT2CONTRACT.md`
|
||||
|
||||
Objective:
|
||||
|
||||
- Implement deterministic input and output planning before conversion logic exists.
|
||||
|
||||
Touched surfaces:
|
||||
|
||||
- `paths.py`
|
||||
- `conversion.py` skeleton if needed
|
||||
- CLI path handling tests
|
||||
|
||||
Expected outputs:
|
||||
|
||||
- Single PDF discovery.
|
||||
- Directory PDF discovery.
|
||||
- Recursive traversal only when requested.
|
||||
- Deterministic output paths for Markdown, assets, metadata JSON, report, and optional raw MinerU output.
|
||||
- Existing-output protection unless `--overwrite` is passed.
|
||||
|
||||
Verification checks:
|
||||
|
||||
- Unit tests for single PDF path planning.
|
||||
- Unit tests for directory and recursive discovery.
|
||||
- Unit tests for overwrite behavior.
|
||||
- Tests include Korean or non-ASCII filename handling using generated temporary files, not committed sample PDFs.
|
||||
|
||||
Hard failure criteria:
|
||||
|
||||
- Output planning can overwrite user files without explicit overwrite intent.
|
||||
- Directory conversion descends recursively without `--recursive`.
|
||||
|
||||
Primary agents:
|
||||
|
||||
- `feature-generator-agent`
|
||||
- `evaluation-agent`
|
||||
|
||||
### Sprint 3: Domain Records, Metadata, And Warning Model
|
||||
|
||||
Active contract:
|
||||
|
||||
- `docs/Sprints/SPRINT3CONTRACT.md`
|
||||
|
||||
Objective:
|
||||
|
||||
- Define project-owned records before binding to MinerU output.
|
||||
|
||||
Touched surfaces:
|
||||
|
||||
- `ir.py`
|
||||
- `metadata.py`
|
||||
- `report.py` skeleton if needed
|
||||
- Unit tests
|
||||
|
||||
Expected outputs:
|
||||
|
||||
- Document, page, block, asset, and warning records.
|
||||
- Stable warning codes from `ARCHITECTURE.md`.
|
||||
- Metadata JSON builder with required top-level and summary fields.
|
||||
- Warning aggregation logic.
|
||||
|
||||
Verification checks:
|
||||
|
||||
- Unit tests for metadata schema creation.
|
||||
- Unit tests for warning aggregation.
|
||||
- Unit tests for optional fields such as bbox and confidence being preserved only when present.
|
||||
|
||||
Hard failure criteria:
|
||||
|
||||
- Public API requires raw MinerU objects.
|
||||
- Metadata omits source PDF, SHA-256, engine, pages, warnings, assets, or summary.
|
||||
|
||||
Primary agents:
|
||||
|
||||
- `metadata-agent`
|
||||
- `feature-generator-agent`
|
||||
- `evaluation-agent`
|
||||
|
||||
### Sprint 4: MinerU Adapter With Mocked Contract
|
||||
|
||||
Active contract:
|
||||
|
||||
- `docs/Sprints/SPRINT4CONTRACT.md`
|
||||
|
||||
Objective:
|
||||
|
||||
- Build the direct local MinerU adapter boundary with mocked outputs first.
|
||||
|
||||
Touched surfaces:
|
||||
|
||||
- `mineru_adapter.py`
|
||||
- `doctor.py` partial checks
|
||||
- Adapter tests with fake subprocess results and fake output directories
|
||||
|
||||
Expected outputs:
|
||||
|
||||
- Adapter availability check.
|
||||
- Version check.
|
||||
- Direct CLI command construction.
|
||||
- Strict-local command validation.
|
||||
- Subprocess execution wrapper capturing stdout, stderr, exit code, and paths.
|
||||
- Parsed adapter result object with raw Markdown, raw structured data when available, assets, warnings, engine, engine version, options, exit code, and stderr.
|
||||
- Baseline command shape based on MinerU 3.1.0 direct local CLI: `mineru -p <input> -o <output>`.
|
||||
- Strict-local validation allows CLI-internal temporary local `mineru-api` orchestration, while rejecting `--api-url`, remote APIs, router mode, HTTP client backends, and remote OpenAI-compatible backends.
|
||||
|
||||
Verification checks:
|
||||
|
||||
- Mocked successful MinerU output test.
|
||||
- Mocked missing MinerU test.
|
||||
- Mocked non-zero exit test.
|
||||
- Test that prohibited remote/API flags cannot be introduced.
|
||||
- No real MinerU/model dependency in default tests.
|
||||
|
||||
Hard failure criteria:
|
||||
|
||||
- Adapter passes `--api-url`, uses router mode, uses an HTTP client backend, or connects to a remote API or remote OpenAI-compatible backend.
|
||||
- Adapter falls back to another engine after MinerU failure.
|
||||
- Tests require model downloads by default.
|
||||
|
||||
Primary agents:
|
||||
|
||||
- `mineru-integration-agent`
|
||||
- `feature-generator-agent`
|
||||
- `evaluation-agent`
|
||||
|
||||
### Sprint 5: Obsidian Markdown Normalization And Assets
|
||||
|
||||
Active contract:
|
||||
|
||||
- `docs/Sprints/SPRINT5CONTRACT.md`
|
||||
|
||||
Objective:
|
||||
|
||||
- Normalize MinerU/project IR output into Obsidian-friendly Markdown.
|
||||
|
||||
Touched surfaces:
|
||||
|
||||
- `markdown.py`
|
||||
- `quality.py` partial asset link checks
|
||||
- Unit tests
|
||||
|
||||
Expected outputs:
|
||||
|
||||
- Inline math delimiter normalization to `$...$`.
|
||||
- Display math delimiter normalization to `$$...$$`.
|
||||
- Blank-line normalization around display math.
|
||||
- Relative asset link normalization.
|
||||
- Simple table preservation and complex table fallback warnings.
|
||||
- No visible page markers by default.
|
||||
|
||||
Verification checks:
|
||||
|
||||
- Unit tests for inline math.
|
||||
- Unit tests for display math spacing.
|
||||
- Unit tests for underscores/carets inside math.
|
||||
- Unit tests for relative asset links.
|
||||
- Unit tests for table fallback warning behavior.
|
||||
|
||||
Hard failure criteria:
|
||||
|
||||
- Normalization rewrites LaTeX semantics without deterministic tests.
|
||||
- Generated links are absolute when relative links are required.
|
||||
- Page provenance is only visible in Markdown and missing from metadata.
|
||||
|
||||
Primary agents:
|
||||
|
||||
- `obsidian-markdown-agent`
|
||||
- `feature-generator-agent`
|
||||
- `evaluation-agent`
|
||||
|
||||
### Sprint 6: Quality Checks And Report Generation
|
||||
|
||||
Active contract:
|
||||
|
||||
- `docs/Sprints/SPRINT6CONTRACT.md`
|
||||
|
||||
Objective:
|
||||
|
||||
- Produce local quality signals and human-readable reports from metadata.
|
||||
|
||||
Touched surfaces:
|
||||
|
||||
- `quality.py`
|
||||
- `report.py`
|
||||
- `metadata.py`
|
||||
- Unit tests
|
||||
|
||||
Expected outputs:
|
||||
|
||||
- Missing asset link count.
|
||||
- Math renderability check interface with graceful unavailable-tool handling.
|
||||
- Pages-with-warnings summary.
|
||||
- `<stem>.report.md` generated from metadata.
|
||||
- Final status: `success`, `partial`, or `failed`.
|
||||
|
||||
Verification checks:
|
||||
|
||||
- Unit tests for report content.
|
||||
- Unit tests for missing asset link count.
|
||||
- Unit tests for math render failure aggregation.
|
||||
- Report generation does not re-run MinerU.
|
||||
|
||||
Hard failure criteria:
|
||||
|
||||
- Report diverges from JSON metadata.
|
||||
- Math render failures are silently ignored.
|
||||
- Quality checks require network access.
|
||||
|
||||
Primary agents:
|
||||
|
||||
- `metadata-agent`
|
||||
- `evaluation-agent`
|
||||
- `feature-generator-agent`
|
||||
|
||||
### Sprint 7: Conversion Orchestrator, CLI, And Python API
|
||||
|
||||
Active contract:
|
||||
|
||||
- `docs/Sprints/SPRINT7CONTRACT.md`
|
||||
|
||||
Objective:
|
||||
|
||||
- Connect path planning, MinerU adapter, normalization, metadata, report, and summaries.
|
||||
|
||||
Touched surfaces:
|
||||
|
||||
- `conversion.py`
|
||||
- `cli.py`
|
||||
- `__init__.py`
|
||||
- CLI and API tests
|
||||
|
||||
Expected outputs:
|
||||
|
||||
- `convert_pdf(input_path, output_dir, metadata=True)` public API.
|
||||
- `pdf2md convert INPUT --out OUTPUT_DIR`.
|
||||
- `--metadata`, `--keep-raw`, `--recursive`, `--overwrite`, `--gpu`, and `--strict-local` behavior.
|
||||
- Batch conversion for directories.
|
||||
- CLI summary with warning counts.
|
||||
|
||||
Verification checks:
|
||||
|
||||
- API test with mocked MinerU adapter.
|
||||
- CLI single PDF test with mocked MinerU adapter.
|
||||
- CLI directory test with mocked MinerU adapter.
|
||||
- Existing output test.
|
||||
- Failure summary test.
|
||||
|
||||
Hard failure criteria:
|
||||
|
||||
- Public API exposes raw MinerU objects as required return fields.
|
||||
- CLI writes outputs after a hard failure that should stop conversion.
|
||||
- CLI suppresses warning counts.
|
||||
|
||||
Primary agents:
|
||||
|
||||
- `feature-generator-agent`
|
||||
- `requirements-guard-agent`
|
||||
- `evaluation-agent`
|
||||
|
||||
### Sprint 8: Doctor And Setup Documentation
|
||||
|
||||
Active contract:
|
||||
|
||||
- `docs/Sprints/SPRINT8CONTRACT.md`
|
||||
## 5. Active Next Sprint
|
||||
|
||||
Status:
|
||||
|
||||
- Implemented.
|
||||
- No active implementation sprint.
|
||||
|
||||
Objective:
|
||||
Next implementation work should start from a new user-approved requirement and, if substantial, a new sprint contract.
|
||||
|
||||
- Make local setup failures explicit before users run conversions.
|
||||
## 6. Abandoned Planning
|
||||
|
||||
Touched surfaces:
|
||||
|
||||
- `doctor.py`
|
||||
- `cli.py`
|
||||
- `README.md`
|
||||
- `scripts/install-mineru.ps1`
|
||||
- `scripts/install-models.py`
|
||||
- Tests for mocked environment checks
|
||||
|
||||
Expected outputs:
|
||||
|
||||
- `pdf2md doctor` reports Python version, `uv`, CUDA/PyTorch GPU visibility, MinerU availability, MinerU version, and detectable model/cache paths.
|
||||
- GPU unavailable warning is clear.
|
||||
- Missing `uv` is reported clearly.
|
||||
- Pre-Turing/Pascal GPU risk is reported clearly for GTX 1070 Ti compute capability 6.1.
|
||||
- Missing required dependency causes doctor failure.
|
||||
- Setup docs explain Windows PowerShell, Python 3.12, `uv`, MinerU, models, GPU expectations, and local-only behavior.
|
||||
|
||||
Verification checks:
|
||||
|
||||
- Mocked doctor tests for success, missing MinerU, missing GPU, and missing dependency.
|
||||
- Documentation review for no cloud/API runtime path.
|
||||
|
||||
Hard failure criteria:
|
||||
|
||||
- Doctor says the environment is healthy when MinerU is missing.
|
||||
- Doctor implies cloud/API fallback is supported.
|
||||
|
||||
Primary agents:
|
||||
|
||||
- `local-setup-agent`
|
||||
- `license-privacy-agent`
|
||||
- `evaluation-agent`
|
||||
|
||||
### Sprint 9: Local Fixture Evaluation And V1 Release Gate
|
||||
|
||||
Active contract:
|
||||
|
||||
- `docs/Sprints/SPRINT9CONTRACT.md`
|
||||
### Sprint 17: Offline Windows Installer
|
||||
|
||||
Status:
|
||||
|
||||
- Implemented.
|
||||
- Abandoned at the user's request on 2026-05-13.
|
||||
|
||||
Objective:
|
||||
Historical references:
|
||||
|
||||
- Validate the end-to-end v1 behavior against local samples without committing samples.
|
||||
- `docs/Sprints/SPRINT17CONTRACT.md`.
|
||||
- `docs/superpowers/plans/2026-05-12-offline-installer.md`.
|
||||
|
||||
Touched surfaces:
|
||||
Do not implement or extend Sprint 17 unless the user explicitly reopens offline installer work.
|
||||
|
||||
- `tests/integration/`
|
||||
- Optional local-only fixture manifest that does not include sample PDFs
|
||||
- `README.md`
|
||||
- `PROGRESS.md`
|
||||
## 7. Future Decisions
|
||||
|
||||
Expected outputs:
|
||||
- Decide whether simplified outputs need a metadata-free `pdf2md recheck`; current `recheck` remains legacy-only for outputs with adjacent metadata JSON.
|
||||
- Validate `--gpu auto --mineru-profile auto` on a stronger NVIDIA GPU PC.
|
||||
|
||||
- Fast mocked integration suite.
|
||||
- Optional MinerU-dependent local test command.
|
||||
- Local sample coverage notes in `PROGRESS.md`.
|
||||
- V1 release checklist status.
|
||||
## 8. Harness Operating Model
|
||||
|
||||
Verification checks:
|
||||
Use the project long-running harness only for substantial implementation work.
|
||||
|
||||
- `uv run pytest` passes without model downloads.
|
||||
- Optional MinerU test is clearly marked and skipped unless explicitly enabled.
|
||||
- Representative sample produces Markdown, metadata JSON, report Markdown, and asset paths.
|
||||
- Obsidian math delimiter expectations are met.
|
||||
- No sample PDFs are staged.
|
||||
1. `harness-planner-agent` turns the next user request into a sprint contract.
|
||||
2. `evaluation-agent` reviews the contract before code changes start.
|
||||
3. `feature-generator-agent` implements one approved contract at a time.
|
||||
4. `feature-generator-agent` runs self-checks and records residual risks.
|
||||
5. `evaluation-agent` independently verifies the result against the contract.
|
||||
6. The parent agent updates `PROGRESS.md`, commits the completed change, and leaves a handoff.
|
||||
|
||||
Hard failure criteria:
|
||||
After a chunk is no longer active, archive completed-work details in `docs/WORKARCHIVE.md` and keep `PROGRESS.md` focused on current status, blockers, and next actions.
|
||||
|
||||
- Default tests require GPU, MinerU models, or network access.
|
||||
- Sample files are added to git unintentionally.
|
||||
- V1 release checklist passes without metadata/report generation.
|
||||
## 9. Completed Sprint Archive
|
||||
|
||||
Primary agents and skills:
|
||||
Completed sprint details have been moved out of this active implementation plan.
|
||||
|
||||
- `evaluation-agent`
|
||||
- `requirements-guard-agent`
|
||||
- `fixture-evaluation` skill
|
||||
- Summary and verification evidence: `docs/WORKARCHIVE.md`.
|
||||
- Detailed historical contracts: `docs/Sprints/SPRINT0CONTRACT.md` through `docs/Sprints/SPRINT16CONTRACT.md`.
|
||||
- UI folder batch design and execution record: `docs/superpowers/specs/2026-05-13-ui-folder-batch-conversion-design.md` and `docs/superpowers/plans/2026-05-13-ui-folder-batch-conversion.md`.
|
||||
- Abandoned Sprint 17 planning record: `docs/Sprints/SPRINT17CONTRACT.md` and `docs/superpowers/plans/2026-05-12-offline-installer.md`.
|
||||
|
||||
### Sprint 10: Pre-Conversion PDF Page Chunking
|
||||
|
||||
Active contract:
|
||||
|
||||
- `docs/Sprints/SPRINT10CONTRACT.md`
|
||||
|
||||
Status:
|
||||
|
||||
- Implemented.
|
||||
|
||||
Objective:
|
||||
|
||||
- Split long PDFs into temporary fixed-size page chunks before MinerU conversion.
|
||||
|
||||
Touched surfaces:
|
||||
|
||||
- `pdf_splitter.py`
|
||||
- `conversion.py`
|
||||
- `cli.py`
|
||||
- `report.py`
|
||||
- README and Sprint 10 documentation
|
||||
- Unit tests for splitter, conversion, CLI, and report behavior
|
||||
|
||||
Expected outputs:
|
||||
|
||||
- `pdf2md convert INPUT --out OUTPUT --chunk-pages` enables 20-page chunks.
|
||||
- `pdf2md convert INPUT --out OUTPUT --chunk-pages N` enables custom positive chunk size.
|
||||
- `convert_pdf(..., chunk_pages=N)` returns a `BatchConversionResult` in chunk mode.
|
||||
- Temporary chunk PDFs are deleted after conversion completes.
|
||||
- Chunk Markdown files are separate and named with original page ranges.
|
||||
- Metadata and report content expose original source path and chunk page ranges.
|
||||
|
||||
Verification checks:
|
||||
|
||||
- pypdf-based local blank PDF tests cover page counts, chunk ranges, and written chunk page counts.
|
||||
- Mocked conversion tests verify one adapter call per chunk, failed-chunk continuation, chunk metadata/report context, and temporary chunk cleanup.
|
||||
- CLI tests verify `--chunk-pages` without a value uses 20 pages.
|
||||
|
||||
Hard failure criteria:
|
||||
|
||||
- Chunking uploads document content or uses another conversion engine.
|
||||
- Chunk outputs are merged.
|
||||
- Default tests require real MinerU, GPU, model files, network, Obsidian, LaTeX tooling, or `samples/`.
|
||||
|
||||
### Sprint 11: MathJax Warning Mitigation
|
||||
|
||||
Active contract:
|
||||
|
||||
- `docs/Sprints/SPRINT11CONTRACT.md`
|
||||
|
||||
Status:
|
||||
|
||||
- Implemented.
|
||||
|
||||
Objective:
|
||||
|
||||
- Repair narrow MathJax-invalid formula artifacts after initial local validation and before final output writing.
|
||||
|
||||
Touched surfaces:
|
||||
|
||||
- `quality.py`
|
||||
- `math_repair.py`
|
||||
- `conversion.py`
|
||||
- `ir.py`
|
||||
- Unit tests for quality details, repair rules, conversion, and recheck behavior
|
||||
|
||||
Expected outputs:
|
||||
|
||||
- Failed math expression records expose body, display mode, span, and checker message.
|
||||
- Repair candidates are generated only for failed math spans.
|
||||
- Repeated same-direction scripts are disambiguated with an empty group.
|
||||
- Truncated `\end{a}` array endings are repaired when array environments are unbalanced.
|
||||
- `convert` and `recheck` share the same repair behavior.
|
||||
- Applied repairs are recorded as `MATH_RENDER_REPAIRED` info warnings and do not count as math render errors.
|
||||
|
||||
Verification checks:
|
||||
|
||||
- Default fast tests pass without real MinerU, GPU, Node.js, MathJax, network, Obsidian, or `samples/`.
|
||||
- `samples/MITC공부.pdf` validates locally with `Math render error count: 0`.
|
||||
|
||||
Hard failure criteria:
|
||||
|
||||
- Repair changes math spans that did not fail local MathJax validation.
|
||||
- Repair claims success without candidate revalidation.
|
||||
- Repair introduces remote services, alternate engines, or mandatory sample-dependent default tests.
|
||||
|
||||
## 6. Cross-Cutting Acceptance Criteria
|
||||
|
||||
Every implementation sprint must preserve these acceptance criteria:
|
||||
|
||||
- No runtime remote document processing path exists.
|
||||
- MinerU is the only conversion engine.
|
||||
- Failures are explicit and traceable.
|
||||
- Warnings are structured and countable.
|
||||
- Markdown and metadata can be traced back to source pages where available.
|
||||
- Reports are generated from metadata.
|
||||
- Default tests are fast and local.
|
||||
- `samples/` remains untracked unless explicitly requested.
|
||||
|
||||
## 7. First Implementation Request Contract Template
|
||||
|
||||
Use this template when implementation begins.
|
||||
|
||||
```markdown
|
||||
## Sprint Contract
|
||||
|
||||
Objective:
|
||||
|
||||
Touched surfaces:
|
||||
|
||||
Expected outputs:
|
||||
|
||||
Non-goals:
|
||||
|
||||
Verification checks:
|
||||
|
||||
Hard failure criteria:
|
||||
|
||||
Handoff fields:
|
||||
- Files changed:
|
||||
- Commands run:
|
||||
- Tests passed:
|
||||
- Known failures:
|
||||
- Residual risks:
|
||||
- Next action:
|
||||
```
|
||||
|
||||
## 8. Open Risks
|
||||
|
||||
- MinerU 3.1.0 install and CLI behavior are source-verified, but real local output still needs a later local probe before release.
|
||||
- GTX 1070 Ti 8GB is visible locally, but it is Pascal compute capability 6.1; `doctor` and setup docs must make CUDA/PyTorch limits clear.
|
||||
- `uv` is installed per-user at `C:\Users\user\.local\bin`, but a new shell may need PATH refresh before `uv` is visible.
|
||||
- Formula renderability checks and conservative warning mitigation are implemented, but formula reconstruction remains best effort and should keep warnings/provenance visible.
|
||||
- Some PDFs will have tables or formulas that cannot be faithfully represented in Markdown; metadata and `.report.md` must surface this instead of hiding it.
|
||||
- Redistribution license obligations must be reviewed before packaging, redistribution, or bundling model weights.
|
||||
|
||||
## 9. Recommended Next Step
|
||||
|
||||
Run optional real local MinerU validation on a long sample only when requested. Default verification should continue to use mocked adapters and generated temporary PDFs so it remains independent of MinerU, GPU, model files, network access, and `samples/`.
|
||||
|
||||
Facts carried forward from Sprint 0:
|
||||
Facts carried forward from completed work:
|
||||
|
||||
- MinerU is fixed to version 3.1.0.
|
||||
- Direct local CLI command shape is `mineru -p <input> -o <output>`.
|
||||
- MinerU output layout should be treated as optional-file based until locally probed.
|
||||
- Python 3.12 is compatible with the pinned MinerU package range.
|
||||
- GTX 1070 Ti CUDA/PyTorch support needs explicit doctor validation.
|
||||
- MinerU/model license posture is acceptable for personal local use. Redistribution remains out of scope until reviewed.
|
||||
- Formula reconstruction remains best effort and must keep warnings/provenance visible.
|
||||
- MinerU/model license posture is acceptable for personal local use. Redistribution remains gated by license review.
|
||||
|
||||
Reference in New Issue
Block a user