664 lines
20 KiB
Markdown
664 lines
20 KiB
Markdown
# V1 Implementation Plan: Local PDF-to-Markdown Converter
|
|
|
|
Last updated: 2026-05-08
|
|
|
|
This document is the implementation plan for v1. It does not replace `PRD.md` or `ARCHITECTURE.md`; use those files as the source of product requirements and system design. This plan explains the order of work, sprint contracts, verification gates, and agent ownership for implementing the converter.
|
|
|
|
Sprint 1 created the Python package scaffold and CLI placeholder. Sprint 2 created path planning. Sprint 3 created project-owned records and metadata construction. Sprint 4 created the mocked direct local MinerU adapter boundary. Sprint 5 created the Obsidian Markdown normalization boundary. Sprint 6 created local quality-check and report-rendering boundaries. Sprint 7 implemented conversion orchestration, the public conversion API, and the `pdf2md convert` CLI path with fake-adapter tests. Sprint 8 implemented mockable doctor diagnostics, the `pdf2md doctor` CLI path, and setup documentation. Sprint 9 implemented fast mocked integration tests, explicit opt-in local MinerU fixture evaluation, and the v1 release checklist. Sprint 10 implemented opt-in pre-conversion PDF chunking for long documents.
|
|
|
|
## 1. V1 Outcome
|
|
|
|
v1 is complete when a local user can run:
|
|
|
|
```bash
|
|
uv run pdf2md doctor
|
|
uv run pdf2md convert paper.pdf --out out --metadata
|
|
uv run pdf2md convert pdfs --out out --recursive --metadata
|
|
```
|
|
|
|
and receive, for each PDF:
|
|
|
|
- Obsidian-friendly Markdown.
|
|
- A stable sibling assets directory when assets exist.
|
|
- `<stem>.metadata.json`.
|
|
- `<stem>.report.md`.
|
|
- Clear warnings when math, tables, assets, reading order, GPU availability, or MinerU execution are uncertain.
|
|
|
|
Long PDFs can be chunked explicitly:
|
|
|
|
```bash
|
|
uv run pdf2md convert paper.pdf --out out --chunk-pages
|
|
uv run pdf2md convert paper.pdf --out out --chunk-pages 20
|
|
```
|
|
|
|
Chunked conversion writes separate outputs per chunk and does not merge Markdown files.
|
|
|
|
The converter must use MinerU 3.1.0 through direct local CLI execution only. It must not silently fallback to another engine.
|
|
|
|
## 2. Non-Negotiable Constraints
|
|
|
|
- Python 3.12 and `uv`.
|
|
- MinerU 3.1.0 is the only conversion engine.
|
|
- Direct local MinerU CLI execution only.
|
|
- MinerU 3.1.0 may launch a temporary local `mineru-api` internally when CLI runs without `--api-url`.
|
|
- No cloud OCR, hosted LLM/VLM, remote document parser, `--api-url`, remote APIs, router mode, HTTP client backends, or remote OpenAI-compatible backends.
|
|
- Target hardware: NVIDIA GTX 1070 Ti 8GB.
|
|
- Digital PDFs with text layers are the v1 priority.
|
|
- `samples/` is local fixture context and must not be committed unless explicitly requested.
|
|
- Every substantial implementation chunk needs a sprint contract and independent evaluation.
|
|
|
|
## 3. Harness Operating Model
|
|
|
|
Use the project long-running harness only for substantial implementation work.
|
|
|
|
1. `harness-planner-agent` turns the next user request into a sprint contract.
|
|
2. `evaluation-agent` reviews the contract before code changes start.
|
|
3. `feature-generator-agent` implements one approved contract at a time.
|
|
4. `feature-generator-agent` runs self-checks and records residual risks.
|
|
5. `evaluation-agent` independently verifies the result against the contract.
|
|
6. The parent agent updates `PROGRESS.md`, commits the completed change, and leaves a handoff.
|
|
|
|
After a chunk is no longer active, archive completed-work details in `docs/WORKARCHIVE.md` and keep `PROGRESS.md` focused on current status, blockers, and next actions.
|
|
|
|
Each sprint contract must include:
|
|
|
|
- Objective.
|
|
- Touched surfaces.
|
|
- Expected outputs.
|
|
- Non-goals.
|
|
- Verification checks.
|
|
- Hard failure criteria.
|
|
- Handoff fields.
|
|
|
|
## 4. Proposed Repository Layout
|
|
|
|
Create this layout incrementally; do not scaffold unused modules before a sprint needs them.
|
|
|
|
```text
|
|
pyproject.toml
|
|
README.md
|
|
src/
|
|
pdf2md/
|
|
__init__.py
|
|
cli.py
|
|
conversion.py
|
|
pdf_splitter.py
|
|
paths.py
|
|
mineru_adapter.py
|
|
ir.py
|
|
markdown.py
|
|
metadata.py
|
|
quality.py
|
|
report.py
|
|
doctor.py
|
|
tests/
|
|
unit/
|
|
integration/
|
|
fixtures/
|
|
scripts/
|
|
install-mineru.ps1
|
|
install-models.py
|
|
```
|
|
|
|
Planned module responsibilities:
|
|
|
|
- `cli.py`: command parsing, CLI summaries, exit codes.
|
|
- `conversion.py`: orchestration for one PDF and batch input.
|
|
- `paths.py`: input discovery, output path planning, overwrite checks.
|
|
- `mineru_adapter.py`: direct local MinerU CLI boundary.
|
|
- `ir.py`: project-owned document/page/block/asset/warning records.
|
|
- `markdown.py`: Obsidian Markdown normalization.
|
|
- `metadata.py`: metadata schema creation and warning aggregation.
|
|
- `quality.py`: local checks for assets, math renderability, and output sanity.
|
|
- `report.py`: `<stem>.report.md` generation from metadata.
|
|
- `doctor.py`: environment, dependency, CUDA/GPU, MinerU, and cache diagnostics.
|
|
|
|
## 5. Sprint Sequence
|
|
|
|
### Sprint 0: Source And Environment Verification
|
|
|
|
Active contract:
|
|
|
|
- `docs/Sprints/SPRINT0CONTRACT.md`
|
|
|
|
Objective:
|
|
|
|
- Verify the facts needed before implementation starts.
|
|
|
|
Touched surfaces:
|
|
|
|
- `docs/KNOWLEDGEBASE.md`
|
|
- `docs/V1IMPLEMENTATIONPLAN.md` if sequencing changes
|
|
- `PROGRESS.md`
|
|
|
|
Expected outputs:
|
|
|
|
- Confirmed MinerU 3.1.0 install command, CLI invocation shape, version command, output paths, and local execution behavior.
|
|
- Confirmed Python 3.12, `uv`, CUDA/PyTorch, and GTX 1070 Ti 8GB risks.
|
|
- Confirmed license notes needed before redistribution.
|
|
|
|
Verification checks:
|
|
|
|
- All volatile facts cite official MinerU, Python, uv, PyTorch/CUDA, or license sources.
|
|
- No candidate engine comparison is reintroduced.
|
|
- No implementation code is created.
|
|
|
|
Hard failure criteria:
|
|
|
|
- MinerU 3.1.0 cannot be reasonably invoked through a direct local CLI on the target environment.
|
|
- Python 3.12 compatibility is not viable without changing project requirements.
|
|
|
|
Primary agents:
|
|
|
|
- `research-agent`
|
|
- `local-setup-agent`
|
|
- `license-privacy-agent`
|
|
|
|
### Sprint 1: Project Scaffold And Fast Test Loop
|
|
|
|
Active contract:
|
|
|
|
- `docs/Sprints/SPRINT1CONTRACT.md`
|
|
|
|
Objective:
|
|
|
|
- Create the minimal Python project structure and a fast local test loop.
|
|
|
|
Touched surfaces:
|
|
|
|
- `pyproject.toml`
|
|
- `src/pdf2md/__init__.py`
|
|
- `tests/`
|
|
- Development documentation if needed
|
|
|
|
Expected outputs:
|
|
|
|
- `uv sync` works.
|
|
- `uv run pytest` works.
|
|
- Project package imports as `pdf2md`.
|
|
- CLI entry point name `pdf2md` is reserved but may initially expose only `doctor` or a clear placeholder until later sprints.
|
|
- If `uv` is still unavailable locally, Sprint 1 records that blocker and is not marked complete.
|
|
|
|
Verification checks:
|
|
|
|
- Import test passes.
|
|
- Empty test suite or initial scaffold tests pass.
|
|
- No runtime network dependency is introduced.
|
|
|
|
Hard failure criteria:
|
|
|
|
- Project cannot be installed with `uv`.
|
|
- Scaffolding adds speculative config systems, extra engines, or unused abstractions.
|
|
|
|
Primary agents:
|
|
|
|
- `harness-planner-agent`
|
|
- `feature-generator-agent`
|
|
- `evaluation-agent`
|
|
|
|
### Sprint 2: Paths, Input Discovery, And Overwrite Planning
|
|
|
|
Active contract:
|
|
|
|
- `docs/Sprints/SPRINT2CONTRACT.md`
|
|
|
|
Objective:
|
|
|
|
- Implement deterministic input and output planning before conversion logic exists.
|
|
|
|
Touched surfaces:
|
|
|
|
- `paths.py`
|
|
- `conversion.py` skeleton if needed
|
|
- CLI path handling tests
|
|
|
|
Expected outputs:
|
|
|
|
- Single PDF discovery.
|
|
- Directory PDF discovery.
|
|
- Recursive traversal only when requested.
|
|
- Deterministic output paths for Markdown, assets, metadata JSON, report, and optional raw MinerU output.
|
|
- Existing-output protection unless `--overwrite` is passed.
|
|
|
|
Verification checks:
|
|
|
|
- Unit tests for single PDF path planning.
|
|
- Unit tests for directory and recursive discovery.
|
|
- Unit tests for overwrite behavior.
|
|
- Tests include Korean or non-ASCII filename handling using generated temporary files, not committed sample PDFs.
|
|
|
|
Hard failure criteria:
|
|
|
|
- Output planning can overwrite user files without explicit overwrite intent.
|
|
- Directory conversion descends recursively without `--recursive`.
|
|
|
|
Primary agents:
|
|
|
|
- `feature-generator-agent`
|
|
- `evaluation-agent`
|
|
|
|
### Sprint 3: Domain Records, Metadata, And Warning Model
|
|
|
|
Active contract:
|
|
|
|
- `docs/Sprints/SPRINT3CONTRACT.md`
|
|
|
|
Objective:
|
|
|
|
- Define project-owned records before binding to MinerU output.
|
|
|
|
Touched surfaces:
|
|
|
|
- `ir.py`
|
|
- `metadata.py`
|
|
- `report.py` skeleton if needed
|
|
- Unit tests
|
|
|
|
Expected outputs:
|
|
|
|
- Document, page, block, asset, and warning records.
|
|
- Stable warning codes from `ARCHITECTURE.md`.
|
|
- Metadata JSON builder with required top-level and summary fields.
|
|
- Warning aggregation logic.
|
|
|
|
Verification checks:
|
|
|
|
- Unit tests for metadata schema creation.
|
|
- Unit tests for warning aggregation.
|
|
- Unit tests for optional fields such as bbox and confidence being preserved only when present.
|
|
|
|
Hard failure criteria:
|
|
|
|
- Public API requires raw MinerU objects.
|
|
- Metadata omits source PDF, SHA-256, engine, pages, warnings, assets, or summary.
|
|
|
|
Primary agents:
|
|
|
|
- `metadata-agent`
|
|
- `feature-generator-agent`
|
|
- `evaluation-agent`
|
|
|
|
### Sprint 4: MinerU Adapter With Mocked Contract
|
|
|
|
Active contract:
|
|
|
|
- `docs/Sprints/SPRINT4CONTRACT.md`
|
|
|
|
Objective:
|
|
|
|
- Build the direct local MinerU adapter boundary with mocked outputs first.
|
|
|
|
Touched surfaces:
|
|
|
|
- `mineru_adapter.py`
|
|
- `doctor.py` partial checks
|
|
- Adapter tests with fake subprocess results and fake output directories
|
|
|
|
Expected outputs:
|
|
|
|
- Adapter availability check.
|
|
- Version check.
|
|
- Direct CLI command construction.
|
|
- Strict-local command validation.
|
|
- Subprocess execution wrapper capturing stdout, stderr, exit code, and paths.
|
|
- Parsed adapter result object with raw Markdown, raw structured data when available, assets, warnings, engine, engine version, options, exit code, and stderr.
|
|
- Baseline command shape based on MinerU 3.1.0 direct local CLI: `mineru -p <input> -o <output>`.
|
|
- Strict-local validation allows CLI-internal temporary local `mineru-api` orchestration, while rejecting `--api-url`, remote APIs, router mode, HTTP client backends, and remote OpenAI-compatible backends.
|
|
|
|
Verification checks:
|
|
|
|
- Mocked successful MinerU output test.
|
|
- Mocked missing MinerU test.
|
|
- Mocked non-zero exit test.
|
|
- Test that prohibited remote/API flags cannot be introduced.
|
|
- No real MinerU/model dependency in default tests.
|
|
|
|
Hard failure criteria:
|
|
|
|
- Adapter passes `--api-url`, uses router mode, uses an HTTP client backend, or connects to a remote API or remote OpenAI-compatible backend.
|
|
- Adapter falls back to another engine after MinerU failure.
|
|
- Tests require model downloads by default.
|
|
|
|
Primary agents:
|
|
|
|
- `mineru-integration-agent`
|
|
- `feature-generator-agent`
|
|
- `evaluation-agent`
|
|
|
|
### Sprint 5: Obsidian Markdown Normalization And Assets
|
|
|
|
Active contract:
|
|
|
|
- `docs/Sprints/SPRINT5CONTRACT.md`
|
|
|
|
Objective:
|
|
|
|
- Normalize MinerU/project IR output into Obsidian-friendly Markdown.
|
|
|
|
Touched surfaces:
|
|
|
|
- `markdown.py`
|
|
- `quality.py` partial asset link checks
|
|
- Unit tests
|
|
|
|
Expected outputs:
|
|
|
|
- Inline math delimiter normalization to `$...$`.
|
|
- Display math delimiter normalization to `$$...$$`.
|
|
- Blank-line normalization around display math.
|
|
- Relative asset link normalization.
|
|
- Simple table preservation and complex table fallback warnings.
|
|
- No visible page markers by default.
|
|
|
|
Verification checks:
|
|
|
|
- Unit tests for inline math.
|
|
- Unit tests for display math spacing.
|
|
- Unit tests for underscores/carets inside math.
|
|
- Unit tests for relative asset links.
|
|
- Unit tests for table fallback warning behavior.
|
|
|
|
Hard failure criteria:
|
|
|
|
- Normalization rewrites LaTeX semantics without deterministic tests.
|
|
- Generated links are absolute when relative links are required.
|
|
- Page provenance is only visible in Markdown and missing from metadata.
|
|
|
|
Primary agents:
|
|
|
|
- `obsidian-markdown-agent`
|
|
- `feature-generator-agent`
|
|
- `evaluation-agent`
|
|
|
|
### Sprint 6: Quality Checks And Report Generation
|
|
|
|
Active contract:
|
|
|
|
- `docs/Sprints/SPRINT6CONTRACT.md`
|
|
|
|
Objective:
|
|
|
|
- Produce local quality signals and human-readable reports from metadata.
|
|
|
|
Touched surfaces:
|
|
|
|
- `quality.py`
|
|
- `report.py`
|
|
- `metadata.py`
|
|
- Unit tests
|
|
|
|
Expected outputs:
|
|
|
|
- Missing asset link count.
|
|
- Math renderability check interface with graceful unavailable-tool handling.
|
|
- Pages-with-warnings summary.
|
|
- `<stem>.report.md` generated from metadata.
|
|
- Final status: `success`, `partial`, or `failed`.
|
|
|
|
Verification checks:
|
|
|
|
- Unit tests for report content.
|
|
- Unit tests for missing asset link count.
|
|
- Unit tests for math render failure aggregation.
|
|
- Report generation does not re-run MinerU.
|
|
|
|
Hard failure criteria:
|
|
|
|
- Report diverges from JSON metadata.
|
|
- Math render failures are silently ignored.
|
|
- Quality checks require network access.
|
|
|
|
Primary agents:
|
|
|
|
- `metadata-agent`
|
|
- `evaluation-agent`
|
|
- `feature-generator-agent`
|
|
|
|
### Sprint 7: Conversion Orchestrator, CLI, And Python API
|
|
|
|
Active contract:
|
|
|
|
- `docs/Sprints/SPRINT7CONTRACT.md`
|
|
|
|
Objective:
|
|
|
|
- Connect path planning, MinerU adapter, normalization, metadata, report, and summaries.
|
|
|
|
Touched surfaces:
|
|
|
|
- `conversion.py`
|
|
- `cli.py`
|
|
- `__init__.py`
|
|
- CLI and API tests
|
|
|
|
Expected outputs:
|
|
|
|
- `convert_pdf(input_path, output_dir, metadata=True)` public API.
|
|
- `pdf2md convert INPUT --out OUTPUT_DIR`.
|
|
- `--metadata`, `--keep-raw`, `--recursive`, `--overwrite`, `--gpu`, and `--strict-local` behavior.
|
|
- Batch conversion for directories.
|
|
- CLI summary with warning counts.
|
|
|
|
Verification checks:
|
|
|
|
- API test with mocked MinerU adapter.
|
|
- CLI single PDF test with mocked MinerU adapter.
|
|
- CLI directory test with mocked MinerU adapter.
|
|
- Existing output test.
|
|
- Failure summary test.
|
|
|
|
Hard failure criteria:
|
|
|
|
- Public API exposes raw MinerU objects as required return fields.
|
|
- CLI writes outputs after a hard failure that should stop conversion.
|
|
- CLI suppresses warning counts.
|
|
|
|
Primary agents:
|
|
|
|
- `feature-generator-agent`
|
|
- `requirements-guard-agent`
|
|
- `evaluation-agent`
|
|
|
|
### Sprint 8: Doctor And Setup Documentation
|
|
|
|
Active contract:
|
|
|
|
- `docs/Sprints/SPRINT8CONTRACT.md`
|
|
|
|
Status:
|
|
|
|
- Implemented.
|
|
|
|
Objective:
|
|
|
|
- Make local setup failures explicit before users run conversions.
|
|
|
|
Touched surfaces:
|
|
|
|
- `doctor.py`
|
|
- `cli.py`
|
|
- `README.md`
|
|
- `scripts/install-mineru.ps1`
|
|
- `scripts/install-models.py`
|
|
- Tests for mocked environment checks
|
|
|
|
Expected outputs:
|
|
|
|
- `pdf2md doctor` reports Python version, `uv`, CUDA/PyTorch GPU visibility, MinerU availability, MinerU version, and detectable model/cache paths.
|
|
- GPU unavailable warning is clear.
|
|
- Missing `uv` is reported clearly.
|
|
- Pre-Turing/Pascal GPU risk is reported clearly for GTX 1070 Ti compute capability 6.1.
|
|
- Missing required dependency causes doctor failure.
|
|
- Setup docs explain Windows PowerShell, Python 3.12, `uv`, MinerU, models, GPU expectations, and local-only behavior.
|
|
|
|
Verification checks:
|
|
|
|
- Mocked doctor tests for success, missing MinerU, missing GPU, and missing dependency.
|
|
- Documentation review for no cloud/API runtime path.
|
|
|
|
Hard failure criteria:
|
|
|
|
- Doctor says the environment is healthy when MinerU is missing.
|
|
- Doctor implies cloud/API fallback is supported.
|
|
|
|
Primary agents:
|
|
|
|
- `local-setup-agent`
|
|
- `license-privacy-agent`
|
|
- `evaluation-agent`
|
|
|
|
### Sprint 9: Local Fixture Evaluation And V1 Release Gate
|
|
|
|
Active contract:
|
|
|
|
- `docs/Sprints/SPRINT9CONTRACT.md`
|
|
|
|
Status:
|
|
|
|
- Implemented.
|
|
|
|
Objective:
|
|
|
|
- Validate the end-to-end v1 behavior against local samples without committing samples.
|
|
|
|
Touched surfaces:
|
|
|
|
- `tests/integration/`
|
|
- Optional local-only fixture manifest that does not include sample PDFs
|
|
- `README.md`
|
|
- `PROGRESS.md`
|
|
|
|
Expected outputs:
|
|
|
|
- Fast mocked integration suite.
|
|
- Optional MinerU-dependent local test command.
|
|
- Local sample coverage notes in `PROGRESS.md`.
|
|
- V1 release checklist status.
|
|
|
|
Verification checks:
|
|
|
|
- `uv run pytest` passes without model downloads.
|
|
- Optional MinerU test is clearly marked and skipped unless explicitly enabled.
|
|
- Representative sample produces Markdown, metadata JSON, report Markdown, and asset paths.
|
|
- Obsidian math delimiter expectations are met.
|
|
- No sample PDFs are staged.
|
|
|
|
Hard failure criteria:
|
|
|
|
- Default tests require GPU, MinerU models, or network access.
|
|
- Sample files are added to git unintentionally.
|
|
- V1 release checklist passes without metadata/report generation.
|
|
|
|
Primary agents and skills:
|
|
|
|
- `evaluation-agent`
|
|
- `requirements-guard-agent`
|
|
- `fixture-evaluation` skill
|
|
|
|
### Sprint 10: Pre-Conversion PDF Page Chunking
|
|
|
|
Active contract:
|
|
|
|
- `docs/Sprints/SPRINT10CONTRACT.md`
|
|
|
|
Status:
|
|
|
|
- Implemented.
|
|
|
|
Objective:
|
|
|
|
- Split long PDFs into temporary fixed-size page chunks before MinerU conversion.
|
|
|
|
Touched surfaces:
|
|
|
|
- `pdf_splitter.py`
|
|
- `conversion.py`
|
|
- `cli.py`
|
|
- `report.py`
|
|
- README and Sprint 10 documentation
|
|
- Unit tests for splitter, conversion, CLI, and report behavior
|
|
|
|
Expected outputs:
|
|
|
|
- `pdf2md convert INPUT --out OUTPUT --chunk-pages` enables 20-page chunks.
|
|
- `pdf2md convert INPUT --out OUTPUT --chunk-pages N` enables custom positive chunk size.
|
|
- `convert_pdf(..., chunk_pages=N)` returns a `BatchConversionResult` in chunk mode.
|
|
- Temporary chunk PDFs are deleted after conversion completes.
|
|
- Chunk Markdown files are separate and named with original page ranges.
|
|
- Metadata and report content expose original source path and chunk page ranges.
|
|
|
|
Verification checks:
|
|
|
|
- pypdf-based local blank PDF tests cover page counts, chunk ranges, and written chunk page counts.
|
|
- Mocked conversion tests verify one adapter call per chunk, failed-chunk continuation, chunk metadata/report context, and temporary chunk cleanup.
|
|
- CLI tests verify `--chunk-pages` without a value uses 20 pages.
|
|
|
|
Hard failure criteria:
|
|
|
|
- Chunking uploads document content or uses another conversion engine.
|
|
- Chunk outputs are merged.
|
|
- Default tests require real MinerU, GPU, model files, network, Obsidian, LaTeX tooling, or `samples/`.
|
|
|
|
## 6. Cross-Cutting Acceptance Criteria
|
|
|
|
Every implementation sprint must preserve these acceptance criteria:
|
|
|
|
- No runtime remote document processing path exists.
|
|
- MinerU is the only conversion engine.
|
|
- Failures are explicit and traceable.
|
|
- Warnings are structured and countable.
|
|
- Markdown and metadata can be traced back to source pages where available.
|
|
- Reports are generated from metadata.
|
|
- Default tests are fast and local.
|
|
- `samples/` remains untracked unless explicitly requested.
|
|
|
|
## 7. First Implementation Request Contract Template
|
|
|
|
Use this template when implementation begins.
|
|
|
|
```markdown
|
|
## Sprint Contract
|
|
|
|
Objective:
|
|
|
|
Touched surfaces:
|
|
|
|
Expected outputs:
|
|
|
|
Non-goals:
|
|
|
|
Verification checks:
|
|
|
|
Hard failure criteria:
|
|
|
|
Handoff fields:
|
|
- Files changed:
|
|
- Commands run:
|
|
- Tests passed:
|
|
- Known failures:
|
|
- Residual risks:
|
|
- Next action:
|
|
```
|
|
|
|
## 8. Open Risks
|
|
|
|
- MinerU 3.1.0 install and CLI behavior are source-verified, but real local output still needs a later local probe before release.
|
|
- GTX 1070 Ti 8GB is visible locally, but it is Pascal compute capability 6.1; `doctor` and setup docs must make CUDA/PyTorch limits clear.
|
|
- `uv` is installed per-user at `C:\Users\user\.local\bin`, but a new shell may need PATH refresh before `uv` is visible.
|
|
- Formula renderability checks need a local tool choice; the implementation should start with an interface and graceful unavailable-tool warning if needed.
|
|
- Some PDFs will have tables or formulas that cannot be faithfully represented in Markdown; metadata and `.report.md` must surface this instead of hiding it.
|
|
- Redistribution license obligations must be reviewed before packaging, redistribution, or bundling model weights.
|
|
|
|
## 9. Recommended Next Step
|
|
|
|
Run optional real local MinerU validation on a long sample only when requested. Default verification should continue to use mocked adapters and generated temporary PDFs so it remains independent of MinerU, GPU, model files, network access, and `samples/`.
|
|
|
|
Facts carried forward from Sprint 0:
|
|
|
|
- MinerU is fixed to version 3.1.0.
|
|
- Direct local CLI command shape is `mineru -p <input> -o <output>`.
|
|
- MinerU output layout should be treated as optional-file based until locally probed.
|
|
- Python 3.12 is compatible with the pinned MinerU package range.
|
|
- GTX 1070 Ti CUDA/PyTorch support needs explicit doctor validation.
|
|
- MinerU/model license posture is acceptable for personal local use. Redistribution remains out of scope until reviewed.
|