modify pdftomd
This commit is contained in:
@@ -0,0 +1,412 @@
|
||||
# Sprint 16 Contract: Simplified Output Layout
|
||||
|
||||
Status: Implemented
|
||||
Last updated: 2026-05-12
|
||||
|
||||
## Objective
|
||||
|
||||
Simplify conversion outputs so each input PDF gets one predictable output folder named after the PDF stem, all images live under one `images` folder, Markdown parts use `_001`, `_002` numbering, one human-readable report is written per PDF, and no metadata JSON file is persisted.
|
||||
|
||||
This sprint changes the public output contract. It supersedes the older v1 output layout that wrote sibling `<stem>.md`, `<stem>.assets`, `<stem>.metadata.json`, and `<stem>.report.md` files.
|
||||
|
||||
## Product Output Contract
|
||||
|
||||
For an input PDF:
|
||||
|
||||
```text
|
||||
paper.pdf
|
||||
```
|
||||
|
||||
and output root:
|
||||
|
||||
```text
|
||||
outputs/
|
||||
```
|
||||
|
||||
write:
|
||||
|
||||
```text
|
||||
outputs/
|
||||
paper/
|
||||
paper_001.md
|
||||
paper_002.md
|
||||
paper_report.md
|
||||
images/
|
||||
...
|
||||
```
|
||||
|
||||
Rules:
|
||||
|
||||
- `paper` is the PDF stem, meaning the original filename without `.pdf`.
|
||||
- A one-part conversion still writes `paper_001.md`.
|
||||
- A multi-part conversion writes `paper_001.md`, `paper_002.md`, and so on.
|
||||
- Part numbering uses at least three digits and grows only when the part count exceeds 999.
|
||||
- All generated image and media assets for the PDF live under `paper/images/`.
|
||||
- Markdown links must point to `images/<asset-name>`.
|
||||
- The report is a single file at `paper/paper_report.md`.
|
||||
- No `<stem>.metadata.json`, part metadata JSON, or sidecar metadata JSON is written.
|
||||
- Internal metadata records may still be built in memory to produce reports, warnings, counts, and `ConversionResult` fields.
|
||||
|
||||
## Contract Assumptions
|
||||
|
||||
- The user request "metadata is not needed" means metadata JSON should not be written as a user-facing output file. It does not mean removing internal metadata objects needed for report generation and warning aggregation.
|
||||
- Keep `--chunk-pages` semantics from Sprint 14: when enabled, MinerU receives one source page per run and final Markdown files are grouped by `chunk_pages`.
|
||||
- If `--chunk-pages` is absent, the whole PDF is still converted in one MinerU run and written as `<stem>_001.md`.
|
||||
- Keep `--chunk-pages` without a value as the default grouping size of 20.
|
||||
- Keep `--metadata` accepted as a backward-compatible no-op for one sprint, but update help text to say metadata JSON output is disabled in the simplified layout.
|
||||
- `pdf2md recheck` remains supported only for legacy outputs that still have adjacent metadata JSON. New simplified outputs should fail recheck clearly until a later sprint designs metadata-free recheck.
|
||||
- Recursive directory conversion should preserve the discovered relative parent before the PDF stem folder: `outputs/<relative-parent>/<stem>/<stem>_001.md`.
|
||||
- If two inputs would map to the same output folder and overwrite is false, fail during preflight. Do not invent automatic suffixes.
|
||||
- `--keep-raw` should place raw MinerU diagnostics under `paper/raw/` so raw outputs do not clutter the main folder.
|
||||
|
||||
## Touched Surfaces
|
||||
|
||||
Allowed during implementation:
|
||||
|
||||
- Modify `src/pdf2md/paths.py`.
|
||||
- Modify `src/pdf2md/pdf_splitter.py` only if part naming needs helper support.
|
||||
- Modify `src/pdf2md/conversion.py`.
|
||||
- Modify `src/pdf2md/report.py` or add a focused aggregate report helper if one report needs multiple part summaries.
|
||||
- Modify `src/pdf2md/cli.py`.
|
||||
- Modify `src/pdf2md_ui/runner.py` and `src/pdf2md_ui/app.py` only if UI text or expected output descriptions mention metadata/report paths.
|
||||
- Modify `tests/test_paths.py`.
|
||||
- Modify `tests/test_conversion.py`.
|
||||
- Modify `tests/test_cli.py`.
|
||||
- Modify `tests/test_report.py`.
|
||||
- Modify `tests/test_ui_runner.py` only if UI command/output assumptions change.
|
||||
- Modify `tests/integration/test_v1_fast_release_gate.py`.
|
||||
- Modify `tests/integration/test_optional_mineru_fixtures.py`.
|
||||
- Modify `README.md`.
|
||||
- Modify `PRD.md`.
|
||||
- Modify `ARCHITECTURE.md`.
|
||||
- Modify `docs/V1IMPLEMENTATIONPLAN.md`.
|
||||
- Modify `PLAN.md`.
|
||||
- Modify `PROGRESS.md`.
|
||||
- Modify `docs/WORKARCHIVE.md` after implementation.
|
||||
|
||||
Not allowed:
|
||||
|
||||
- Do not change MinerU 3.1.0 as the fixed engine.
|
||||
- Do not add another conversion engine.
|
||||
- Do not add remote/API/backend paths.
|
||||
- Do not change `--gpu`, `--mineru-profile`, or strict-local behavior except where report text reflects the new layout.
|
||||
- Do not make default tests depend on real MinerU, GPU, CUDA, model files, network, Obsidian, MathJax, or `samples/`.
|
||||
- Do not commit generated `outputs/`, sample PDFs, local model files, or `dist/pdf2md-ui.exe`.
|
||||
|
||||
## Architecture Plan
|
||||
|
||||
### WP16.1: Document-Level Output Layout
|
||||
|
||||
Add or reshape path planning so final outputs are planned per source PDF folder instead of as sibling files.
|
||||
|
||||
Expected final paths for a single PDF:
|
||||
|
||||
```text
|
||||
<out>/<stem>/<stem>_001.md
|
||||
<out>/<stem>/images/
|
||||
<out>/<stem>/<stem>_report.md
|
||||
```
|
||||
|
||||
Expected final paths for recursive input:
|
||||
|
||||
```text
|
||||
<out>/<relative-parent>/<stem>/<stem>_001.md
|
||||
<out>/<relative-parent>/<stem>/images/
|
||||
<out>/<relative-parent>/<stem>/<stem>_report.md
|
||||
```
|
||||
|
||||
Implementation guidance:
|
||||
|
||||
- Keep `DiscoveredPdf.relative_parent` behavior.
|
||||
- Add a focused part-planning helper rather than encoding final output names through fake temporary PDF filenames.
|
||||
- Keep `PlannedOutput` if the existing conversion code can use it cleanly, but allow multiple Markdown parts to share the same `assets_dir` and `report_path`.
|
||||
- Duplicate-path detection must reject duplicate Markdown files and raw directories, but it must allow shared `images/` and shared report paths for parts belonging to the same source PDF.
|
||||
|
||||
### WP16.2: Markdown Part Numbering
|
||||
|
||||
Replace public part names:
|
||||
|
||||
```text
|
||||
paper.part-001.pages-001-020.md
|
||||
paper.part-002.pages-021-040.md
|
||||
```
|
||||
|
||||
with:
|
||||
|
||||
```text
|
||||
paper_001.md
|
||||
paper_002.md
|
||||
```
|
||||
|
||||
Rules:
|
||||
|
||||
- Part index is based on final output group order, not source page number.
|
||||
- The report must still record source page ranges for each part.
|
||||
- Failed groups should not create a Markdown file, but the report must mention the failed part and source page range.
|
||||
|
||||
### WP16.3: Shared Images Folder
|
||||
|
||||
Replace per-output asset directories:
|
||||
|
||||
```text
|
||||
paper.part-001.pages-001-020.assets/
|
||||
paper.part-002.pages-021-040.assets/
|
||||
```
|
||||
|
||||
with:
|
||||
|
||||
```text
|
||||
paper/images/
|
||||
```
|
||||
|
||||
Implementation guidance:
|
||||
|
||||
- Copy all assets for one source PDF into the shared `images/` folder.
|
||||
- Rewrite Markdown links to `images/<asset-name>`.
|
||||
- Use deterministic collision-safe filenames. Recommended pattern:
|
||||
- page-known assets: `page-001_<original-name>`, with `-002` suffixes when needed.
|
||||
- page-unknown assets: `asset-001<suffix>`, preserving the original suffix when available.
|
||||
- Keep asset-link validation pointed at the shared `images/` directory.
|
||||
|
||||
### WP16.4: One Report, No Metadata JSON
|
||||
|
||||
Stop writing metadata JSON as a user-facing output file.
|
||||
|
||||
Implementation guidance:
|
||||
|
||||
- Continue building internal metadata dictionaries or records for each part so report generation and `ConversionResult` summaries stay traceable.
|
||||
- Add an aggregate report path at `<stem>/<stem>_report.md`.
|
||||
- The report must include:
|
||||
- source PDF path,
|
||||
- output folder path,
|
||||
- Markdown part list with page ranges,
|
||||
- engine and engine options,
|
||||
- final status,
|
||||
- warning count,
|
||||
- asset count,
|
||||
- missing/invalid asset link counts,
|
||||
- inline/display formula counts,
|
||||
- MathJax render error count,
|
||||
- text fidelity summary when available,
|
||||
- failed source pages or failed parts when any exist,
|
||||
- warnings grouped by page or part.
|
||||
- `ConversionResult.metadata_path` should be `None` for simplified outputs.
|
||||
- `ConversionResult.report_path` should point to the shared report path.
|
||||
|
||||
### WP16.5: CLI, UI, And Documentation
|
||||
|
||||
Update user-facing docs and tests to remove metadata JSON as an expected output.
|
||||
|
||||
Implementation guidance:
|
||||
|
||||
- `pdf2md convert` summary may keep printing Markdown paths and warning counts.
|
||||
- Update CLI help for `--metadata` to say metadata JSON output is disabled or deprecated in the simplified layout.
|
||||
- Update README examples to show the new folder layout.
|
||||
- Update PRD and ARCHITECTURE so they no longer claim metadata JSON is required as a public artifact.
|
||||
- Keep internal provenance wording clear: warnings and report are still derived from internal metadata-like records.
|
||||
- Update optional fixture documentation so generated metadata JSON is not required for sample validation.
|
||||
|
||||
## Implementation Task Plan
|
||||
|
||||
### Task 1: Path Planning For Simplified Layout
|
||||
|
||||
Files:
|
||||
|
||||
- Modify `src/pdf2md/paths.py`.
|
||||
- Modify `tests/test_paths.py`.
|
||||
|
||||
Steps:
|
||||
|
||||
- [ ] Add failing tests showing `plan_outputs()` maps `paper.pdf` to `out/paper/paper_001.md`, `out/paper/images`, no metadata path, and `out/paper/paper_report.md`.
|
||||
- [ ] Add a failing test for Korean filenames, using the PDF stem exactly as the output folder and file prefix.
|
||||
- [ ] Add a failing test for recursive input preserving `relative_parent`.
|
||||
- [ ] Add a failing test that duplicate source stems in the same relative parent conflict before conversion.
|
||||
- [ ] Implement the minimal path planning changes.
|
||||
- [ ] Run `uv run pytest tests/test_paths.py`.
|
||||
- [ ] Commit path planning changes.
|
||||
|
||||
### Task 2: Single-Output Conversion Writes Simplified Files
|
||||
|
||||
Files:
|
||||
|
||||
- Modify `src/pdf2md/conversion.py`.
|
||||
- Modify `tests/test_conversion.py`.
|
||||
- Modify `tests/test_cli.py`.
|
||||
|
||||
Steps:
|
||||
|
||||
- [ ] Add failing conversion tests showing a non-chunked fake-adapter conversion writes `out/paper/paper_001.md`, `out/paper/images`, and `out/paper/paper_report.md`.
|
||||
- [ ] Add failing assertions that no `.metadata.json` file is written and `result.metadata_path is None`.
|
||||
- [ ] Add failing CLI test showing `pdf2md convert paper.pdf --out out` creates the simplified folder.
|
||||
- [ ] Implement the minimal conversion changes for non-chunked output.
|
||||
- [ ] Run `uv run pytest tests/test_conversion.py tests/test_cli.py tests/test_paths.py`.
|
||||
- [ ] Commit single-output conversion changes.
|
||||
|
||||
### Task 3: Grouped Output Parts And Shared Images
|
||||
|
||||
Files:
|
||||
|
||||
- Modify `src/pdf2md/conversion.py`.
|
||||
- Modify `src/pdf2md/pdf_splitter.py` only if a small helper is needed.
|
||||
- Modify `tests/test_conversion.py`.
|
||||
- Modify `tests/test_cli.py`.
|
||||
|
||||
Steps:
|
||||
|
||||
- [ ] Add failing tests for `chunk_pages=20` showing final Markdown names are `paper_001.md`, `paper_002.md`, not `paper.part-...md`.
|
||||
- [ ] Add failing tests proving all grouped assets are copied into `paper/images/` and Markdown links use `images/...`.
|
||||
- [ ] Add failing tests proving asset collisions across pages get deterministic unique filenames.
|
||||
- [ ] Add failing tests proving failed page conversions are represented in the shared report while later pages still convert.
|
||||
- [ ] Implement grouped output naming and shared image handling.
|
||||
- [ ] Run `uv run pytest tests/test_conversion.py tests/test_cli.py tests/test_pdf_splitter.py`.
|
||||
- [ ] Commit grouped output changes.
|
||||
|
||||
### Task 4: Aggregate Report Without Metadata JSON
|
||||
|
||||
Files:
|
||||
|
||||
- Modify `src/pdf2md/report.py` or add a focused aggregate report helper.
|
||||
- Modify `src/pdf2md/conversion.py`.
|
||||
- Modify `tests/test_report.py`.
|
||||
- Modify `tests/test_conversion.py`.
|
||||
|
||||
Steps:
|
||||
|
||||
- [ ] Add failing report tests for a one-file report listing multiple Markdown parts and source page ranges.
|
||||
- [ ] Add failing conversion tests proving only one report exists for a chunked PDF.
|
||||
- [ ] Add failing tests proving report summary totals combine all output parts.
|
||||
- [ ] Add failing tests proving all-failed conversions write a report but no Markdown part.
|
||||
- [ ] Implement aggregate report rendering from internal metadata records.
|
||||
- [ ] Run `uv run pytest tests/test_report.py tests/test_conversion.py`.
|
||||
- [ ] Commit report changes.
|
||||
|
||||
### Task 5: Recheck, CLI Compatibility, UI Text, And Docs
|
||||
|
||||
Files:
|
||||
|
||||
- Modify `src/pdf2md/cli.py`.
|
||||
- Modify `src/pdf2md/conversion.py`.
|
||||
- Modify `src/pdf2md_ui/runner.py` and `src/pdf2md_ui/app.py` only if text/output assumptions change.
|
||||
- Modify `README.md`.
|
||||
- Modify `PRD.md`.
|
||||
- Modify `ARCHITECTURE.md`.
|
||||
- Modify `docs/V1IMPLEMENTATIONPLAN.md`.
|
||||
- Modify `tests/test_cli.py`.
|
||||
- Modify `tests/test_ui_runner.py` only if UI behavior changes.
|
||||
- Modify `tests/integration/test_v1_fast_release_gate.py`.
|
||||
- Modify `tests/integration/test_optional_mineru_fixtures.py`.
|
||||
|
||||
Steps:
|
||||
|
||||
- [ ] Add failing CLI tests proving `--metadata` remains accepted but no metadata JSON is written.
|
||||
- [ ] Add failing recheck test proving simplified outputs without metadata fail with a clear legacy-metadata message.
|
||||
- [ ] Update integration tests to require Markdown part files, one report, and image links, not metadata JSON.
|
||||
- [ ] Update README, PRD, ARCHITECTURE, and release-gate wording for the simplified layout.
|
||||
- [ ] Implement CLI/recheck/doc changes.
|
||||
- [ ] Run `uv run pytest tests/test_cli.py tests/test_ui_runner.py tests/integration/test_v1_fast_release_gate.py`.
|
||||
- [ ] Commit CLI, UI, integration, and documentation changes.
|
||||
|
||||
### Task 6: Final Verification And Handoff
|
||||
|
||||
Files:
|
||||
|
||||
- Modify `PLAN.md`.
|
||||
- Modify `PROGRESS.md`.
|
||||
- Modify `docs/WORKARCHIVE.md` after implementation.
|
||||
- Modify `docs/Sprints/SPRINT16CONTRACT.md` status and handoff fields.
|
||||
|
||||
Steps:
|
||||
|
||||
- [ ] Run focused Sprint 16 verification:
|
||||
|
||||
```powershell
|
||||
uv run pytest tests/test_paths.py tests/test_conversion.py tests/test_cli.py tests/test_report.py tests/integration/test_v1_fast_release_gate.py
|
||||
```
|
||||
|
||||
- [ ] Run full default verification:
|
||||
|
||||
```powershell
|
||||
uv run pytest
|
||||
```
|
||||
|
||||
- [ ] Run diff check:
|
||||
|
||||
```powershell
|
||||
git diff --check
|
||||
```
|
||||
|
||||
- [ ] Update `PROGRESS.md` with files changed, checks run, residual risks, and next actions.
|
||||
- [ ] Archive completed implementation evidence in `docs/WORKARCHIVE.md`.
|
||||
- [ ] Commit final coordination updates.
|
||||
|
||||
## Verification Commands
|
||||
|
||||
```powershell
|
||||
uv run pytest tests/test_paths.py tests/test_conversion.py tests/test_cli.py tests/test_report.py tests/integration/test_v1_fast_release_gate.py
|
||||
uv run pytest
|
||||
git diff --check
|
||||
git status --short --untracked-files=all
|
||||
```
|
||||
|
||||
Optional local fixture validation after implementation:
|
||||
|
||||
```powershell
|
||||
$env:MINERU_MODEL_SOURCE='local'
|
||||
uv run pdf2md convert samples\SolidElement.pdf --out outputs\SolidElement_sprint16_layout --overwrite --chunk-pages --gpu auto --mineru-profile auto --strict-local
|
||||
```
|
||||
|
||||
Expected optional validation:
|
||||
|
||||
- Output folder is `outputs\SolidElement\` or the explicitly provided output root plus `SolidElement\`, depending on the command.
|
||||
- Markdown part is `SolidElement_001.md` for the 6-page sample.
|
||||
- Report is `SolidElement_report.md`.
|
||||
- Images are under `images\`.
|
||||
- No metadata JSON exists.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
- Each input PDF writes into an output folder named after the PDF stem.
|
||||
- Markdown outputs are named `<stem>_001.md`, `<stem>_002.md`, and so on.
|
||||
- All image/media assets for one PDF live under `<stem>/images/`.
|
||||
- Markdown links point to `images/...`.
|
||||
- Exactly one report file is written per input PDF at `<stem>/<stem>_report.md`.
|
||||
- No metadata JSON file is written for new conversions.
|
||||
- Internal warning, provenance, formula count, asset count, and text fidelity information remains available in the report.
|
||||
- Chunk mode still converts one source page per MinerU run and groups Markdown by `chunk_pages`.
|
||||
- Strict-local and MinerU-only constraints remain unchanged.
|
||||
- Default tests stay fast and local.
|
||||
|
||||
## Hard Failure Criteria
|
||||
|
||||
- Any new conversion writes `.metadata.json` as a public output.
|
||||
- Output files keep old `part-001.pages-...` names.
|
||||
- Assets are split into per-part `.assets` folders.
|
||||
- More than one report is written for one input PDF.
|
||||
- Markdown links point outside the PDF output folder.
|
||||
- Chunk mode stops using one source page per MinerU run.
|
||||
- Strict-local enforcement is weakened.
|
||||
- Default tests require real MinerU, GPU, model files, network, Obsidian, MathJax, or `samples/`.
|
||||
- Sample PDFs, generated outputs, local model files, or `dist/pdf2md-ui.exe` are committed.
|
||||
|
||||
## Open Questions
|
||||
|
||||
- Should metadata-free `pdf2md recheck` be restored in a later sprint by deriving enough state from the report and Markdown, or is rerunning conversion acceptable for simplified outputs?
|
||||
- Should raw MinerU outputs under `--keep-raw` be flattened into `raw/` or kept per part under `raw/<stem>_001/`? This contract recommends per-part raw folders to avoid collisions.
|
||||
|
||||
## Handoff Requirements
|
||||
|
||||
After implementation:
|
||||
|
||||
- Update this contract status to `Implemented`.
|
||||
- Record final file layout examples in `README.md`.
|
||||
- Record verification commands and outcomes in `PROGRESS.md`.
|
||||
- Archive implementation and optional sample validation results in `docs/WORKARCHIVE.md`.
|
||||
- Keep generated outputs and sample PDFs uncommitted.
|
||||
|
||||
## Implementation Handoff
|
||||
|
||||
- Files changed: `src/pdf2md/paths.py`, `src/pdf2md/conversion.py`, `src/pdf2md/report.py`, `src/pdf2md/cli.py`, `src/pdf2md_ui/runner.py`, focused tests, and current docs.
|
||||
- Output layout implemented: `<out>/<stem>/<stem>_001.md`, additional numbered parts when grouped, `<out>/<stem>/images/`, and `<out>/<stem>/<stem>_report.md`.
|
||||
- Metadata JSON behavior: new conversions do not write public `.metadata.json`; `ConversionResult.metadata_path` is `None`; internal metadata-like records still feed reports and tests.
|
||||
- Recheck behavior: `pdf2md recheck` remains legacy-only and requires adjacent metadata JSON.
|
||||
- Verification recorded in `PROGRESS.md`: focused Sprint 16 tests passed, full `uv run pytest` passed 227 tests with 1 optional skip, and `git diff --check` passed with line-ending warnings only.
|
||||
Reference in New Issue
Block a user