baram2584/PDFToMD

Fork 0

Files

T

김경종 dc11880140 modify pdftomd

2026-05-14 10:16:59 +09:00

17 KiB

Raw Blame History

Sprint 16 Contract: Simplified Output Layout

Status: Implemented Last updated: 2026-05-12

Objective

Simplify conversion outputs so each input PDF gets one predictable output folder named after the PDF stem, all images live under one images folder, Markdown parts use _001, _002 numbering, one human-readable report is written per PDF, and no metadata JSON file is persisted.

This sprint changes the public output contract. It supersedes the older v1 output layout that wrote sibling <stem>.md, <stem>.assets, <stem>.metadata.json, and <stem>.report.md files.

Product Output Contract

For an input PDF:

paper.pdf

and output root:

outputs/

write:

outputs/
  paper/
    paper_001.md
    paper_002.md
    paper_report.md
    images/
      ...

Rules:

paper is the PDF stem, meaning the original filename without .pdf.
A one-part conversion still writes paper_001.md.
A multi-part conversion writes paper_001.md, paper_002.md, and so on.
Part numbering uses at least three digits and grows only when the part count exceeds 999.
All generated image and media assets for the PDF live under paper/images/.
Markdown links must point to images/<asset-name>.
The report is a single file at paper/paper_report.md.
No <stem>.metadata.json, part metadata JSON, or sidecar metadata JSON is written.
Internal metadata records may still be built in memory to produce reports, warnings, counts, and ConversionResult fields.

Contract Assumptions

The user request "metadata is not needed" means metadata JSON should not be written as a user-facing output file. It does not mean removing internal metadata objects needed for report generation and warning aggregation.
Keep --chunk-pages semantics from Sprint 14: when enabled, MinerU receives one source page per run and final Markdown files are grouped by chunk_pages.
If --chunk-pages is absent, the whole PDF is still converted in one MinerU run and written as <stem>_001.md.
Keep --chunk-pages without a value as the default grouping size of 20.
Keep --metadata accepted as a backward-compatible no-op for one sprint, but update help text to say metadata JSON output is disabled in the simplified layout.
pdf2md recheck remains supported only for legacy outputs that still have adjacent metadata JSON. New simplified outputs should fail recheck clearly until a later sprint designs metadata-free recheck.
Recursive directory conversion should preserve the discovered relative parent before the PDF stem folder: outputs/<relative-parent>/<stem>/<stem>_001.md.
If two inputs would map to the same output folder and overwrite is false, fail during preflight. Do not invent automatic suffixes.
--keep-raw should place raw MinerU diagnostics under paper/raw/ so raw outputs do not clutter the main folder.

Touched Surfaces

Allowed during implementation:

Modify src/pdf2md/paths.py.
Modify src/pdf2md/pdf_splitter.py only if part naming needs helper support.
Modify src/pdf2md/conversion.py.
Modify src/pdf2md/report.py or add a focused aggregate report helper if one report needs multiple part summaries.
Modify src/pdf2md/cli.py.
Modify src/pdf2md_ui/runner.py and src/pdf2md_ui/app.py only if UI text or expected output descriptions mention metadata/report paths.
Modify tests/test_paths.py.
Modify tests/test_conversion.py.
Modify tests/test_cli.py.
Modify tests/test_report.py.
Modify tests/test_ui_runner.py only if UI command/output assumptions change.
Modify tests/integration/test_v1_fast_release_gate.py.
Modify tests/integration/test_optional_mineru_fixtures.py.
Modify README.md.
Modify PRD.md.
Modify ARCHITECTURE.md.
Modify docs/V1IMPLEMENTATIONPLAN.md.
Modify PLAN.md.
Modify PROGRESS.md.
Modify docs/WORKARCHIVE.md after implementation.

Not allowed:

Do not change MinerU 3.1.0 as the fixed engine.
Do not add another conversion engine.
Do not add remote/API/backend paths.
Do not change --gpu, --mineru-profile, or strict-local behavior except where report text reflects the new layout.
Do not make default tests depend on real MinerU, GPU, CUDA, model files, network, Obsidian, MathJax, or samples/.
Do not commit generated outputs/, sample PDFs, local model files, or dist/pdf2md-ui.exe.

Architecture Plan

WP16.1: Document-Level Output Layout

Add or reshape path planning so final outputs are planned per source PDF folder instead of as sibling files.

Expected final paths for a single PDF:

<out>/<stem>/<stem>_001.md
<out>/<stem>/images/
<out>/<stem>/<stem>_report.md

Expected final paths for recursive input:

<out>/<relative-parent>/<stem>/<stem>_001.md
<out>/<relative-parent>/<stem>/images/
<out>/<relative-parent>/<stem>/<stem>_report.md

Implementation guidance:

Keep DiscoveredPdf.relative_parent behavior.
Add a focused part-planning helper rather than encoding final output names through fake temporary PDF filenames.
Keep PlannedOutput if the existing conversion code can use it cleanly, but allow multiple Markdown parts to share the same assets_dir and report_path.
Duplicate-path detection must reject duplicate Markdown files and raw directories, but it must allow shared images/ and shared report paths for parts belonging to the same source PDF.

WP16.2: Markdown Part Numbering

Replace public part names:

paper.part-001.pages-001-020.md
paper.part-002.pages-021-040.md

with:

paper_001.md
paper_002.md

Rules:

Part index is based on final output group order, not source page number.
The report must still record source page ranges for each part.
Failed groups should not create a Markdown file, but the report must mention the failed part and source page range.

WP16.3: Shared Images Folder

Replace per-output asset directories:

paper.part-001.pages-001-020.assets/
paper.part-002.pages-021-040.assets/

with:

paper/images/

Implementation guidance:

Copy all assets for one source PDF into the shared images/ folder.
Rewrite Markdown links to images/<asset-name>.
Use deterministic collision-safe filenames. Recommended pattern:
- page-known assets: page-001_<original-name>, with -002 suffixes when needed.
- page-unknown assets: asset-001<suffix>, preserving the original suffix when available.
Keep asset-link validation pointed at the shared images/ directory.

WP16.4: One Report, No Metadata JSON

Stop writing metadata JSON as a user-facing output file.

Implementation guidance:

Continue building internal metadata dictionaries or records for each part so report generation and ConversionResult summaries stay traceable.
Add an aggregate report path at <stem>/<stem>_report.md.
The report must include:
- source PDF path,
- output folder path,
- Markdown part list with page ranges,
- engine and engine options,
- final status,
- warning count,
- asset count,
- missing/invalid asset link counts,
- inline/display formula counts,
- MathJax render error count,
- text fidelity summary when available,
- failed source pages or failed parts when any exist,
- warnings grouped by page or part.
ConversionResult.metadata_path should be None for simplified outputs.
ConversionResult.report_path should point to the shared report path.

WP16.5: CLI, UI, And Documentation

Update user-facing docs and tests to remove metadata JSON as an expected output.

Implementation guidance:

pdf2md convert summary may keep printing Markdown paths and warning counts.
Update CLI help for --metadata to say metadata JSON output is disabled or deprecated in the simplified layout.
Update README examples to show the new folder layout.
Update PRD and ARCHITECTURE so they no longer claim metadata JSON is required as a public artifact.
Keep internal provenance wording clear: warnings and report are still derived from internal metadata-like records.
Update optional fixture documentation so generated metadata JSON is not required for sample validation.

Implementation Task Plan

Task 1: Path Planning For Simplified Layout

Files:

Modify src/pdf2md/paths.py.
Modify tests/test_paths.py.

Steps:

Add failing tests showing plan_outputs() maps paper.pdf to out/paper/paper_001.md, out/paper/images, no metadata path, and out/paper/paper_report.md.
Add a failing test for Korean filenames, using the PDF stem exactly as the output folder and file prefix.
Add a failing test for recursive input preserving relative_parent.
Add a failing test that duplicate source stems in the same relative parent conflict before conversion.
Implement the minimal path planning changes.
Run uv run pytest tests/test_paths.py.
Commit path planning changes.

Task 2: Single-Output Conversion Writes Simplified Files

Files:

Modify src/pdf2md/conversion.py.
Modify tests/test_conversion.py.
Modify tests/test_cli.py.

Steps:

Add failing conversion tests showing a non-chunked fake-adapter conversion writes out/paper/paper_001.md, out/paper/images, and out/paper/paper_report.md.
Add failing assertions that no .metadata.json file is written and result.metadata_path is None.
Add failing CLI test showing pdf2md convert paper.pdf --out out creates the simplified folder.
Implement the minimal conversion changes for non-chunked output.
Run uv run pytest tests/test_conversion.py tests/test_cli.py tests/test_paths.py.
Commit single-output conversion changes.

Task 3: Grouped Output Parts And Shared Images

Files:

Modify src/pdf2md/conversion.py.
Modify src/pdf2md/pdf_splitter.py only if a small helper is needed.
Modify tests/test_conversion.py.
Modify tests/test_cli.py.

Steps:

Add failing tests for chunk_pages=20 showing final Markdown names are paper_001.md, paper_002.md, not paper.part-...md.
Add failing tests proving all grouped assets are copied into paper/images/ and Markdown links use images/....
Add failing tests proving asset collisions across pages get deterministic unique filenames.
Add failing tests proving failed page conversions are represented in the shared report while later pages still convert.
Implement grouped output naming and shared image handling.
Run uv run pytest tests/test_conversion.py tests/test_cli.py tests/test_pdf_splitter.py.
Commit grouped output changes.

Task 4: Aggregate Report Without Metadata JSON

Files:

Modify src/pdf2md/report.py or add a focused aggregate report helper.
Modify src/pdf2md/conversion.py.
Modify tests/test_report.py.
Modify tests/test_conversion.py.

Steps:

Add failing report tests for a one-file report listing multiple Markdown parts and source page ranges.
Add failing conversion tests proving only one report exists for a chunked PDF.
Add failing tests proving report summary totals combine all output parts.
Add failing tests proving all-failed conversions write a report but no Markdown part.
Implement aggregate report rendering from internal metadata records.
Run uv run pytest tests/test_report.py tests/test_conversion.py.
Commit report changes.

Task 5: Recheck, CLI Compatibility, UI Text, And Docs

Files:

Modify src/pdf2md/cli.py.
Modify src/pdf2md/conversion.py.
Modify src/pdf2md_ui/runner.py and src/pdf2md_ui/app.py only if text/output assumptions change.
Modify README.md.
Modify PRD.md.
Modify ARCHITECTURE.md.
Modify docs/V1IMPLEMENTATIONPLAN.md.
Modify tests/test_cli.py.
Modify tests/test_ui_runner.py only if UI behavior changes.
Modify tests/integration/test_v1_fast_release_gate.py.
Modify tests/integration/test_optional_mineru_fixtures.py.

Steps:

Add failing CLI tests proving --metadata remains accepted but no metadata JSON is written.
Add failing recheck test proving simplified outputs without metadata fail with a clear legacy-metadata message.
Update integration tests to require Markdown part files, one report, and image links, not metadata JSON.
Update README, PRD, ARCHITECTURE, and release-gate wording for the simplified layout.
Implement CLI/recheck/doc changes.
Run uv run pytest tests/test_cli.py tests/test_ui_runner.py tests/integration/test_v1_fast_release_gate.py.
Commit CLI, UI, integration, and documentation changes.

Task 6: Final Verification And Handoff

Files:

Modify PLAN.md.
Modify PROGRESS.md.
Modify docs/WORKARCHIVE.md after implementation.
Modify docs/Sprints/SPRINT16CONTRACT.md status and handoff fields.

Steps:

Run focused Sprint 16 verification:

uv run pytest tests/test_paths.py tests/test_conversion.py tests/test_cli.py tests/test_report.py tests/integration/test_v1_fast_release_gate.py

Run full default verification:

uv run pytest

Run diff check:

git diff --check

Update PROGRESS.md with files changed, checks run, residual risks, and next actions.
Archive completed implementation evidence in docs/WORKARCHIVE.md.
Commit final coordination updates.

Verification Commands

uv run pytest tests/test_paths.py tests/test_conversion.py tests/test_cli.py tests/test_report.py tests/integration/test_v1_fast_release_gate.py
uv run pytest
git diff --check
git status --short --untracked-files=all

Optional local fixture validation after implementation:

$env:MINERU_MODEL_SOURCE='local'
uv run pdf2md convert samples\SolidElement.pdf --out outputs\SolidElement_sprint16_layout --overwrite --chunk-pages --gpu auto --mineru-profile auto --strict-local

Expected optional validation:

Output folder is outputs\SolidElement\ or the explicitly provided output root plus SolidElement\, depending on the command.
Markdown part is SolidElement_001.md for the 6-page sample.
Report is SolidElement_report.md.
Images are under images\.
No metadata JSON exists.

Acceptance Criteria

Each input PDF writes into an output folder named after the PDF stem.
Markdown outputs are named <stem>_001.md, <stem>_002.md, and so on.
All image/media assets for one PDF live under <stem>/images/.
Markdown links point to images/....
Exactly one report file is written per input PDF at <stem>/<stem>_report.md.
No metadata JSON file is written for new conversions.
Internal warning, provenance, formula count, asset count, and text fidelity information remains available in the report.
Chunk mode still converts one source page per MinerU run and groups Markdown by chunk_pages.
Strict-local and MinerU-only constraints remain unchanged.
Default tests stay fast and local.

Hard Failure Criteria

Any new conversion writes .metadata.json as a public output.
Output files keep old part-001.pages-... names.
Assets are split into per-part .assets folders.
More than one report is written for one input PDF.
Markdown links point outside the PDF output folder.
Chunk mode stops using one source page per MinerU run.
Strict-local enforcement is weakened.
Default tests require real MinerU, GPU, model files, network, Obsidian, MathJax, or samples/.
Sample PDFs, generated outputs, local model files, or dist/pdf2md-ui.exe are committed.

Open Questions

Should metadata-free pdf2md recheck be restored in a later sprint by deriving enough state from the report and Markdown, or is rerunning conversion acceptable for simplified outputs?
Should raw MinerU outputs under --keep-raw be flattened into raw/ or kept per part under raw/<stem>_001/? This contract recommends per-part raw folders to avoid collisions.

Handoff Requirements

After implementation:

Update this contract status to Implemented.
Record final file layout examples in README.md.
Record verification commands and outcomes in PROGRESS.md.
Archive implementation and optional sample validation results in docs/WORKARCHIVE.md.
Keep generated outputs and sample PDFs uncommitted.

Implementation Handoff

Files changed: src/pdf2md/paths.py, src/pdf2md/conversion.py, src/pdf2md/report.py, src/pdf2md/cli.py, src/pdf2md_ui/runner.py, focused tests, and current docs.
Output layout implemented: <out>/<stem>/<stem>_001.md, additional numbered parts when grouped, <out>/<stem>/images/, and <out>/<stem>/<stem>_report.md.
Metadata JSON behavior: new conversions do not write public .metadata.json; ConversionResult.metadata_path is None; internal metadata-like records still feed reports and tests.
Recheck behavior: pdf2md recheck remains legacy-only and requires adjacent metadata JSON.
Verification recorded in PROGRESS.md: focused Sprint 16 tests passed, full uv run pytest passed 227 tests with 1 optional skip, and git diff --check passed with line-ending warnings only.

17 KiB Raw Blame History

Sprint 16 Contract: Simplified Output Layout

Objective

Product Output Contract

Contract Assumptions

Touched Surfaces

Architecture Plan

WP16.1: Document-Level Output Layout

WP16.2: Markdown Part Numbering

WP16.3: Shared Images Folder

WP16.4: One Report, No Metadata JSON

WP16.5: CLI, UI, And Documentation

Implementation Task Plan

Task 1: Path Planning For Simplified Layout

Task 2: Single-Output Conversion Writes Simplified Files

Task 3: Grouped Output Parts And Shared Images

Task 4: Aggregate Report Without Metadata JSON

Task 5: Recheck, CLI Compatibility, UI Text, And Docs

Task 6: Final Verification And Handoff

Verification Commands

Acceptance Criteria

Hard Failure Criteria

Open Questions

Handoff Requirements

Implementation Handoff

17 KiB

Raw Blame History