baram2584/PDFToMD

Fork 0

Files

T

김경종 88d6b92283 add pdftomd

2026-05-08 16:42:19 +09:00

14 KiB

Raw Permalink Blame History

Sprint 10 Contract: Pre-Conversion PDF Page Chunking

Status: Implemented Last updated: 2026-05-08

Objective

Add an opt-in pre-conversion workflow for long PDFs:

Split each source PDF into fixed-size chunk PDFs of 20 pages.
Convert each chunk PDF independently through the existing MinerU conversion pipeline.
Do not merge the generated Markdown files.

The feature is intended to reduce long-document memory/runtime pressure and make partial progress usable when one chunk fails. It must preserve strict-local execution, keep MinerU 3.1.0 as the only conversion engine, and keep default tests independent of real MinerU, GPU, CUDA, model files, network access, Obsidian, LaTeX tooling, and samples/.

Research Summary

Sources checked on 2026-05-08:

pypdf PyPI: current release observed as 6.10.2, uploaded 2026-04-15; metadata lists BSD-3-Clause, Python >=3.9, Python 3.12 support, and describes pypdf as a pure-Python PDF library capable of splitting, merging, cropping, and transforming PDF pages.
pypdf merging docs: PdfWriter.append() can append a complete or partial source PDF; examples use zero-based page ranges such as (0, 10). The docs recommend append or merge over low-level add_page / insert_page.
pypdf streaming docs: PdfReader and PdfWriter support file-like objects, but the project should write chunk PDFs to local disk because MinerU accepts local file paths.
pypdf PdfWriter docs: writer operations clone/copy PDF objects into the destination. The docs warn that cloning linked objects can copy more than just the visible page object in some cases, so chunk output size must be checked in tests.
MinerU CLI tools docs: the direct mineru CLI accepts -p/--path, -o/--output, -s/--start, and -e/--end; without --api-url, it launches a temporary local mineru-api.
PyMuPDF PyPI: PyMuPDF is fast and local, but PyPI lists dual licensing under GNU AGPL v3 or an Artifex commercial license.
pikepdf page assembly docs: pikepdf can split pages and transfer page-associated data; it is a capable fallback candidate but adds a QPDF-backed dependency and is not needed for a first implementation.

Recommended Package Decision

Use pypdf for Sprint 10.

Rationale:

It is pure Python and fits the current Python 3.12 + uv workflow.
It has permissive BSD-3-Clause metadata on PyPI.
It directly supports page-level PDF assembly with PdfReader / PdfWriter.
It avoids adding PyMuPDF's AGPL/commercial licensing considerations for a simple split-only feature.
It avoids adding pikepdf/QPDF native dependency complexity before there is evidence that pypdf cannot handle the project samples.

Recommended dependency range for implementation:

dependencies = [
    "pypdf>=6.10.2,<7",
]

The implementation adds this dependency to pyproject.toml and uv.lock.

Current Precondition

pdf2md convert already converts one PDF or a directory of PDFs.
Existing conversion output per input includes:
- Markdown
- optional metadata JSON, enabled by default
- <stem>.report.md
- assets directory
- optional raw MinerU output
plan_outputs() already enforces overwrite and output-root safety.
convert_input() already handles directory batches and continues after per-file failures.
The MinerU adapter accepts one PDF path at a time and runs direct local mineru CLI.
samples/ is local and untracked; do not commit sample PDFs or generated outputs.

Touched Surfaces

Allowed during implementation:

pyproject.toml
uv.lock
src/pdf2md/pdf_splitter.py
src/pdf2md/paths.py
src/pdf2md/conversion.py
src/pdf2md/cli.py
src/pdf2md/ir.py only if new warning codes or chunk provenance records are required
src/pdf2md/metadata.py only for chunk provenance fields
src/pdf2md/report.py only to expose chunk provenance in reports
tests/test_pdf_splitter.py
tests/test_conversion.py
tests/test_cli.py
tests/test_paths.py
tests/test_metadata.py
tests/integration/ for mocked chunk workflow coverage
README.md
docs/V1IMPLEMENTATIONPLAN.md
docs/Sprints/SPRINT10CONTRACT.md
PLAN.md
PROGRESS.md

Not allowed:

Runtime engine selection or alternate conversion engines.
Use of cloud OCR, remote LLM/VLM, hosted renderers, hosted document parsers, --api-url, router mode, HTTP client backends, remote APIs, or remote OpenAI-compatible backends.
Mandatory default tests requiring real MinerU, GPU, CUDA, model files, network, Obsidian, LaTeX tooling, or samples/.
Committed files under samples/.
Committed generated conversion outputs.
Automatic model or package downloads triggered by import time, doctor, convert, or tests.
Markdown merge behavior for chunk outputs.
Claims that chunking improves formula correctness; it is only a processing-control feature.

Product Behavior

Activation:

Chunking is opt-in and existing conversion behavior is unchanged when chunk_pages is unset.
CLI: pdf2md convert INPUT --out OUTPUT_DIR --chunk-pages uses the default chunk size of 20 pages.
CLI: pdf2md convert INPUT --out OUTPUT_DIR --chunk-pages 20 uses an explicit positive chunk size.
Python API: convert_pdf(..., chunk_pages=20) and convert_input(..., chunk_pages=20).
convert_pdf() returns ConversionResult without chunking and BatchConversionResult when chunk mode is active.
chunk_pages must be None or a positive integer.

Chunking behavior:

If chunk_pages is unset, current behavior remains unchanged.
If chunk_pages=20 and a PDF has 20 or fewer pages, conversion may either:
- convert the original PDF directly, or
- create one chunk PDF and convert that chunk.
Recommended: convert the original directly when total_pages <= chunk_pages to avoid unnecessary intermediate files.
If a PDF has more than 20 pages, split it into chunk PDFs with ranges:
- chunk 1: source pages 1-20
- chunk 2: source pages 21-40
- chunk N: remaining pages
Convert chunk PDFs sequentially, not in parallel. GTX 1070 Ti 8GB memory pressure makes sequential conversion the safer default.
If one chunk conversion fails, continue with later chunks and report the failed chunk clearly.
Do not merge Markdown outputs.

Recommended chunk output naming:

<stem>.part-001.pages-001-020.md
<stem>.part-001.pages-001-020.metadata.json
<stem>.part-001.pages-001-020.report.md
<stem>.part-001.pages-001-020.assets/

<stem>.part-002.pages-021-040.md
...

Recommended chunk PDF staging:

Use a temporary working directory.
Delete temporary chunk PDFs after conversion completes, including when --keep-raw is enabled.
Do not add a separate --keep-chunks flag in Sprint 10.

Provenance Requirements

Each chunk conversion must preserve original-source context.

Required chunk fields in metadata or engine options:

original source PDF path
original source SHA-256
chunk PDF path when retained, or chunk PDF filename when temporary
chunk index, 1-based
total chunk count
source page start, 1-based inclusive
source page end, 1-based inclusive
chunk page count

Page provenance must distinguish:

chunk-local page index, starting at 0 for MinerU output
original source page number, starting at 1 for user-facing reports

The report should include a short chunk context line, for example:

- Chunk: 2/5, source pages: 21-40

Architecture Plan

WP10.1: PDF Splitter Module

Owner:

feature-generator-agent
mineru-integration-agent

Actions:

Add src/pdf2md/pdf_splitter.py.
Define project-owned PdfChunkPlan.
Implement page counting with pypdf.PdfReader.
Implement chunk planning without writing files.
Implement chunk writing with pypdf.PdfWriter.append(source, (start, end)) or an equivalent tested PdfReader/PdfWriter path.
Use zero-based half-open page ranges internally and one-based inclusive ranges for filenames and reports.
Reject invalid chunk sizes with clear ValueError.
Fail clearly on encrypted/password-protected PDFs unless a later sprint adds password handling.

Expected output:

Deterministic chunk plans and local chunk PDFs suitable for the existing MinerU adapter.

WP10.2: Chunk-Aware Path Planning

Owner:

feature-generator-agent

Actions:

Extend output planning so chunk outputs are deterministic and conflict-checked before conversion starts.
Avoid collisions between original-output stems and chunk-output stems.
Preserve output-root escape prevention.
Respect --overwrite.
Keep Korean and non-ASCII source stems working.

Expected output:

A long PDF can produce multiple planned Markdown/metadata/report/assets outputs without overwriting another chunk.

WP10.3: Conversion Orchestration

Owner:

feature-generator-agent
mineru-integration-agent

Actions:

Add chunk mode to convert_pdf() and convert_input().
When chunk mode is active, split before calling the MinerU adapter.
Reuse existing per-PDF conversion path for each chunk PDF rather than creating a second conversion pipeline.
Continue conversion after a chunk-level failure and aggregate a batch-like result for the source.
Ensure temporary chunk directories are cleaned up unless raw retention is requested.
Keep strict-local validation unchanged.

Expected output:

Long PDF conversion yields separate Markdown/metadata/report/assets outputs per chunk.

WP10.4: Metadata And Report Chunk Provenance

Owner:

metadata-agent
obsidian-markdown-agent

Actions:

Add chunk provenance fields without exposing raw pypdf objects.
Keep existing required metadata fields valid.
Keep original source provenance visible even though MinerU sees a chunk PDF as input.
Ensure chunk reports are readable without opening JSON metadata.

Expected output:

Users can map each output file back to original source pages.

WP10.5: CLI And API Surface

Owner:

feature-generator-agent
requirements-guard-agent

Actions:

Add --chunk-pages [INTEGER] to pdf2md convert.
Keep chunking disabled unless the option is present.
Use 20 pages when --chunk-pages is present without an explicit value.
Validate positive integer input.
Keep --out, --metadata, --keep-raw, --recursive, --overwrite, --gpu, and strict-local behavior unchanged.
Update README with the long-PDF workflow.

Expected output:

uv run pdf2md convert samples/long.pdf --out outputs --chunk-pages 20

WP10.6: Tests

Owner:

feature-generator-agent
evaluation-agent

Default tests must not require real MinerU or sample PDFs.

Required tests:

Build small local PDF fixtures using pypdf blank pages or minimal test PDFs.
Page count detection for 1, 20, 21, 40, and 41 pages.
Chunk planning produces expected 1-based filenames and page ranges.
Chunk writing produces PDFs with the expected page counts.
Non-positive chunk size is rejected.
Existing conversion without --chunk-pages is unchanged.
Chunked conversion calls the fake adapter once per chunk.
Chunked conversion writes separate Markdown, metadata JSON, report Markdown, and assets per chunk.
--overwrite and conflict detection work for all planned chunk outputs.
A failed chunk does not silently fallback and does not prevent later chunks from being attempted.
Metadata/report contain original source PDF and source page range.
CLI validates --chunk-pages and prints a useful summary.

Optional local validation:

Run chunked conversion on a local samples/ PDF only by explicit user request or opt-in gate.
Do not commit generated chunk PDFs or outputs.

Acceptance Criteria

Sprint 10 implementation can split a PDF into 20-page chunk PDFs before MinerU conversion.
Chunk PDFs are converted one by one using the existing direct local MinerU CLI adapter.
Markdown outputs are separate and not merged.
Metadata/report files show chunk index and original page range.
Default test suite passes without real MinerU, GPU, CUDA, model files, network, Obsidian, LaTeX tooling, or samples/.
Strict-local policy remains unchanged.
Existing non-chunked conversion behavior remains backward-compatible.

Hard Failure Criteria

Chunking uses a remote PDF service or uploads document content.
Chunking introduces an alternate Markdown conversion engine.
Default tests require real MinerU, GPU, CUDA, model files, network, or local samples.
Chunk outputs overwrite each other or overwrite non-chunk outputs without --overwrite.
Chunk metadata loses original source page provenance.
The implementation merges Markdown despite this contract's non-merge requirement.
The implementation silently skips failed chunks without warnings.

Resolved Decisions

Activation mode: opt-in with --chunk-pages; the option defaults to 20 pages when no value is supplied.
Chunk PDF retention: temporary chunk PDFs only; they are deleted after conversion completes.
API return type: convert_pdf() returns a BatchConversionResult when chunk mode is active.

Verification Commands For Implementation

uv sync
uv run pytest tests/test_pdf_splitter.py tests/test_conversion.py tests/test_cli.py tests/test_paths.py tests/test_metadata.py
uv run pytest
git diff --check
git status --short --untracked-files=all

Optional local command after implementation and explicit user approval:

uv run pdf2md convert samples/MITC공부.pdf --out outputs --chunk-pages 20 --overwrite

Handoff Requirements

After implementation:

Update PROGRESS.md with files changed, commands run, tests passed, optional sample status, known failures, residual risks, and next action.
Do not mark the sprint implemented until independent evaluation or equivalent focused review verifies the acceptance criteria.
Commit the completed change without including samples/ or generated outputs.

Implementation Handoff

Files changed: pyproject.toml, uv.lock, src/pdf2md/pdf_splitter.py, src/pdf2md/conversion.py, src/pdf2md/cli.py, src/pdf2md/__init__.py, src/pdf2md/report.py, tests, README, docs/V1IMPLEMENTATIONPLAN.md, PLAN.md, and PROGRESS.md.
Verification status: targeted unit tests passed 42 tests; the full local test suite passed 163 tests with 1 optional skip; git diff --check passed with line-ending warnings only.
Optional local sample conversion remains out of scope unless explicitly requested.

14 KiB Raw Permalink Blame History

Sprint 10 Contract: Pre-Conversion PDF Page Chunking

Objective

Research Summary

Recommended Package Decision

Current Precondition

Touched Surfaces

Product Behavior

Provenance Requirements

Architecture Plan

WP10.1: PDF Splitter Module

WP10.2: Chunk-Aware Path Planning

WP10.3: Conversion Orchestration

WP10.4: Metadata And Report Chunk Provenance

WP10.5: CLI And API Surface

WP10.6: Tests

Acceptance Criteria

Hard Failure Criteria

Resolved Decisions

Verification Commands For Implementation

Handoff Requirements

Implementation Handoff

14 KiB

Raw Permalink Blame History