add files

This commit is contained in:
김경종
2026-04-30 17:05:19 +09:00
parent f3e01b5a8c
commit 7e985ae94a
135 changed files with 41205 additions and 0 deletions
+30
View File
@@ -0,0 +1,30 @@
{
"project": "PDFtoMD",
"phase": "0-harness-foundation",
"steps": [
{
"step": 0,
"name": "sample-metadata-contract",
"status": "completed",
"summary": "Created deterministic samples/metadata.json and metadata contract tests for current sample PDFs."
},
{
"step": 1,
"name": "core-package-skeleton",
"status": "completed",
"summary": "Created importable pdftomd package skeleton, pyproject metadata, and typed core models."
},
{
"step": 2,
"name": "page-preanalysis-contract",
"status": "completed",
"summary": "Added PyMuPDF-only PDF preanalysis with page facts, OCR candidates, and 20-page chunk ranges."
},
{
"step": 3,
"name": "markdown-quality-gates",
"status": "completed",
"summary": "Added focused Markdown quality gates for math, LaTeX, tables, image links, frontmatter, and anchors."
}
]
}
+63
View File
@@ -0,0 +1,63 @@
# Step 0: sample-metadata-contract
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/PRD.md
- /docs/ARCHITECTURE.md
- /docs/CONVERSION_POLICY.md
- /docs/ADR.md
- /docs/TOOLCHAIN.md
## Task
Create the first sample corpus metadata contract without implementing the conversion engine.
The metadata must classify every PDF currently under `samples/` by traits that future regression tests can use:
- text layer quality
- scanned or mixed scanned/text pages
- multi-column or complex layout risk
- formula density
- table density
- figure density
- Korean filename/path coverage
- target regression focus
Use deterministic JSON so future agents can update it with minimal diff noise.
## Sprint Contract
- Done means: `samples/metadata.json` exists, includes every current PDF by exact relative path, and has enough structured fields for future tests to select OCR, layout, formula, table, figure, and Korean-path cases.
- Hard thresholds:
- Every current `samples/*.pdf` appears exactly once.
- Metadata is valid UTF-8 JSON.
- Tests fail if a sample PDF is added without metadata.
- Tests fail if duplicate sample paths exist in the metadata.
- No conversion engine code is introduced in this step.
- Files owned:
- `samples/metadata.json`
- `tests/test_sample_metadata.py`
- `PROGRESS.md`
- `phases/0-harness-foundation/index.json`
- Dependencies:
- Existing sample PDFs under `samples/`
- PyMuPDF may be used only for lightweight page count/text/image inspection if needed.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests\test_sample_metadata.py
```
## Verification
1. Run the acceptance commands.
2. Confirm `samples/metadata.json` paths match `samples/*.pdf`.
3. Confirm Korean filenames remain readable in JSON.
4. Update `PROGRESS.md` with completed work, validation output, and next handoff.
5. Update this phase index step to `completed` with a one-line `summary`, or to `blocked`/`error` with a concrete reason.
## Do Not
- Do not create `src/` or conversion engine modules in this step.
- Do not rename, delete, compress, or rewrite sample PDFs.
- Do not add sidecar output files for converted documents.
- Do not add a new custom agent.
+59
View File
@@ -0,0 +1,59 @@
# Step 1: core-package-skeleton
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/ARCHITECTURE.md
- /docs/CONVERSION_POLICY.md
- /docs/ADR.md
- /phases/0-harness-foundation/step0.md
- /phases/0-harness-foundation/index.json
## Task
Create the minimal Python package skeleton and internal data contracts needed by later parser, pre-analysis, and renderer steps.
The skeleton should establish importable modules and typed models only. It should not call Marker, Nougat, PyMuPDF, OCR, CUDA, or the filesystem-heavy conversion path yet.
Suggested module boundary:
- `src/pdftomd/__init__.py`
- `src/pdftomd/models.py`
- `tests/test_models.py`
The exact type names may differ if the local design suggests better names, but the contracts must represent document identity, page ranges, block roles, bounding boxes, assets, formulas, tables, figures, and chunk metadata.
## Sprint Contract
- Done means: future steps have stable importable types for page analysis, block modeling, chunk metadata, and output assets.
- Hard thresholds:
- Tests cover model construction, deterministic slug/path-relevant fields, and page range invariants.
- Models do not depend on Marker, Nougat, PyMuPDF, torch, pandas, or PyQt.
- The package imports on Windows with `.\venv\python.exe`.
- Public contracts are documented by tests or clear docstrings.
- Files owned:
- `src/pdftomd/`
- `tests/test_models.py`
- `PROGRESS.md`
- `phases/0-harness-foundation/index.json`
- Dependencies:
- Step 0 metadata should be complete or explicitly blocked.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests\test_models.py
```
## Verification
1. Run the acceptance commands.
2. Confirm package imports with `.\venv\python.exe -c "import pdftomd; print(pdftomd.__name__)"`.
3. Confirm no heavy parser/model imports are introduced.
4. Update `PROGRESS.md` with completed work, validation output, and next handoff.
5. Update this phase index step to `completed` with a one-line `summary`, or to `blocked`/`error` with a concrete reason.
## Do Not
- Do not implement actual PDF parsing.
- Do not run Marker or Nougat.
- Do not add CLI commands.
- Do not add PyQt UI code.
- Do not widen the output contract beyond `docs/ARCHITECTURE.md`.
+63
View File
@@ -0,0 +1,63 @@
# Step 2: page-preanalysis-contract
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/ARCHITECTURE.md
- /docs/CONVERSION_POLICY.md
- /docs/ADR.md
- /docs/TOOLCHAIN.md
- /phases/0-harness-foundation/step0.md
- /phases/0-harness-foundation/step1.md
- /phases/0-harness-foundation/index.json
## Task
Implement the lightweight page pre-analysis contract that decides what later conversion steps need to know before Marker runs.
This step should use PyMuPDF only for fast document/page inspection:
- page count
- text length or text density per page
- image count per page
- OCR candidate flag per page
- basic long-document chunk candidates using the 20-page target
The output should be typed using the models from Step 1.
## Sprint Contract
- Done means: given a PDF path, the pre-analysis API returns deterministic page-level facts and chunk candidates without running Marker, Nougat, OCR, or GPU code.
- Hard thresholds:
- Tests cover at least one text-heavy sample and one mixed/scanned-risk sample from `samples/metadata.json`.
- Tests cover Korean path handling through `pathlib`.
- OCR candidate logic is deterministic and documented by tests.
- Chunk candidates never exceed the document page count.
- Explicit conversion or Markdown rendering is not implemented here.
- Files owned:
- `src/pdftomd/preanalysis.py`
- model additions in `src/pdftomd/models.py` only if required
- `tests/test_preanalysis.py`
- `PROGRESS.md`
- `phases/0-harness-foundation/index.json`
- Dependencies:
- Step 0 sample metadata
- Step 1 package skeleton and models
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests\test_preanalysis.py
```
## Verification
1. Run the acceptance commands.
2. Confirm PyMuPDF is the only PDF inspection dependency used in this step.
3. Confirm the sample metadata traits and test expectations are consistent.
4. Update `PROGRESS.md` with completed work, validation output, and next handoff.
5. Update this phase index step to `completed` with a one-line `summary`, or to `blocked`/`error` with a concrete reason.
## Do Not
- Do not call Marker, Nougat, Surya, torch, or OCR.
- Do not write conversion output under `output/`.
- Do not create resume cache or runtime state files.
- Do not implement reading-order reconstruction in this step.
+63
View File
@@ -0,0 +1,63 @@
# Step 3: markdown-quality-gates
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/ARCHITECTURE.md
- /docs/CONVERSION_POLICY.md
- /docs/ADR.md
- /phases/0-harness-foundation/step0.md
- /phases/0-harness-foundation/step1.md
- /phases/0-harness-foundation/step2.md
- /phases/0-harness-foundation/index.json
## Task
Create focused Markdown quality gate functions that later renderer and conversion steps can call.
This step should validate generated Markdown-like strings and asset references without requiring a full PDF conversion. It should prefer structured checks over full snapshot comparison.
Quality gates should cover:
- math delimiter balance for `$...$` and `$$...$$`
- LaTeX `\begin{...}` / `\end{...}` pairs
- image link path existence or modeled asset reference existence
- table parseability for simple Markdown tables
- chunk frontmatter fields required by the output contract
- caption/reference anchor shape where confidence is sufficient
## Sprint Contract
- Done means: later renderer steps have reusable validation functions and focused pytest coverage for Markdown output risks.
- Hard thresholds:
- Tests include passing and failing examples for math delimiter checks.
- Tests include a complex table case where Markdown limitations are represented as an allowed HTML/fallback decision.
- Tests do not rely on full Markdown snapshot equality.
- Validation functions do not mutate generated Markdown silently unless an explicit repair function is named and tested.
- No PDF parsing or renderer implementation is introduced here.
- Files owned:
- `src/pdftomd/quality.py`
- model additions in `src/pdftomd/models.py` only if required
- `tests/test_quality.py`
- `PROGRESS.md`
- `phases/0-harness-foundation/index.json`
- Dependencies:
- Step 1 package skeleton and models
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests\test_quality.py
```
## Verification
1. Run the acceptance commands.
2. Confirm quality gates are focused assertions, not whole-document snapshots.
3. Confirm failures return actionable messages for evaluator use.
4. Update `PROGRESS.md` with completed work, validation output, and next handoff.
5. Update this phase index step to `completed` with a one-line `summary`, or to `blocked`/`error` with a concrete reason.
## Do Not
- Do not implement Marker/Nougat adapters.
- Do not implement the full Markdown renderer.
- Do not introduce an LLM correction path.
- Do not write warning/error messages into generated Markdown content.
@@ -0,0 +1,30 @@
{
"project": "PDFtoMD",
"phase": "1-core-runtime-contracts",
"steps": [
{
"step": 0,
"name": "input-normalization-slug",
"status": "completed",
"summary": "Added deterministic PDF path normalization, document identity creation, anchors, and output bundle path contracts."
},
{
"step": 1,
"name": "conversion-options-config",
"status": "completed",
"summary": "Added typed conversion options with runtime mode and formula parser defaults matching project policy."
},
{
"step": 2,
"name": "output-bundle-contract",
"status": "completed",
"summary": "Added deterministic output bundle paths and separated runtime artifact paths from document output."
},
{
"step": 3,
"name": "runtime-cache-policy",
"status": "completed",
"summary": "Added model cache and runtime artifact path policies with explicit offline environment mappings."
}
]
}
+38
View File
@@ -0,0 +1,38 @@
# Step 0: input-normalization-slug
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/ARCHITECTURE.md
- /docs/CONVERSION_POLICY.md
- /phases/0-harness-foundation/index.json
## Task
Implement deterministic input normalization and document slug generation for local PDF paths.
Cover `pathlib` handling for Korean filenames, spaces, relative paths, absolute paths, and long Windows paths. The API should not invoke Marker, Nougat, PyMuPDF, or any conversion logic.
## Sprint Contract
- Done means: the core package has a tested function or small module that normalizes input PDF paths and produces stable document slugs.
- Hard thresholds: same input path and options produce the same slug; non-PDF paths fail clearly; Korean and spaced paths are tested; no parser import is introduced.
- Files owned: `src/pdftomd/`, `tests/`, `PROGRESS.md`, `phases/1-core-runtime-contracts/index.json`.
- Dependencies: Phase 0 package skeleton and model contracts.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm `PROGRESS.md` records the handoff and validation result.
3. Update this phase index step to `completed`, `blocked`, or `error`.
## Do Not
- Do not implement PDF parsing.
- Do not write conversion output.
- Do not add UI code.
+38
View File
@@ -0,0 +1,38 @@
# Step 1: conversion-options-config
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/ARCHITECTURE.md
- /docs/ADR.md
- /phases/1-core-runtime-contracts/step0.md
## Task
Define the typed conversion options and runtime configuration used by CLI, library, parser adapters, renderer, and UI.
Include runtime mode, device behavior, chunk target pages, formula parser mode, Nougat command path, output directory, model cache location, and resume/log options.
## Sprint Contract
- Done means: conversion options have defaults matching project policy and can be constructed by tests without CLI parsing.
- Hard thresholds: explicit `cuda` fail-fast semantics and `auto` fallback semantics are represented; Nougat remains formula-only; PyQt and hosted API options are not introduced.
- Files owned: `src/pdftomd/`, `tests/`, `PROGRESS.md`, `phases/1-core-runtime-contracts/index.json`.
- Dependencies: Step 0 normalized path/slug contract.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm defaults align with `docs/ARCHITECTURE.md` and `docs/CONVERSION_POLICY.md`.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not add command-line parsing yet.
- Do not initialize CUDA, Marker, or Nougat.
- Do not add external API settings.
+39
View File
@@ -0,0 +1,39 @@
# Step 2: output-bundle-contract
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/ARCHITECTURE.md
- /docs/CONVERSION_POLICY.md
- /phases/1-core-runtime-contracts/step0.md
- /phases/1-core-runtime-contracts/step1.md
## Task
Define deterministic output bundle path rules for chunk Markdown files, image assets, anchors, and runtime artifacts.
This is a contract step. It may include lightweight path helpers and tests, but it should not render Markdown or write parsed document content.
## Sprint Contract
- Done means: output directory, chunk file names, image asset names, and runtime log/state locations are modeled and tested.
- Hard thresholds: document output sidecars remain out of scope; runtime logs/state are separated from Markdown bundle output; asset naming is deterministic.
- Files owned: `src/pdftomd/`, `tests/`, `PROGRESS.md`, `phases/1-core-runtime-contracts/index.json`.
- Dependencies: Steps 0 and 1.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm generated path contracts match `docs/ARCHITECTURE.md`.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not implement the renderer.
- Do not write files under `output/` in tests unless using a temp directory.
- Do not create sidecar metadata output.
+39
View File
@@ -0,0 +1,39 @@
# Step 3: runtime-cache-policy
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/TOOLCHAIN.md
- /docs/CONVERSION_POLICY.md
- /phases/1-core-runtime-contracts/step1.md
- /phases/1-core-runtime-contracts/step2.md
## Task
Establish model cache, log path, and resume state policy as typed contracts and documented path helpers.
The result should prepare later CLI/runtime phases to use local model cache paths and offline-preferred model loading.
## Sprint Contract
- Done means: model cache and runtime cache path contracts are tested and documented without downloading models.
- Hard thresholds: no network download is triggered; logs/state remain outside generated Markdown content; environment variable overrides are deterministic.
- Files owned: `src/pdftomd/`, `tests/`, `docs/TOOLCHAIN.md`, `PROGRESS.md`, `phases/1-core-runtime-contracts/index.json`.
- Dependencies: Steps 1 and 2.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm `docs/TOOLCHAIN.md` stays consistent with any cache path decisions.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not download Marker or Nougat weights.
- Do not add hosted storage or cloud cache behavior.
- Do not write warnings into Markdown output.
+26
View File
@@ -0,0 +1,26 @@
{
"project": "PDFtoMD",
"phase": "2-marker-adapter",
"steps": [
{
"step": 0,
"name": "marker-invocation-adapter",
"status": "pending"
},
{
"step": 1,
"name": "ocr-plan-handoff",
"status": "pending"
},
{
"step": 2,
"name": "marker-block-normalization",
"status": "pending"
},
{
"step": 3,
"name": "marker-failure-reporting",
"status": "pending"
}
]
}
+38
View File
@@ -0,0 +1,38 @@
# Step 0: marker-invocation-adapter
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/ARCHITECTURE.md
- /docs/TOOLCHAIN.md
- /phases/1-core-runtime-contracts/index.json
## Task
Implement the first Marker adapter boundary that invokes Marker through a small internal interface.
Keep this adapter isolated so tests can use fakes without loading large models. Real Marker invocation should be smoke-testable but not required for every unit test.
## Sprint Contract
- Done means: Marker invocation is behind a narrow interface and can return structured parse results or clear failures.
- Hard thresholds: Marker remains the primary document parser; Nougat is not used here; unit tests avoid mandatory model downloads; parser errors are structured.
- Files owned: `src/pdftomd/marker_adapter.py`, related tests, `PROGRESS.md`, `phases/2-marker-adapter/index.json`.
- Dependencies: Phase 1 runtime contracts.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm the adapter can be tested without external services.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not parse formulas with Nougat.
- Do not implement Markdown rendering.
- Do not make every test load Marker models.
+38
View File
@@ -0,0 +1,38 @@
# Step 1: ocr-plan-handoff
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/CONVERSION_POLICY.md
- /phases/0-harness-foundation/step2.md
- /phases/2-marker-adapter/step0.md
## Task
Connect PyMuPDF page pre-analysis results to the Marker adapter as an OCR/layout handoff plan.
The goal is to preserve page-level OCR decisions without making the entire document scan-only or text-only.
## Sprint Contract
- Done means: the adapter accepts page-level OCR candidates and passes the relevant intent into Marker configuration or records an explicit unsupported-path fallback.
- Hard thresholds: OCR decisions stay page-aware; PyMuPDF remains pre-analysis only; no OCR logs are inserted into Markdown.
- Files owned: `src/pdftomd/marker_adapter.py`, `src/pdftomd/preanalysis.py` if needed, tests, `PROGRESS.md`, phase index.
- Dependencies: Phase 0 pre-analysis and Step 0 Marker adapter.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm mixed text/scanned sample traits are represented in tests.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not force document-wide OCR when only selected pages need OCR.
- Do not implement reading-order fixes here.
- Do not add a second primary parser.
+39
View File
@@ -0,0 +1,39 @@
# Step 2: marker-block-normalization
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/ARCHITECTURE.md
- /docs/CONVERSION_POLICY.md
- /phases/0-harness-foundation/step1.md
- /phases/2-marker-adapter/step0.md
## Task
Map Marker structured output into the internal block model for headings, paragraphs, lists, tables, figures, captions, and equation candidates.
Prefer structured Marker APIs or JSON-like structures over scraping final Markdown.
## Sprint Contract
- Done means: fake Marker structures and at least one real or recorded sample shape map into internal block types.
- Hard thresholds: semantic block roles are preserved; bounding boxes and page numbers survive where available; formula blocks are only marked as candidates for Phase 3.
- Files owned: `src/pdftomd/marker_adapter.py`, model additions if required, tests, `PROGRESS.md`, phase index.
- Dependencies: Phase 0 models and Step 0 adapter.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm no final Markdown scraping is required for normal block mapping.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not perform Nougat conversion.
- Do not render Markdown.
- Do not discard page or bounding-box metadata without a documented reason.
+38
View File
@@ -0,0 +1,38 @@
# Step 3: marker-failure-reporting
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/CONVERSION_POLICY.md
- /phases/2-marker-adapter/step0.md
- /phases/2-marker-adapter/step2.md
## Task
Define structured Marker failure reporting for parser errors, unsupported pages, timeout-like failures, and recoverable partial output.
This prepares later CLI and resume behavior without writing CLI code.
## Sprint Contract
- Done means: Marker adapter failures are typed, testable, and do not corrupt generated Markdown content.
- Hard thresholds: failures include page/chunk context where available; errors go to runtime reporting paths, not document body; fallback eligibility is explicit.
- Files owned: `src/pdftomd/marker_adapter.py`, error/reporting models, tests, `PROGRESS.md`, phase index.
- Dependencies: Steps 0 and 2.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm failure messages are actionable for CLI and evaluator use.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not silently swallow Marker failures.
- Do not implement resume state here.
- Do not write errors into Markdown chunks.
+26
View File
@@ -0,0 +1,26 @@
{
"project": "PDFtoMD",
"phase": "3-formula-pipeline",
"steps": [
{
"step": 0,
"name": "formula-block-detection",
"status": "pending"
},
{
"step": 1,
"name": "nougat-command-adapter",
"status": "pending"
},
{
"step": 2,
"name": "latex-validation-repair",
"status": "pending"
},
{
"step": 3,
"name": "formula-reference-links",
"status": "pending"
}
]
}
+37
View File
@@ -0,0 +1,37 @@
# Step 0: formula-block-detection
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/CONVERSION_POLICY.md
- /phases/2-marker-adapter/step2.md
## Task
Implement formula candidate detection from normalized Marker blocks.
Detect Marker equation blocks and text-pattern candidates while classifying inline versus block formulas based on block role and layout hints.
## Sprint Contract
- Done means: formula candidates are represented as internal objects ready for Nougat or Marker fallback.
- Hard thresholds: ordinary currency-like dollar text is not blindly treated as math; inline/block distinction is tested; no Nougat invocation occurs yet.
- Files owned: `src/pdftomd/formulas.py`, tests, `PROGRESS.md`, `phases/3-formula-pipeline/index.json`.
- Dependencies: Phase 2 block normalization.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm tests include inline and block formula candidates.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not call Nougat.
- Do not render Markdown math.
- Do not make regex the only source when structured block role exists.
+38
View File
@@ -0,0 +1,38 @@
# Step 1: nougat-command-adapter
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/TOOLCHAIN.md
- /docs/CONVERSION_POLICY.md
- /phases/3-formula-pipeline/step0.md
## Task
Implement the Nougat formula-only adapter boundary.
The adapter should accept formula candidates and return LaTeX candidates or structured failure results. It should support a configured Nougat command path and be mockable in unit tests.
## Sprint Contract
- Done means: Nougat execution is isolated behind a testable command adapter and never becomes the primary document parser.
- Hard thresholds: failures preserve Marker fallback text; tests do not require GPU/model execution by default; command path handling works on Windows.
- Files owned: `src/pdftomd/formulas.py`, optional `src/pdftomd/nougat_adapter.py`, tests, `PROGRESS.md`, phase index.
- Dependencies: Step 0 formula candidates and Phase 1 options.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm `.\venv\Scripts\nougat.exe --help` remains documented as an environment check, not a unit-test requirement.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not parse whole PDFs with Nougat.
- Do not require model downloads for normal unit tests.
- Do not discard Marker source text on failure.
+38
View File
@@ -0,0 +1,38 @@
# Step 2: latex-validation-repair
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/CONVERSION_POLICY.md
- /phases/0-harness-foundation/step3.md
- /phases/3-formula-pipeline/step1.md
## Task
Implement LaTeX and Markdown math validation for formula outputs, plus explicit repair helpers for safe cases.
Validation should cover delimiter balance and common `\begin{...}` / `\end{...}` pairs.
## Sprint Contract
- Done means: formula output validation returns actionable diagnostics and tested repairs for narrow, deterministic cases.
- Hard thresholds: validation does not silently mutate math; unrepairable failures fall back to Marker text; delimiter tests include both inline and block math.
- Files owned: `src/pdftomd/formulas.py`, `src/pdftomd/quality.py`, tests, `PROGRESS.md`, phase index.
- Dependencies: Phase 0 quality gates and Step 1 Nougat adapter.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm broken delimiter and environment examples are covered.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not build a broad LaTeX parser from scratch.
- Do not use LLM repair.
- Do not hide validation failures.
+37
View File
@@ -0,0 +1,37 @@
# Step 3: formula-reference-links
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/CONVERSION_POLICY.md
- /phases/3-formula-pipeline/step2.md
## Task
Preserve formula numbering and body references as internal Markdown link targets when confidence is sufficient.
Support common English and Korean reference patterns such as `Eq. (3)` and `식 (5)`.
## Sprint Contract
- Done means: formula anchors and reference rewrites are modeled and tested independently from final Markdown rendering.
- Hard thresholds: low-confidence matches remain plain text; duplicate formula numbers do not create unstable anchors; references never point to missing anchors.
- Files owned: `src/pdftomd/formulas.py`, reference model/tests, `PROGRESS.md`, phase index.
- Dependencies: Steps 0 through 2.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm duplicate and missing reference cases are tested.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not rewrite ambiguous references.
- Do not render final Markdown chunks.
- Do not remove the original formula number text.
+26
View File
@@ -0,0 +1,26 @@
{
"project": "PDFtoMD",
"phase": "4-semantic-enrichment",
"steps": [
{
"step": 0,
"name": "reading-order-checks",
"status": "pending"
},
{
"step": 1,
"name": "paragraph-stitching",
"status": "pending"
},
{
"step": 2,
"name": "header-footer-filtering",
"status": "pending"
},
{
"step": 3,
"name": "reference-indexing",
"status": "pending"
}
]
}
+38
View File
@@ -0,0 +1,38 @@
# Step 0: reading-order-checks
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/ARCHITECTURE.md
- /docs/CONVERSION_POLICY.md
- /phases/2-marker-adapter/step2.md
## Task
Create reading-order verification helpers over normalized blocks.
Use page numbers and bounding boxes to detect obvious ordering anomalies in multi-column or inserted-text layouts.
## Sprint Contract
- Done means: reading-order checks produce diagnostics that later enrichment and evaluator steps can use.
- Hard thresholds: checks are deterministic; tests include a multi-column-like fixture; helpers do not reorder content silently.
- Files owned: `src/pdftomd/enrichment.py`, tests, `PROGRESS.md`, `phases/4-semantic-enrichment/index.json`.
- Dependencies: Phase 2 normalized block model.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm diagnostics are actionable and tied to page/block ids.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not override Marker ordering without tests.
- Do not render Markdown.
- Do not call Marker or Nougat.
+37
View File
@@ -0,0 +1,37 @@
# Step 1: paragraph-stitching
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/CONVERSION_POLICY.md
- /phases/4-semantic-enrichment/step0.md
## Task
Implement paragraph stitching for line-fragmented PDF text blocks.
Handle continuation lines and hyphenated line breaks while preserving likely compound words or identifiers when confidence is low.
## Sprint Contract
- Done means: paragraph stitching turns line fragments into coherent paragraph blocks with focused tests.
- Hard thresholds: hyphen joins are tested; low-confidence hyphen cases are preserved; list items and headings are not merged into paragraphs.
- Files owned: `src/pdftomd/enrichment.py`, tests, `PROGRESS.md`, phase index.
- Dependencies: Step 0 checks and normalized block model.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm Korean and English text fixtures remain stable.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not rely only on punctuation rules when bounding-box hints exist.
- Do not merge across tables, figures, or formulas.
- Do not modify source PDF files.
+37
View File
@@ -0,0 +1,37 @@
# Step 2: header-footer-filtering
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/CONVERSION_POLICY.md
- /phases/4-semantic-enrichment/step1.md
## Task
Detect repeated page headers, footers, and page numbers and separate them from the main Markdown body flow.
The implementation should mark or remove repetitive boilerplate according to policy while keeping enough diagnostics for review.
## Sprint Contract
- Done means: repeated top/bottom page-region text can be identified and excluded from main content in tests.
- Hard thresholds: unique body text is not removed; page number patterns are tested; removal decisions are deterministic.
- Files owned: `src/pdftomd/enrichment.py`, tests, `PROGRESS.md`, phase index.
- Dependencies: Paragraph and block model from earlier steps.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm false-positive protections are tested.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not delete content without a confidence rule.
- Do not write filtered text into sidecar document outputs.
- Do not implement CLI reporting here.
+38
View File
@@ -0,0 +1,38 @@
# Step 3: reference-indexing
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/CONVERSION_POLICY.md
- /phases/3-formula-pipeline/step3.md
- /phases/4-semantic-enrichment/step2.md
## Task
Build a reference index for figures, tables, formulas, captions, and body references.
The index should support later Markdown rendering by providing stable anchors and high-confidence link targets.
## Sprint Contract
- Done means: table, figure, and formula references can be resolved or left plain with reasons.
- Hard thresholds: anchors are deterministic; duplicate labels are handled; missing targets do not produce broken links.
- Files owned: `src/pdftomd/enrichment.py`, reference models/tests, `PROGRESS.md`, phase index.
- Dependencies: Formula links and semantic block enrichment.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm figure/table/formula reference fixtures are covered.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not rewrite ambiguous references.
- Do not make anchors depend on nondeterministic ordering.
- Do not render final Markdown here.
@@ -0,0 +1,26 @@
{
"project": "PDFtoMD",
"phase": "5-markdown-rendering-assets",
"steps": [
{
"step": 0,
"name": "markdown-block-renderer",
"status": "pending"
},
{
"step": 1,
"name": "table-renderer-fallbacks",
"status": "pending"
},
{
"step": 2,
"name": "figure-asset-writer",
"status": "pending"
},
{
"step": 3,
"name": "chunk-renderer",
"status": "pending"
}
]
}
@@ -0,0 +1,38 @@
# Step 0: markdown-block-renderer
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/ARCHITECTURE.md
- /docs/CONVERSION_POLICY.md
- /phases/4-semantic-enrichment/index.json
## Task
Implement block-level Markdown rendering for headings, paragraphs, lists, blockquotes, formulas, captions, and simple references.
Renderer tests should use internal block fixtures, not live PDF parsing.
## Sprint Contract
- Done means: core block types render to deterministic Markdown strings with focused tests.
- Hard thresholds: math delimiter validation is applied; renderer does not inject warnings/errors into Markdown; output is stable across runs.
- Files owned: `src/pdftomd/renderer.py`, tests, `PROGRESS.md`, `phases/5-markdown-rendering-assets/index.json`.
- Dependencies: Phase 4 enriched blocks and Phase 3 formula outputs.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm renderer tests are focused, not full snapshots.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not invoke Marker or Nougat.
- Do not implement table/asset file writing in this step.
- Do not add sidecar document outputs.
@@ -0,0 +1,37 @@
# Step 1: table-renderer-fallbacks
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/CONVERSION_POLICY.md
- /phases/5-markdown-rendering-assets/step0.md
## Task
Implement table rendering policy for Markdown tables, limited HTML tables, and image fallback links.
Use structured table objects and avoid ad hoc string parsing for complex cases where possible.
## Sprint Contract
- Done means: simple tables render as Markdown, complex tables can render as limited HTML or fallback references, and table captions/footnotes are preserved.
- Hard thresholds: tests cover merged-cell-like structures, footnotes, captions, and table fallback decisions; invalid table output is detected by quality gates.
- Files owned: `src/pdftomd/renderer.py`, table models/tests, `PROGRESS.md`, phase index.
- Dependencies: Step 0 renderer and Phase 0 quality gates.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm fallback images are linked but not generated unless a table asset exists.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not fake table content that was not extracted.
- Do not discard captions or footnotes.
- Do not implement full HTML sanitizer scope beyond limited table output.
@@ -0,0 +1,39 @@
# Step 2: figure-asset-writer
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/ARCHITECTURE.md
- /docs/CONVERSION_POLICY.md
- /phases/1-core-runtime-contracts/step2.md
- /phases/5-markdown-rendering-assets/step0.md
## Task
Implement deterministic image/figure asset writing and Markdown image reference generation.
Use hash-based deduplication when asset bytes are available and preserve figure captions and reference anchors.
## Sprint Contract
- Done means: figure assets can be written to temp output bundles with deterministic names and Markdown references.
- Hard thresholds: duplicate images share stored assets where configured; Korean path output is tested; missing assets produce validation failures, not broken silent links.
- Files owned: `src/pdftomd/assets.py`, renderer integration/tests, `PROGRESS.md`, phase index.
- Dependencies: Output bundle contract and renderer.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm tests write only to temporary directories.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not write into real `output/` during tests.
- Do not rename source PDFs.
- Do not drop figure captions.
@@ -0,0 +1,39 @@
# Step 3: chunk-renderer
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/ARCHITECTURE.md
- /docs/CONVERSION_POLICY.md
- /phases/1-core-runtime-contracts/step2.md
- /phases/5-markdown-rendering-assets/step2.md
## Task
Implement chunk planning and chunk Markdown bundle writing over enriched blocks.
Chunk boundaries should target 20 pages but preserve logical block integrity for paragraphs, tables, figures, and formulas.
## Sprint Contract
- Done means: chunk files with frontmatter can be written deterministically from internal document fixtures.
- Hard thresholds: block integrity is preserved at chunk boundaries; chunk frontmatter includes minimum context; quality gates run on rendered chunks.
- Files owned: `src/pdftomd/chunking.py`, `src/pdftomd/renderer.py`, tests, `PROGRESS.md`, phase index.
- Dependencies: Renderer, assets, and output bundle contracts.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm long-document chunk fixtures cover boundary behavior.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not split blocks in the middle to satisfy exact 20-page counts.
- Do not create document sidecar metadata files.
- Do not implement CLI orchestration here.
+31
View File
@@ -0,0 +1,31 @@
{
"project": "PDFtoMD",
"phase": "6-cli-runtime-resume",
"steps": [
{
"step": 0,
"name": "cli-entrypoint-options",
"status": "pending"
},
{
"step": 1,
"name": "progress-logging",
"status": "pending"
},
{
"step": 2,
"name": "resume-state",
"status": "pending"
},
{
"step": 3,
"name": "device-oom-policy",
"status": "pending"
},
{
"step": 4,
"name": "model-cache-offline",
"status": "pending"
}
]
}
+38
View File
@@ -0,0 +1,38 @@
# Step 0: cli-entrypoint-options
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/ARCHITECTURE.md
- /phases/1-core-runtime-contracts/index.json
- /phases/5-markdown-rendering-assets/index.json
## Task
Implement the `python -m pdftomd` CLI entrypoint and option parsing over the existing library API.
Expose input PDF, output directory, formula parser mode, Nougat command, runtime/device, chunk size, logging, and resume options.
## Sprint Contract
- Done means: CLI options map into typed conversion options and can run against a mocked pipeline in tests.
- Hard thresholds: CLI does not duplicate conversion logic; defaults match docs; explicit `cuda` and `auto` modes are represented.
- Files owned: `src/pdftomd/__main__.py`, CLI modules/tests, `README.md` if command docs change, `PROGRESS.md`, phase index.
- Dependencies: Core contracts and renderer pipeline.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm CLI help text shows documented options.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not put parser logic inside CLI parsing code.
- Do not implement PyQt UI.
- Do not silently CPU fallback for explicit CUDA mode.
+37
View File
@@ -0,0 +1,37 @@
# Step 1: progress-logging
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/CONVERSION_POLICY.md
- /phases/6-cli-runtime-resume/step0.md
## Task
Implement progress reporting and stderr/local log behavior for chunk-level conversion.
Progress should summarize chunk success/failure without writing warnings or errors into Markdown content.
## Sprint Contract
- Done means: CLI/runtime tests can observe progress events and log file output in temp locations.
- Hard thresholds: Markdown chunks remain free of warning/error logs; failure summaries include chunk ids; logs use deterministic local paths from Phase 1.
- Files owned: `src/pdftomd/runtime.py`, CLI integration/tests, `PROGRESS.md`, phase index.
- Dependencies: CLI entrypoint and output/cache contracts.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm stderr/log behavior is tested separately from Markdown output.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not write runtime logs inside generated Markdown.
- Do not require a real PDF conversion for progress unit tests.
- Do not create persistent logs outside temp dirs in tests.
+37
View File
@@ -0,0 +1,37 @@
# Step 2: resume-state
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/CONVERSION_POLICY.md
- /phases/6-cli-runtime-resume/step1.md
## Task
Implement runtime resume state for successful and failed chunks.
Resume state is a runtime artifact, not a document output sidecar.
## Sprint Contract
- Done means: conversion can skip completed chunks and retry failed chunks using a local state file in tests.
- Hard thresholds: state format is deterministic; stale state is detected; resume does not skip chunks when input/options changed materially.
- Files owned: `src/pdftomd/resume.py`, runtime integration/tests, `PROGRESS.md`, phase index.
- Dependencies: Progress/logging and chunk renderer contracts.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm state files are written only under temp/runtime cache paths in tests.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not treat resume state as part of generated document output.
- Do not skip chunks after parser/version-relevant option changes.
- Do not create hidden global state.
+39
View File
@@ -0,0 +1,39 @@
# Step 3: device-oom-policy
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/ARCHITECTURE.md
- /docs/CONVERSION_POLICY.md
- /docs/TOOLCHAIN.md
- /phases/1-core-runtime-contracts/step1.md
## Task
Implement runtime device selection, CUDA fail-fast behavior, auto CPU fallback behavior, and OOM retry policy hooks.
This step should be tested with mocks and small CUDA smoke checks only where safe.
## Sprint Contract
- Done means: runtime policy enforces explicit CUDA fail-fast, auto fallback warning, and configurable OOM retry reductions.
- Hard thresholds: no silent CPU fallback for explicit CUDA; tests do not require exhausting VRAM; GTX 1070 Ti constraints remain documented.
- Files owned: `src/pdftomd/runtime.py`, tests, `docs/TOOLCHAIN.md` if behavior changes, `PROGRESS.md`, phase index.
- Dependencies: Runtime config options.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm CUDA smoke test instructions still work separately.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not intentionally trigger real GPU OOM in tests.
- Do not change PyTorch pins without updating `docs/TOOLCHAIN.md`.
- Do not hide runtime warnings.
+38
View File
@@ -0,0 +1,38 @@
# Step 4: model-cache-offline
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/TOOLCHAIN.md
- /docs/ARCHITECTURE.md
- /phases/6-cli-runtime-resume/step3.md
## Task
Document and wire model cache/offline behavior for Marker, Nougat, and Hugging Face cache paths.
Add CLI/runtime hooks for environment variables or explicit cache paths without downloading models during tests.
## Sprint Contract
- Done means: users can see how to pre-download models and run offline, and runtime cache paths are configurable.
- Hard thresholds: no test performs network download; docs include Windows commands; cache path policy matches Phase 1.
- Files owned: `src/pdftomd/runtime.py`, `README.md`, `docs/TOOLCHAIN.md`, tests, `PROGRESS.md`, phase index.
- Dependencies: Device/runtime policy and cache contracts.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm offline instructions are clear and do not imply bundled weights.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not download model weights as part of tests.
- Do not commit model caches.
- Do not make online access mandatory for already-cached models.
+26
View File
@@ -0,0 +1,26 @@
{
"project": "PDFtoMD",
"phase": "7-mvp-quality-hardening",
"steps": [
{
"step": 0,
"name": "sample-smoke-conversions",
"status": "pending"
},
{
"step": 1,
"name": "quality-metrics-report",
"status": "pending"
},
{
"step": 2,
"name": "regression-thresholds",
"status": "pending"
},
{
"step": 3,
"name": "mvp-fix-sweep",
"status": "pending"
}
]
}
+38
View File
@@ -0,0 +1,38 @@
# Step 0: sample-smoke-conversions
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/PRD.md
- /docs/CONVERSION_POLICY.md
- /phases/6-cli-runtime-resume/index.json
## Task
Create controlled sample smoke conversion tests for the MVP corpus.
The tests should exercise the end-to-end pipeline on a small selected subset or page range first, then document which full documents are suitable for manual or slower regression runs.
## Sprint Contract
- Done means: at least one text-layer sample and one mixed/scanned-risk sample can be converted in a controlled test path.
- Hard thresholds: tests have runtime bounds; sample selection comes from `samples/metadata.json`; generated output is checked with quality gates.
- Files owned: `tests/`, sample metadata updates if needed, `PROGRESS.md`, `phases/7-mvp-quality-hardening/index.json`.
- Dependencies: CLI/runtime and renderer phases complete.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Record sample coverage and any skipped slow tests in `PROGRESS.md`.
3. Update this phase index.
## Do Not
- Do not make every validation run process all long PDFs if runtime becomes impractical.
- Do not commit generated `output/` bundles.
- Do not weaken quality gates to pass broken output.
+37
View File
@@ -0,0 +1,37 @@
# Step 1: quality-metrics-report
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/PRD.md
- /phases/7-mvp-quality-hardening/step0.md
## Task
Add focused quality metrics for converted Markdown bundles.
Metrics should cover headings, math delimiter balance, LaTeX environment pairs, image links, captions, table parseability, chunk frontmatter, and no-exception conversion.
## Sprint Contract
- Done means: evaluator-friendly quality metrics can be run on sample outputs and produce actionable failure messages.
- Hard thresholds: metrics do not rely on full Markdown snapshots; failures identify file/chunk/block context; reports stay out of generated Markdown.
- Files owned: `src/pdftomd/quality.py`, `tests/`, optional scripts under `scripts/`, `PROGRESS.md`, phase index.
- Dependencies: Step 0 sample smoke conversions and quality gates.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm metrics can be used by `harness-review`.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not create broad snapshot baselines as the primary quality gate.
- Do not write quality reports inside Markdown chunks.
- Do not hide per-chunk failures.
+37
View File
@@ -0,0 +1,37 @@
# Step 2: regression-thresholds
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/PRD.md
- /phases/7-mvp-quality-hardening/step1.md
## Task
Define MVP regression thresholds for the sample corpus.
Thresholds should distinguish mandatory fast validation from slower/manual quality checks.
## Sprint Contract
- Done means: MVP pass/fail criteria are encoded in tests or documented commands and tied to sample metadata traits.
- Hard thresholds: mandatory validation remains runnable on the local machine; slow tests are opt-in; failed quality areas are not masked.
- Files owned: `tests/`, `scripts/`, sample metadata updates if needed, `PROGRESS.md`, phase index.
- Dependencies: Quality metrics report.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm slow tests are documented separately if needed.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not make local validation unusably slow.
- Do not turn all failures into warnings.
- Do not remove sample coverage for Korean paths or formulas.
+37
View File
@@ -0,0 +1,37 @@
# Step 3: mvp-fix-sweep
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/PRD.md
- /phases/7-mvp-quality-hardening/step2.md
## Task
Run a focused MVP stabilization pass based on failing quality metrics and sample smoke tests.
This step should fix only defects revealed by prior acceptance criteria and should avoid feature expansion.
## Sprint Contract
- Done means: MVP fast validation and selected sample smoke conversions pass with documented residual risks.
- Hard thresholds: fixes are test-backed; no new primary parser is introduced; out-of-scope UI/API/LLM features remain out of scope.
- Files owned: failing modules and tests identified by prior phase output, `PROGRESS.md`, phase index.
- Dependencies: Regression thresholds and quality reports.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Record remaining quality risks in `PROGRESS.md`.
3. Update this phase index.
## Do Not
- Do not use this as a broad refactor step.
- Do not add new major features.
- Do not bypass failed quality gates without recording a blocker.
@@ -0,0 +1,26 @@
{
"project": "PDFtoMD",
"phase": "8-release-docs-packaging",
"steps": [
{
"step": 0,
"name": "readme-usage-flow",
"status": "pending"
},
{
"step": 1,
"name": "environment-bootstrap-docs",
"status": "pending"
},
{
"step": 2,
"name": "license-checkpoint",
"status": "pending"
},
{
"step": 3,
"name": "release-checklist",
"status": "pending"
}
]
}
+36
View File
@@ -0,0 +1,36 @@
# Step 0: readme-usage-flow
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /README.md
- /phases/7-mvp-quality-hardening/index.json
## Task
Update README usage flow for the MVP CLI.
Document install, validation, basic conversion, formula parser modes, runtime modes, output layout, resume, and logs.
## Sprint Contract
- Done means: a user can follow README instructions to run the local CLI on Windows after environment setup.
- Hard thresholds: commands match implemented CLI; docs do not promise PyQt or hosted API as MVP; generated output contract is accurate.
- Files owned: `README.md`, docs as needed, `PROGRESS.md`, `phases/8-release-docs-packaging/index.json`.
- Dependencies: MVP quality hardening complete.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
```
## Verification
1. Run the acceptance command.
2. Confirm documented commands are copy-pasteable in PowerShell.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not document unimplemented features as available.
- Do not add marketing-style content.
- Do not include model weights in the repository.
+38
View File
@@ -0,0 +1,38 @@
# Step 1: environment-bootstrap-docs
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/TOOLCHAIN.md
- /requirements.txt
## Task
Document and optionally script the repo-local environment bootstrap flow.
Cover Conda Python 3.11, requirements install, CUDA smoke test, `pip check`, and Nougat help check.
## Sprint Contract
- Done means: environment setup instructions reflect the verified GTX 1070 Ti / torch 2.7.1+cu126 baseline.
- Hard thresholds: dependency pins remain consistent across README, TOOLCHAIN, and requirements; no unverified torch upgrade is introduced.
- Files owned: `README.md`, `docs/TOOLCHAIN.md`, optional `scripts/`, `PROGRESS.md`, phase index.
- Dependencies: MVP CLI docs.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pip check
.\venv\Scripts\nougat.exe --help
```
## Verification
1. Run the acceptance commands where local environment is available.
2. Explain any skipped environment command in `PROGRESS.md`.
3. Update this phase index.
## Do Not
- Do not replace the single `venv` policy.
- Do not require `uv`.
- Do not change pins without official compatibility verification.
+36
View File
@@ -0,0 +1,36 @@
# Step 2: license-checkpoint
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/ADR.md
- /docs/TOOLCHAIN.md
## Task
Add a licensing checkpoint document or section for current personal use and future redistribution/commercial review.
This is not legal advice. It should identify Marker GPL/model-weight concerns and when to revisit them.
## Sprint Contract
- Done means: docs clearly state current personal-use context and future review triggers.
- Hard thresholds: docs do not claim legal conclusions; process/API isolation is described only as a risk mitigation candidate; model weights are not redistributed.
- Files owned: `docs/TOOLCHAIN.md`, `docs/ADR.md`, optional `README.md`, `PROGRESS.md`, phase index.
- Dependencies: Release docs context.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
```
## Verification
1. Run the acceptance command.
2. Confirm license notes are cautious and consistent.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not provide legal advice.
- Do not mark the project commercially safe without review.
- Do not vendor model weights.
+36
View File
@@ -0,0 +1,36 @@
# Step 3: release-checklist
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /README.md
- /docs/TOOLCHAIN.md
## Task
Create the local MVP release checklist.
Include validation, sample smoke conversion, environment checks, offline cache readiness, known limitations, and next phase entry conditions.
## Sprint Contract
- Done means: the repository has a concise checklist for deciding whether the local MVP is ready for personal use.
- Hard thresholds: checklist references real commands; known limitations are explicit; PyQt phase remains separate.
- Files owned: `README.md`, optional `docs/RELEASE_CHECKLIST.md`, `PROGRESS.md`, phase index.
- Dependencies: Prior release docs steps.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
```
## Verification
1. Run the acceptance command.
2. Confirm checklist can be followed by a fresh agent or user.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not claim production readiness.
- Do not include hosted API release work.
- Do not start PyQt implementation.
+26
View File
@@ -0,0 +1,26 @@
{
"project": "PDFtoMD",
"phase": "9-pyqt-thin-client",
"steps": [
{
"step": 0,
"name": "ui-api-contract",
"status": "pending"
},
{
"step": 1,
"name": "pyqt-shell",
"status": "pending"
},
{
"step": 2,
"name": "ui-progress-resume",
"status": "pending"
},
{
"step": 3,
"name": "ui-packaging-notes",
"status": "pending"
}
]
}
+38
View File
@@ -0,0 +1,38 @@
# Step 0: ui-api-contract
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/UI_GUIDE.md
- /docs/ARCHITECTURE.md
- /phases/8-release-docs-packaging/index.json
## Task
Define the stable core API contract that PyQt will call.
This step should verify that the UI does not need to import Marker, Nougat, PyMuPDF, or renderer internals directly.
## Sprint Contract
- Done means: UI-facing API functions/classes are documented and tested without building UI screens.
- Hard thresholds: UI remains a thin client; conversion logic is not duplicated; progress/resume events are available through the API.
- Files owned: `src/pdftomd/`, `tests/`, `docs/UI_GUIDE.md`, `PROGRESS.md`, `phases/9-pyqt-thin-client/index.json`.
- Dependencies: Local MVP release-ready CLI/library.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm UI boundary in `docs/UI_GUIDE.md` remains accurate.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not create UI widgets yet.
- Do not import parser internals in UI-facing code.
- Do not change CLI behavior unless tests require it.
+36
View File
@@ -0,0 +1,36 @@
# Step 1: pyqt-shell
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/UI_GUIDE.md
- /phases/9-pyqt-thin-client/step0.md
## Task
Create the minimal PyQt application shell for selecting a PDF, choosing output directory, and configuring runtime/formula options.
Keep the interface quiet and utilitarian for repeated local conversion work.
## Sprint Contract
- Done means: a minimal PyQt window can launch and bind controls to the core API config without running conversion by default.
- Hard thresholds: UI does not duplicate conversion engine logic; long Korean paths remain visible via tooltip or equivalent; defaults match CLI.
- Files owned: UI package/modules, tests if practical, `docs/UI_GUIDE.md`, `PROGRESS.md`, phase index.
- Dependencies: UI API contract.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
```
## Verification
1. Run the acceptance command.
2. If UI launch cannot be automated, record manual verification steps in `PROGRESS.md`.
3. Update this phase index.
## Do Not
- Do not create a marketing landing page.
- Do not call Marker/Nougat directly from widgets.
- Do not make UI the only way to run conversion.
+36
View File
@@ -0,0 +1,36 @@
# Step 2: ui-progress-resume
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/UI_GUIDE.md
- /phases/9-pyqt-thin-client/step1.md
## Task
Connect UI progress, cancellation, error summary, log opening, and resume controls to the core API.
The UI should show chunk success/failure and respect CUDA/auto runtime semantics.
## Sprint Contract
- Done means: UI can display progress events and expose resume behavior without corrupting generated Markdown.
- Hard thresholds: explicit CUDA failures are shown clearly; auto fallback warnings are visible; cancellation does not leave inconsistent runtime state.
- Files owned: UI modules, tests/manual verification notes, `PROGRESS.md`, phase index.
- Dependencies: PyQt shell and core runtime events.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
```
## Verification
1. Run the acceptance command.
2. Record UI smoke test steps and screenshots only if the user asks for them.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not write logs into Markdown content.
- Do not silently fallback from explicit CUDA to CPU.
- Do not implement a separate UI-only resume system.
+36
View File
@@ -0,0 +1,36 @@
# Step 3: ui-packaging-notes
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/UI_GUIDE.md
- /phases/9-pyqt-thin-client/step2.md
## Task
Document PyQt local packaging and known limitations for Windows.
This step may add packaging notes or scripts only if they are consistent with the verified environment.
## Sprint Contract
- Done means: UI packaging limitations and local execution commands are documented without changing the core engine contract.
- Hard thresholds: docs do not promise standalone redistribution without license review; model weights remain external/cache-based; CLI remains supported.
- Files owned: `README.md`, `docs/UI_GUIDE.md`, optional packaging docs/scripts, `PROGRESS.md`, phase index.
- Dependencies: UI progress/resume behavior.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
```
## Verification
1. Run the acceptance command.
2. Confirm packaging notes mention license/model cache constraints.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not build an installer unless explicitly requested.
- Do not bundle model weights.
- Do not drop CLI documentation.
+44
View File
@@ -0,0 +1,44 @@
{
"phases": [
{
"dir": "0-harness-foundation",
"status": "completed"
},
{
"dir": "1-core-runtime-contracts",
"status": "completed"
},
{
"dir": "2-marker-adapter",
"status": "pending"
},
{
"dir": "3-formula-pipeline",
"status": "pending"
},
{
"dir": "4-semantic-enrichment",
"status": "pending"
},
{
"dir": "5-markdown-rendering-assets",
"status": "pending"
},
{
"dir": "6-cli-runtime-resume",
"status": "pending"
},
{
"dir": "7-mvp-quality-hardening",
"status": "pending"
},
{
"dir": "8-release-docs-packaging",
"status": "pending"
},
{
"dir": "9-pyqt-thin-client",
"status": "pending"
}
]
}