remove files

This commit is contained in:
김경종
2026-05-08 16:31:17 +09:00
parent 7e985ae94a
commit 551ab50735
135 changed files with 0 additions and 41205 deletions
-20
View File
@@ -1,20 +0,0 @@
{
"name": "local-harness-engineering",
"interface": {
"displayName": "Local Harness Engineering"
},
"plugins": [
{
"name": "harness-engineering",
"source": {
"source": "local",
"path": "./plugins/harness-engineering"
},
"policy": {
"installation": "AVAILABLE",
"authentication": "ON_INSTALL"
},
"category": "Productivity"
}
]
}
-59
View File
@@ -1,59 +0,0 @@
---
name: harness-review
description: Review a Harness Engineering repository against its persistent rules and design docs. Use when Codex is asked to review local changes, generated phase files, or implementation output against `AGENTS.md`, `docs/ARCHITECTURE.md`, `docs/ADR.md`, `docs/UI_GUIDE.md`, testing expectations, and Harness step acceptance criteria.
---
# Harness Review
Use this skill when the user wants a repository-grounded review instead of generic commentary.
## Review input set
Read these first:
- `/AGENTS.md`
- `/docs/ARCHITECTURE.md`
- `/docs/HARNESS.md`
- `/docs/ADR.md`
- `/docs/UI_GUIDE.md`
- the changed files or generated `phases/` files under review
If the user explicitly asks for delegated review, prefer the repo custom agent `harness_reviewer` or built-in read-only explorers.
## Checklist
Evaluate the patch against these questions:
1. Does it follow the architecture described in `docs/ARCHITECTURE.md`?
2. Does it stay within the technology choices documented in `docs/ADR.md`?
3. Are new or changed behaviors covered by tests or other explicit validation?
4. Does it violate any CRITICAL rule in `AGENTS.md`?
5. Do generated `phases/` files include a clear Sprint Contract with hard thresholds?
6. Do generated `phases/` files remain self-contained, executable, and internally consistent?
7. If the user expects verification, does `python scripts/validate_workspace.py` succeed or is the failure explained?
## Output rules
- Lead with findings, ordered by severity.
- Include file references for each finding.
- Explain the concrete risk or regression, not just the rule name.
- If there are no findings, say so explicitly and mention residual risks or missing evidence.
- Keep summaries brief after the findings.
## Preferred review table
When the user asks for a checklist-style review, use this table:
| Item | Result | Notes |
|------|------|------|
| Architecture compliance | PASS/FAIL | {details} |
| Tech stack compliance | PASS/FAIL | {details} |
| Test coverage | PASS/FAIL | {details} |
| CRITICAL rules | PASS/FAIL | {details} |
| Build and validation | PASS/FAIL | {details} |
## What not to do
- Do not approve changes just because they compile.
- Do not focus on style-only issues when correctness, architecture drift, or missing validation exists.
- Do not assume a passing hook means the implementation is acceptable; review the actual diff and docs.
@@ -1,4 +0,0 @@
interface:
display_name: "Harness Review"
short_description: "Review changes against Harness project rules"
default_prompt: "Use Harness review to check architecture, tests, and rules."
-154
View File
@@ -1,154 +0,0 @@
---
name: harness-workflow
description: Plan and run the Harness Engineering workflow for this repository. Use when Codex needs to read `AGENTS.md` and `docs/*.md`, discuss implementation scope, draft phase plans, or create/update `phases/index.json`, `phases/{phase}/index.json`, and `phases/{phase}/stepN.md` files for staged execution.
---
# Harness Workflow
Use this skill when the user is working in the Harness template and wants structured planning or phase-file generation.
## Workflow
### 1. Explore first
Read these files before proposing steps:
- `/AGENTS.md`
- `/docs/PRD.md`
- `/docs/ARCHITECTURE.md`
- `/docs/HARNESS.md`
- `/docs/ADR.md`
- `/docs/UI_GUIDE.md`
If the user explicitly asks for parallel exploration, use built-in Codex subagents such as `explorer`, or the repo-scoped custom agent `phase_planner`.
### 2. Discuss before locking the plan
If scope, sequencing, or architecture choices are still ambiguous, surface the decision points before creating `phases/` files.
### 3. Design steps with strict boundaries
When drafting a phase plan:
1. Keep scope minimal. One step should usually touch one layer or one module.
2. Make each step self-contained. Every `stepN.md` must work in an isolated Codex session.
3. List prerequisite files explicitly. Never rely on "as discussed above".
4. Specify interfaces or invariants, not line-by-line implementations.
5. Use executable acceptance commands, not vague success criteria.
6. Include a Sprint Contract with done meaning, hard thresholds, owned files, and dependencies.
7. Write concrete warnings in "do not do X because Y" form.
8. Use kebab-case step names.
## Files to generate
### `phases/index.json`
Top-level phase registry. Append to `phases[]` when the file already exists.
```json
{
"phases": [
{
"dir": "0-mvp",
"status": "pending"
}
]
}
```
- `dir`: phase directory name.
- `status`: `pending`, `completed`, `error`, or `blocked`.
- Timestamp fields are written by `scripts/execute.py`; do not seed them during planning.
### `phases/{phase}/index.json`
```json
{
"project": "<project-name>",
"phase": "<phase-name>",
"steps": [
{ "step": 0, "name": "project-setup", "status": "pending" },
{ "step": 1, "name": "core-types", "status": "pending" },
{ "step": 2, "name": "api-layer", "status": "pending" }
]
}
```
- `project`: from `AGENTS.md`.
- `phase`: directory name.
- `steps[].step`: zero-based integer.
- `steps[].name`: kebab-case slug.
- `steps[].status`: initialize to `pending`.
### `phases/{phase}/stepN.md`
Each step file should contain:
1. A title.
2. A "read these files first" section.
3. A concrete task section.
4. A Sprint Contract section.
5. Executable acceptance criteria.
6. Verification instructions.
7. Explicit prohibitions.
Recommended structure:
```markdown
# Step {N}: {name}
## Read First
- /AGENTS.md
- /docs/ARCHITECTURE.md
- /docs/ADR.md
- {files from previous steps}
## Task
{specific instructions}
## Sprint Contract
- Done means: {observable or testable completion definition}
- Hard thresholds: {criteria that fail the step if violated}
- Files owned: {paths this step may edit}
- Dependencies: {previous steps or required docs}
## Acceptance Criteria
```bash
python scripts/validate_workspace.py
```
## Verification
1. Run the acceptance commands.
2. Check AGENTS and docs for rule drift.
3. Update the matching step in phases/{phase}/index.json:
- completed + summary
- error + error_message
- blocked + blocked_reason
## Do Not
- {concrete prohibition}
```
```
## Execution
Run the generated phase with:
```bash
python scripts/execute.py <phase-name>
python scripts/execute.py <phase-name> --push
```
`scripts/execute.py` handles:
- `feat-{phase}` branch checkout/creation
- guardrail injection from `AGENTS.md` and `docs/*.md`
- accumulation of completed-step summaries into later prompts
- up to 3 retries with prior error feedback
- two-phase commit of code changes and metadata updates
- timestamps such as `created_at`, `started_at`, `completed_at`, `failed_at`, and `blocked_at`
## Recovery rules
- If a step is `error`, reset its status to `pending`, remove `error_message`, then rerun.
- If a step is `blocked`, resolve the blocker, reset to `pending`, remove `blocked_reason`, then rerun.
@@ -1,4 +0,0 @@
interface:
display_name: "Harness Workflow"
short_description: "Guide Codex through Harness phase planning"
default_prompt: "Use the Harness workflow to plan phases and step files."
-12
View File
@@ -1,12 +0,0 @@
name = "conversion_architect"
description = "Read-only conversion architecture specialist for parser boundaries, internal block models, chunk policy, renderer contracts, and output structure."
model = "gpt-5.4"
model_reasoning_effort = "high"
sandbox_mode = "read-only"
developer_instructions = """
Read AGENTS.md, PLAN.md, PROGRESS.md, docs/ARCHITECTURE.md, docs/CONVERSION_POLICY.md, and docs/ADR.md before proposing architecture.
Keep Marker as the document-structure source, Nougat as formula-only, and PyMuPDF as pre-analysis/chunk planning unless the user explicitly asks to revisit ADRs.
Define interfaces and invariants rather than line-by-line implementation.
Surface risks around chunk boundaries, fallback paths, deterministic naming, and Markdown output integrity.
Do not edit files unless explicitly instructed.
"""
@@ -1,12 +0,0 @@
name = "formula_pipeline_specialist"
description = "Read-only formula pipeline specialist for Nougat handoff, formula detection, LaTeX validation, numbering, references, and fallback behavior."
model = "gpt-5.4"
model_reasoning_effort = "high"
sandbox_mode = "read-only"
developer_instructions = """
Read AGENTS.md, PLAN.md, PROGRESS.md, docs/CONVERSION_POLICY.md, docs/TOOLCHAIN.md, and docs/ADR.md before advising.
Treat Nougat as formula-only and Marker source text as the required fallback.
Focus on equation block detection, inline/block formula classification, formula numbering, reference anchors, delimiter repair, and begin/end validation.
Call out confidence thresholds and failure modes explicitly.
Do not edit files unless explicitly instructed.
"""
-11
View File
@@ -1,11 +0,0 @@
name = "harness_reviewer"
description = "Read-only reviewer for Harness projects, focused on architecture drift, critical rule violations, and missing validation."
model = "gpt-5.4"
model_reasoning_effort = "high"
sandbox_mode = "read-only"
developer_instructions = """
Review changes like a repository owner.
Prioritize correctness, architecture compliance, behavior regressions, and missing tests over style.
Always compare the patch against AGENTS.md, docs/ARCHITECTURE.md, docs/ADR.md, and the requested acceptance criteria.
Lead with concrete findings and file references. If no material issues are found, say so explicitly and mention residual risks.
"""
@@ -1,11 +0,0 @@
name = "layout_table_figure_specialist"
description = "Read-only layout/table/figure specialist for reading order, paragraph stitching, table rendering, figure extraction, captions, and references."
model = "gpt-5.4"
model_reasoning_effort = "high"
sandbox_mode = "read-only"
developer_instructions = """
Read AGENTS.md, PLAN.md, PROGRESS.md, docs/CONVERSION_POLICY.md, docs/ARCHITECTURE.md, and docs/PRD.md before advising.
Focus on logical reading order, multi-column layouts, header/footer removal, paragraph stitching, Markdown vs HTML table decisions, table screenshot fallback, figure asset naming, deduplication, captions, and internal references.
Return testable heuristics and edge cases grounded in sample PDFs when available.
Do not edit files unless explicitly instructed.
"""
@@ -1,12 +0,0 @@
name = "pdf_toolchain_researcher"
description = "Read-only PDF toolchain researcher for Marker, Nougat, PyMuPDF, PyTorch/CUDA, model cache, and licensing compatibility."
model = "gpt-5.4"
model_reasoning_effort = "high"
sandbox_mode = "read-only"
developer_instructions = """
Read AGENTS.md, PLAN.md, PROGRESS.md, docs/TOOLCHAIN.md, docs/ARCHITECTURE.md, docs/CONVERSION_POLICY.md, and docs/ADR.md before answering.
Focus on official or primary sources for Marker, Nougat, PyMuPDF, PyTorch/CUDA, Markdown math, and comparison baselines.
Return compatibility findings, recommended dependency pins, runtime risks, model cache implications, and licensing questions.
Do not edit files unless the parent agent explicitly asks for a patch.
Do not propose replacing Marker as the primary parser without explaining architecture and ADR impact.
"""
-12
View File
@@ -1,12 +0,0 @@
name = "phase_planner"
description = "Read-heavy Harness planner that decomposes docs into minimal, self-contained phase and step files."
model = "gpt-5.4"
model_reasoning_effort = "high"
sandbox_mode = "read-only"
developer_instructions = """
Plan before implementing.
Read AGENTS.md and the docs directory, identify the smallest coherent phase boundaries, and draft self-contained steps.
Keep each step scoped to one layer or one module when possible.
Do not make code changes unless the parent agent explicitly asks you to write files.
Return concrete file paths, acceptance commands, and blocking assumptions.
"""
-12
View File
@@ -1,12 +0,0 @@
name = "quality_evaluator"
description = "Read-only quality evaluator for focused PDF-to-Markdown tests, sample corpus coverage, regression strategy, and validation gaps."
model = "gpt-5.4"
model_reasoning_effort = "high"
sandbox_mode = "read-only"
developer_instructions = """
Read AGENTS.md, PLAN.md, PROGRESS.md, docs/PRD.md, docs/CONVERSION_POLICY.md, and docs/ARCHITECTURE.md before evaluating quality.
Prefer focused assertions over full Markdown snapshots.
Prioritize tests for headings, formula delimiters, LaTeX environment pairs, table parseability, image links, caption matching, chunk integrity, Windows paths, Korean filenames, and no-exception conversion.
Return concrete pytest targets, fixture needs, and residual risks.
Do not write tests unless explicitly asked.
"""
-12
View File
@@ -1,12 +0,0 @@
name = "sample_corpus_analyst"
description = "Read-only analyst for samples/ PDFs, focused on page traits, text-layer quality, OCR needs, formulas, tables, figures, and regression metadata."
model = "gpt-5.4"
model_reasoning_effort = "high"
sandbox_mode = "read-only"
developer_instructions = """
Read AGENTS.md, PLAN.md, PROGRESS.md, docs/PRD.md, docs/CONVERSION_POLICY.md, and docs/TOOLCHAIN.md before analyzing samples.
Use PyMuPDF-oriented evidence when possible: page count, first-page text length, image count, suspected scan pages, OCR candidates, and layout complexity.
Design sample metadata schema and quality test implications, but do not create or modify metadata files unless explicitly asked.
Preserve Korean filenames exactly in reports.
Return concrete next tests and any sample coverage gaps.
"""
@@ -1,26 +0,0 @@
---
description: Review conversion policy, architecture, ADRs, and AGENTS.md for consistency.
argument-hint: [optional-topic]
allowed-tools: [Read, Glob, Grep, Bash]
---
# /conversion-policy-review
## Arguments
The user invoked this command with: $ARGUMENTS
## Workflow
1. Read `AGENTS.md`, `docs/ARCHITECTURE.md`, `docs/CONVERSION_POLICY.md`, `docs/ADR.md`, `docs/PRD.md`, and `docs/TOOLCHAIN.md`.
2. Check for drift in parser responsibilities, output contract, runtime policy, logging/resume policy, environment pins, and sidecar scope.
3. Lead with concrete inconsistencies and file references.
4. Run `python scripts\validate_workspace.py` if file changes were made or if the user asks.
5. Do not edit files unless explicitly asked.
## Output
- **Findings**
- **Consistency Status**
- **Open Questions**
- **Suggested Fixes**
-28
View File
@@ -1,28 +0,0 @@
---
description: Verify the repo-local PDFtoMD Python environment, CUDA, and Nougat CLI.
argument-hint: [quick|full]
allowed-tools: [Read, Bash]
---
# /env-check
## Arguments
The user invoked this command with: $ARGUMENTS
## Workflow
1. Read `AGENTS.md`, `requirements.txt`, `docs/TOOLCHAIN.md`, and `PROGRESS.md`.
2. Run `.\venv\python.exe -m pip check`.
3. Run a CUDA smoke test with `torch.ones((1,), device="cuda")` unless `$ARGUMENTS` says `quick`.
4. Run `.\venv\Scripts\nougat.exe --help`.
5. Summarize versions and failures.
6. Do not install or upgrade packages unless the user explicitly asks.
## Output
- **Environment**
- **CUDA**
- **Nougat**
- **Dependency Health**
- **Action Needed**
-25
View File
@@ -1,25 +0,0 @@
---
description: Check model cache and offline-readiness assumptions for Marker, Nougat, and Hugging Face assets.
argument-hint: [cache-path-or-empty]
allowed-tools: [Read, Bash]
---
# /model-cache-check
## Arguments
The user invoked this command with: $ARGUMENTS
## Workflow
1. Read `AGENTS.md`, `docs/TOOLCHAIN.md`, `docs/ARCHITECTURE.md`, and `docs/CONVERSION_POLICY.md`.
2. Inspect relevant environment variables and common Hugging Face cache paths.
3. Check whether local cache paths are explicit enough for offline execution.
4. Do not download model weights unless the user explicitly asks.
## Output
- **Cache Paths**
- **Offline Readiness**
- **Missing Assets**
- **Documentation Gaps**
-28
View File
@@ -1,28 +0,0 @@
---
description: Draft Harness phase steps for PDFtoMD implementation without executing them.
argument-hint: [phase-goal]
allowed-tools: [Read, Glob, Grep, Write, Edit]
---
# /phase-draft
## Arguments
The user invoked this command with: $ARGUMENTS
## Workflow
1. Read `AGENTS.md`, `PLAN.md`, `PROGRESS.md`, `docs/PRD.md`, `docs/ARCHITECTURE.md`, `docs/CONVERSION_POLICY.md`, `docs/HARNESS.md`, `docs/ADR.md`, and `docs/TOOLCHAIN.md`.
2. Use `$harness-workflow` guidance if phase files should be created.
3. Keep each step self-contained and scoped to one layer or module.
4. Include executable acceptance commands.
5. Include a Sprint Contract with done criteria, hard thresholds, owned files, and dependencies.
6. Do not create phase files unless the user explicitly requested file generation.
## Output
- **Phase Goal**
- **Step List**
- **Dependencies**
- **Acceptance Commands**
- **Do Not**
-26
View File
@@ -1,26 +0,0 @@
---
description: Draft focused pytest coverage for PDFtoMD conversion quality.
argument-hint: [feature-or-sample-focus]
allowed-tools: [Read, Glob, Grep]
---
# /quality-plan
## Arguments
The user invoked this command with: $ARGUMENTS
## Workflow
1. Read `AGENTS.md`, `PLAN.md`, `PROGRESS.md`, `docs/PRD.md`, `docs/ARCHITECTURE.md`, and `docs/CONVERSION_POLICY.md`.
2. Identify focused tests for headings, formulas, tables, images, captions, links, chunk boundaries, Windows paths, Korean filenames, and no-exception conversion.
3. Prefer concrete pytest names and fixture inputs.
4. Do not write tests unless explicitly asked.
## Output
- **Test Goals**
- **Proposed Test Files**
- **Fixture Needs**
- **Acceptance Commands**
- **Residual Risks**
-27
View File
@@ -1,27 +0,0 @@
---
description: Audit samples/ PDFs for page counts, text-layer quality, images, and OCR candidates.
argument-hint: [pdf-glob-or-empty]
allowed-tools: [Read, Glob, Bash, Write, Edit]
---
# /sample-audit
## Arguments
The user invoked this command with: $ARGUMENTS
## Workflow
1. Read `AGENTS.md`, `PLAN.md`, `PROGRESS.md`, and `docs/CONVERSION_POLICY.md`.
2. Use PyMuPDF from `.\venv` to inspect matching `samples/*.pdf` files.
3. Report page count, first-page text length, image counts, suspected scan/OCR pages, Korean filename coverage, and obvious layout risks.
4. If the user asks to write metadata, create or update `samples/metadata.json`; otherwise only report.
5. Update `PROGRESS.md` when files are changed.
## Output
- **Corpus Summary**
- **Per-PDF Traits**
- **OCR Candidates**
- **Test Implications**
- **Recommended Metadata Changes**
-31
View File
@@ -1,31 +0,0 @@
---
description: Draft or review a step-level generator/evaluator contract before implementation.
argument-hint: [phase-dir step-number]
allowed-tools: [Read, Glob, Grep, Edit]
---
# /sprint-contract
## Arguments
The user invoked this command with: $ARGUMENTS
## Workflow
1. Read `AGENTS.md`, `PLAN.md`, `PROGRESS.md`, `docs/HARNESS.md`, and the target `phases/{phase}/stepN.md`.
2. Confirm the step has a concrete Sprint Contract:
- Done means
- Hard thresholds
- Files owned
- Dependencies
- Acceptance commands
- Explicit Do Not list
3. If the contract is missing or vague, edit only the target step file to make the contract executable by a fresh agent.
4. Do not implement the step.
## Output
- **Target Step**
- **Contract Status**: ready | updated | blocked
- **Evaluator Thresholds**
- **Remaining Ambiguity**
-27
View File
@@ -1,27 +0,0 @@
---
description: Summarize current PDFtoMD plan, progress, blockers, and next work.
argument-hint: [optional-focus]
allowed-tools: [Read, Glob, Grep, Bash]
---
# /status
## Arguments
The user invoked this command with: $ARGUMENTS
## Workflow
1. Read `AGENTS.md`, `PLAN.md`, `PROGRESS.md`, and `docs/HARNESS.md`.
2. Summarize the current project goal, scope, completed work, in-progress work, blockers, and next work.
3. If `$ARGUMENTS` names an area, focus the summary on that area.
4. Do not modify files.
## Output
- **Goal**
- **Current State**
- **Next Work**
- **Blockers**
- **Relevant Files**
- **Active Phase/Step**
-9
View File
@@ -1,9 +0,0 @@
# Project-scoped Codex defaults for the Harness template.
# As of 2026-04-15, hooks are experimental and disabled on native Windows.
[features]
codex_hooks = true
[agents]
max_threads = 6
max_depth = 1
-40
View File
@@ -1,40 +0,0 @@
{
"hooks": {
"PreToolUse": [
{
"matcher": "Bash",
"hooks": [
{
"type": "command",
"command": "python \".codex/hooks/pre_tool_use_policy.py\"",
"statusMessage": "Checking risky shell command"
}
]
}
],
"Stop": [
{
"hooks": [
{
"type": "command",
"command": "python \".codex/hooks/stop_continue.py\"",
"statusMessage": "Running Harness validation",
"timeout": 300
},
{
"type": "command",
"command": "python \".codex/hooks/handoff_policy.py\"",
"statusMessage": "Checking PLAN/PROGRESS handoff",
"timeout": 60
},
{
"type": "command",
"command": "python \".codex/hooks/drift_policy.py\"",
"statusMessage": "Checking documentation drift",
"timeout": 60
}
]
}
]
}
}
-81
View File
@@ -1,81 +0,0 @@
#!/usr/bin/env python3
"""Catch high-confidence documentation drift before a Codex turn ends."""
from __future__ import annotations
import json
import subprocess
import sys
from pathlib import Path
def changed_paths(root: Path) -> set[str]:
result = subprocess.run(
["git", "status", "--porcelain"],
cwd=root,
capture_output=True,
text=True,
timeout=20,
)
if result.returncode != 0:
return set()
paths: set[str] = set()
for line in result.stdout.splitlines():
if not line.strip():
continue
path = line[3:].replace("\\", "/")
if " -> " in path:
path = path.split(" -> ", 1)[1]
paths.add(path)
return paths
def block(reason: str) -> int:
json.dump({"decision": "block", "reason": reason}, sys.stdout)
return 0
def main() -> int:
try:
payload = json.load(sys.stdin)
except json.JSONDecodeError:
return 0
if payload.get("stop_hook_active"):
return 0
root = Path(payload.get("cwd") or ".").resolve()
paths = changed_paths(root)
if "requirements.txt" in paths and "docs/TOOLCHAIN.md" not in paths:
return block(
"requirements.txt changed without docs/TOOLCHAIN.md. "
"Update the toolchain notes with dependency compatibility rationale."
)
sample_pdf_changed = any(path.startswith("samples/") and path.lower().endswith(".pdf") for path in paths)
metadata_changed = "samples/metadata.json" in paths
if sample_pdf_changed and not metadata_changed:
return block(
"A sample PDF changed without samples/metadata.json. "
"Update the sample metadata mapping so quality tests know the corpus traits."
)
policy_docs = {
"docs/ARCHITECTURE.md",
"docs/CONVERSION_POLICY.md",
"docs/ADR.md",
}
touched_policy_docs = policy_docs.intersection(paths)
if touched_policy_docs and "PROGRESS.md" not in paths:
return block(
"Architecture or conversion policy docs changed without PROGRESS.md. "
"Record the decision and handoff context in PROGRESS.md."
)
return 0
if __name__ == "__main__":
raise SystemExit(main())
-95
View File
@@ -1,95 +0,0 @@
#!/usr/bin/env python3
"""Require PLAN/PROGRESS handoff discipline for multi-agent work."""
from __future__ import annotations
import json
import subprocess
import sys
from pathlib import Path
TRACKED_PREFIXES = (
".agents/",
".codex/",
"AGENTS.md",
"PLAN.md",
"docs/",
"phases/",
"plugins/",
"requirements.txt",
"scripts/",
"src/",
"tests/",
)
def git_status_names(root: Path) -> list[str]:
result = subprocess.run(
["git", "status", "--porcelain"],
cwd=root,
capture_output=True,
text=True,
timeout=20,
)
if result.returncode != 0:
return []
names: list[str] = []
for line in result.stdout.splitlines():
if not line.strip():
continue
path = line[3:].replace("\\", "/")
if " -> " in path:
path = path.split(" -> ", 1)[1]
names.append(path)
return names
def is_coordination_relevant(path: str) -> bool:
return any(path == prefix or path.startswith(prefix) for prefix in TRACKED_PREFIXES)
def block(reason: str) -> int:
json.dump({"decision": "block", "reason": reason}, sys.stdout)
return 0
def main() -> int:
try:
payload = json.load(sys.stdin)
except json.JSONDecodeError:
return 0
if payload.get("stop_hook_active"):
return 0
root = Path(payload.get("cwd") or ".").resolve()
plan = root / "PLAN.md"
progress = root / "PROGRESS.md"
if not plan.exists() or not progress.exists():
return block(
"Multi-agent coordination requires PLAN.md and PROGRESS.md. "
"Create or restore both files before ending the turn."
)
changed = git_status_names(root)
if not changed:
return 0
relevant = [path for path in changed if is_coordination_relevant(path)]
progress_changed = "PROGRESS.md" in changed
if relevant and not progress_changed:
return block(
"Repository planning, docs, code, tests, requirements, or .codex files changed, "
"but PROGRESS.md was not updated. Add a concise handoff note so the next agent "
"can see what changed, what was verified, and what remains next."
)
return 0
if __name__ == "__main__":
raise SystemExit(main())
-50
View File
@@ -1,50 +0,0 @@
#!/usr/bin/env python3
"""Block obviously destructive shell commands before Codex runs them."""
from __future__ import annotations
import json
import re
import sys
BLOCK_PATTERNS = (
r"\brm\s+-rf\b",
r"\bgit\s+push\s+--force(?:-with-lease)?\b",
r"\bgit\s+reset\s+--hard\b",
r"\bgit\s+clean\s+-[a-zA-Z]*f[a-zA-Z]*[dx][a-zA-Z]*\b",
r"\bgit\s+checkout\s+--\s+\.\b",
r"\bDROP\s+TABLE\b",
r"\btruncate\s+table\b",
r"\bRemove-Item\b.*\b-Recurse\b",
r"\bdel\b\s+/s\b",
r"\bconda\s+(?:env\s+)?remove\b.*\b--all\b",
)
def main() -> int:
try:
payload = json.load(sys.stdin)
except json.JSONDecodeError:
return 0
command = payload.get("tool_input", {}).get("command", "")
for pattern in BLOCK_PATTERNS:
if re.search(pattern, command, re.IGNORECASE):
json.dump(
{
"hookSpecificOutput": {
"hookEventName": "PreToolUse",
"permissionDecision": "deny",
"permissionDecisionReason": "Harness guardrail blocked a risky shell command.",
}
},
sys.stdout,
)
return 0
return 0
if __name__ == "__main__":
raise SystemExit(main())
-55
View File
@@ -1,55 +0,0 @@
#!/usr/bin/env python3
"""Run repository validation when a Codex turn stops and request one more pass if it fails."""
from __future__ import annotations
import json
import subprocess
import sys
from pathlib import Path
def main() -> int:
try:
payload = json.load(sys.stdin)
except json.JSONDecodeError:
return 0
if payload.get("stop_hook_active"):
return 0
root = Path(payload.get("cwd") or ".").resolve()
validator = root / "scripts" / "validate_workspace.py"
if not validator.exists():
return 0
result = subprocess.run(
[sys.executable, str(validator)],
cwd=root,
capture_output=True,
text=True,
timeout=240,
)
if result.returncode == 0:
return 0
summary = (result.stdout or result.stderr or "workspace validation failed").strip()
if len(summary) > 1200:
summary = summary[:1200].rstrip() + "..."
json.dump(
{
"decision": "block",
"reason": (
"Validation failed. Review the output, fix the repo, then continue.\n\n"
f"{summary}"
),
},
sys.stdout,
)
return 0
if __name__ == "__main__":
raise SystemExit(main())
@@ -1,23 +0,0 @@
---
name: conversion-architecture
description: Design PDFtoMD conversion architecture, parser boundaries, internal block models, chunk policy, renderer contracts, output structure, logging, and resume behavior. Use when planning or reviewing conversion engine design.
---
# Conversion Architecture
## Workflow
1. Read `AGENTS.md`, `PLAN.md`, `PROGRESS.md`, `docs/ARCHITECTURE.md`, `docs/CONVERSION_POLICY.md`, and `docs/ADR.md`.
2. Keep responsibilities stable:
- Marker: layout, OCR, reading order, body, headings, tables, figures, captions
- Nougat: formula-only LaTeX parsing
- PyMuPDF: page pre-analysis, text-layer quality, page counts, chunk planning
3. Define interfaces and invariants before implementation.
4. Keep output deterministic and chunked under the documented output contract.
5. Record architecture changes in `docs/ADR.md` when decisions change.
## Guardrails
- Do not place conversion logic in a future PyQt UI.
- Do not add document sidecars unless explicitly requested.
- Do not let chunking split a paragraph, table, figure, or formula without a fallback plan.
@@ -1,4 +0,0 @@
interface:
display_name: "Conversion Architecture"
short_description: "Plan parser and renderer boundaries"
default_prompt: "Use $conversion-architecture to design the next PDFtoMD engine phase."
-24
View File
@@ -1,24 +0,0 @@
---
name: formula-quality
description: Plan and review formula extraction quality for PDFtoMD. Use when Codex needs Nougat handoff rules, inline/block formula classification, LaTeX delimiter checks, equation numbering, reference anchors, or Marker fallback behavior.
---
# Formula Quality
## Workflow
1. Read `AGENTS.md`, `docs/CONVERSION_POLICY.md`, `docs/TOOLCHAIN.md`, and `docs/ADR.md`.
2. Identify formula candidates from Marker equation blocks or mathematical text patterns.
3. Classify formulas as inline or block based on layout context.
4. Validate:
- `$ ... $` and `$$ ... $$` balance
- `\begin{...}` / `\end{...}` pairs
- formula numbering
- body references such as `Eq. (3)` or Korean equation references
5. Use Marker source text as fallback when Nougat fails.
## Guardrails
- Do not pass whole documents through Nougat as the primary parser.
- Do not discard formula text on parse failure.
- Do not rewrite references as links unless the target confidence is sufficient.
@@ -1,4 +0,0 @@
interface:
display_name: "Formula Quality"
short_description: "Validate equations and LaTeX output"
default_prompt: "Use $formula-quality to design formula parsing tests and fallback behavior."
-27
View File
@@ -1,27 +0,0 @@
---
name: markdown-quality
description: Plan and review Markdown output quality for PDFtoMD. Use when Codex needs tests or policies for headings, tables, HTML fallback, image links, captions, frontmatter, chunk integrity, and deterministic output.
---
# Markdown Quality
## Workflow
1. Read `AGENTS.md`, `docs/PRD.md`, `docs/ARCHITECTURE.md`, and `docs/CONVERSION_POLICY.md`.
2. Prefer focused assertions over full snapshots.
3. Validate:
- heading hierarchy
- table parseability
- limited HTML table fallback
- image link existence
- figure/table captions
- internal references
- chunk frontmatter
- deterministic filenames and anchors
4. Use Markdown or HTML parsers when practical.
## Guardrails
- Do not inject runtime warnings into generated Markdown.
- Do not rely only on brittle whole-file snapshots.
- Do not lose complex table content without linking a fallback asset.
@@ -1,4 +0,0 @@
interface:
display_name: "Markdown Quality"
short_description: "Check chunk Markdown and assets"
default_prompt: "Use $markdown-quality to plan focused Markdown output validation."
-23
View File
@@ -1,23 +0,0 @@
---
name: pdf-toolchain
description: Research and maintain PDFtoMD toolchain compatibility for Marker, Nougat, PyMuPDF, PyTorch/CUDA, model cache, and licensing. Use when Codex needs dependency pins, runtime compatibility checks, official-source research, or updates to docs/TOOLCHAIN.md and related ADRs.
---
# PDF Toolchain
## Workflow
1. Read `AGENTS.md`, `PLAN.md`, `PROGRESS.md`, `docs/TOOLCHAIN.md`, `docs/ARCHITECTURE.md`, and `docs/ADR.md`.
2. Prefer official or primary sources for current facts.
3. Verify local facts with commands when relevant:
- `.\venv\python.exe -m pip check`
- `.\venv\python.exe -c "import torch; print(torch.__version__, torch.version.cuda, torch.cuda.is_available())"`
- `.\venv\Scripts\nougat.exe --help`
4. Preserve the verified GTX 1070 Ti baseline unless a replacement is tested.
5. Update `docs/TOOLCHAIN.md` and `docs/ADR.md` when dependency decisions change.
## Guardrails
- Do not upgrade `torch`, `transformers`, `albumentations`, `pypdfium2`, `opencv-python-headless`, `Pillow`, or `fsspec` without re-running compatibility checks.
- Do not switch the primary parser away from Marker without an ADR update.
- Do not download model weights unless the user explicitly asks.
@@ -1,4 +0,0 @@
interface:
display_name: "PDF Toolchain"
short_description: "PDF parser and CUDA dependency guidance"
default_prompt: "Use $pdf-toolchain to verify PDFtoMD dependency compatibility and update toolchain notes."
-27
View File
@@ -1,27 +0,0 @@
---
name: sample-corpus
description: Analyze and maintain the PDFtoMD samples corpus. Use when Codex needs to classify samples/ PDFs, design samples/metadata.json, identify OCR candidates, or connect corpus traits to focused regression tests.
---
# Sample Corpus
## Workflow
1. Read `AGENTS.md`, `PLAN.md`, `PROGRESS.md`, `docs/PRD.md`, and `docs/CONVERSION_POLICY.md`.
2. Inspect PDFs with PyMuPDF before proposing tests.
3. Track these traits per PDF:
- page count
- text-layer quality
- scanned or mixed pages
- multi-column layout
- formula density
- table density
- figure density
- Korean filename/path coverage
4. If writing metadata, use `samples/metadata.json` and update `PROGRESS.md`.
## Guardrails
- Preserve original sample PDFs.
- Do not rename Korean sample files unless the user explicitly asks.
- Do not treat first-page text length as the only OCR signal.
@@ -1,4 +0,0 @@
interface:
display_name: "Sample Corpus"
short_description: "Classify PDF samples for quality tests"
default_prompt: "Use $sample-corpus to audit samples/ PDFs and propose regression metadata."
-23
View File
@@ -1,23 +0,0 @@
---
name: windows-runtime
description: Maintain Windows-native PDFtoMD runtime behavior. Use when Codex needs guidance for repo-local venv, CUDA/OOM handling, Korean paths, long paths, model cache, offline operation, stderr logs, or resume cache behavior.
---
# Windows Runtime
## Workflow
1. Read `AGENTS.md`, `docs/TOOLCHAIN.md`, `docs/ARCHITECTURE.md`, and `docs/CONVERSION_POLICY.md`.
2. Verify environment health with:
- `.\venv\python.exe -m pip check`
- CUDA smoke test
- `.\venv\Scripts\nougat.exe --help`
3. Use `pathlib` for path design and tests.
4. Include Korean filenames, spaces, and long Windows paths in test plans.
5. Keep model cache and offline behavior explicit.
## Guardrails
- Do not silently fall back to CPU when the user explicitly requested CUDA.
- Do not choose batch sizes that assume more than 8 GB VRAM.
- Do not delete local environments or sample PDFs without explicit approval.
@@ -1,4 +0,0 @@
interface:
display_name: "Windows Runtime"
short_description: "Windows, CUDA, paths, and offline checks"
default_prompt: "Use $windows-runtime to verify PDFtoMD local runtime assumptions."
-1
View File
@@ -1 +0,0 @@
*.pdf binary
-21
View File
@@ -1,21 +0,0 @@
# Python environments and caches
venv/
.venv/
.models/
__pycache__/
*.py[cod]
# Test and build artifacts
.pytest_cache/
.mypy_cache/
.ruff_cache/
build/
dist/
*.egg-info/
# Conversion outputs
output/
# Local environment files
.env
*.env
-235
View File
@@ -1,235 +0,0 @@
# Project: PDFtoMD
## Repository Role
- 이 저장소는 PDF 문서를 AI Agent가 쉽게 탐색하고 읽을 수 있는 Markdown 문서 묶음으로 변환하는 저장소입니다.
- 목표는 단순 텍스트 추출이 아니라, 읽기 순서, 문단 흐름, 수식, 표, 이미지, 캡션, 본문 참조를 보존한 구조화 변환입니다.
- 텍스트 레이어 PDF, 스캔 PDF, 텍스트/스캔 혼합 PDF를 모두 지원 대상으로 둡니다.
- 1차 목표는 Windows native 환경에서 완전 로컬로 실행되는 CLI/라이브러리 변환 엔진입니다.
- 2차 목표는 PyQt 기반 Windows UI와 선택적 외부 API 연동입니다.
- 변환 결과는 chunk Markdown 파일과 필요한 이미지/표 asset 중심으로 구성합니다.
- 별도 문서 출력 sidecar 산출물은 명시 요청 전까지 범위에 넣지 않습니다.
- 다만 변환 재개를 위한 로컬 runtime cache/state 파일과 stderr/local log 파일은 문서 출력물이 아니므로 허용할 수 있습니다.
- Persistent repository instructions live in this `AGENTS.md`.
- Reusable repo-scoped workflows live in `.agents/skills/`.
- Project-scoped custom agents live in `.codex/agents/`.
- Experimental hooks live in `.codex/hooks.json`.
## Read First
새 세션이나 새 agent 작업을 시작하면 다음 문서를 먼저 읽습니다.
- `AGENTS.md`
- `PLAN.md`
- `PROGRESS.md`
- `docs/PRD.md`
- `docs/ARCHITECTURE.md`
- `docs/CONVERSION_POLICY.md`
- `docs/HARNESS.md`
- `docs/IMPLEMENTATION_PLAN.md`
- `docs/ADR.md`
- `docs/TOOLCHAIN.md`
- `docs/UI_GUIDE.md`
## 기술 스택
- **Language**: Python 3.11+
- **Environment**: repo-local single `venv`
- **Primary PDF Parser**: `Marker`
- **Primary Mathematical Expression Parser**: `Nougat`
- **PDF Analysis / Splitting**: `PyMuPDF`
- **OCR / Layout Support**: Marker OCR/layout 기능 우선 사용
- **Pre-analysis**: PyMuPDF 등으로 페이지별 텍스트 레이어 품질과 OCR 필요 여부를 사전 판별
- **Table Handling**: Markdown table 우선, 복잡한 표는 제한적 HTML table과 표 영역 이미지 fallback 허용
- **Output Target**: Markdown
- **Runtime**: Windows native, local-first, GPU 기본 사용, VRAM 8GB 기준으로 batch/chunk 크기 제한
- **Verified GPU Baseline**: NVIDIA GeForce GTX 1070 Ti 8GB, `torch==2.7.1+cu126`
- **UI**: `PyQt`는 2차 목표이며 CLI/라이브러리 API를 호출하는 thin client여야 함
## 아키텍처 규칙
### Parser Engine Strategy
- Marker를 기본 PDF parser로 사용합니다.
- Marker는 전체 레이아웃 추적, OCR/layout, reading order, heading, 본문, 표, 그림, 캡션, semantic block role을 담당합니다.
- Nougat은 메인 PDF parser가 아니라 수식용 parser로 사용합니다.
- Nougat은 Marker의 equation block 또는 수식 패턴이 감지된 블록의 수식 문자열 생성에만 사용합니다.
- Nougat 변환 실패 시 정보 유실 방지를 위해 Marker가 추출한 원문 문자열을 fallback으로 사용합니다.
- PyMuPDF는 페이지 수, 텍스트 레이어 품질, chunk 계획, 저수준 PDF/page/image 작업에 사용합니다.
- 세부 변환 정책은 `docs/CONVERSION_POLICY.md`를 우선합니다.
### Reading Order & Paragraph Flow Strategy
PDF는 텍스트를 좌표 기반으로 저장하므로, 사람이 읽는 논리적인 순서로 재구성하는 것이 핵심입니다.
- **Logical Reading Order**: 다단 문서나 삽입 문구가 있는 레이아웃에서 텍스트 흐름을 추적하여 Markdown의 선형 구조로 배치합니다.
- **Paragraph Stitching**: PDF 추출 시 발생하는 행 단위 분절을 제거하고, 문맥과 bounding box 정보를 이용해 완성된 문단으로 병합합니다.
- **Header/Footer Filtering**: 페이지 상/하단 반복 패턴, 페이지 번호, 머리말/꼬리말은 본문 흐름에서 분리하거나 제외합니다.
- **Semantic Mapping**: 제목, 본문, 인용구, 리스트, 표, 그림, 캡션, 수식 등의 의미 역할을 보존합니다.
### Scientific, Mathematical, Table, Figure Strategy
- 수식은 Markdown math delimiter로 표현 가능한 LaTeX를 우선합니다.
- 인라인 수식은 `$ ... $`, 블록 수식은 `$$ ... $$` 형식을 사용합니다.
- 수식 번호와 본문 내 참조 관계는 가능한 한 보존하고 내부 Markdown 링크로 연결합니다.
- LaTeX는 delimiter 짝, `\begin{...}` / `\end{...}` 짝, 흔한 깨짐 패턴을 검증하고 렌더링이 깨지지 않도록 보정합니다.
- CLI 기본 수식 parser는 `nougat`입니다.
- 표는 Markdown table을 우선합니다.
- 병합 셀, 다중 header, 각주 포함 표 등 Markdown table 손실이 큰 경우 제한적으로 HTML `<table>`을 사용합니다.
- 구조화 손실이 큰 표는 원본 보존용 표 영역 이미지 fallback을 함께 연결합니다.
- 이미지는 추출 파일, 캡션, figure 번호, 본문 참조를 함께 연결합니다.
- 이미지 중복 저장을 줄이기 위해 hash 기반 deduplication을 고려합니다.
### Chunk Strategy
- 긴 PDF는 기본적으로 20페이지 단위 chunk로 나누어 변환합니다.
- chunk 경계는 page count보다 논리 block integrity를 우선합니다.
- 문단, 표, 그림, 수식 중간에서 잘림이 발생하면 해당 block을 이전 또는 다음 chunk로 이동해 온전하게 보존합니다.
- 각 chunk Markdown 최상단에는 문서 제목, page range, chunk 번호 등 최소 문맥을 간결한 frontmatter로 포함할 수 있습니다.
### Output Structure
```text
output/
└── document-slug/
├── document-slug_001.md
├── document-slug_002.md
├── document-slug_003.md
└── images/
├── document-slug_fig-001.png
└── document-slug_fig-003.png
```
세부 규칙:
- chunk Markdown 파일명은 `<slug>_<chunk-index:03d>.md`
- image asset은 `images/`
- figure 번호가 있으면 `{document-slug}_fig-{figure-number}.png`를 우선합니다.
- figure 번호가 없거나 충돌 가능성이 있으면 chunk/page/block identifier를 포함한 결정적 파일명을 사용합니다.
- 같은 입력과 같은 옵션은 같은 output path, anchor, asset naming을 생성해야 합니다.
### Runtime Policy
- 기본 runtime은 `cuda`입니다.
- explicit `--runtime cuda` 또는 `--device cuda`에서 CUDA가 준비되지 않았으면 fail-fast 처리합니다.
- `--runtime auto`는 필요 시 경고 후 CPU fallback을 허용합니다.
- GTX 1070 Ti 8GB 기준 기본 batch size는 1~2 수준으로 보수적으로 잡습니다.
- GPU OOM 발생 시 가능한 경우 batch/page 단위를 줄여 재시도합니다.
- 모델 cache는 명시적 로컬 경로를 사용해 최초 다운로드 이후 offline 실행을 지원해야 합니다.
### Logging And Resume Policy
- CLI는 chunk 단위 진행률과 성공/실패 상태를 표시합니다.
- 경고와 오류는 stderr 및 로컬 log 파일에 기록합니다.
- 생성된 Markdown 내부에는 오류 로그를 삽입하지 않습니다.
- 성공한 chunk는 재실행 시 건너뛰고 실패한 chunk만 resume할 수 있도록 runtime state/cache를 둘 수 있습니다.
- runtime state/cache는 문서 출력 contract가 아니며, 별도 sidecar 문서 산출물과 구분합니다.
### Licensing Policy
- 현재 사용 맥락은 개인용입니다.
- 배포나 상업적 사용 가능성이 생기면 Marker GPL 및 model weight license 조건을 다시 검토해야 합니다.
- 프로세스/API 분리는 라이선스 위험 완화 후보일 뿐이며 법적 결론으로 취급하지 않습니다.
### Git
- 로컬 git을 사용합니다.
- 주소: `C:\git\PDFToMDWithMath`
- 변경사항이 생길 때마다 커밋합니다.
- 커밋 메시지는 conventional commits 형식을 따릅니다.
## 개발 프로세스
- CRITICAL: 새 기능 구현 시 반드시 테스트를 먼저 작성하고, 테스트가 통과하는 구현을 작성합니다. (TDD)
- 1차 구현은 CLI/라이브러리 변환 엔진을 먼저 안정화하고, PyQt UI는 2차 목표로 분리합니다.
- 변환 품질 테스트는 전체 Markdown snapshot 비교보다 heading, 수식, 이미지, 표, 캡션, 링크, 예외 여부 등 부분 검증을 우선합니다.
- 정규식만으로 복잡한 문서를 처리하지 말고, 가능한 경우 Markdown/HTML parser나 구조화된 parser API를 사용합니다.
- `samples/` 폴더의 PDF는 회귀 테스트와 품질 평가용 corpus로 사용합니다.
- `samples/` PDF의 특성은 sample metadata mapping 파일로 관리합니다.
- 연구/기획 요청에서는 구현 코드, phase 파일, custom agent 파일을 만들지 않습니다.
- `scripts/execute.py`는 step 완료 후 결과 커밋을 정리하므로, step 프롬프트 안에서 별도 커밋을 만들 필요는 없습니다.
## Multi-Agent Coordination
- 여러 agent가 일을 나눠서 할 때 **작업 계획의 단일 출처는 repo root의 `PLAN.md`**, **진행 상태의 단일 출처는 repo root의 `PROGRESS.md`**입니다.
- 새 세션이나 새 agent 작업을 시작할 때는 반드시 `AGENTS.md`, `PLAN.md`, `PROGRESS.md`를 먼저 읽어야 합니다.
- 새 agent는 작업을 시작하기 전에 다음 질문에 답할 수 있어야 합니다.
- 현재 프로젝트 목표는 무엇인가?
- 내 담당 범위와 하지 말아야 할 범위는 무엇인가?
- 이미 완료된 작업은 무엇인가?
- 현재 진행 중인 작업과 막힌 점은 무엇인가?
- 바로 이어서 해야 할 다음 작업은 무엇인가?
- 다른 agent와 충돌할 가능성이 있는 파일이나 책임 영역은 무엇인가?
- `PLAN.md`는 앞으로 해야 할 일의 단일 작업 계획 문서로 사용합니다.
- `PLAN.md`에는 목표, 범위, 우선순위, 작업 분해, 담당 agent 또는 역할, 의존성, 수락 기준, 명시적 제외 범위를 기록합니다.
- `PLAN.md`에는 agent가 새로 시작해도 자신의 담당 작업을 선택할 수 있도록 작업 항목을 구체적으로 적습니다.
- `PROGRESS.md`는 어디까지 진행됐는지의 단일 진행 상태 문서로 사용합니다.
- `PROGRESS.md`에는 완료한 작업, 진행 중 작업, 막힌 점, 주요 결정, 검증 결과, 다음에 이어서 할 작업을 기록합니다.
- `PROGRESS.md`의 "Next Work"는 다음 agent가 실제로 이어받을 수 있는 작업 단위로 유지합니다.
- 작업 시작 전 `PROGRESS.md`의 현재 상태와 `PLAN.md`의 담당 범위를 확인하고, 중복 작업이나 충돌 가능성이 있으면 먼저 정리합니다.
- 작업 중 계획이 바뀌면 `PLAN.md`를 먼저 갱신하고, 실제 진행 결과는 `PROGRESS.md`에 반영합니다.
- 작업을 마치거나 중단할 때는 다음 agent가 이어받을 수 있도록 `PROGRESS.md`를 갱신합니다.
- 여러 agent가 병렬로 작업할 경우 각 agent의 담당 파일, 책임 영역, 의존성을 `PLAN.md`에 명확히 기록합니다.
- 병렬 작업 중 완료/실패/차단 상태, 검증 결과, 다음 handoff 내용은 `PROGRESS.md`에 기록합니다.
- `PLAN.md``PROGRESS.md`가 없고 다중 agent 작업이 필요한 경우, 구현 작업을 시작하기 전에 두 파일의 초안을 먼저 만들거나 사용자에게 생성 승인을 요청합니다.
- Harness의 `phases/{phase}/index.json`은 phase 실행 상태용이고, 다중 agent 작업의 전체 계획과 진행 기록은 `PLAN.md``PROGRESS.md`를 우선합니다.
## Custom Agent Planning
- custom agent 파일은 `.codex/agents/<kebab-case-name>.toml`에 둡니다.
- agent 이름은 snake_case를 사용합니다.
- agent 설계 작업은 기본적으로 read-only agent부터 시작합니다.
- 사용자가 명시적으로 승인한 agent만 하나씩 생성합니다.
- 연구/기획 요청에서는 구현 코드, phase 파일, agent 파일을 만들지 않습니다.
## Codex Project Extensions
Project-scoped Codex extensions live under `.codex/`.
### Agents
- `.codex/agents/phase-planner.toml`: Harness phase planning.
- `.codex/agents/harness-reviewer.toml`: Harness repository review.
- `.codex/agents/pdf-toolchain-researcher.toml`: PDF toolchain compatibility and licensing research.
- `.codex/agents/sample-corpus-analyst.toml`: sample PDF corpus analysis.
- `.codex/agents/conversion-architect.toml`: conversion pipeline architecture.
- `.codex/agents/quality-evaluator.toml`: focused quality and regression strategy.
- `.codex/agents/formula-pipeline-specialist.toml`: Nougat/formula pipeline analysis.
- `.codex/agents/layout-table-figure-specialist.toml`: reading order, table, figure, and caption analysis.
### Commands
- `.codex/commands/status.md`: summarize plan/progress/blockers/next work.
- `.codex/commands/env-check.md`: verify Python environment, CUDA, and Nougat CLI.
- `.codex/commands/sample-audit.md`: inspect `samples/` PDF traits.
- `.codex/commands/quality-plan.md`: draft focused pytest strategy.
- `.codex/commands/conversion-policy-review.md`: review policy/architecture/ADR consistency.
- `.codex/commands/model-cache-check.md`: inspect model cache and offline readiness.
- `.codex/commands/phase-draft.md`: draft Harness phase steps.
- `.codex/commands/sprint-contract.md`: draft or review step-level generator/evaluator contracts.
### Skills
- `.codex/skills/pdf-toolchain`: dependency, CUDA, model cache, and license workflow.
- `.codex/skills/sample-corpus`: sample metadata and corpus classification workflow.
- `.codex/skills/conversion-architecture`: parser/renderer/chunk architecture workflow.
- `.codex/skills/formula-quality`: formula parsing and LaTeX validation workflow.
- `.codex/skills/markdown-quality`: Markdown output validation workflow.
- `.codex/skills/windows-runtime`: Windows path, CUDA, model cache, and resume workflow.
### Hooks
- `.codex/hooks/pre_tool_use_policy.py`: blocks high-risk shell commands.
- `.codex/hooks/stop_continue.py`: runs repository validation at stop.
- `.codex/hooks/handoff_policy.py`: enforces `PLAN.md`/`PROGRESS.md` handoff discipline.
- `.codex/hooks/drift_policy.py`: catches high-confidence docs/toolchain/sample metadata drift.
- Hooks are configured in `.codex/hooks.json`; native Windows hook execution may depend on the active Codex surface.
## Harness Workflow
- Harness 운영 규칙의 세부 기준은 `docs/HARNESS.md`를 우선합니다.
- Anthropic식 장기 작업 흐름은 `planner -> generator -> evaluator` 역할 분리를 기본으로 합니다.
- Planner는 phase/step을 만들고, Generator는 한 번에 하나의 step만 수행하며, Evaluator는 구현 agent와 독립된 관점으로 hard threshold를 적용합니다.
- 각 구현 step은 코드 작성 전에 "Sprint Contract"를 명시해야 합니다. 이 contract에는 완료 정의, hard threshold, 담당 파일, 의존성, 검증 명령이 포함되어야 합니다.
- Generator와 Evaluator의 합의나 검토 결과는 대화에만 남기지 말고 파일에 남깁니다. 기본 위치는 `phases/{phase}/stepN.md`, `phases/{phase}/index.json`, `PROGRESS.md`입니다.
- Hook은 보조 장치이며 `stepN.md`의 Acceptance Criteria와 Evaluator 판단을 대체하지 않습니다.
- Harness 자체는 단순하게 유지합니다. 새 agent, hook, command는 실제 실패 모드를 줄일 때만 추가합니다.
- 먼저 `docs/PRD.md`, `docs/ARCHITECTURE.md`, `docs/CONVERSION_POLICY.md`, `docs/ADR.md`, `docs/TOOLCHAIN.md`를 읽고 기획/설계 의도를 파악합니다.
- 단계별 실행 계획이 필요하면 repo skill `harness-workflow`를 사용해 `phases/` 아래 파일을 설계합니다.
- 변경사항 리뷰가 필요하면 repo skill `harness-review` 또는 Codex의 `/review`를 사용합니다.
- `phases/{phase}/index.json`은 phase 진행 상태의 단일 진실 공급원으로 취급합니다.
-`stepN.md`는 독립된 Codex 세션에서도 실행 가능하도록 자기완결적으로 작성합니다.
## 검증
- 기본 검증 스크립트는 `python scripts/validate_workspace.py`.
- Python 검증은 최소한 다음 항목을 포함해야 합니다.
- `.\venv\python.exe -m pip check`
- `.\venv\python.exe -m pytest`
- CUDA runtime/import smoke test
- `.\venv\Scripts\nougat.exe --help`
- Node 프로젝트면 `package.json``lint`, `build`, `test` 스크립트를 자동 탐지해 순서대로 실행합니다.
- 다른 스택이면 `HARNESS_VALIDATION_COMMANDS` 환경 변수에 줄바꿈 기준으로 검증 커맨드를 지정합니다.
## 명령어
- `conda create -p .\venv python=3.11 -y`: repo-local Python 3.11 환경 생성
- `.\venv\python.exe -m pip install -r requirements.txt`: 검증된 단일 환경 의존성 설치
- `python scripts/validate_workspace.py`: 저장소 검증
- `python scripts/execute.py <phase-dir>`: Codex 기반 phase 순차 실행
- `python scripts/execute.py <phase-dir> --push`: phase 완료 후 브랜치 push
- `python -m pdftomd <input.pdf> --formula-parser nougat --nougat-command .\venv\Scripts\nougat.exe`: Nougat 수식 parser를 사용하는 기본 변환 경로
- `python -m pdftomd <input.pdf> --formula-parser marker`: Nougat 없이 Marker 수식 문자열을 유지하는 compatibility 변환 경로
-151
View File
@@ -1,151 +0,0 @@
# PDFtoMD Multi-Agent Plan
## Goal
Build a Windows-native, local-first PDF-to-Markdown conversion engine that preserves logical reading order, paragraph flow, formulas, tables, figures, captions, and chunked output for AI-agent consumption.
## Current Scope
- Primary deliverable: CLI/library conversion engine.
- Primary PDF parser: Marker.
- Formula parser: Nougat, isolated from the main parser path and used only for mathematical expressions/formulas.
- PDF analysis and chunk planning: PyMuPDF.
- Output: chunked Markdown files plus image/table assets under a document slug directory.
- Default chunk size: 20 pages.
- Runtime target: Windows 10, local GPU first, GTX 1070 Ti with 8 GB VRAM.
- User context: personal use.
- Python environment target: one repo-local Python 3.11 environment.
## Out of Scope For Now
- PyQt UI implementation.
- Hosted conversion API.
- Default LLM correction path.
- Sidecar metadata/log output unless explicitly requested.
- Custom agent file creation until the user approves one agent at a time.
- Engine implementation outside an approved Harness phase.
## Current Inputs
- Repository instructions: `AGENTS.md`.
- Product/design documents: `docs/PRD.md`, `docs/ARCHITECTURE.md`, `docs/ADR.md`, `docs/TOOLCHAIN.md`, `docs/UI_GUIDE.md`.
- Conversion policy decisions: `docs/CONVERSION_POLICY.md`.
- Harness operating guide: `docs/HARNESS.md`.
- Full implementation roadmap: `docs/IMPLEMENTATION_PLAN.md`.
- Executable phase registry: `phases/index.json`.
- Sample corpus:
- `samples/2007쉘구조물의유한요소해석에대하여.pdf`
- `samples/FourNodeQuadrilateralShellElementMITC4.pdf`
- `samples/MITC공부.pdf`
- `samples/유한요소해석법을이용한쉘구조물의동적좌굴해석.pdf`
## Research Tracks
1. Toolchain research
- Verify Marker, Nougat, PyMuPDF, PyTorch/CUDA, Pandas, and Markdown/math-rendering constraints.
- Track licensing risks. Current use is personal, but revisit if distribution or commercial use becomes relevant.
- Compare Marker-first architecture against PyMuPDF4LLM, Docling, and MinerU as quality baselines only.
- Keep `docs/TOOLCHAIN.md` updated when dependency pins or compatibility findings change.
2. Conversion architecture
- Define stable internal document/block types.
- Keep Marker document/block structure as the main source for headings, body text, reading order, figures, tables, and captions.
- Treat Nougat output as formula text input subject to validation and fallback policy.
- Keep PyMuPDF responsible for page counts, chunk planning, and low-level PDF/page operations.
- Follow `docs/CONVERSION_POLICY.md` for OCR decisions, parser handoff rules, fallback behavior, chunk boundary handling, logging, and resume policy.
3. Quality and regression strategy
- Prefer focused assertions over full Markdown snapshots.
- Validate headings, formula delimiters, begin/end pairs, table shape, image links, captions, and no-exception conversion.
- Include Korean filenames and Windows paths in regression coverage.
- Include VRAM pressure and long-document chunking scenarios.
4. Runtime strategy
- Use repo-local Python environments.
- Use a single `venv` for Marker/PyMuPDF/Pandas/tests and Nougat.
- Use CUDA-enabled PyTorch compatible with the installed NVIDIA driver and GTX 1070 Ti.
- Current verified PyTorch choice is `torch==2.7.1+cu126`, because newer `torch==2.11.0+cu128` does not support GTX 1070 Ti `sm_61`.
- Keep Nougat dependency pins explicit inside the unified environment, especially `transformers==4.57.6`, `albumentations==1.3.1`, `pypdfium2==4.30.0`, `opencv-python-headless==4.11.0.86`, `Pillow==10.4.0`, and `fsspec==2026.2.0`.
## Created Project Agent Roles
The user approved creating the project-scoped Codex extensions on 2026-04-30. These read-only agents now live under `.codex/agents/`.
1. `pdf_toolchain_researcher`
- Read-only.
- Owns official-doc research for Marker, Nougat, PyMuPDF, PyTorch/CUDA, Markdown math, and comparison tools.
- Outputs compatibility notes, licensing notes, and recommended dependency constraints.
2. `conversion_architect`
- Read-only at first.
- Owns engine boundaries, internal data contracts, chunk policy, adapter interfaces, and output contract.
- Outputs phase-ready architecture notes and acceptance criteria.
3. `quality_evaluator`
- Read-only at first.
- Owns sample-corpus classification, focused quality checks, regression fixtures, and failure taxonomy.
- Outputs test strategy before implementation begins.
4. `formula_pipeline_specialist`
- Read-only at first.
- Owns Nougat integration assumptions, formula extraction boundaries, LaTeX delimiter validation, and fallback policy.
5. `layout_table_figure_specialist`
- Read-only at first.
- Owns reading order, paragraph stitching, table rendering, figure extraction, caption linking, and cross-reference preservation.
6. `sample_corpus_analyst`
- Read-only at first.
- Owns sample PDF corpus analysis, OCR-candidate identification, metadata schema suggestions, and regression implications.
## Future Agent Roles
- `marker_adapter_worker`: implementation worker for Marker adapter code, after TDD phase approval.
- `markdown_renderer_worker`: implementation worker for Markdown renderer and output contract, after TDD phase approval.
- `runtime_cli_worker`: implementation worker for CLI/runtime/device behavior, after TDD phase approval.
- `test_fixture_worker`: implementation worker for sample metadata and focused pytest fixtures, after TDD phase approval.
## Harness Execution Model
This project now follows a file-based planner/generator/evaluator workflow for long-running work.
1. Planner creates or updates `phases/` steps from `AGENTS.md`, `PLAN.md`, `PROGRESS.md`, and `docs/*.md`.
2. Generator executes one `stepN.md` at a time and stays inside that step's owned files and Do Not list.
3. Evaluator reviews the result against the step's Sprint Contract and hard thresholds before the work is considered complete.
4. Communication and handoff happen through files, not only chat:
- `PLAN.md` for overall work plan.
- `PROGRESS.md` for current state and next handoff.
- `phases/{phase}/index.json` for step execution status.
- `docs/HARNESS.md` for role and contract rules.
## Active Phase Plan
Phase registry:
- `phases/index.json`
Full phase roadmap:
1. `0-harness-foundation`: sample metadata, core models, PyMuPDF pre-analysis contract, Markdown quality gates.
2. `1-core-runtime-contracts`: input normalization, conversion options, output bundle contract, runtime cache policy.
3. `2-marker-adapter`: Marker invocation, OCR plan handoff, block normalization, parser failure reporting.
4. `3-formula-pipeline`: formula candidate detection, Nougat command adapter, LaTeX validation/repair, formula reference links.
5. `4-semantic-enrichment`: reading-order checks, paragraph stitching, header/footer filtering, figure/table/formula reference indexing.
6. `5-markdown-rendering-assets`: block renderer, table renderer/fallbacks, figure asset writer, chunk renderer.
7. `6-cli-runtime-resume`: CLI options, progress/logging, resume state, CUDA/OOM policy, model cache/offline support.
8. `7-mvp-quality-hardening`: sample smoke conversions, quality metrics, regression thresholds, MVP fix sweep.
9. `8-release-docs-packaging`: usage docs, environment bootstrap docs, license checkpoint, local release checklist.
10. `9-pyqt-thin-client`: UI API contract, PyQt shell, UI progress/resume, UI packaging notes.
Detailed phase goals and dependencies are recorded in `docs/IMPLEMENTATION_PLAN.md`. Executable step contracts live under each `phases/{phase}/stepN.md`.
## Priority Order
1. Execute `phases/0-harness-foundation/step0.md` only after the user wants implementation to begin.
2. Keep each implementation step inside its Sprint Contract and TDD requirements.
3. Review each completed phase before starting the next phase.
4. Treat PyQt and external API work as post-MVP unless the user explicitly changes scope.
## Acceptance Criteria For Planning Stage
- `PLAN.md` and `PROGRESS.md` exist and reflect the current goal.
- `docs/CONVERSION_POLICY.md` records parser, OCR, formula, table, figure, chunk, runtime, logging, resume, and quality-test policy decisions.
- `docs/TOOLCHAIN.md` records verified local dependency and compatibility decisions.
- Environment decision is recorded.
- `requirements.txt` records the verified single-environment dependency pins.
- `docs/HARNESS.md` records the planner/generator/evaluator workflow.
- `docs/IMPLEMENTATION_PLAN.md` records the full phase roadmap.
- `phases/index.json` and `phases/*/stepN.md` provide executable self-contained tickets.
- No custom agent file is created without explicit user approval.
- Repository validation command remains runnable:
```bash
python scripts/validate_workspace.py
```
-177
View File
@@ -1,177 +0,0 @@
# PDFtoMD Progress
## Current Status
- Date: 2026-04-30.
- Mode: Phase 1 implementation complete; ready for Phase 1 review or Phase 2 handoff.
- Implementation status: Phase 0 and Phase 1 complete.
- Custom agent creation: project-scoped read-only agents created after user approval.
- Persistent multi-agent coordination files were approved by the user.
## Completed
- Read repository instructions and project documents:
- `AGENTS.md`
- `docs/PRD.md`
- `docs/ARCHITECTURE.md`
- `docs/ADR.md`
- `docs/UI_GUIDE.md`
- Confirmed project direction:
- Windows-native local CLI/library engine first.
- Marker for document structure, reading order, tables, figures, headings, and captions.
- Nougat only for mathematical expressions/formulas.
- PyMuPDF for PDF analysis and chunk planning.
- Markdown chunk output plus image/table assets.
- Recorded detailed conversion policy decisions in `docs/CONVERSION_POLICY.md`.
- Strengthened project documentation with current research and decisions:
- `AGENTS.md`
- `README.md`
- `docs/PRD.md`
- `docs/ARCHITECTURE.md`
- `docs/ADR.md`
- `docs/TOOLCHAIN.md`
- `docs/UI_GUIDE.md`
- Strengthened `AGENTS.md` multi-agent coordination rules so every new agent reads `PLAN.md` and `PROGRESS.md` first and can identify current goals, assigned scope, completed work, blockers, next work, and conflict risks.
- Created project-scoped Codex extensions under `.codex/`:
- Agents: `pdf-toolchain-researcher`, `sample-corpus-analyst`, `conversion-architect`, `quality-evaluator`, `formula-pipeline-specialist`, `layout-table-figure-specialist`.
- Commands: `status`, `env-check`, `sample-audit`, `quality-plan`, `conversion-policy-review`, `model-cache-check`, `phase-draft`.
- Skills: `pdf-toolchain`, `sample-corpus`, `conversion-architecture`, `formula-quality`, `markdown-quality`, `windows-runtime`.
- Hooks: strengthened risky command guard, added handoff policy and drift policy hooks.
- Validated `.codex` extension formats:
- Agent TOML files parsed successfully.
- `.codex/hooks.json` parsed successfully.
- Hook Python scripts compiled successfully.
- All `.codex/skills/*/SKILL.md` files passed `skill-creator` quick validation.
- Confirmed user environment:
- Windows 10.
- NVIDIA GeForce GTX 1070 Ti.
- 8 GB VRAM.
- NVIDIA driver 577.00.
- `nvidia-smi` reports CUDA runtime capability 12.9.
- User reports CUDA 12.4 installed.
- Current detected Python: Miniforge Python 3.12.7.
- Conda is available.
- `uv` is not available.
- Created repo-local environment:
- `venv`: Python 3.11.15, unified Marker/PyMuPDF/Pandas/test/Nougat environment.
- Removed previous experimental `venv-nougat` directory after unified `venv` validation passed.
- Verified unified environment:
- `torch==2.7.1+cu126`
- `torchvision==0.22.1+cu126`
- `marker-pdf==1.10.2`
- `nougat-ocr==0.1.17`
- `transformers==4.57.6`
- `albumentations==1.3.1`
- `fsspec==2026.2.0`
- `pymupdf==1.27.2.3`
- `pandas==3.0.2`
- `pytest==9.0.3`
- `Pillow==10.4.0`
- `pypdfium2==4.30.0`
- `opencv-python-headless==4.11.0.86`
- `pip check`: passed.
- CUDA tensor operation on GTX 1070 Ti: passed.
- `venv\Scripts\nougat.exe --help`: passed.
- Ran earlier repository validation before default Python test discovery was added:
- `python scripts/validate_workspace.py`: passed at that time with no configured validation commands.
- Confirmed sample PDFs:
- `samples/2007쉘구조물의유한요소해석에대하여.pdf`: 13 pages, first page text length 3523, first page images 0.
- `samples/FourNodeQuadrilateralShellElementMITC4.pdf`: 7 pages, first page text length 3269, first page images 0.
- `samples/MITC공부.pdf`: 13 pages, first page text length 226, first page images 2.
- `samples/유한요소해석법을이용한쉘구조물의동적좌굴해석.pdf`: 76 pages, first page text length 446, first page images 10.
- Strengthened the project for Anthropic-style Harness Engineering:
- Added `docs/HARNESS.md` with planner/generator/evaluator roles, file protocol, Sprint Contract template, evaluator hard thresholds, and simplification rules.
- Added executable phase registry `phases/index.json`.
- Added first self-contained phase `phases/0-harness-foundation/` with four pending steps:
- `sample-metadata-contract`
- `core-package-skeleton`
- `page-preanalysis-contract`
- `markdown-quality-gates`
- Updated `AGENTS.md`, `PLAN.md`, `README.md`, `docs/ARCHITECTURE.md`, and `docs/ADR.md` to reference the Harness workflow.
- Added `.codex/commands/sprint-contract.md`.
- Strengthened Harness workflow/review skill guidance to require Sprint Contracts.
- Updated hooks for simpler Windows-friendly command paths and expanded handoff checks to include `phases/`, `scripts/`, `.agents/`, and `plugins/`.
- Made `scripts/validate_workspace.py` discover repo-local Python validation by default.
- Added `scripts/test_validate_workspace.py` and fixed `scripts/test_execute.py` UTF-8 fixture handling on Windows.
- Established the full phase-by-phase implementation roadmap before starting engine implementation:
- Added `docs/IMPLEMENTATION_PLAN.md`.
- Expanded `phases/index.json` from Phase 0 only to Phases 0 through 9.
- Added executable pending step contracts for:
- `1-core-runtime-contracts`
- `2-marker-adapter`
- `3-formula-pipeline`
- `4-semantic-enrichment`
- `5-markdown-rendering-assets`
- `6-cli-runtime-resume`
- `7-mvp-quality-hardening`
- `8-release-docs-packaging`
- `9-pyqt-thin-client`
- Updated `PLAN.md`, `AGENTS.md`, and `README.md` to point new agents to the full implementation roadmap.
- Implemented Phase 0 Harness foundation:
- Step 0 `sample-metadata-contract`: added deterministic `samples/metadata.json` and metadata contract tests.
- Step 1 `core-package-skeleton`: added `pyproject.toml`, importable `src/pdftomd` package, typed model contracts, and model tests.
- Step 2 `page-preanalysis-contract`: added PyMuPDF-only `analyze_pdf()` preanalysis, deterministic OCR candidate logic, and chunk candidate tests.
- Step 3 `markdown-quality-gates`: added focused Markdown quality gates and tests for math delimiters, LaTeX environments, image links, tables, chunk frontmatter, and anchors.
- Parallel work was split by disjoint write scopes: sample metadata/model contracts first, then preanalysis/quality gates.
- Reviewed Phase 0 with `harness-review` criteria:
- No blocking findings.
- Architecture boundary remained intact: Marker and Nougat are not invoked in foundation contracts, and PyMuPDF is limited to page pre-analysis.
- `python scripts\validate_workspace.py` passed before Phase 1 work started.
- Implemented Phase 1 core runtime contracts:
- Step 0 `input-normalization-slug`: added deterministic PDF path normalization, document identity creation, anchors, and output bundle path contracts.
- Step 1 `conversion-options-config`: added typed conversion options, runtime modes, and formula parser options without CLI parsing.
- Step 2 `output-bundle-contract`: added deterministic document bundle paths while keeping runtime artifacts separate from document output.
- Step 3 `runtime-cache-policy`: added explicit `.models/` default cache policy, `PDFTOMD_MODEL_CACHE` override, Hugging Face offline environment mappings, and runtime artifact paths.
- Updated `docs/TOOLCHAIN.md` and `.gitignore` for model cache policy.
## Web Research Notes
- Marker currently supports Markdown/JSON/chunks/HTML output and includes tables, equations, inline math, image extraction, layout, and reading-order functionality.
- Nougat is the intended isolated formula parser candidate; Windows GPU use depends on a correct PyTorch install.
- PyMuPDF remains appropriate for page counting, PDF splitting/chunk planning, and low-level image/page operations.
- PyMuPDF4LLM, Docling, and MinerU are useful comparison baselines but are not the primary parser under the current architecture.
- MathJax notes that `$...$` inline math can conflict with ordinary dollar signs, so delimiter validation is required.
## In Progress
- None.
## Blockers
- None yet.
## Decisions
- Personal-use context lowers immediate licensing risk, but Marker GPL/model license implications must be revisited before redistribution or commercial use.
- Mixed text/scanned PDFs are in scope, with page-level OCR intervention decisions based on lightweight text-layer quality analysis.
- Marker owns layout, reading order, body text, headings, tables, figures, captions, and OCR/layout handling.
- Nougat owns only mathematical expressions and formula blocks, with Marker text fallback on failure.
- Markdown tables are preferred, but limited HTML tables and table-region screenshot fallbacks are allowed for complex tables.
- Figure/table/formula numbers and body references should become internal Markdown links when confidence is sufficient.
- Chunking should prefer logical block boundaries over strict 20-page boundaries when a block would be split.
- Chunk Markdown may include concise frontmatter with core context, but document-output sidecars remain out of scope by default.
- CLI should write warnings/errors to stderr and local logs, not into generated Markdown.
- Resume support may use local runtime state/cache files to skip successful chunks.
- Custom agents will be created later, only one at a time after explicit user approval.
- Planning files are the source of truth for multi-agent coordination.
- Harness phase files now exist. `PLAN.md` remains the overall plan, `PROGRESS.md` remains the handoff state, and `phases/{phase}/index.json` is the phase execution status.
- Each future implementation step should use the `docs/HARNESS.md` planner/generator/evaluator workflow and include a Sprint Contract before code changes.
- Full implementation sequencing is recorded in `docs/IMPLEMENTATION_PLAN.md`; phase files are pending tickets and should not be executed out of dependency order.
- Phase 0 and Phase 1 are complete. `phases/index.json` marks both `0-harness-foundation` and `1-core-runtime-contracts` as completed.
- Main and Nougat dependencies can share one environment when Nougat's loose dependencies are pinned explicitly.
- `torch==2.11.0+cu128` was rejected for this machine because it does not support GTX 1070 Ti `sm_61`.
- `torch==2.7.1+cu126` was selected because it satisfies Marker `torch>=2.7.0` and successfully runs CUDA tensor operations on GTX 1070 Ti.
- `nougat-ocr==0.1.17` requires dependency pins:
- `transformers==4.57.6`, because `transformers 5.7.0` breaks Nougat imports.
- `albumentations==1.3.1`, because `albumentations 2.x` breaks Nougat transform initialization.
- `fsspec==2026.2.0`, because newer `fsspec` conflicts with `datasets`.
- `pypdfium2==4.30.0`, `opencv-python-headless==4.11.0.86`, and `Pillow==10.4.0`, because Marker/Surya depend on these versions and Nougat can operate with them.
## Next Work
1. Review Phase 1 output with `harness-review` before moving to Phase 2.
2. If review passes, start `phases/2-marker-adapter/step0.md`.
3. Execute phases in order unless `PLAN.md` and `docs/IMPLEMENTATION_PLAN.md` are updated with a clear dependency rationale.
4. Do not create new custom agents unless the user explicitly approves another agent.
## Latest Validation
- `.\venv\python.exe -m pytest scripts\test_validate_workspace.py`: passed, 7 tests.
- `.\venv\python.exe -m py_compile scripts\execute.py scripts\validate_workspace.py .codex\hooks\*.py`: passed.
- JSON parse check for `phases/index.json`, `phases/0-harness-foundation/index.json`, and `.codex/hooks.json`: passed.
- Phase structure check for all `stepN.md` files: passed.
- `.codex/commands/*.md` frontmatter check: passed.
- `python scripts\validate_workspace.py`: passed, 103 tests after Phase 1 implementation.
- `.\venv\python.exe -c "import pdftomd; print(pdftomd.__name__)"`: passed after adding editable package metadata.
-63
View File
@@ -1,63 +0,0 @@
# PDFtoMD
PDFtoMD는 수학, 공학, 역학 중심 PDF를 AI Agent가 읽기 쉬운 Markdown 문서 묶음으로 변환하는 로컬 우선 변환 엔진입니다.
목표는 단순 텍스트 추출이 아니라 원문 문서의 읽기 순서, 문단 흐름, 수식, 표, 그림, 캡션, 본문 참조를 보존한 구조화 변환입니다.
## Status
- Current phase: Harness foundation planning.
- Implementation: not started.
- Primary target: Windows 10 native CLI/library engine.
- UI: future PyQt thin client.
## Core Direction
- Marker handles document structure, reading order, OCR/layout, body text, tables, figures, headings, and captions.
- Nougat handles only mathematical expressions and formula blocks.
- PyMuPDF handles lightweight page analysis, text-layer quality checks, page counts, chunk planning, and low-level PDF operations.
- Mixed text/scanned PDFs are in scope.
- Output is chunked Markdown plus image/table assets under a document slug directory.
## Environment
Use one repo-local Python 3.11 environment.
```powershell
conda create -p .\venv python=3.11 -y
.\venv\python.exe -m pip install -r requirements.txt
```
Verified local baseline:
- Windows 10
- NVIDIA GeForce GTX 1070 Ti, 8 GB VRAM
- NVIDIA driver 577.00
- PyTorch `2.7.1+cu126`
- Marker `1.10.2`
- Nougat OCR `0.1.17`
## Verification
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pip check
.\venv\python.exe -c "import torch; x=torch.ones((1,), device='cuda'); print(torch.__version__, torch.version.cuda, x.item())"
.\venv\Scripts\nougat.exe --help
```
`scripts/validate_workspace.py` now discovers repo-local Python validation by default. It prefers `.\venv\python.exe`, compiles Harness scripts, and runs `scripts/test_*.py` with pytest unless `HARNESS_VALIDATION_COMMANDS` or npm scripts override discovery.
## Important Documents
- `AGENTS.md`: persistent repository instructions.
- `PLAN.md`: multi-agent planning state.
- `PROGRESS.md`: multi-agent progress state.
- `phases/`: executable Harness phase tickets.
- `docs/PRD.md`: product requirements.
- `docs/ARCHITECTURE.md`: engine architecture.
- `docs/CONVERSION_POLICY.md`: detailed conversion decisions.
- `docs/HARNESS.md`: planner/generator/evaluator Harness workflow.
- `docs/IMPLEMENTATION_PLAN.md`: full phase-by-phase implementation roadmap.
- `docs/ADR.md`: architecture decision records.
- `docs/TOOLCHAIN.md`: toolchain and dependency notes.
- `docs/UI_GUIDE.md`: future PyQt UI guidance.
## Sample Corpus
The `samples/` directory is used for quality evaluation and regression tests. Current sample PDFs include Korean filenames, engineering/mechanics documents, formulas, figures, and a long 76-page document.
Before implementation, create a sample metadata mapping file that tags each PDF by text-layer quality, scanned pages, multi-column layout, formula density, table density, figure density, and Korean filename coverage.
-142
View File
@@ -1,142 +0,0 @@
# Architecture Decision Records
## 철학
프로젝트의 핵심 가치관:
- 정확한 수식 변환
- 로컬 작동
- 메모리 최적 사용
- AI Agent가 탐색하기 쉬운 deterministic Markdown bundle
- 원문 구조와 참조 관계 보존
---
## ADR-001: Marker-first document parsing
**결정**: Marker를 기본 PDF parser로 사용한다.
**이유**:
- Marker는 layout, OCR, reading order, table, figure, caption, heading을 포함한 문서 구조 추적에 적합하다.
- 프로젝트 목표는 단순 텍스트 추출이 아니라 원문 논리 구조를 Markdown으로 재구성하는 것이다.
**트레이드오프**:
- Marker 의존성 및 model weight 관리가 필요하다.
- 배포 가능성이 생기면 GPL 및 model license 검토가 필요하다.
---
## ADR-002: Nougat as formula-only parser
**결정**: Nougat은 전체 PDF parser가 아니라 수식 및 수학적 표현 parser로만 사용한다.
**이유**:
- Nougat은 학술 문서의 수식/LaTeX 변환에 강점이 있다.
- 전체 문서 구조는 Marker가 담당해야 reading order, 표, 그림, caption 경로가 일관된다.
**트레이드오프**:
- Marker block과 Nougat 결과를 연결하는 handoff/fallback 계층이 필요하다.
- Nougat 실패 시 Marker 원문 문자열을 fallback으로 사용해야 한다.
---
## ADR-003: PyMuPDF page pre-analysis and chunk planning
**결정**: PyMuPDF를 페이지 수, 텍스트 레이어 품질, OCR 필요 여부, chunk 계획, 저수준 PDF 작업에 사용한다.
**이유**:
- 무거운 parser 실행 전에 빠른 page-level 분석이 필요하다.
- 혼합 PDF는 페이지별 OCR 개입 여부를 판단해야 한다.
- 긴 PDF는 20페이지 목표 chunk로 나누되 논리 block 경계를 고려해야 한다.
**트레이드오프**:
- PyMuPDF 분석 결과와 Marker layout 결과를 조정하는 adapter가 필요하다.
---
## ADR-004: Single Python 3.11 environment
**결정**: repo-local 단일 Python 3.11 `venv`를 사용한다.
**이유**:
- 개발과 실행 경로를 단순화한다.
- Marker와 Nougat은 명시적 dependency pin을 두면 하나의 환경에서 함께 동작한다.
**검증된 주요 pin**:
- `torch==2.7.1+cu126`
- `torchvision==0.22.1+cu126`
- `marker-pdf==1.10.2`
- `nougat-ocr==0.1.17`
- `transformers==4.57.6`
- `albumentations==1.3.1`
- `pypdfium2==4.30.0`
- `opencv-python-headless==4.11.0.86`
- `Pillow==10.4.0`
- `fsspec==2026.2.0`
**트레이드오프**:
- Nougat의 느슨한 dependency bounds 때문에 requirements pin을 엄격히 유지해야 한다.
- 최신 PyTorch를 무조건 사용할 수 없다. GTX 1070 Ti `sm_61` 지원 때문에 `torch==2.7.1+cu126`을 사용한다.
---
## ADR-005: Markdown bundle output without document sidecars by default
**결정**: 기본 출력은 chunk Markdown 파일과 asset directory로 제한한다.
**이유**:
- AI Agent가 읽고 탐색하기 쉬운 산출물을 우선한다.
- 별도 sidecar 산출물은 사용자가 명시적으로 요청하기 전까지 범위를 넓히지 않는다.
**트레이드오프**:
- 변환 diagnostics를 문서 출력과 분리해야 한다.
- runtime log/state/cache는 허용하되 문서 output contract와 구분해야 한다.
---
## ADR-006: Focused quality assertions over full snapshots
**결정**: 전체 Markdown snapshot 비교보다 focused assertions를 우선한다.
**이유**:
- PDF 변환 결과는 줄바꿈, spacing, parser version에 민감하다.
- 품질 핵심은 heading, 수식, 표, 이미지, caption, 링크, chunk integrity, 예외 여부다.
**트레이드오프**:
- 테스트 설계가 더 세분화된다.
- sample metadata mapping이 필요하다.
---
## ADR-007: Runtime fallback policy
**결정**:
- explicit `--runtime cuda` 또는 `--device cuda`는 CUDA 실패 시 fail-fast.
- `--runtime auto`는 경고 후 CPU fallback 허용.
- GPU OOM은 가능한 경우 batch/page 단위를 줄여 재시도.
**이유**:
- 사용자가 CUDA를 명시한 경우 조용한 CPU 전환은 예측 불가능한 지연을 만든다.
- auto mode는 유연한 실행을 제공해야 한다.
**트레이드오프**:
- runtime state와 오류 reporting이 필요하다.
---
## ADR-008: Future PyQt UI as thin client
**결정**: PyQt UI는 변환 엔진을 직접 구현하지 않고 CLI/라이브러리 API를 호출하는 thin client로 둔다.
**이유**:
- 1차 목표는 CLI/library 엔진 안정화다.
- UI와 core engine의 책임을 분리해야 테스트와 유지보수가 쉽다.
**트레이드오프**:
- UI 설계 전에 core API contract를 안정화해야 한다.
---
## ADR-009: File-based planner/generator/evaluator Harness
**결정**: 장기 작업은 `planner -> generator -> evaluator` 역할 분리와 파일 기반 handoff를 사용하는 Harness workflow로 관리한다.
**이유**:
- PDF 변환 엔진은 parser, OCR, 수식, 표, 그림, runtime, 테스트가 얽힌 장기 작업이므로 단일 대화에서 일관성을 유지하기 어렵다.
- 작은 self-contained phase step은 새 agent가 fresh context로 작업을 이어받기 쉽게 한다.
- 구현 agent와 평가 agent를 분리하면 자기 평가 편향을 줄이고, hard threshold 기반 검증을 강제할 수 있다.
- `PLAN.md`, `PROGRESS.md`, `phases/` 파일을 통한 handoff는 대화 밖에서도 현재 상태를 재구성할 수 있게 한다.
**트레이드오프**:
- 각 step마다 Sprint Contract와 검증 기준을 작성하는 비용이 생긴다.
- 너무 많은 agent, hook, command를 추가하면 Harness 자체가 유지보수 대상이 될 수 있으므로 `docs/HARNESS.md`의 단순화 규칙을 따른다.
- Hook은 보조 장치일 뿐이며, evaluator 검토와 acceptance criteria를 대체하지 않는다.
-152
View File
@@ -1,152 +0,0 @@
# Architecture
## Scope
현재 구현 목표는 1차 목표인 Windows native, local-first CLI/library 변환 엔진입니다.
- 기본 parser: `Marker`
- 기본 수식 parser: `Nougat`
- PDF 분석과 chunk 계획: `PyMuPDF`
- 출력: Markdown chunk files plus assets
- 기본 chunk 목표: 20페이지
- 기본 runtime: CUDA
- UI, hosted API, 기본 LLM 보정 경로는 1차 목표 범위 밖입니다.
## Architecture Principles
- Marker-first architecture를 유지합니다.
- Nougat은 전체 문서 parser가 아니라 수식 parser입니다.
- PyMuPDF는 무거운 변환 전에 빠른 page-level 분석과 chunk 계획을 담당합니다.
- 출력은 AI Agent가 탐색하기 쉬운 deterministic Markdown bundle이어야 합니다.
- 복잡한 table/figure/formula 손실 가능성은 fallback과 품질 검증으로 다룹니다.
- 생성 Markdown은 원문 문서 내용 중심이어야 하며 경고/오류 로그로 오염시키지 않습니다.
## Pipeline
1. Input normalization
- PDF path를 `pathlib` 기반으로 정규화합니다.
- 한글, 공백, 긴 Windows 경로를 지원합니다.
- document slug를 결정적으로 생성합니다.
2. Page pre-analysis
- PyMuPDF로 page count, text length, image count, text-layer quality를 확인합니다.
- 페이지별 OCR 필요 여부를 추정합니다.
- 긴 문서는 20페이지 목표 chunk 계획을 세우되 logical block boundary 보존을 고려합니다.
3. Marker parse
- Marker가 layout, OCR, reading order, body text, headings, tables, figures, captions, semantic blocks를 담당합니다.
- Marker Document Model 또는 이에 준하는 구조화 출력을 내부 block model로 매핑합니다.
4. Formula handoff
- Marker equation block 또는 수식 패턴이 감지된 block만 Nougat에 전달합니다.
- Nougat 결과는 LaTeX 문자열 후보로 취급하며 validation과 fallback 정책을 통과해야 합니다.
- Nougat 실패 시 Marker 원문 수식 문자열을 사용합니다.
5. Semantic enrichment
- 수식 번호, figure 번호, table 번호, caption, 본문 참조를 식별합니다.
- 식별 confidence가 충분하면 내부 Markdown link로 연결합니다.
- header/footer/page-number 반복 패턴은 본문 흐름에서 제거하거나 분리합니다.
6. Markdown rendering
- heading, paragraph, list, blockquote, table, figure, equation block을 Markdown으로 렌더링합니다.
- Markdown table을 우선하되 복잡한 표는 제한적 HTML table 또는 이미지 fallback을 사용합니다.
- 각 chunk에는 문서 제목, page range, chunk 번호 등 최소 frontmatter를 넣을 수 있습니다.
7. Asset writing
- 이미지는 `images/` 아래 결정적 파일명으로 저장합니다.
- figure 번호가 있으면 `{document-slug}_fig-{figure-number}.png`를 우선합니다.
- 충돌 또는 번호 부재 시 chunk/page/block identifier를 사용합니다.
- hash 기반 deduplication으로 중복 asset 저장을 줄입니다.
8. Validation and reporting
- math delimiter balance, LaTeX environment pairs, table parseability, image link existence, caption matching, chunk boundary integrity를 검증합니다.
- CLI는 progress bar와 chunk별 성공/실패를 표시합니다.
- 오류와 경고는 stderr와 local log에 기록합니다.
## Planned Layout
```text
samples/ # regression and quality corpus
tests/ # focused pytest coverage
scripts/ # validation / harness helpers
phases/ # executable Harness phase tickets
src/ # source package, planned
venv/ # repo-local Windows virtual environment, ignored by git
output/ # conversion output, ignored by git
```
## Harness Boundary
- `docs/HARNESS.md` defines the planner/generator/evaluator workflow for long-running work.
- `phases/` files are execution tickets, not architecture policy. Architecture policy remains in `docs/ARCHITECTURE.md`, `docs/CONVERSION_POLICY.md`, and `docs/ADR.md`.
- Each implementation phase must keep parser, formula, pre-analysis, renderer, runtime, and UI responsibilities separated according to this document.
- Evaluator checks should use hard thresholds from each step's Sprint Contract and the focused quality strategy below.
## Output Contract
출력은 문서 slug 디렉터리 아래에 묶입니다.
```text
output/
└── document-slug/
├── document-slug_001.md
├── document-slug_002.md
└── images/
├── document-slug_fig-001.png
└── document-slug_fig-003.png
```
세부 규칙:
- chunk Markdown 파일명은 `<slug>_<chunk-index:03d>.md`
- image asset은 `images/`
- 같은 입력과 같은 옵션은 같은 output path를 생성해야 합니다.
- 별도 문서 sidecar metadata/log 산출물은 기본 output contract에 포함하지 않습니다.
- local log와 resume state/cache는 runtime artifact이며 문서 출력 contract와 구분합니다.
## Runtime Policy
- 기본 runtime은 `cuda`
- explicit `--runtime cuda` 또는 `--device cuda`에서 CUDA가 준비되지 않았으면 빠르게 실패
- `--runtime auto`는 필요 시 CPU fallback 경고를 출력
- GTX 1070 Ti 8GB 기준 batch size는 1~2 수준에서 시작
- GPU OOM 시 가능한 경우 batch/page 단위를 줄여 재시도
- 수식 parser 기본값은 `nougat`
- verified PyTorch baseline은 `torch==2.7.1+cu126`
## Environment
단일 repo-local Python 3.11 `venv`를 사용합니다.
```powershell
conda create -p .\venv python=3.11 -y
.\venv\python.exe -m pip install -r requirements.txt
```
주요 pin:
- `torch==2.7.1+cu126`
- `torchvision==0.22.1+cu126`
- `marker-pdf==1.10.2`
- `nougat-ocr==0.1.17`
- `transformers==4.57.6`
- `albumentations==1.3.1`
- `pypdfium2==4.30.0`
- `opencv-python-headless==4.11.0.86`
- `Pillow==10.4.0`
- `fsspec==2026.2.0`
## Model Cache And Offline Mode
- 모델 cache 위치는 명시적으로 관리해야 합니다.
- 최초 다운로드 이후 offline 실행 시 이미 받은 weight를 우선 사용해야 합니다.
- README에는 model download와 offline 실행 절차를 별도로 추가해야 합니다.
## Quality Strategy
- 전체 Markdown snapshot 비교는 주요 검증 방식으로 사용하지 않습니다.
- focused assertions를 우선합니다.
- 검증 대상:
- heading hierarchy
- math delimiter balance
- LaTeX `\begin` / `\end` pairs
- image link existence
- figure/table/formula caption matching
- table parseability
- chunk boundary integrity
- Windows path and Korean filename handling
- no-exception conversion
## Out of Scope for the First Goal
- PyQt UI 구현
- hosted conversion API 기본 경로화
- LLM 보정 모드 기본 경로화
- 생성 문서와 함께 배포되는 별도 sidecar metadata/log 산출물
-91
View File
@@ -1,91 +0,0 @@
# Conversion Policy
This document records implementation decisions for the PDF-to-Markdown conversion engine. It is planning guidance, not implementation code.
## Input Classification
- Support mixed PDFs by default: text-layer pages, scanned pages, and mixed pages can appear in the same document.
- Use PyMuPDF or equivalent lightweight page analysis before heavy parsing to estimate text-layer quality per page.
- Decide OCR intervention per page instead of treating the entire PDF as text-only or scan-only.
- Prefer Marker's OCR/layout functionality for scanned or weak text-layer pages.
## Parser Responsibilities
- Marker owns overall layout tracking, reading order, body extraction, table structure, image extraction, headings, captions, and semantic block roles.
- Nougat owns only mathematical expressions and formula block parsing.
- Do not use Nougat as the main document parser.
- Send a block to Nougat when Marker identifies it as an equation area or when text-pattern detection marks it as mathematical content.
- If Nougat conversion fails, preserve information by falling back to Marker's extracted source text.
## Formula Handling
- Treat formulas embedded inside a sentence without independent line spacing as inline formulas.
- Treat formulas occupying independent line space or vertical whitespace as block formulas.
- Preserve formula numbers detected near the right or bottom side of a formula region.
- Attach anchors to extracted formula numbers and rewrite body references such as `Eq. (3)` or `식 (5)` as internal Markdown links when confidence is sufficient.
- Validate Markdown math delimiters by counting opening and closing `$ ... $` and `$$ ... $$` pairs across each chunk.
- Validate common LaTeX environments by checking matching `\begin{...}` and `\end{...}` names and counts.
- If delimiter or environment validation fails, repair the closest logical location in a way that keeps Markdown rendering intact.
## Tables
- Prefer Markdown tables when structure can be represented without major loss.
- Use limited HTML `<table>` output for tables with merged cells, multi-row headers, or structures that exceed GitHub Flavored Markdown table expressiveness.
- Preserve table footnotes as regular text immediately below the table.
- Preserve top or bottom captions as text and create internal links from body references such as `Table 1`.
- If structured table extraction loses too much information, also save a screenshot of the table region as a fallback asset and link it near the structured output.
## Figures And Images
- Use deterministic image asset naming such as `{document-slug}_fig-{figure-number}.png` when a figure number is available.
- Include chunk/page/block identifiers in names or anchors when needed to avoid collisions.
- Place extracted image assets in the document `images/` directory.
- Add figure captions below Markdown image links.
- Rewrite body references such as `Fig. 2` to internal Markdown links when the figure target can be identified.
- Deduplicate extracted images by hash and let repeated references share one asset and anchor.
## Reading Order And Paragraph Flow
- Stitch lines into paragraphs when a line does not end with terminal punctuation and the next line begins like a continuation, or when bounding-box line spacing matches intra-paragraph spacing.
- Join hyphenated line breaks when a line-ending hyphen is followed by a lowercase continuation without whitespace.
- Preserve hyphens for known compounds, identifiers, or proper nouns when confidence is low.
- Use Marker bounding boxes to validate that the linearized text flow matches expected reading order in sample PDFs.
- Detect repeated header/footer/page-number patterns in stable top/bottom page regions and exclude them from body Markdown, or separate them from the main body flow.
## Chunking
- Use 20 pages as the default chunk target.
- Prefer logical block boundaries over strict page boundaries when a paragraph, formula, table, or figure would be cut in the middle.
- If a block crosses a chunk boundary, keep the block intact by moving it to the previous or next chunk according to the least damaging boundary.
- Add minimal context at the top of each chunk, including document title, page range, and chunk number.
- Avoid sidecar metadata by default; put only core metadata in concise Markdown frontmatter.
## Determinism And Paths
- Ensure the same PDF and same options produce stable output structure and filenames.
- Use deterministic slug, anchor, asset, and chunk naming rules.
- Prefer `pathlib` for filesystem paths.
- Test Korean filenames, paths with spaces, and long Windows paths.
## Runtime And Recovery
- Use conservative batch sizes, usually 1 or 2, for GTX 1070 Ti 8 GB VRAM.
- If a GPU out-of-memory error occurs, retry with a smaller batch or smaller page unit where possible.
- If the user explicitly requests `--device cuda` or `--runtime cuda`, fail fast instead of silently switching to CPU.
- If the user requests `--runtime auto`, warn and fall back to CPU when CUDA initialization fails.
- Keep model cache locations explicit, preferably under a local project or user-configured model cache directory, so offline operation can reuse already-downloaded weights.
## Logging And Resume
- Show chunk-level progress and success/failure status in the CLI.
- Print warnings and errors to stderr and a local log file.
- Do not inject warnings or error logs into generated Markdown because they reduce document readability and integrity.
- Support resuming failed conversions by skipping already successful chunks when a local state/cache file is available.
- Sidecar outputs are still out of scope unless explicitly requested; a resume state file is a runtime cache, not part of the document output contract.
## Quality Tests
- Prefer focused assertions over full Markdown snapshots.
- Validate heading structure, formula delimiter balance, LaTeX environment pairs, image links, caption matching, table parseability, and no-exception conversion.
- Use regex and Markdown/HTML parsers where practical instead of ad hoc string checks.
- Maintain a sample metadata mapping file for `samples/` that tags each PDF by traits such as text-layer quality, scanned pages, multi-column layout, formula density, table density, figure density, and Korean filename coverage.
- Use engineering/mechanics PDFs with multi-column layout, formulas, graphs, and tables as the MVP acceptance corpus.
## Licensing
- Current use is personal, which lowers immediate distribution risk.
- If redistribution or commercial use becomes relevant, revisit Marker GPL and model-weight license implications before packaging.
- Process or service isolation can be considered as a licensing risk-mitigation strategy, but it is not a legal conclusion and should be reviewed before distribution.
## UI Boundary
- Keep the core conversion engine as a Python API/CLI package.
- Future PyQt UI should remain a thin client over the same API and must not duplicate conversion logic.
-114
View File
@@ -1,114 +0,0 @@
# Harness Engineering Guide
이 문서는 PDFtoMD 프로젝트에서 장기 agent 작업을 관리하는 Harness 운영 규칙입니다. 기준은 Anthropic의 "Harness design for long-running application development" 글에서 강조한 planner, generator, evaluator 분리, 파일 기반 handoff, sprint contract, 독립 평가 루프입니다.
## Purpose
- 긴 변환 엔진 개발을 작은 self-contained step으로 나눕니다.
- 새 agent가 이전 대화 맥락 없이도 `AGENTS.md`, `PLAN.md`, `PROGRESS.md`, `phases/` 파일만 읽고 일을 이어받게 합니다.
- 구현 agent와 평가 agent를 분리해 자기 평가 편향을 줄입니다.
- 각 step의 성공 조건을 코드 작성 전에 파일로 고정합니다.
- Harness 자체는 단순하게 유지하고, 복잡성은 필요한 검증 기준과 step 경계에만 둡니다.
## Roles
### Planner
- 제품 목표와 아키텍처 문서를 읽고 phase와 step을 작성합니다.
- 구현 세부를 과도하게 지정하지 않고 산출물, 책임 범위, 수락 기준, 금지 범위를 명확히 합니다.
- 산출물:
- `PLAN.md` 업데이트
- `phases/index.json`
- `phases/{phase}/index.json`
- `phases/{phase}/stepN.md`
### Generator
- 한 번에 하나의 `stepN.md`만 수행합니다.
- 작업 전 step의 "Sprint Contract"를 읽고, 애매하면 구현 전에 `PROGRESS.md`에 blocker로 남깁니다.
- TDD가 필요한 구현 step에서는 테스트를 먼저 작성합니다.
- 산출물:
- step 범위 내 코드, 테스트, 문서 변경
- `phases/{phase}/index.json` step status 업데이트
- `PROGRESS.md` handoff 업데이트
### Evaluator
- generator가 만든 결과를 독립적으로 검토합니다.
- 합의된 기준 중 하나라도 hard threshold를 넘지 못하면 step을 통과시키지 않습니다.
- 통과 여부만 보지 않고, 재작업 가능한 구체적 실패 원인을 남깁니다.
- 산출물:
- review finding 또는 pass 기록
- 필요한 경우 `phases/{phase}/index.json``error_message` 또는 `blocked_reason`
- `PROGRESS.md` 검증 결과
## File Protocol
- `AGENTS.md`: 변하지 않는 저장소 규칙.
- `PLAN.md`: 전체 작업 계획의 단일 출처.
- `PROGRESS.md`: 현재 진행 상태와 handoff의 단일 출처.
- `docs/*.md`: 제품, 아키텍처, 결정, 도구 체인, Harness 운영 지식.
- `phases/index.json`: 실행 가능한 phase registry.
- `phases/{phase}/index.json`: 해당 phase step 상태의 단일 출처.
- `phases/{phase}/stepN.md`: 새 agent가 독립 실행할 수 있는 ticket.
## Step Contract Template
`stepN.md`는 다음 정보를 포함해야 합니다.
````markdown
# Step N: step-name
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/ARCHITECTURE.md
- /docs/ADR.md
- /docs/CONVERSION_POLICY.md
## Task
이 step에서 만들어야 하는 산출물과 수정 가능한 파일을 구체적으로 적습니다.
## Sprint Contract
- Done means: 사용자가 관찰할 수 있거나 테스트로 확인 가능한 완료 조건.
- Hard thresholds: 하나라도 실패하면 step 실패로 보는 기준.
- Files owned: 이 step에서 수정할 수 있는 파일 또는 디렉터리.
- Dependencies: 이전 step 산출물 또는 필요한 문서.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
```
## Verification
1. 테스트와 검증 명령을 실행합니다.
2. `PROGRESS.md`에 결과와 다음 handoff를 기록합니다.
3. `phases/{phase}/index.json`의 해당 step을 `completed`, `blocked`, `error` 중 하나로 갱신합니다.
## Do Not
- step 범위 밖 기능을 구현하지 않습니다.
- 새 parser나 외부 API를 도입하지 않습니다.
- 생성 Markdown 출력 contract를 임의로 넓히지 않습니다.
````
## Evaluation Criteria
PDFtoMD의 evaluator는 다음 hard threshold를 우선 적용합니다.
| Area | Hard Threshold |
| --- | --- |
| Architecture | Marker, Nougat, PyMuPDF 책임 경계를 깨지 않는다. |
| TDD | 구현 step은 실패하는 테스트가 먼저 추가되거나, 테스트가 필요 없는 이유가 step에 명시된다. |
| Determinism | 같은 입력과 옵션은 같은 slug, asset path, anchor, Markdown 구조를 만든다. |
| Markdown quality | heading, math delimiter, table, image link, caption, chunk frontmatter 검증이 가능하다. |
| Runtime | Windows path, Korean filename, CUDA/CPU runtime 정책을 훼손하지 않는다. |
| Scope | PyQt UI, hosted API, LLM correction, sidecar output을 1차 구현에 끌어오지 않는다. |
| Handoff | `PROGRESS.md`와 phase index가 다음 agent에게 충분한 상태를 제공한다. |
## When To Use The Full Loop
- Full planner/generator/evaluator loop를 사용합니다:
- 새 phase를 시작할 때
- parser adapter, chunk planner, renderer, quality validator처럼 실패 비용이 큰 작업
- sample corpus나 runtime 정책처럼 여러 파일과 문서가 동시에 바뀌는 작업
- 단순한 문서 오타, 작은 command 설명, 명확한 단일 테스트 수정은 일반 Codex 작업으로 처리해도 됩니다. 그래도 `PROGRESS.md`는 갱신합니다.
## Simplification Rule
Harness 구성 요소는 실제로 품질을 높일 때만 유지합니다.
- 같은 검증을 두 곳에서 반복하면 하나로 줄입니다.
- hook은 보조 장치로 취급하고, step의 acceptance criteria와 evaluator 판단을 대체하지 않습니다.
- agent에게 너무 많은 컨텍스트를 주지 말고, step에 필요한 문서와 파일만 지정합니다.
-121
View File
@@ -1,121 +0,0 @@
# Implementation Phase Plan
이 문서는 PDFtoMD 구현 전체를 phase 단위로 나눈 실행 계획입니다. 각 phase의 상세 실행 티켓은 `phases/{phase}/stepN.md`에 둡니다.
## Planning Principles
- 1차 목표는 Windows native, local-first CLI/library 변환 엔진입니다.
- PyQt UI는 core API와 CLI가 안정화된 뒤 thin client로 구현합니다.
- 각 phase는 이전 phase의 산출물을 전제로 하며, phase 안의 step은 하나의 agent가 독립 실행할 수 있어야 합니다.
- 구현 phase는 TDD를 기본으로 합니다.
- Parser 책임 경계는 유지합니다: Marker는 문서 구조, Nougat은 수식, PyMuPDF는 사전 분석과 저수준 PDF 작업입니다.
## Phase Overview
| Phase | Goal | Primary Output | Depends On |
| --- | --- | --- | --- |
| 0. Harness foundation | 실행 가능한 Harness 기반과 최소 품질 토대 | sample metadata, core models, preanalysis contract, quality gates | current docs |
| 1. Core runtime contracts | 변환 옵션, 입력 정규화, 출력 bundle 계약, path/cache 정책 | stable API contracts and tests | Phase 0 |
| 2. Marker adapter | Marker 실행과 block normalization 경계 구현 | Marker adapter, OCR handoff, block mapping tests | Phase 1 |
| 3. Formula pipeline | Nougat formula-only handoff와 LaTeX 검증/fallback | formula detector, Nougat adapter, repair/fallback tests | Phase 2 |
| 4. Semantic enrichment | 문단, reading order, header/footer, 참조 관계 보강 | enrichment pipeline and reference index | Phase 2, 3 |
| 5. Markdown rendering and assets | Markdown chunk, table, figure, asset writer 구현 | deterministic Markdown bundle writer | Phase 4 |
| 6. CLI runtime and resume | CLI, progress/logging, runtime, OOM, resume 구현 | user-facing local CLI | Phase 5 |
| 7. MVP quality hardening | samples 기반 end-to-end 품질 검증과 회귀 안정화 | MVP acceptance suite | Phase 6 |
| 8. Release docs and packaging | 설치, 모델 cache, offline, release 문서 정리 | local release-ready docs/scripts | Phase 7 |
| 9. PyQt thin client | CLI/library를 호출하는 Windows UI | optional PyQt UI | Phase 8 |
## Phase 0: Harness Foundation
- Directory: `phases/0-harness-foundation`
- Purpose: 구현 전 공통 모델, sample metadata, PyMuPDF pre-analysis contract, Markdown quality gates를 만든다.
- Steps:
1. `sample-metadata-contract`
2. `core-package-skeleton`
3. `page-preanalysis-contract`
4. `markdown-quality-gates`
## Phase 1: Core Runtime Contracts
- Directory: `phases/1-core-runtime-contracts`
- Purpose: parser 실행 전에 모든 phase가 공유할 입력, 옵션, path, output contract를 안정화한다.
- Steps:
1. `input-normalization-slug`
2. `conversion-options-config`
3. `output-bundle-contract`
4. `runtime-cache-policy`
## Phase 2: Marker Adapter
- Directory: `phases/2-marker-adapter`
- Purpose: Marker를 primary parser로 연결하고, OCR/page plan과 Marker 구조화 출력을 내부 block model로 매핑한다.
- Steps:
1. `marker-invocation-adapter`
2. `ocr-plan-handoff`
3. `marker-block-normalization`
4. `marker-failure-reporting`
## Phase 3: Formula Pipeline
- Directory: `phases/3-formula-pipeline`
- Purpose: Nougat을 formula-only parser로 연결하고, 수식 delimiter, numbering, fallback을 안정화한다.
- Steps:
1. `formula-block-detection`
2. `nougat-command-adapter`
3. `latex-validation-repair`
4. `formula-reference-links`
## Phase 4: Semantic Enrichment
- Directory: `phases/4-semantic-enrichment`
- Purpose: Marker block을 Markdown에 적합한 논리 구조로 보강한다.
- Steps:
1. `reading-order-checks`
2. `paragraph-stitching`
3. `header-footer-filtering`
4. `reference-indexing`
## Phase 5: Markdown Rendering And Assets
- Directory: `phases/5-markdown-rendering-assets`
- Purpose: chunked Markdown bundle과 image/table asset 출력을 결정적으로 생성한다.
- Steps:
1. `markdown-block-renderer`
2. `table-renderer-fallbacks`
3. `figure-asset-writer`
4. `chunk-renderer`
## Phase 6: CLI Runtime And Resume
- Directory: `phases/6-cli-runtime-resume`
- Purpose: 변환 엔진을 사용자가 실행할 수 있는 CLI로 묶고 runtime/recovery 정책을 구현한다.
- Steps:
1. `cli-entrypoint-options`
2. `progress-logging`
3. `resume-state`
4. `device-oom-policy`
5. `model-cache-offline`
## Phase 7: MVP Quality Hardening
- Directory: `phases/7-mvp-quality-hardening`
- Purpose: sample corpus 기준으로 end-to-end 품질을 고정하고 MVP 수락 기준을 통과시킨다.
- Steps:
1. `sample-smoke-conversions`
2. `quality-metrics-report`
3. `regression-thresholds`
4. `mvp-fix-sweep`
## Phase 8: Release Docs And Packaging
- Directory: `phases/8-release-docs-packaging`
- Purpose: 개인용 로컬 실행 기준으로 설치, 모델 다운로드, offline 실행, release checklist를 정리한다.
- Steps:
1. `readme-usage-flow`
2. `environment-bootstrap-docs`
3. `license-checkpoint`
4. `release-checklist`
## Phase 9: PyQt Thin Client
- Directory: `phases/9-pyqt-thin-client`
- Purpose: core engine을 중복 구현하지 않는 Windows UI를 만든다.
- Steps:
1. `ui-api-contract`
2. `pyqt-shell`
3. `ui-progress-resume`
4. `ui-packaging-notes`
## Deferred Backlog
- Hosted conversion API는 현재 phase plan에 포함하지 않습니다.
- LLM correction mode는 기본 경로가 아니며, MVP 이후 별도 ADR과 phase 계획이 필요합니다.
- 배포/상업적 사용이 현실화되면 Marker GPL과 model weight license를 별도 법적 검토 대상으로 둡니다.
-88
View File
@@ -1,88 +0,0 @@
# PRD: PDFtoMD
## 목표
PDFtoMD는 수학, 공학, 역학 중심의 PDF 문서를 AI Agent가 쉽게 접근하고 읽을 수 있는 Markdown 문서 묶음으로 변환하는 프로그램입니다.
이 프로젝트의 목표는 PDF의 텍스트를 단순 추출하는 것이 아니라, 원문 문서의 논리 구조를 보존하면서 AI가 읽기 쉬운 지식 자료로 재구성하는 것입니다.
## 문제 정의
- PDF는 텍스트, 이미지, 수식, 표, 캡션을 좌표 기반으로 저장하므로 원문 읽기 순서가 쉽게 깨집니다.
- 논문과 공학 문서에는 다단 레이아웃, 수식 번호, 그림/표 참조, 복잡한 표가 자주 등장합니다.
- 스캔 PDF와 텍스트 레이어 PDF가 섞인 문서는 OCR 여부를 문서 전체 단위가 아니라 페이지 단위로 판단해야 합니다.
- AI Agent와 RAG 도구는 긴 PDF 하나보다 논리적으로 나뉜 Markdown chunk와 연결된 asset을 더 안정적으로 탐색합니다.
## 사용자
- PDF 문서를 Markdown으로 변환해 AI Agent, RAG, 개인 지식 관리 도구에 활용하고 싶은 사용자
- 수식, 표, 이미지가 많은 논문/공학 문서를 Markdown으로 읽고 관리하고 싶은 사용자
- 긴 PDF를 여러 Markdown 파일로 나누어 부분 탐색하고 싶은 사용자
- Windows native 환경에서 외부 서비스 없이 로컬로 변환하고 싶은 사용자
## 1차 MVP 범위
- Windows native 환경에서 완전 로컬 실행
- GPU 기본 사용, VRAM 8GB 환경을 기준으로 안정적인 chunk 처리
- repo-local Python 3.11 단일 `venv` 환경 사용
- PDF parser는 `Marker`를 기본 엔진으로 사용
- 본문 구조, OCR/layout, reading order, 표, 그림, heading, caption은 Marker 경로를 유지
- 수학적 표현이나 수식은 `Nougat` parser를 사용
- PyMuPDF로 페이지 수, 텍스트 레이어 품질, OCR 필요 여부, chunk 계획을 사전 분석
- PDF 텍스트를 Markdown 문단과 heading 구조로 변환
- PDF 내 수식을 Markdown math delimiter를 사용하는 LaTeX로 변환
- Nougat 실패 시 Marker 원문 수식 문자열을 fallback으로 보존
- PDF 내 이미지를 추출하고 Markdown에서 연결
- 이미지의 figure 번호와 캡션을 가능한 한 보존
- PDF 내 표를 구조화하고 Markdown table로 출력
- Markdown table 손실이 큰 표는 제한적 HTML table 또는 표 영역 이미지 fallback으로 보존
- 페이지 수가 많은 문서를 20페이지 목표 chunk로 분할하되 논리 block 경계 보존
- CLI 진행률, chunk 단위 성공/실패 요약, stderr/local log 기록
- 실패 chunk 재개를 위한 runtime cache/state 기반 resume 옵션
- `samples/` PDF 기반 품질 검증과 회귀 테스트 지원
## 2차 범위
- PyQt 기반 Windows UI
- UI는 CLI/라이브러리 계층을 호출하는 thin client로 구현
- 선택적 외부 API 연동은 변환 엔진 안정화 이후 검토
## 제외 범위
- hosted conversion API 기본 경로화
- LLM 보정 모드 기본 경로화
- 생성 문서와 함께 배포되는 별도 sidecar metadata/log 산출물
- 변환 엔진 로직을 PyQt UI 안에 중복 구현하는 방식
## 핵심 기능
1. PDF 문서를 Markdown 문서 묶음으로 변환
2. 텍스트 PDF, 스캔 PDF, 혼합 PDF를 페이지별 OCR 판단으로 처리
3. 수식을 `$ ... $` 또는 `$$ ... $$` 형식의 LaTeX로 보존
4. 수식 번호와 본문 내 수식 참조를 가능한 한 내부 링크로 연결
5. 논문에서 자주 쓰이는 다중 컬럼 문서를 Markdown의 선형 구조로 재배치
6. 이미지 추출 및 Markdown 연결
7. figure 번호, caption, 본문 내 figure 참조 연결
8. 표 구조화 및 표 유형별 Markdown/HTML/fallback 이미지 출력
9. 긴 PDF를 여러 chunk Markdown 파일로 분할 변환
10. 한글 파일명, 긴 Windows 경로, 공백 포함 경로 지원
11. GTX 1070 Ti 8GB VRAM 기준 batch 크기 제어와 OOM 재시도
12. offline 실행을 위한 명시적 model cache 정책
## 품질 기준
- 원문 읽기 순서가 Markdown에서 자연스럽게 유지되어야 합니다.
- heading, 본문, 리스트, 인용, 표, 그림, 캡션, 수식의 의미 역할이 구분되어야 합니다.
- 수식 delimiter와 기본 LaTeX 구조가 깨지지 않아야 합니다.
- 수식 번호와 본문 참조가 가능한 한 연결되어야 합니다.
- 이미지와 캡션, figure 번호, 본문 참조가 가능한 한 연결되어야 합니다.
- 표는 구조 손실을 최소화하는 형식으로 저장되어야 합니다.
- chunk 경계가 문단, 표, 그림, 수식을 중간에서 깨뜨리지 않아야 합니다.
- 같은 입력 PDF와 같은 옵션은 같은 파일명, anchor, asset 구조를 생성해야 합니다.
- Windows 경로, 한글 파일명, 긴 문서, GPU 메모리 부족 상황을 고려해야 합니다.
- 오류와 경고는 Markdown 본문을 오염시키지 않고 stderr/local log에 남겨야 합니다.
## Acceptance Criteria
- `python scripts/validate_workspace.py`가 성공해야 합니다.
- `.\venv\python.exe -m pip check`가 성공해야 합니다.
- CUDA smoke test가 GTX 1070 Ti에서 성공해야 합니다.
- `.\venv\Scripts\nougat.exe --help`가 성공해야 합니다.
- sample metadata mapping 파일이 각 sample PDF의 특성을 설명해야 합니다.
- focused pytest가 heading, 수식 delimiter, LaTeX environment pair, image link, caption matching, table parseability, chunk boundary, no-exception conversion을 검증해야 합니다.
## UI
- UI는 2차 목표로 PyQt를 사용합니다.
- UI는 변환 엔진을 직접 구현하지 않고 CLI/라이브러리 계층을 호출하는 thin client로 둡니다.
- 미니멀하고 깔끔한 Windows 표준 디자인을 따릅니다.
-90
View File
@@ -1,90 +0,0 @@
# Toolchain Notes
This document summarizes the researched toolchain choices and local compatibility decisions.
## Verified Environment
- OS: Windows 10
- GPU: NVIDIA GeForce GTX 1070 Ti
- VRAM: 8 GB
- NVIDIA driver: 577.00
- `nvidia-smi` CUDA runtime capability: 12.9
- User-installed CUDA toolkit: 12.4
- Python: 3.11.15 in repo-local `venv`
- Environment manager: Conda / Miniforge
## Python Dependencies
Use one repo-local `venv` and install from `requirements.txt`.
Key pins:
- `torch==2.7.1+cu126`
- `torchvision==0.22.1+cu126`
- `marker-pdf==1.10.2`
- `nougat-ocr==0.1.17`
- `transformers==4.57.6`
- `albumentations==1.3.1`
- `pymupdf==1.27.2.3`
- `pandas==3.0.2`
- `pytest==9.0.3`
- `pypdfium2==4.30.0`
- `opencv-python-headless==4.11.0.86`
- `Pillow==10.4.0`
- `fsspec==2026.2.0`
## PyTorch / CUDA Decision
- `torch==2.11.0+cu128` imports on this machine but does not support GTX 1070 Ti `sm_61` at runtime.
- `torch==2.7.1+cu126` satisfies Marker `torch>=2.7.0` and successfully runs CUDA tensor operations on GTX 1070 Ti.
- Keep this pin unless a newer official PyTorch wheel is verified to support `sm_61`.
## Marker
- Marker is the primary document parser.
- It handles layout, OCR/layout, reading order, body text, headings, tables, figures, captions, and semantic block roles.
- It should be consumed through structured output or adapter APIs where possible, not by scraping final Markdown text.
## Nougat
- Nougat is used only for formulas and mathematical expressions.
- `nougat-ocr==0.1.17` has loose dependency bounds, so the project pins compatible versions.
- `transformers 5.x` breaks Nougat imports.
- `albumentations 2.x` breaks Nougat transform initialization.
- Nougat failure must fall back to Marker source text.
## PyMuPDF
- PyMuPDF is used for lightweight page analysis, page counts, text-layer quality checks, OCR intervention planning, chunk planning, and low-level PDF/page operations.
- It is not the primary document parser.
## Comparison Baselines
These tools are useful for research or quality comparison but are not the primary architecture:
- PyMuPDF4LLM
- Docling
- MinerU
- MarkItDown
Do not switch the primary parser without updating `docs/ADR.md`, `docs/ARCHITECTURE.md`, and `docs/CONVERSION_POLICY.md`.
## Reference Links
- Marker PyPI: https://pypi.org/project/marker-pdf/
- Nougat GitHub: https://github.com/facebookresearch/nougat
- PyMuPDF documentation: https://pymupdf.readthedocs.io/
- PyTorch previous versions: https://docs.pytorch.org/get-started/previous-versions/
- GitHub Flavored Markdown spec: https://github.github.io/gfm/
- MathJax TeX delimiters: https://docs.mathjax.org/en/latest/input/tex/delimiters.html
- Docling GitHub: https://github.com/docling-project/docling
- MinerU GitHub: https://github.com/opendatalab/MinerU
## Markdown And Math Rendering
- Markdown table output should target GitHub Flavored Markdown where possible.
- Complex tables may use limited HTML `<table>`.
- Math output uses `$ ... $` for inline formulas and `$$ ... $$` for block formulas.
- `$...$` can conflict with ordinary dollar signs, so delimiter validation and repair are required.
## Model Cache
- Use explicit local cache paths for Marker/Nougat/Hugging Face model downloads.
- README should include model pre-download and offline execution instructions before the engine is released.
- Default project-local model cache path is `.models/`.
- `PDFTOMD_MODEL_CACHE` can override the default cache root.
- The runtime cache policy exposes Hugging Face cache environment variables from that root without downloading models during validation.
- Runtime logs and resume state are runtime artifacts under `output/.pdftomd-runtime/<document-slug>/`, not generated document sidecars.
## Licensing Notes
- Current user context is personal use.
- Before redistribution or commercial use, revisit Marker GPL and model-weight license implications.
- Process or API isolation can reduce coupling risk, but it is not a substitute for legal review.
-39
View File
@@ -1,39 +0,0 @@
# UI 디자인 가이드
UI는 2차 목표입니다. 1차 MVP에서는 CLI/라이브러리 변환 엔진을 먼저 안정화합니다.
## 디자인 원칙
1. 표준 Windows 환경에 맞는 미니멀한 UI를 따른다.
2. 변환 엔진 로직을 UI에 중복 구현하지 않는다.
3. PyQt UI는 core Python API 또는 CLI를 호출하는 thin client로 둔다.
4. 긴 문서 변환 중 사용자가 현재 상태를 파악할 수 있어야 한다.
5. 오류와 경고는 읽기 쉬운 방식으로 보여주되, 생성 Markdown을 오염시키지 않는다.
## 주요 화면
- PDF 선택
- 출력 폴더 선택
- runtime 선택: `cuda`, `auto`, `cpu`
- formula parser 선택: `nougat`, `marker`
- chunk size 표시 및 기본값 유지
- 진행률과 chunk별 상태 표시
- 실패 chunk 요약
- resume 실행 버튼
- local log 열기
## Interaction Rules
- 기본값은 CLI 기본값과 동일해야 한다.
- `cuda` 명시 실행에서 CUDA 초기화 실패 시 CPU fallback을 자동으로 하지 않고 명확히 실패를 표시한다.
- `auto` 실행에서 CUDA 실패 시 경고 후 CPU fallback 상태를 표시한다.
- 변환 중 취소가 가능해야 한다.
- 성공한 chunk와 실패한 chunk가 구분되어야 한다.
## Visual Style
- Windows native에 어울리는 절제된 색상과 간격을 사용한다.
- 작업 도구 UI이므로 marketing hero나 장식적 layout은 사용하지 않는다.
- 긴 파일명과 한글 경로가 잘리지 않도록 middle ellipsis 또는 tooltip을 제공한다.
- 로그와 결과 경로는 복사 가능한 텍스트로 제공한다.
## Boundary
- UI는 `src/` core package의 public API 또는 CLI만 호출한다.
- UI에서 Marker/Nougat/PyMuPDF를 직접 조합하지 않는다.
- UI 테스트는 core conversion quality test와 분리한다.
-30
View File
@@ -1,30 +0,0 @@
{
"project": "PDFtoMD",
"phase": "0-harness-foundation",
"steps": [
{
"step": 0,
"name": "sample-metadata-contract",
"status": "completed",
"summary": "Created deterministic samples/metadata.json and metadata contract tests for current sample PDFs."
},
{
"step": 1,
"name": "core-package-skeleton",
"status": "completed",
"summary": "Created importable pdftomd package skeleton, pyproject metadata, and typed core models."
},
{
"step": 2,
"name": "page-preanalysis-contract",
"status": "completed",
"summary": "Added PyMuPDF-only PDF preanalysis with page facts, OCR candidates, and 20-page chunk ranges."
},
{
"step": 3,
"name": "markdown-quality-gates",
"status": "completed",
"summary": "Added focused Markdown quality gates for math, LaTeX, tables, image links, frontmatter, and anchors."
}
]
}
-63
View File
@@ -1,63 +0,0 @@
# Step 0: sample-metadata-contract
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/PRD.md
- /docs/ARCHITECTURE.md
- /docs/CONVERSION_POLICY.md
- /docs/ADR.md
- /docs/TOOLCHAIN.md
## Task
Create the first sample corpus metadata contract without implementing the conversion engine.
The metadata must classify every PDF currently under `samples/` by traits that future regression tests can use:
- text layer quality
- scanned or mixed scanned/text pages
- multi-column or complex layout risk
- formula density
- table density
- figure density
- Korean filename/path coverage
- target regression focus
Use deterministic JSON so future agents can update it with minimal diff noise.
## Sprint Contract
- Done means: `samples/metadata.json` exists, includes every current PDF by exact relative path, and has enough structured fields for future tests to select OCR, layout, formula, table, figure, and Korean-path cases.
- Hard thresholds:
- Every current `samples/*.pdf` appears exactly once.
- Metadata is valid UTF-8 JSON.
- Tests fail if a sample PDF is added without metadata.
- Tests fail if duplicate sample paths exist in the metadata.
- No conversion engine code is introduced in this step.
- Files owned:
- `samples/metadata.json`
- `tests/test_sample_metadata.py`
- `PROGRESS.md`
- `phases/0-harness-foundation/index.json`
- Dependencies:
- Existing sample PDFs under `samples/`
- PyMuPDF may be used only for lightweight page count/text/image inspection if needed.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests\test_sample_metadata.py
```
## Verification
1. Run the acceptance commands.
2. Confirm `samples/metadata.json` paths match `samples/*.pdf`.
3. Confirm Korean filenames remain readable in JSON.
4. Update `PROGRESS.md` with completed work, validation output, and next handoff.
5. Update this phase index step to `completed` with a one-line `summary`, or to `blocked`/`error` with a concrete reason.
## Do Not
- Do not create `src/` or conversion engine modules in this step.
- Do not rename, delete, compress, or rewrite sample PDFs.
- Do not add sidecar output files for converted documents.
- Do not add a new custom agent.
-59
View File
@@ -1,59 +0,0 @@
# Step 1: core-package-skeleton
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/ARCHITECTURE.md
- /docs/CONVERSION_POLICY.md
- /docs/ADR.md
- /phases/0-harness-foundation/step0.md
- /phases/0-harness-foundation/index.json
## Task
Create the minimal Python package skeleton and internal data contracts needed by later parser, pre-analysis, and renderer steps.
The skeleton should establish importable modules and typed models only. It should not call Marker, Nougat, PyMuPDF, OCR, CUDA, or the filesystem-heavy conversion path yet.
Suggested module boundary:
- `src/pdftomd/__init__.py`
- `src/pdftomd/models.py`
- `tests/test_models.py`
The exact type names may differ if the local design suggests better names, but the contracts must represent document identity, page ranges, block roles, bounding boxes, assets, formulas, tables, figures, and chunk metadata.
## Sprint Contract
- Done means: future steps have stable importable types for page analysis, block modeling, chunk metadata, and output assets.
- Hard thresholds:
- Tests cover model construction, deterministic slug/path-relevant fields, and page range invariants.
- Models do not depend on Marker, Nougat, PyMuPDF, torch, pandas, or PyQt.
- The package imports on Windows with `.\venv\python.exe`.
- Public contracts are documented by tests or clear docstrings.
- Files owned:
- `src/pdftomd/`
- `tests/test_models.py`
- `PROGRESS.md`
- `phases/0-harness-foundation/index.json`
- Dependencies:
- Step 0 metadata should be complete or explicitly blocked.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests\test_models.py
```
## Verification
1. Run the acceptance commands.
2. Confirm package imports with `.\venv\python.exe -c "import pdftomd; print(pdftomd.__name__)"`.
3. Confirm no heavy parser/model imports are introduced.
4. Update `PROGRESS.md` with completed work, validation output, and next handoff.
5. Update this phase index step to `completed` with a one-line `summary`, or to `blocked`/`error` with a concrete reason.
## Do Not
- Do not implement actual PDF parsing.
- Do not run Marker or Nougat.
- Do not add CLI commands.
- Do not add PyQt UI code.
- Do not widen the output contract beyond `docs/ARCHITECTURE.md`.
-63
View File
@@ -1,63 +0,0 @@
# Step 2: page-preanalysis-contract
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/ARCHITECTURE.md
- /docs/CONVERSION_POLICY.md
- /docs/ADR.md
- /docs/TOOLCHAIN.md
- /phases/0-harness-foundation/step0.md
- /phases/0-harness-foundation/step1.md
- /phases/0-harness-foundation/index.json
## Task
Implement the lightweight page pre-analysis contract that decides what later conversion steps need to know before Marker runs.
This step should use PyMuPDF only for fast document/page inspection:
- page count
- text length or text density per page
- image count per page
- OCR candidate flag per page
- basic long-document chunk candidates using the 20-page target
The output should be typed using the models from Step 1.
## Sprint Contract
- Done means: given a PDF path, the pre-analysis API returns deterministic page-level facts and chunk candidates without running Marker, Nougat, OCR, or GPU code.
- Hard thresholds:
- Tests cover at least one text-heavy sample and one mixed/scanned-risk sample from `samples/metadata.json`.
- Tests cover Korean path handling through `pathlib`.
- OCR candidate logic is deterministic and documented by tests.
- Chunk candidates never exceed the document page count.
- Explicit conversion or Markdown rendering is not implemented here.
- Files owned:
- `src/pdftomd/preanalysis.py`
- model additions in `src/pdftomd/models.py` only if required
- `tests/test_preanalysis.py`
- `PROGRESS.md`
- `phases/0-harness-foundation/index.json`
- Dependencies:
- Step 0 sample metadata
- Step 1 package skeleton and models
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests\test_preanalysis.py
```
## Verification
1. Run the acceptance commands.
2. Confirm PyMuPDF is the only PDF inspection dependency used in this step.
3. Confirm the sample metadata traits and test expectations are consistent.
4. Update `PROGRESS.md` with completed work, validation output, and next handoff.
5. Update this phase index step to `completed` with a one-line `summary`, or to `blocked`/`error` with a concrete reason.
## Do Not
- Do not call Marker, Nougat, Surya, torch, or OCR.
- Do not write conversion output under `output/`.
- Do not create resume cache or runtime state files.
- Do not implement reading-order reconstruction in this step.
-63
View File
@@ -1,63 +0,0 @@
# Step 3: markdown-quality-gates
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/ARCHITECTURE.md
- /docs/CONVERSION_POLICY.md
- /docs/ADR.md
- /phases/0-harness-foundation/step0.md
- /phases/0-harness-foundation/step1.md
- /phases/0-harness-foundation/step2.md
- /phases/0-harness-foundation/index.json
## Task
Create focused Markdown quality gate functions that later renderer and conversion steps can call.
This step should validate generated Markdown-like strings and asset references without requiring a full PDF conversion. It should prefer structured checks over full snapshot comparison.
Quality gates should cover:
- math delimiter balance for `$...$` and `$$...$$`
- LaTeX `\begin{...}` / `\end{...}` pairs
- image link path existence or modeled asset reference existence
- table parseability for simple Markdown tables
- chunk frontmatter fields required by the output contract
- caption/reference anchor shape where confidence is sufficient
## Sprint Contract
- Done means: later renderer steps have reusable validation functions and focused pytest coverage for Markdown output risks.
- Hard thresholds:
- Tests include passing and failing examples for math delimiter checks.
- Tests include a complex table case where Markdown limitations are represented as an allowed HTML/fallback decision.
- Tests do not rely on full Markdown snapshot equality.
- Validation functions do not mutate generated Markdown silently unless an explicit repair function is named and tested.
- No PDF parsing or renderer implementation is introduced here.
- Files owned:
- `src/pdftomd/quality.py`
- model additions in `src/pdftomd/models.py` only if required
- `tests/test_quality.py`
- `PROGRESS.md`
- `phases/0-harness-foundation/index.json`
- Dependencies:
- Step 1 package skeleton and models
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests\test_quality.py
```
## Verification
1. Run the acceptance commands.
2. Confirm quality gates are focused assertions, not whole-document snapshots.
3. Confirm failures return actionable messages for evaluator use.
4. Update `PROGRESS.md` with completed work, validation output, and next handoff.
5. Update this phase index step to `completed` with a one-line `summary`, or to `blocked`/`error` with a concrete reason.
## Do Not
- Do not implement Marker/Nougat adapters.
- Do not implement the full Markdown renderer.
- Do not introduce an LLM correction path.
- Do not write warning/error messages into generated Markdown content.
@@ -1,30 +0,0 @@
{
"project": "PDFtoMD",
"phase": "1-core-runtime-contracts",
"steps": [
{
"step": 0,
"name": "input-normalization-slug",
"status": "completed",
"summary": "Added deterministic PDF path normalization, document identity creation, anchors, and output bundle path contracts."
},
{
"step": 1,
"name": "conversion-options-config",
"status": "completed",
"summary": "Added typed conversion options with runtime mode and formula parser defaults matching project policy."
},
{
"step": 2,
"name": "output-bundle-contract",
"status": "completed",
"summary": "Added deterministic output bundle paths and separated runtime artifact paths from document output."
},
{
"step": 3,
"name": "runtime-cache-policy",
"status": "completed",
"summary": "Added model cache and runtime artifact path policies with explicit offline environment mappings."
}
]
}
-38
View File
@@ -1,38 +0,0 @@
# Step 0: input-normalization-slug
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/ARCHITECTURE.md
- /docs/CONVERSION_POLICY.md
- /phases/0-harness-foundation/index.json
## Task
Implement deterministic input normalization and document slug generation for local PDF paths.
Cover `pathlib` handling for Korean filenames, spaces, relative paths, absolute paths, and long Windows paths. The API should not invoke Marker, Nougat, PyMuPDF, or any conversion logic.
## Sprint Contract
- Done means: the core package has a tested function or small module that normalizes input PDF paths and produces stable document slugs.
- Hard thresholds: same input path and options produce the same slug; non-PDF paths fail clearly; Korean and spaced paths are tested; no parser import is introduced.
- Files owned: `src/pdftomd/`, `tests/`, `PROGRESS.md`, `phases/1-core-runtime-contracts/index.json`.
- Dependencies: Phase 0 package skeleton and model contracts.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm `PROGRESS.md` records the handoff and validation result.
3. Update this phase index step to `completed`, `blocked`, or `error`.
## Do Not
- Do not implement PDF parsing.
- Do not write conversion output.
- Do not add UI code.
-38
View File
@@ -1,38 +0,0 @@
# Step 1: conversion-options-config
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/ARCHITECTURE.md
- /docs/ADR.md
- /phases/1-core-runtime-contracts/step0.md
## Task
Define the typed conversion options and runtime configuration used by CLI, library, parser adapters, renderer, and UI.
Include runtime mode, device behavior, chunk target pages, formula parser mode, Nougat command path, output directory, model cache location, and resume/log options.
## Sprint Contract
- Done means: conversion options have defaults matching project policy and can be constructed by tests without CLI parsing.
- Hard thresholds: explicit `cuda` fail-fast semantics and `auto` fallback semantics are represented; Nougat remains formula-only; PyQt and hosted API options are not introduced.
- Files owned: `src/pdftomd/`, `tests/`, `PROGRESS.md`, `phases/1-core-runtime-contracts/index.json`.
- Dependencies: Step 0 normalized path/slug contract.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm defaults align with `docs/ARCHITECTURE.md` and `docs/CONVERSION_POLICY.md`.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not add command-line parsing yet.
- Do not initialize CUDA, Marker, or Nougat.
- Do not add external API settings.
-39
View File
@@ -1,39 +0,0 @@
# Step 2: output-bundle-contract
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/ARCHITECTURE.md
- /docs/CONVERSION_POLICY.md
- /phases/1-core-runtime-contracts/step0.md
- /phases/1-core-runtime-contracts/step1.md
## Task
Define deterministic output bundle path rules for chunk Markdown files, image assets, anchors, and runtime artifacts.
This is a contract step. It may include lightweight path helpers and tests, but it should not render Markdown or write parsed document content.
## Sprint Contract
- Done means: output directory, chunk file names, image asset names, and runtime log/state locations are modeled and tested.
- Hard thresholds: document output sidecars remain out of scope; runtime logs/state are separated from Markdown bundle output; asset naming is deterministic.
- Files owned: `src/pdftomd/`, `tests/`, `PROGRESS.md`, `phases/1-core-runtime-contracts/index.json`.
- Dependencies: Steps 0 and 1.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm generated path contracts match `docs/ARCHITECTURE.md`.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not implement the renderer.
- Do not write files under `output/` in tests unless using a temp directory.
- Do not create sidecar metadata output.
-39
View File
@@ -1,39 +0,0 @@
# Step 3: runtime-cache-policy
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/TOOLCHAIN.md
- /docs/CONVERSION_POLICY.md
- /phases/1-core-runtime-contracts/step1.md
- /phases/1-core-runtime-contracts/step2.md
## Task
Establish model cache, log path, and resume state policy as typed contracts and documented path helpers.
The result should prepare later CLI/runtime phases to use local model cache paths and offline-preferred model loading.
## Sprint Contract
- Done means: model cache and runtime cache path contracts are tested and documented without downloading models.
- Hard thresholds: no network download is triggered; logs/state remain outside generated Markdown content; environment variable overrides are deterministic.
- Files owned: `src/pdftomd/`, `tests/`, `docs/TOOLCHAIN.md`, `PROGRESS.md`, `phases/1-core-runtime-contracts/index.json`.
- Dependencies: Steps 1 and 2.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm `docs/TOOLCHAIN.md` stays consistent with any cache path decisions.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not download Marker or Nougat weights.
- Do not add hosted storage or cloud cache behavior.
- Do not write warnings into Markdown output.
-26
View File
@@ -1,26 +0,0 @@
{
"project": "PDFtoMD",
"phase": "2-marker-adapter",
"steps": [
{
"step": 0,
"name": "marker-invocation-adapter",
"status": "pending"
},
{
"step": 1,
"name": "ocr-plan-handoff",
"status": "pending"
},
{
"step": 2,
"name": "marker-block-normalization",
"status": "pending"
},
{
"step": 3,
"name": "marker-failure-reporting",
"status": "pending"
}
]
}
-38
View File
@@ -1,38 +0,0 @@
# Step 0: marker-invocation-adapter
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/ARCHITECTURE.md
- /docs/TOOLCHAIN.md
- /phases/1-core-runtime-contracts/index.json
## Task
Implement the first Marker adapter boundary that invokes Marker through a small internal interface.
Keep this adapter isolated so tests can use fakes without loading large models. Real Marker invocation should be smoke-testable but not required for every unit test.
## Sprint Contract
- Done means: Marker invocation is behind a narrow interface and can return structured parse results or clear failures.
- Hard thresholds: Marker remains the primary document parser; Nougat is not used here; unit tests avoid mandatory model downloads; parser errors are structured.
- Files owned: `src/pdftomd/marker_adapter.py`, related tests, `PROGRESS.md`, `phases/2-marker-adapter/index.json`.
- Dependencies: Phase 1 runtime contracts.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm the adapter can be tested without external services.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not parse formulas with Nougat.
- Do not implement Markdown rendering.
- Do not make every test load Marker models.
-38
View File
@@ -1,38 +0,0 @@
# Step 1: ocr-plan-handoff
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/CONVERSION_POLICY.md
- /phases/0-harness-foundation/step2.md
- /phases/2-marker-adapter/step0.md
## Task
Connect PyMuPDF page pre-analysis results to the Marker adapter as an OCR/layout handoff plan.
The goal is to preserve page-level OCR decisions without making the entire document scan-only or text-only.
## Sprint Contract
- Done means: the adapter accepts page-level OCR candidates and passes the relevant intent into Marker configuration or records an explicit unsupported-path fallback.
- Hard thresholds: OCR decisions stay page-aware; PyMuPDF remains pre-analysis only; no OCR logs are inserted into Markdown.
- Files owned: `src/pdftomd/marker_adapter.py`, `src/pdftomd/preanalysis.py` if needed, tests, `PROGRESS.md`, phase index.
- Dependencies: Phase 0 pre-analysis and Step 0 Marker adapter.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm mixed text/scanned sample traits are represented in tests.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not force document-wide OCR when only selected pages need OCR.
- Do not implement reading-order fixes here.
- Do not add a second primary parser.
-39
View File
@@ -1,39 +0,0 @@
# Step 2: marker-block-normalization
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/ARCHITECTURE.md
- /docs/CONVERSION_POLICY.md
- /phases/0-harness-foundation/step1.md
- /phases/2-marker-adapter/step0.md
## Task
Map Marker structured output into the internal block model for headings, paragraphs, lists, tables, figures, captions, and equation candidates.
Prefer structured Marker APIs or JSON-like structures over scraping final Markdown.
## Sprint Contract
- Done means: fake Marker structures and at least one real or recorded sample shape map into internal block types.
- Hard thresholds: semantic block roles are preserved; bounding boxes and page numbers survive where available; formula blocks are only marked as candidates for Phase 3.
- Files owned: `src/pdftomd/marker_adapter.py`, model additions if required, tests, `PROGRESS.md`, phase index.
- Dependencies: Phase 0 models and Step 0 adapter.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm no final Markdown scraping is required for normal block mapping.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not perform Nougat conversion.
- Do not render Markdown.
- Do not discard page or bounding-box metadata without a documented reason.
-38
View File
@@ -1,38 +0,0 @@
# Step 3: marker-failure-reporting
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/CONVERSION_POLICY.md
- /phases/2-marker-adapter/step0.md
- /phases/2-marker-adapter/step2.md
## Task
Define structured Marker failure reporting for parser errors, unsupported pages, timeout-like failures, and recoverable partial output.
This prepares later CLI and resume behavior without writing CLI code.
## Sprint Contract
- Done means: Marker adapter failures are typed, testable, and do not corrupt generated Markdown content.
- Hard thresholds: failures include page/chunk context where available; errors go to runtime reporting paths, not document body; fallback eligibility is explicit.
- Files owned: `src/pdftomd/marker_adapter.py`, error/reporting models, tests, `PROGRESS.md`, phase index.
- Dependencies: Steps 0 and 2.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm failure messages are actionable for CLI and evaluator use.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not silently swallow Marker failures.
- Do not implement resume state here.
- Do not write errors into Markdown chunks.
-26
View File
@@ -1,26 +0,0 @@
{
"project": "PDFtoMD",
"phase": "3-formula-pipeline",
"steps": [
{
"step": 0,
"name": "formula-block-detection",
"status": "pending"
},
{
"step": 1,
"name": "nougat-command-adapter",
"status": "pending"
},
{
"step": 2,
"name": "latex-validation-repair",
"status": "pending"
},
{
"step": 3,
"name": "formula-reference-links",
"status": "pending"
}
]
}
-37
View File
@@ -1,37 +0,0 @@
# Step 0: formula-block-detection
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/CONVERSION_POLICY.md
- /phases/2-marker-adapter/step2.md
## Task
Implement formula candidate detection from normalized Marker blocks.
Detect Marker equation blocks and text-pattern candidates while classifying inline versus block formulas based on block role and layout hints.
## Sprint Contract
- Done means: formula candidates are represented as internal objects ready for Nougat or Marker fallback.
- Hard thresholds: ordinary currency-like dollar text is not blindly treated as math; inline/block distinction is tested; no Nougat invocation occurs yet.
- Files owned: `src/pdftomd/formulas.py`, tests, `PROGRESS.md`, `phases/3-formula-pipeline/index.json`.
- Dependencies: Phase 2 block normalization.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm tests include inline and block formula candidates.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not call Nougat.
- Do not render Markdown math.
- Do not make regex the only source when structured block role exists.
-38
View File
@@ -1,38 +0,0 @@
# Step 1: nougat-command-adapter
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/TOOLCHAIN.md
- /docs/CONVERSION_POLICY.md
- /phases/3-formula-pipeline/step0.md
## Task
Implement the Nougat formula-only adapter boundary.
The adapter should accept formula candidates and return LaTeX candidates or structured failure results. It should support a configured Nougat command path and be mockable in unit tests.
## Sprint Contract
- Done means: Nougat execution is isolated behind a testable command adapter and never becomes the primary document parser.
- Hard thresholds: failures preserve Marker fallback text; tests do not require GPU/model execution by default; command path handling works on Windows.
- Files owned: `src/pdftomd/formulas.py`, optional `src/pdftomd/nougat_adapter.py`, tests, `PROGRESS.md`, phase index.
- Dependencies: Step 0 formula candidates and Phase 1 options.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm `.\venv\Scripts\nougat.exe --help` remains documented as an environment check, not a unit-test requirement.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not parse whole PDFs with Nougat.
- Do not require model downloads for normal unit tests.
- Do not discard Marker source text on failure.
-38
View File
@@ -1,38 +0,0 @@
# Step 2: latex-validation-repair
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/CONVERSION_POLICY.md
- /phases/0-harness-foundation/step3.md
- /phases/3-formula-pipeline/step1.md
## Task
Implement LaTeX and Markdown math validation for formula outputs, plus explicit repair helpers for safe cases.
Validation should cover delimiter balance and common `\begin{...}` / `\end{...}` pairs.
## Sprint Contract
- Done means: formula output validation returns actionable diagnostics and tested repairs for narrow, deterministic cases.
- Hard thresholds: validation does not silently mutate math; unrepairable failures fall back to Marker text; delimiter tests include both inline and block math.
- Files owned: `src/pdftomd/formulas.py`, `src/pdftomd/quality.py`, tests, `PROGRESS.md`, phase index.
- Dependencies: Phase 0 quality gates and Step 1 Nougat adapter.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm broken delimiter and environment examples are covered.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not build a broad LaTeX parser from scratch.
- Do not use LLM repair.
- Do not hide validation failures.
-37
View File
@@ -1,37 +0,0 @@
# Step 3: formula-reference-links
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/CONVERSION_POLICY.md
- /phases/3-formula-pipeline/step2.md
## Task
Preserve formula numbering and body references as internal Markdown link targets when confidence is sufficient.
Support common English and Korean reference patterns such as `Eq. (3)` and `식 (5)`.
## Sprint Contract
- Done means: formula anchors and reference rewrites are modeled and tested independently from final Markdown rendering.
- Hard thresholds: low-confidence matches remain plain text; duplicate formula numbers do not create unstable anchors; references never point to missing anchors.
- Files owned: `src/pdftomd/formulas.py`, reference model/tests, `PROGRESS.md`, phase index.
- Dependencies: Steps 0 through 2.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm duplicate and missing reference cases are tested.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not rewrite ambiguous references.
- Do not render final Markdown chunks.
- Do not remove the original formula number text.
-26
View File
@@ -1,26 +0,0 @@
{
"project": "PDFtoMD",
"phase": "4-semantic-enrichment",
"steps": [
{
"step": 0,
"name": "reading-order-checks",
"status": "pending"
},
{
"step": 1,
"name": "paragraph-stitching",
"status": "pending"
},
{
"step": 2,
"name": "header-footer-filtering",
"status": "pending"
},
{
"step": 3,
"name": "reference-indexing",
"status": "pending"
}
]
}
-38
View File
@@ -1,38 +0,0 @@
# Step 0: reading-order-checks
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/ARCHITECTURE.md
- /docs/CONVERSION_POLICY.md
- /phases/2-marker-adapter/step2.md
## Task
Create reading-order verification helpers over normalized blocks.
Use page numbers and bounding boxes to detect obvious ordering anomalies in multi-column or inserted-text layouts.
## Sprint Contract
- Done means: reading-order checks produce diagnostics that later enrichment and evaluator steps can use.
- Hard thresholds: checks are deterministic; tests include a multi-column-like fixture; helpers do not reorder content silently.
- Files owned: `src/pdftomd/enrichment.py`, tests, `PROGRESS.md`, `phases/4-semantic-enrichment/index.json`.
- Dependencies: Phase 2 normalized block model.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm diagnostics are actionable and tied to page/block ids.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not override Marker ordering without tests.
- Do not render Markdown.
- Do not call Marker or Nougat.
-37
View File
@@ -1,37 +0,0 @@
# Step 1: paragraph-stitching
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/CONVERSION_POLICY.md
- /phases/4-semantic-enrichment/step0.md
## Task
Implement paragraph stitching for line-fragmented PDF text blocks.
Handle continuation lines and hyphenated line breaks while preserving likely compound words or identifiers when confidence is low.
## Sprint Contract
- Done means: paragraph stitching turns line fragments into coherent paragraph blocks with focused tests.
- Hard thresholds: hyphen joins are tested; low-confidence hyphen cases are preserved; list items and headings are not merged into paragraphs.
- Files owned: `src/pdftomd/enrichment.py`, tests, `PROGRESS.md`, phase index.
- Dependencies: Step 0 checks and normalized block model.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm Korean and English text fixtures remain stable.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not rely only on punctuation rules when bounding-box hints exist.
- Do not merge across tables, figures, or formulas.
- Do not modify source PDF files.
-37
View File
@@ -1,37 +0,0 @@
# Step 2: header-footer-filtering
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/CONVERSION_POLICY.md
- /phases/4-semantic-enrichment/step1.md
## Task
Detect repeated page headers, footers, and page numbers and separate them from the main Markdown body flow.
The implementation should mark or remove repetitive boilerplate according to policy while keeping enough diagnostics for review.
## Sprint Contract
- Done means: repeated top/bottom page-region text can be identified and excluded from main content in tests.
- Hard thresholds: unique body text is not removed; page number patterns are tested; removal decisions are deterministic.
- Files owned: `src/pdftomd/enrichment.py`, tests, `PROGRESS.md`, phase index.
- Dependencies: Paragraph and block model from earlier steps.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm false-positive protections are tested.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not delete content without a confidence rule.
- Do not write filtered text into sidecar document outputs.
- Do not implement CLI reporting here.
-38
View File
@@ -1,38 +0,0 @@
# Step 3: reference-indexing
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/CONVERSION_POLICY.md
- /phases/3-formula-pipeline/step3.md
- /phases/4-semantic-enrichment/step2.md
## Task
Build a reference index for figures, tables, formulas, captions, and body references.
The index should support later Markdown rendering by providing stable anchors and high-confidence link targets.
## Sprint Contract
- Done means: table, figure, and formula references can be resolved or left plain with reasons.
- Hard thresholds: anchors are deterministic; duplicate labels are handled; missing targets do not produce broken links.
- Files owned: `src/pdftomd/enrichment.py`, reference models/tests, `PROGRESS.md`, phase index.
- Dependencies: Formula links and semantic block enrichment.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm figure/table/formula reference fixtures are covered.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not rewrite ambiguous references.
- Do not make anchors depend on nondeterministic ordering.
- Do not render final Markdown here.
@@ -1,26 +0,0 @@
{
"project": "PDFtoMD",
"phase": "5-markdown-rendering-assets",
"steps": [
{
"step": 0,
"name": "markdown-block-renderer",
"status": "pending"
},
{
"step": 1,
"name": "table-renderer-fallbacks",
"status": "pending"
},
{
"step": 2,
"name": "figure-asset-writer",
"status": "pending"
},
{
"step": 3,
"name": "chunk-renderer",
"status": "pending"
}
]
}
@@ -1,38 +0,0 @@
# Step 0: markdown-block-renderer
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/ARCHITECTURE.md
- /docs/CONVERSION_POLICY.md
- /phases/4-semantic-enrichment/index.json
## Task
Implement block-level Markdown rendering for headings, paragraphs, lists, blockquotes, formulas, captions, and simple references.
Renderer tests should use internal block fixtures, not live PDF parsing.
## Sprint Contract
- Done means: core block types render to deterministic Markdown strings with focused tests.
- Hard thresholds: math delimiter validation is applied; renderer does not inject warnings/errors into Markdown; output is stable across runs.
- Files owned: `src/pdftomd/renderer.py`, tests, `PROGRESS.md`, `phases/5-markdown-rendering-assets/index.json`.
- Dependencies: Phase 4 enriched blocks and Phase 3 formula outputs.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm renderer tests are focused, not full snapshots.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not invoke Marker or Nougat.
- Do not implement table/asset file writing in this step.
- Do not add sidecar document outputs.
@@ -1,37 +0,0 @@
# Step 1: table-renderer-fallbacks
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/CONVERSION_POLICY.md
- /phases/5-markdown-rendering-assets/step0.md
## Task
Implement table rendering policy for Markdown tables, limited HTML tables, and image fallback links.
Use structured table objects and avoid ad hoc string parsing for complex cases where possible.
## Sprint Contract
- Done means: simple tables render as Markdown, complex tables can render as limited HTML or fallback references, and table captions/footnotes are preserved.
- Hard thresholds: tests cover merged-cell-like structures, footnotes, captions, and table fallback decisions; invalid table output is detected by quality gates.
- Files owned: `src/pdftomd/renderer.py`, table models/tests, `PROGRESS.md`, phase index.
- Dependencies: Step 0 renderer and Phase 0 quality gates.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm fallback images are linked but not generated unless a table asset exists.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not fake table content that was not extracted.
- Do not discard captions or footnotes.
- Do not implement full HTML sanitizer scope beyond limited table output.
@@ -1,39 +0,0 @@
# Step 2: figure-asset-writer
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/ARCHITECTURE.md
- /docs/CONVERSION_POLICY.md
- /phases/1-core-runtime-contracts/step2.md
- /phases/5-markdown-rendering-assets/step0.md
## Task
Implement deterministic image/figure asset writing and Markdown image reference generation.
Use hash-based deduplication when asset bytes are available and preserve figure captions and reference anchors.
## Sprint Contract
- Done means: figure assets can be written to temp output bundles with deterministic names and Markdown references.
- Hard thresholds: duplicate images share stored assets where configured; Korean path output is tested; missing assets produce validation failures, not broken silent links.
- Files owned: `src/pdftomd/assets.py`, renderer integration/tests, `PROGRESS.md`, phase index.
- Dependencies: Output bundle contract and renderer.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm tests write only to temporary directories.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not write into real `output/` during tests.
- Do not rename source PDFs.
- Do not drop figure captions.
@@ -1,39 +0,0 @@
# Step 3: chunk-renderer
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/ARCHITECTURE.md
- /docs/CONVERSION_POLICY.md
- /phases/1-core-runtime-contracts/step2.md
- /phases/5-markdown-rendering-assets/step2.md
## Task
Implement chunk planning and chunk Markdown bundle writing over enriched blocks.
Chunk boundaries should target 20 pages but preserve logical block integrity for paragraphs, tables, figures, and formulas.
## Sprint Contract
- Done means: chunk files with frontmatter can be written deterministically from internal document fixtures.
- Hard thresholds: block integrity is preserved at chunk boundaries; chunk frontmatter includes minimum context; quality gates run on rendered chunks.
- Files owned: `src/pdftomd/chunking.py`, `src/pdftomd/renderer.py`, tests, `PROGRESS.md`, phase index.
- Dependencies: Renderer, assets, and output bundle contracts.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm long-document chunk fixtures cover boundary behavior.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not split blocks in the middle to satisfy exact 20-page counts.
- Do not create document sidecar metadata files.
- Do not implement CLI orchestration here.
-31
View File
@@ -1,31 +0,0 @@
{
"project": "PDFtoMD",
"phase": "6-cli-runtime-resume",
"steps": [
{
"step": 0,
"name": "cli-entrypoint-options",
"status": "pending"
},
{
"step": 1,
"name": "progress-logging",
"status": "pending"
},
{
"step": 2,
"name": "resume-state",
"status": "pending"
},
{
"step": 3,
"name": "device-oom-policy",
"status": "pending"
},
{
"step": 4,
"name": "model-cache-offline",
"status": "pending"
}
]
}
-38
View File
@@ -1,38 +0,0 @@
# Step 0: cli-entrypoint-options
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/ARCHITECTURE.md
- /phases/1-core-runtime-contracts/index.json
- /phases/5-markdown-rendering-assets/index.json
## Task
Implement the `python -m pdftomd` CLI entrypoint and option parsing over the existing library API.
Expose input PDF, output directory, formula parser mode, Nougat command, runtime/device, chunk size, logging, and resume options.
## Sprint Contract
- Done means: CLI options map into typed conversion options and can run against a mocked pipeline in tests.
- Hard thresholds: CLI does not duplicate conversion logic; defaults match docs; explicit `cuda` and `auto` modes are represented.
- Files owned: `src/pdftomd/__main__.py`, CLI modules/tests, `README.md` if command docs change, `PROGRESS.md`, phase index.
- Dependencies: Core contracts and renderer pipeline.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm CLI help text shows documented options.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not put parser logic inside CLI parsing code.
- Do not implement PyQt UI.
- Do not silently CPU fallback for explicit CUDA mode.
-37
View File
@@ -1,37 +0,0 @@
# Step 1: progress-logging
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/CONVERSION_POLICY.md
- /phases/6-cli-runtime-resume/step0.md
## Task
Implement progress reporting and stderr/local log behavior for chunk-level conversion.
Progress should summarize chunk success/failure without writing warnings or errors into Markdown content.
## Sprint Contract
- Done means: CLI/runtime tests can observe progress events and log file output in temp locations.
- Hard thresholds: Markdown chunks remain free of warning/error logs; failure summaries include chunk ids; logs use deterministic local paths from Phase 1.
- Files owned: `src/pdftomd/runtime.py`, CLI integration/tests, `PROGRESS.md`, phase index.
- Dependencies: CLI entrypoint and output/cache contracts.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm stderr/log behavior is tested separately from Markdown output.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not write runtime logs inside generated Markdown.
- Do not require a real PDF conversion for progress unit tests.
- Do not create persistent logs outside temp dirs in tests.
-37
View File
@@ -1,37 +0,0 @@
# Step 2: resume-state
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/CONVERSION_POLICY.md
- /phases/6-cli-runtime-resume/step1.md
## Task
Implement runtime resume state for successful and failed chunks.
Resume state is a runtime artifact, not a document output sidecar.
## Sprint Contract
- Done means: conversion can skip completed chunks and retry failed chunks using a local state file in tests.
- Hard thresholds: state format is deterministic; stale state is detected; resume does not skip chunks when input/options changed materially.
- Files owned: `src/pdftomd/resume.py`, runtime integration/tests, `PROGRESS.md`, phase index.
- Dependencies: Progress/logging and chunk renderer contracts.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm state files are written only under temp/runtime cache paths in tests.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not treat resume state as part of generated document output.
- Do not skip chunks after parser/version-relevant option changes.
- Do not create hidden global state.
-39
View File
@@ -1,39 +0,0 @@
# Step 3: device-oom-policy
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/ARCHITECTURE.md
- /docs/CONVERSION_POLICY.md
- /docs/TOOLCHAIN.md
- /phases/1-core-runtime-contracts/step1.md
## Task
Implement runtime device selection, CUDA fail-fast behavior, auto CPU fallback behavior, and OOM retry policy hooks.
This step should be tested with mocks and small CUDA smoke checks only where safe.
## Sprint Contract
- Done means: runtime policy enforces explicit CUDA fail-fast, auto fallback warning, and configurable OOM retry reductions.
- Hard thresholds: no silent CPU fallback for explicit CUDA; tests do not require exhausting VRAM; GTX 1070 Ti constraints remain documented.
- Files owned: `src/pdftomd/runtime.py`, tests, `docs/TOOLCHAIN.md` if behavior changes, `PROGRESS.md`, phase index.
- Dependencies: Runtime config options.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm CUDA smoke test instructions still work separately.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not intentionally trigger real GPU OOM in tests.
- Do not change PyTorch pins without updating `docs/TOOLCHAIN.md`.
- Do not hide runtime warnings.
-38
View File
@@ -1,38 +0,0 @@
# Step 4: model-cache-offline
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/TOOLCHAIN.md
- /docs/ARCHITECTURE.md
- /phases/6-cli-runtime-resume/step3.md
## Task
Document and wire model cache/offline behavior for Marker, Nougat, and Hugging Face cache paths.
Add CLI/runtime hooks for environment variables or explicit cache paths without downloading models during tests.
## Sprint Contract
- Done means: users can see how to pre-download models and run offline, and runtime cache paths are configurable.
- Hard thresholds: no test performs network download; docs include Windows commands; cache path policy matches Phase 1.
- Files owned: `src/pdftomd/runtime.py`, `README.md`, `docs/TOOLCHAIN.md`, tests, `PROGRESS.md`, phase index.
- Dependencies: Device/runtime policy and cache contracts.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm offline instructions are clear and do not imply bundled weights.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not download model weights as part of tests.
- Do not commit model caches.
- Do not make online access mandatory for already-cached models.
-26
View File
@@ -1,26 +0,0 @@
{
"project": "PDFtoMD",
"phase": "7-mvp-quality-hardening",
"steps": [
{
"step": 0,
"name": "sample-smoke-conversions",
"status": "pending"
},
{
"step": 1,
"name": "quality-metrics-report",
"status": "pending"
},
{
"step": 2,
"name": "regression-thresholds",
"status": "pending"
},
{
"step": 3,
"name": "mvp-fix-sweep",
"status": "pending"
}
]
}
-38
View File
@@ -1,38 +0,0 @@
# Step 0: sample-smoke-conversions
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/PRD.md
- /docs/CONVERSION_POLICY.md
- /phases/6-cli-runtime-resume/index.json
## Task
Create controlled sample smoke conversion tests for the MVP corpus.
The tests should exercise the end-to-end pipeline on a small selected subset or page range first, then document which full documents are suitable for manual or slower regression runs.
## Sprint Contract
- Done means: at least one text-layer sample and one mixed/scanned-risk sample can be converted in a controlled test path.
- Hard thresholds: tests have runtime bounds; sample selection comes from `samples/metadata.json`; generated output is checked with quality gates.
- Files owned: `tests/`, sample metadata updates if needed, `PROGRESS.md`, `phases/7-mvp-quality-hardening/index.json`.
- Dependencies: CLI/runtime and renderer phases complete.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Record sample coverage and any skipped slow tests in `PROGRESS.md`.
3. Update this phase index.
## Do Not
- Do not make every validation run process all long PDFs if runtime becomes impractical.
- Do not commit generated `output/` bundles.
- Do not weaken quality gates to pass broken output.
-37
View File
@@ -1,37 +0,0 @@
# Step 1: quality-metrics-report
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/PRD.md
- /phases/7-mvp-quality-hardening/step0.md
## Task
Add focused quality metrics for converted Markdown bundles.
Metrics should cover headings, math delimiter balance, LaTeX environment pairs, image links, captions, table parseability, chunk frontmatter, and no-exception conversion.
## Sprint Contract
- Done means: evaluator-friendly quality metrics can be run on sample outputs and produce actionable failure messages.
- Hard thresholds: metrics do not rely on full Markdown snapshots; failures identify file/chunk/block context; reports stay out of generated Markdown.
- Files owned: `src/pdftomd/quality.py`, `tests/`, optional scripts under `scripts/`, `PROGRESS.md`, phase index.
- Dependencies: Step 0 sample smoke conversions and quality gates.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm metrics can be used by `harness-review`.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not create broad snapshot baselines as the primary quality gate.
- Do not write quality reports inside Markdown chunks.
- Do not hide per-chunk failures.
-37
View File
@@ -1,37 +0,0 @@
# Step 2: regression-thresholds
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/PRD.md
- /phases/7-mvp-quality-hardening/step1.md
## Task
Define MVP regression thresholds for the sample corpus.
Thresholds should distinguish mandatory fast validation from slower/manual quality checks.
## Sprint Contract
- Done means: MVP pass/fail criteria are encoded in tests or documented commands and tied to sample metadata traits.
- Hard thresholds: mandatory validation remains runnable on the local machine; slow tests are opt-in; failed quality areas are not masked.
- Files owned: `tests/`, `scripts/`, sample metadata updates if needed, `PROGRESS.md`, phase index.
- Dependencies: Quality metrics report.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm slow tests are documented separately if needed.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not make local validation unusably slow.
- Do not turn all failures into warnings.
- Do not remove sample coverage for Korean paths or formulas.
-37
View File
@@ -1,37 +0,0 @@
# Step 3: mvp-fix-sweep
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/PRD.md
- /phases/7-mvp-quality-hardening/step2.md
## Task
Run a focused MVP stabilization pass based on failing quality metrics and sample smoke tests.
This step should fix only defects revealed by prior acceptance criteria and should avoid feature expansion.
## Sprint Contract
- Done means: MVP fast validation and selected sample smoke conversions pass with documented residual risks.
- Hard thresholds: fixes are test-backed; no new primary parser is introduced; out-of-scope UI/API/LLM features remain out of scope.
- Files owned: failing modules and tests identified by prior phase output, `PROGRESS.md`, phase index.
- Dependencies: Regression thresholds and quality reports.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Record remaining quality risks in `PROGRESS.md`.
3. Update this phase index.
## Do Not
- Do not use this as a broad refactor step.
- Do not add new major features.
- Do not bypass failed quality gates without recording a blocker.
@@ -1,26 +0,0 @@
{
"project": "PDFtoMD",
"phase": "8-release-docs-packaging",
"steps": [
{
"step": 0,
"name": "readme-usage-flow",
"status": "pending"
},
{
"step": 1,
"name": "environment-bootstrap-docs",
"status": "pending"
},
{
"step": 2,
"name": "license-checkpoint",
"status": "pending"
},
{
"step": 3,
"name": "release-checklist",
"status": "pending"
}
]
}
-36
View File
@@ -1,36 +0,0 @@
# Step 0: readme-usage-flow
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /README.md
- /phases/7-mvp-quality-hardening/index.json
## Task
Update README usage flow for the MVP CLI.
Document install, validation, basic conversion, formula parser modes, runtime modes, output layout, resume, and logs.
## Sprint Contract
- Done means: a user can follow README instructions to run the local CLI on Windows after environment setup.
- Hard thresholds: commands match implemented CLI; docs do not promise PyQt or hosted API as MVP; generated output contract is accurate.
- Files owned: `README.md`, docs as needed, `PROGRESS.md`, `phases/8-release-docs-packaging/index.json`.
- Dependencies: MVP quality hardening complete.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
```
## Verification
1. Run the acceptance command.
2. Confirm documented commands are copy-pasteable in PowerShell.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not document unimplemented features as available.
- Do not add marketing-style content.
- Do not include model weights in the repository.
-38
View File
@@ -1,38 +0,0 @@
# Step 1: environment-bootstrap-docs
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/TOOLCHAIN.md
- /requirements.txt
## Task
Document and optionally script the repo-local environment bootstrap flow.
Cover Conda Python 3.11, requirements install, CUDA smoke test, `pip check`, and Nougat help check.
## Sprint Contract
- Done means: environment setup instructions reflect the verified GTX 1070 Ti / torch 2.7.1+cu126 baseline.
- Hard thresholds: dependency pins remain consistent across README, TOOLCHAIN, and requirements; no unverified torch upgrade is introduced.
- Files owned: `README.md`, `docs/TOOLCHAIN.md`, optional `scripts/`, `PROGRESS.md`, phase index.
- Dependencies: MVP CLI docs.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pip check
.\venv\Scripts\nougat.exe --help
```
## Verification
1. Run the acceptance commands where local environment is available.
2. Explain any skipped environment command in `PROGRESS.md`.
3. Update this phase index.
## Do Not
- Do not replace the single `venv` policy.
- Do not require `uv`.
- Do not change pins without official compatibility verification.
-36
View File
@@ -1,36 +0,0 @@
# Step 2: license-checkpoint
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/ADR.md
- /docs/TOOLCHAIN.md
## Task
Add a licensing checkpoint document or section for current personal use and future redistribution/commercial review.
This is not legal advice. It should identify Marker GPL/model-weight concerns and when to revisit them.
## Sprint Contract
- Done means: docs clearly state current personal-use context and future review triggers.
- Hard thresholds: docs do not claim legal conclusions; process/API isolation is described only as a risk mitigation candidate; model weights are not redistributed.
- Files owned: `docs/TOOLCHAIN.md`, `docs/ADR.md`, optional `README.md`, `PROGRESS.md`, phase index.
- Dependencies: Release docs context.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
```
## Verification
1. Run the acceptance command.
2. Confirm license notes are cautious and consistent.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not provide legal advice.
- Do not mark the project commercially safe without review.
- Do not vendor model weights.
-36
View File
@@ -1,36 +0,0 @@
# Step 3: release-checklist
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /README.md
- /docs/TOOLCHAIN.md
## Task
Create the local MVP release checklist.
Include validation, sample smoke conversion, environment checks, offline cache readiness, known limitations, and next phase entry conditions.
## Sprint Contract
- Done means: the repository has a concise checklist for deciding whether the local MVP is ready for personal use.
- Hard thresholds: checklist references real commands; known limitations are explicit; PyQt phase remains separate.
- Files owned: `README.md`, optional `docs/RELEASE_CHECKLIST.md`, `PROGRESS.md`, phase index.
- Dependencies: Prior release docs steps.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
```
## Verification
1. Run the acceptance command.
2. Confirm checklist can be followed by a fresh agent or user.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not claim production readiness.
- Do not include hosted API release work.
- Do not start PyQt implementation.
-26
View File
@@ -1,26 +0,0 @@
{
"project": "PDFtoMD",
"phase": "9-pyqt-thin-client",
"steps": [
{
"step": 0,
"name": "ui-api-contract",
"status": "pending"
},
{
"step": 1,
"name": "pyqt-shell",
"status": "pending"
},
{
"step": 2,
"name": "ui-progress-resume",
"status": "pending"
},
{
"step": 3,
"name": "ui-packaging-notes",
"status": "pending"
}
]
}

Some files were not shown because too many files have changed in this diff Show More