add files

This commit is contained in:
김경종
2026-04-30 17:05:19 +09:00
parent f3e01b5a8c
commit 7e985ae94a
135 changed files with 41205 additions and 0 deletions
+30
View File
@@ -0,0 +1,30 @@
{
"project": "PDFtoMD",
"phase": "0-harness-foundation",
"steps": [
{
"step": 0,
"name": "sample-metadata-contract",
"status": "completed",
"summary": "Created deterministic samples/metadata.json and metadata contract tests for current sample PDFs."
},
{
"step": 1,
"name": "core-package-skeleton",
"status": "completed",
"summary": "Created importable pdftomd package skeleton, pyproject metadata, and typed core models."
},
{
"step": 2,
"name": "page-preanalysis-contract",
"status": "completed",
"summary": "Added PyMuPDF-only PDF preanalysis with page facts, OCR candidates, and 20-page chunk ranges."
},
{
"step": 3,
"name": "markdown-quality-gates",
"status": "completed",
"summary": "Added focused Markdown quality gates for math, LaTeX, tables, image links, frontmatter, and anchors."
}
]
}
+63
View File
@@ -0,0 +1,63 @@
# Step 0: sample-metadata-contract
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/PRD.md
- /docs/ARCHITECTURE.md
- /docs/CONVERSION_POLICY.md
- /docs/ADR.md
- /docs/TOOLCHAIN.md
## Task
Create the first sample corpus metadata contract without implementing the conversion engine.
The metadata must classify every PDF currently under `samples/` by traits that future regression tests can use:
- text layer quality
- scanned or mixed scanned/text pages
- multi-column or complex layout risk
- formula density
- table density
- figure density
- Korean filename/path coverage
- target regression focus
Use deterministic JSON so future agents can update it with minimal diff noise.
## Sprint Contract
- Done means: `samples/metadata.json` exists, includes every current PDF by exact relative path, and has enough structured fields for future tests to select OCR, layout, formula, table, figure, and Korean-path cases.
- Hard thresholds:
- Every current `samples/*.pdf` appears exactly once.
- Metadata is valid UTF-8 JSON.
- Tests fail if a sample PDF is added without metadata.
- Tests fail if duplicate sample paths exist in the metadata.
- No conversion engine code is introduced in this step.
- Files owned:
- `samples/metadata.json`
- `tests/test_sample_metadata.py`
- `PROGRESS.md`
- `phases/0-harness-foundation/index.json`
- Dependencies:
- Existing sample PDFs under `samples/`
- PyMuPDF may be used only for lightweight page count/text/image inspection if needed.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests\test_sample_metadata.py
```
## Verification
1. Run the acceptance commands.
2. Confirm `samples/metadata.json` paths match `samples/*.pdf`.
3. Confirm Korean filenames remain readable in JSON.
4. Update `PROGRESS.md` with completed work, validation output, and next handoff.
5. Update this phase index step to `completed` with a one-line `summary`, or to `blocked`/`error` with a concrete reason.
## Do Not
- Do not create `src/` or conversion engine modules in this step.
- Do not rename, delete, compress, or rewrite sample PDFs.
- Do not add sidecar output files for converted documents.
- Do not add a new custom agent.
+59
View File
@@ -0,0 +1,59 @@
# Step 1: core-package-skeleton
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/ARCHITECTURE.md
- /docs/CONVERSION_POLICY.md
- /docs/ADR.md
- /phases/0-harness-foundation/step0.md
- /phases/0-harness-foundation/index.json
## Task
Create the minimal Python package skeleton and internal data contracts needed by later parser, pre-analysis, and renderer steps.
The skeleton should establish importable modules and typed models only. It should not call Marker, Nougat, PyMuPDF, OCR, CUDA, or the filesystem-heavy conversion path yet.
Suggested module boundary:
- `src/pdftomd/__init__.py`
- `src/pdftomd/models.py`
- `tests/test_models.py`
The exact type names may differ if the local design suggests better names, but the contracts must represent document identity, page ranges, block roles, bounding boxes, assets, formulas, tables, figures, and chunk metadata.
## Sprint Contract
- Done means: future steps have stable importable types for page analysis, block modeling, chunk metadata, and output assets.
- Hard thresholds:
- Tests cover model construction, deterministic slug/path-relevant fields, and page range invariants.
- Models do not depend on Marker, Nougat, PyMuPDF, torch, pandas, or PyQt.
- The package imports on Windows with `.\venv\python.exe`.
- Public contracts are documented by tests or clear docstrings.
- Files owned:
- `src/pdftomd/`
- `tests/test_models.py`
- `PROGRESS.md`
- `phases/0-harness-foundation/index.json`
- Dependencies:
- Step 0 metadata should be complete or explicitly blocked.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests\test_models.py
```
## Verification
1. Run the acceptance commands.
2. Confirm package imports with `.\venv\python.exe -c "import pdftomd; print(pdftomd.__name__)"`.
3. Confirm no heavy parser/model imports are introduced.
4. Update `PROGRESS.md` with completed work, validation output, and next handoff.
5. Update this phase index step to `completed` with a one-line `summary`, or to `blocked`/`error` with a concrete reason.
## Do Not
- Do not implement actual PDF parsing.
- Do not run Marker or Nougat.
- Do not add CLI commands.
- Do not add PyQt UI code.
- Do not widen the output contract beyond `docs/ARCHITECTURE.md`.
+63
View File
@@ -0,0 +1,63 @@
# Step 2: page-preanalysis-contract
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/ARCHITECTURE.md
- /docs/CONVERSION_POLICY.md
- /docs/ADR.md
- /docs/TOOLCHAIN.md
- /phases/0-harness-foundation/step0.md
- /phases/0-harness-foundation/step1.md
- /phases/0-harness-foundation/index.json
## Task
Implement the lightweight page pre-analysis contract that decides what later conversion steps need to know before Marker runs.
This step should use PyMuPDF only for fast document/page inspection:
- page count
- text length or text density per page
- image count per page
- OCR candidate flag per page
- basic long-document chunk candidates using the 20-page target
The output should be typed using the models from Step 1.
## Sprint Contract
- Done means: given a PDF path, the pre-analysis API returns deterministic page-level facts and chunk candidates without running Marker, Nougat, OCR, or GPU code.
- Hard thresholds:
- Tests cover at least one text-heavy sample and one mixed/scanned-risk sample from `samples/metadata.json`.
- Tests cover Korean path handling through `pathlib`.
- OCR candidate logic is deterministic and documented by tests.
- Chunk candidates never exceed the document page count.
- Explicit conversion or Markdown rendering is not implemented here.
- Files owned:
- `src/pdftomd/preanalysis.py`
- model additions in `src/pdftomd/models.py` only if required
- `tests/test_preanalysis.py`
- `PROGRESS.md`
- `phases/0-harness-foundation/index.json`
- Dependencies:
- Step 0 sample metadata
- Step 1 package skeleton and models
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests\test_preanalysis.py
```
## Verification
1. Run the acceptance commands.
2. Confirm PyMuPDF is the only PDF inspection dependency used in this step.
3. Confirm the sample metadata traits and test expectations are consistent.
4. Update `PROGRESS.md` with completed work, validation output, and next handoff.
5. Update this phase index step to `completed` with a one-line `summary`, or to `blocked`/`error` with a concrete reason.
## Do Not
- Do not call Marker, Nougat, Surya, torch, or OCR.
- Do not write conversion output under `output/`.
- Do not create resume cache or runtime state files.
- Do not implement reading-order reconstruction in this step.
+63
View File
@@ -0,0 +1,63 @@
# Step 3: markdown-quality-gates
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/ARCHITECTURE.md
- /docs/CONVERSION_POLICY.md
- /docs/ADR.md
- /phases/0-harness-foundation/step0.md
- /phases/0-harness-foundation/step1.md
- /phases/0-harness-foundation/step2.md
- /phases/0-harness-foundation/index.json
## Task
Create focused Markdown quality gate functions that later renderer and conversion steps can call.
This step should validate generated Markdown-like strings and asset references without requiring a full PDF conversion. It should prefer structured checks over full snapshot comparison.
Quality gates should cover:
- math delimiter balance for `$...$` and `$$...$$`
- LaTeX `\begin{...}` / `\end{...}` pairs
- image link path existence or modeled asset reference existence
- table parseability for simple Markdown tables
- chunk frontmatter fields required by the output contract
- caption/reference anchor shape where confidence is sufficient
## Sprint Contract
- Done means: later renderer steps have reusable validation functions and focused pytest coverage for Markdown output risks.
- Hard thresholds:
- Tests include passing and failing examples for math delimiter checks.
- Tests include a complex table case where Markdown limitations are represented as an allowed HTML/fallback decision.
- Tests do not rely on full Markdown snapshot equality.
- Validation functions do not mutate generated Markdown silently unless an explicit repair function is named and tested.
- No PDF parsing or renderer implementation is introduced here.
- Files owned:
- `src/pdftomd/quality.py`
- model additions in `src/pdftomd/models.py` only if required
- `tests/test_quality.py`
- `PROGRESS.md`
- `phases/0-harness-foundation/index.json`
- Dependencies:
- Step 1 package skeleton and models
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests\test_quality.py
```
## Verification
1. Run the acceptance commands.
2. Confirm quality gates are focused assertions, not whole-document snapshots.
3. Confirm failures return actionable messages for evaluator use.
4. Update `PROGRESS.md` with completed work, validation output, and next handoff.
5. Update this phase index step to `completed` with a one-line `summary`, or to `blocked`/`error` with a concrete reason.
## Do Not
- Do not implement Marker/Nougat adapters.
- Do not implement the full Markdown renderer.
- Do not introduce an LLM correction path.
- Do not write warning/error messages into generated Markdown content.