add files

This commit is contained in:
김경종
2026-04-30 17:05:19 +09:00
parent f3e01b5a8c
commit 7e985ae94a
135 changed files with 41205 additions and 0 deletions
+26
View File
@@ -0,0 +1,26 @@
{
"project": "PDFtoMD",
"phase": "4-semantic-enrichment",
"steps": [
{
"step": 0,
"name": "reading-order-checks",
"status": "pending"
},
{
"step": 1,
"name": "paragraph-stitching",
"status": "pending"
},
{
"step": 2,
"name": "header-footer-filtering",
"status": "pending"
},
{
"step": 3,
"name": "reference-indexing",
"status": "pending"
}
]
}
+38
View File
@@ -0,0 +1,38 @@
# Step 0: reading-order-checks
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/ARCHITECTURE.md
- /docs/CONVERSION_POLICY.md
- /phases/2-marker-adapter/step2.md
## Task
Create reading-order verification helpers over normalized blocks.
Use page numbers and bounding boxes to detect obvious ordering anomalies in multi-column or inserted-text layouts.
## Sprint Contract
- Done means: reading-order checks produce diagnostics that later enrichment and evaluator steps can use.
- Hard thresholds: checks are deterministic; tests include a multi-column-like fixture; helpers do not reorder content silently.
- Files owned: `src/pdftomd/enrichment.py`, tests, `PROGRESS.md`, `phases/4-semantic-enrichment/index.json`.
- Dependencies: Phase 2 normalized block model.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm diagnostics are actionable and tied to page/block ids.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not override Marker ordering without tests.
- Do not render Markdown.
- Do not call Marker or Nougat.
+37
View File
@@ -0,0 +1,37 @@
# Step 1: paragraph-stitching
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/CONVERSION_POLICY.md
- /phases/4-semantic-enrichment/step0.md
## Task
Implement paragraph stitching for line-fragmented PDF text blocks.
Handle continuation lines and hyphenated line breaks while preserving likely compound words or identifiers when confidence is low.
## Sprint Contract
- Done means: paragraph stitching turns line fragments into coherent paragraph blocks with focused tests.
- Hard thresholds: hyphen joins are tested; low-confidence hyphen cases are preserved; list items and headings are not merged into paragraphs.
- Files owned: `src/pdftomd/enrichment.py`, tests, `PROGRESS.md`, phase index.
- Dependencies: Step 0 checks and normalized block model.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm Korean and English text fixtures remain stable.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not rely only on punctuation rules when bounding-box hints exist.
- Do not merge across tables, figures, or formulas.
- Do not modify source PDF files.
+37
View File
@@ -0,0 +1,37 @@
# Step 2: header-footer-filtering
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/CONVERSION_POLICY.md
- /phases/4-semantic-enrichment/step1.md
## Task
Detect repeated page headers, footers, and page numbers and separate them from the main Markdown body flow.
The implementation should mark or remove repetitive boilerplate according to policy while keeping enough diagnostics for review.
## Sprint Contract
- Done means: repeated top/bottom page-region text can be identified and excluded from main content in tests.
- Hard thresholds: unique body text is not removed; page number patterns are tested; removal decisions are deterministic.
- Files owned: `src/pdftomd/enrichment.py`, tests, `PROGRESS.md`, phase index.
- Dependencies: Paragraph and block model from earlier steps.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm false-positive protections are tested.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not delete content without a confidence rule.
- Do not write filtered text into sidecar document outputs.
- Do not implement CLI reporting here.
+38
View File
@@ -0,0 +1,38 @@
# Step 3: reference-indexing
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/CONVERSION_POLICY.md
- /phases/3-formula-pipeline/step3.md
- /phases/4-semantic-enrichment/step2.md
## Task
Build a reference index for figures, tables, formulas, captions, and body references.
The index should support later Markdown rendering by providing stable anchors and high-confidence link targets.
## Sprint Contract
- Done means: table, figure, and formula references can be resolved or left plain with reasons.
- Hard thresholds: anchors are deterministic; duplicate labels are handled; missing targets do not produce broken links.
- Files owned: `src/pdftomd/enrichment.py`, reference models/tests, `PROGRESS.md`, phase index.
- Dependencies: Formula links and semantic block enrichment.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm figure/table/formula reference fixtures are covered.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not rewrite ambiguous references.
- Do not make anchors depend on nondeterministic ordering.
- Do not render final Markdown here.