add files

This commit is contained in:
김경종
2026-04-30 17:05:19 +09:00
parent f3e01b5a8c
commit 7e985ae94a
135 changed files with 41205 additions and 0 deletions
+37
View File
@@ -0,0 +1,37 @@
# Step 2: header-footer-filtering
## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/CONVERSION_POLICY.md
- /phases/4-semantic-enrichment/step1.md
## Task
Detect repeated page headers, footers, and page numbers and separate them from the main Markdown body flow.
The implementation should mark or remove repetitive boilerplate according to policy while keeping enough diagnostics for review.
## Sprint Contract
- Done means: repeated top/bottom page-region text can be identified and excluded from main content in tests.
- Hard thresholds: unique body text is not removed; page number patterns are tested; removal decisions are deterministic.
- Files owned: `src/pdftomd/enrichment.py`, tests, `PROGRESS.md`, phase index.
- Dependencies: Paragraph and block model from earlier steps.
## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```
## Verification
1. Run the acceptance commands.
2. Confirm false-positive protections are tested.
3. Update `PROGRESS.md` and this phase index.
## Do Not
- Do not delete content without a confidence rule.
- Do not write filtered text into sidecar document outputs.
- Do not implement CLI reporting here.