PDFToMD/phases/4-semantic-enrichment/step2.md

# Step 2: header-footer-filtering

## Read First
- /AGENTS.md
- /PLAN.md
- /PROGRESS.md
- /docs/HARNESS.md
- /docs/IMPLEMENTATION_PLAN.md
- /docs/CONVERSION_POLICY.md
- /phases/4-semantic-enrichment/step1.md

## Task
Detect repeated page headers, footers, and page numbers and separate them from the main Markdown body flow.

The implementation should mark or remove repetitive boilerplate according to policy while keeping enough diagnostics for review.

## Sprint Contract
- Done means: repeated top/bottom page-region text can be identified and excluded from main content in tests.
- Hard thresholds: unique body text is not removed; page number patterns are tested; removal decisions are deterministic.
- Files owned: `src/pdftomd/enrichment.py`, tests, `PROGRESS.md`, phase index.
- Dependencies: Paragraph and block model from earlier steps.

## Acceptance Criteria
```powershell
python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests
```

## Verification
1. Run the acceptance commands.
2. Confirm false-positive protections are tested.
3. Update `PROGRESS.md` and this phase index.

## Do Not
- Do not delete content without a confidence rule.
- Do not write filtered text into sidecar document outputs.
- Do not implement CLI reporting here.