Files
PDFToMD/phases/4-semantic-enrichment/step2.md
T
김경종 7e985ae94a add files
2026-04-30 17:05:19 +09:00

1.2 KiB

Step 2: header-footer-filtering

Read First

  • /AGENTS.md
  • /PLAN.md
  • /PROGRESS.md
  • /docs/HARNESS.md
  • /docs/IMPLEMENTATION_PLAN.md
  • /docs/CONVERSION_POLICY.md
  • /phases/4-semantic-enrichment/step1.md

Task

Detect repeated page headers, footers, and page numbers and separate them from the main Markdown body flow.

The implementation should mark or remove repetitive boilerplate according to policy while keeping enough diagnostics for review.

Sprint Contract

  • Done means: repeated top/bottom page-region text can be identified and excluded from main content in tests.
  • Hard thresholds: unique body text is not removed; page number patterns are tested; removal decisions are deterministic.
  • Files owned: src/pdftomd/enrichment.py, tests, PROGRESS.md, phase index.
  • Dependencies: Paragraph and block model from earlier steps.

Acceptance Criteria

python scripts\validate_workspace.py
.\venv\python.exe -m pytest tests

Verification

  1. Run the acceptance commands.
  2. Confirm false-positive protections are tested.
  3. Update PROGRESS.md and this phase index.

Do Not

  • Do not delete content without a confidence rule.
  • Do not write filtered text into sidecar document outputs.
  • Do not implement CLI reporting here.