24 lines
1.1 KiB
Markdown
24 lines
1.1 KiB
Markdown
---
|
|
name: conversion-architecture
|
|
description: Design PDFtoMD conversion architecture, parser boundaries, internal block models, chunk policy, renderer contracts, output structure, logging, and resume behavior. Use when planning or reviewing conversion engine design.
|
|
---
|
|
|
|
# Conversion Architecture
|
|
|
|
## Workflow
|
|
|
|
1. Read `AGENTS.md`, `PLAN.md`, `PROGRESS.md`, `docs/ARCHITECTURE.md`, `docs/CONVERSION_POLICY.md`, and `docs/ADR.md`.
|
|
2. Keep responsibilities stable:
|
|
- Marker: layout, OCR, reading order, body, headings, tables, figures, captions
|
|
- Nougat: formula-only LaTeX parsing
|
|
- PyMuPDF: page pre-analysis, text-layer quality, page counts, chunk planning
|
|
3. Define interfaces and invariants before implementation.
|
|
4. Keep output deterministic and chunked under the documented output contract.
|
|
5. Record architecture changes in `docs/ADR.md` when decisions change.
|
|
|
|
## Guardrails
|
|
|
|
- Do not place conversion logic in a future PyQt UI.
|
|
- Do not add document sidecars unless explicitly requested.
|
|
- Do not let chunking split a paragraph, table, figure, or formula without a fallback plan.
|