add files
This commit is contained in:
@@ -0,0 +1,26 @@
|
||||
{
|
||||
"project": "PDFtoMD",
|
||||
"phase": "2-marker-adapter",
|
||||
"steps": [
|
||||
{
|
||||
"step": 0,
|
||||
"name": "marker-invocation-adapter",
|
||||
"status": "pending"
|
||||
},
|
||||
{
|
||||
"step": 1,
|
||||
"name": "ocr-plan-handoff",
|
||||
"status": "pending"
|
||||
},
|
||||
{
|
||||
"step": 2,
|
||||
"name": "marker-block-normalization",
|
||||
"status": "pending"
|
||||
},
|
||||
{
|
||||
"step": 3,
|
||||
"name": "marker-failure-reporting",
|
||||
"status": "pending"
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -0,0 +1,38 @@
|
||||
# Step 0: marker-invocation-adapter
|
||||
|
||||
## Read First
|
||||
- /AGENTS.md
|
||||
- /PLAN.md
|
||||
- /PROGRESS.md
|
||||
- /docs/HARNESS.md
|
||||
- /docs/IMPLEMENTATION_PLAN.md
|
||||
- /docs/ARCHITECTURE.md
|
||||
- /docs/TOOLCHAIN.md
|
||||
- /phases/1-core-runtime-contracts/index.json
|
||||
|
||||
## Task
|
||||
Implement the first Marker adapter boundary that invokes Marker through a small internal interface.
|
||||
|
||||
Keep this adapter isolated so tests can use fakes without loading large models. Real Marker invocation should be smoke-testable but not required for every unit test.
|
||||
|
||||
## Sprint Contract
|
||||
- Done means: Marker invocation is behind a narrow interface and can return structured parse results or clear failures.
|
||||
- Hard thresholds: Marker remains the primary document parser; Nougat is not used here; unit tests avoid mandatory model downloads; parser errors are structured.
|
||||
- Files owned: `src/pdftomd/marker_adapter.py`, related tests, `PROGRESS.md`, `phases/2-marker-adapter/index.json`.
|
||||
- Dependencies: Phase 1 runtime contracts.
|
||||
|
||||
## Acceptance Criteria
|
||||
```powershell
|
||||
python scripts\validate_workspace.py
|
||||
.\venv\python.exe -m pytest tests
|
||||
```
|
||||
|
||||
## Verification
|
||||
1. Run the acceptance commands.
|
||||
2. Confirm the adapter can be tested without external services.
|
||||
3. Update `PROGRESS.md` and this phase index.
|
||||
|
||||
## Do Not
|
||||
- Do not parse formulas with Nougat.
|
||||
- Do not implement Markdown rendering.
|
||||
- Do not make every test load Marker models.
|
||||
@@ -0,0 +1,38 @@
|
||||
# Step 1: ocr-plan-handoff
|
||||
|
||||
## Read First
|
||||
- /AGENTS.md
|
||||
- /PLAN.md
|
||||
- /PROGRESS.md
|
||||
- /docs/HARNESS.md
|
||||
- /docs/IMPLEMENTATION_PLAN.md
|
||||
- /docs/CONVERSION_POLICY.md
|
||||
- /phases/0-harness-foundation/step2.md
|
||||
- /phases/2-marker-adapter/step0.md
|
||||
|
||||
## Task
|
||||
Connect PyMuPDF page pre-analysis results to the Marker adapter as an OCR/layout handoff plan.
|
||||
|
||||
The goal is to preserve page-level OCR decisions without making the entire document scan-only or text-only.
|
||||
|
||||
## Sprint Contract
|
||||
- Done means: the adapter accepts page-level OCR candidates and passes the relevant intent into Marker configuration or records an explicit unsupported-path fallback.
|
||||
- Hard thresholds: OCR decisions stay page-aware; PyMuPDF remains pre-analysis only; no OCR logs are inserted into Markdown.
|
||||
- Files owned: `src/pdftomd/marker_adapter.py`, `src/pdftomd/preanalysis.py` if needed, tests, `PROGRESS.md`, phase index.
|
||||
- Dependencies: Phase 0 pre-analysis and Step 0 Marker adapter.
|
||||
|
||||
## Acceptance Criteria
|
||||
```powershell
|
||||
python scripts\validate_workspace.py
|
||||
.\venv\python.exe -m pytest tests
|
||||
```
|
||||
|
||||
## Verification
|
||||
1. Run the acceptance commands.
|
||||
2. Confirm mixed text/scanned sample traits are represented in tests.
|
||||
3. Update `PROGRESS.md` and this phase index.
|
||||
|
||||
## Do Not
|
||||
- Do not force document-wide OCR when only selected pages need OCR.
|
||||
- Do not implement reading-order fixes here.
|
||||
- Do not add a second primary parser.
|
||||
@@ -0,0 +1,39 @@
|
||||
# Step 2: marker-block-normalization
|
||||
|
||||
## Read First
|
||||
- /AGENTS.md
|
||||
- /PLAN.md
|
||||
- /PROGRESS.md
|
||||
- /docs/HARNESS.md
|
||||
- /docs/IMPLEMENTATION_PLAN.md
|
||||
- /docs/ARCHITECTURE.md
|
||||
- /docs/CONVERSION_POLICY.md
|
||||
- /phases/0-harness-foundation/step1.md
|
||||
- /phases/2-marker-adapter/step0.md
|
||||
|
||||
## Task
|
||||
Map Marker structured output into the internal block model for headings, paragraphs, lists, tables, figures, captions, and equation candidates.
|
||||
|
||||
Prefer structured Marker APIs or JSON-like structures over scraping final Markdown.
|
||||
|
||||
## Sprint Contract
|
||||
- Done means: fake Marker structures and at least one real or recorded sample shape map into internal block types.
|
||||
- Hard thresholds: semantic block roles are preserved; bounding boxes and page numbers survive where available; formula blocks are only marked as candidates for Phase 3.
|
||||
- Files owned: `src/pdftomd/marker_adapter.py`, model additions if required, tests, `PROGRESS.md`, phase index.
|
||||
- Dependencies: Phase 0 models and Step 0 adapter.
|
||||
|
||||
## Acceptance Criteria
|
||||
```powershell
|
||||
python scripts\validate_workspace.py
|
||||
.\venv\python.exe -m pytest tests
|
||||
```
|
||||
|
||||
## Verification
|
||||
1. Run the acceptance commands.
|
||||
2. Confirm no final Markdown scraping is required for normal block mapping.
|
||||
3. Update `PROGRESS.md` and this phase index.
|
||||
|
||||
## Do Not
|
||||
- Do not perform Nougat conversion.
|
||||
- Do not render Markdown.
|
||||
- Do not discard page or bounding-box metadata without a documented reason.
|
||||
@@ -0,0 +1,38 @@
|
||||
# Step 3: marker-failure-reporting
|
||||
|
||||
## Read First
|
||||
- /AGENTS.md
|
||||
- /PLAN.md
|
||||
- /PROGRESS.md
|
||||
- /docs/HARNESS.md
|
||||
- /docs/IMPLEMENTATION_PLAN.md
|
||||
- /docs/CONVERSION_POLICY.md
|
||||
- /phases/2-marker-adapter/step0.md
|
||||
- /phases/2-marker-adapter/step2.md
|
||||
|
||||
## Task
|
||||
Define structured Marker failure reporting for parser errors, unsupported pages, timeout-like failures, and recoverable partial output.
|
||||
|
||||
This prepares later CLI and resume behavior without writing CLI code.
|
||||
|
||||
## Sprint Contract
|
||||
- Done means: Marker adapter failures are typed, testable, and do not corrupt generated Markdown content.
|
||||
- Hard thresholds: failures include page/chunk context where available; errors go to runtime reporting paths, not document body; fallback eligibility is explicit.
|
||||
- Files owned: `src/pdftomd/marker_adapter.py`, error/reporting models, tests, `PROGRESS.md`, phase index.
|
||||
- Dependencies: Steps 0 and 2.
|
||||
|
||||
## Acceptance Criteria
|
||||
```powershell
|
||||
python scripts\validate_workspace.py
|
||||
.\venv\python.exe -m pytest tests
|
||||
```
|
||||
|
||||
## Verification
|
||||
1. Run the acceptance commands.
|
||||
2. Confirm failure messages are actionable for CLI and evaluator use.
|
||||
3. Update `PROGRESS.md` and this phase index.
|
||||
|
||||
## Do Not
|
||||
- Do not silently swallow Marker failures.
|
||||
- Do not implement resume state here.
|
||||
- Do not write errors into Markdown chunks.
|
||||
Reference in New Issue
Block a user