remove files
This commit is contained in:
-177
@@ -1,177 +0,0 @@
|
||||
# PDFtoMD Progress
|
||||
|
||||
## Current Status
|
||||
- Date: 2026-04-30.
|
||||
- Mode: Phase 1 implementation complete; ready for Phase 1 review or Phase 2 handoff.
|
||||
- Implementation status: Phase 0 and Phase 1 complete.
|
||||
- Custom agent creation: project-scoped read-only agents created after user approval.
|
||||
- Persistent multi-agent coordination files were approved by the user.
|
||||
|
||||
## Completed
|
||||
- Read repository instructions and project documents:
|
||||
- `AGENTS.md`
|
||||
- `docs/PRD.md`
|
||||
- `docs/ARCHITECTURE.md`
|
||||
- `docs/ADR.md`
|
||||
- `docs/UI_GUIDE.md`
|
||||
- Confirmed project direction:
|
||||
- Windows-native local CLI/library engine first.
|
||||
- Marker for document structure, reading order, tables, figures, headings, and captions.
|
||||
- Nougat only for mathematical expressions/formulas.
|
||||
- PyMuPDF for PDF analysis and chunk planning.
|
||||
- Markdown chunk output plus image/table assets.
|
||||
- Recorded detailed conversion policy decisions in `docs/CONVERSION_POLICY.md`.
|
||||
- Strengthened project documentation with current research and decisions:
|
||||
- `AGENTS.md`
|
||||
- `README.md`
|
||||
- `docs/PRD.md`
|
||||
- `docs/ARCHITECTURE.md`
|
||||
- `docs/ADR.md`
|
||||
- `docs/TOOLCHAIN.md`
|
||||
- `docs/UI_GUIDE.md`
|
||||
- Strengthened `AGENTS.md` multi-agent coordination rules so every new agent reads `PLAN.md` and `PROGRESS.md` first and can identify current goals, assigned scope, completed work, blockers, next work, and conflict risks.
|
||||
- Created project-scoped Codex extensions under `.codex/`:
|
||||
- Agents: `pdf-toolchain-researcher`, `sample-corpus-analyst`, `conversion-architect`, `quality-evaluator`, `formula-pipeline-specialist`, `layout-table-figure-specialist`.
|
||||
- Commands: `status`, `env-check`, `sample-audit`, `quality-plan`, `conversion-policy-review`, `model-cache-check`, `phase-draft`.
|
||||
- Skills: `pdf-toolchain`, `sample-corpus`, `conversion-architecture`, `formula-quality`, `markdown-quality`, `windows-runtime`.
|
||||
- Hooks: strengthened risky command guard, added handoff policy and drift policy hooks.
|
||||
- Validated `.codex` extension formats:
|
||||
- Agent TOML files parsed successfully.
|
||||
- `.codex/hooks.json` parsed successfully.
|
||||
- Hook Python scripts compiled successfully.
|
||||
- All `.codex/skills/*/SKILL.md` files passed `skill-creator` quick validation.
|
||||
- Confirmed user environment:
|
||||
- Windows 10.
|
||||
- NVIDIA GeForce GTX 1070 Ti.
|
||||
- 8 GB VRAM.
|
||||
- NVIDIA driver 577.00.
|
||||
- `nvidia-smi` reports CUDA runtime capability 12.9.
|
||||
- User reports CUDA 12.4 installed.
|
||||
- Current detected Python: Miniforge Python 3.12.7.
|
||||
- Conda is available.
|
||||
- `uv` is not available.
|
||||
- Created repo-local environment:
|
||||
- `venv`: Python 3.11.15, unified Marker/PyMuPDF/Pandas/test/Nougat environment.
|
||||
- Removed previous experimental `venv-nougat` directory after unified `venv` validation passed.
|
||||
- Verified unified environment:
|
||||
- `torch==2.7.1+cu126`
|
||||
- `torchvision==0.22.1+cu126`
|
||||
- `marker-pdf==1.10.2`
|
||||
- `nougat-ocr==0.1.17`
|
||||
- `transformers==4.57.6`
|
||||
- `albumentations==1.3.1`
|
||||
- `fsspec==2026.2.0`
|
||||
- `pymupdf==1.27.2.3`
|
||||
- `pandas==3.0.2`
|
||||
- `pytest==9.0.3`
|
||||
- `Pillow==10.4.0`
|
||||
- `pypdfium2==4.30.0`
|
||||
- `opencv-python-headless==4.11.0.86`
|
||||
- `pip check`: passed.
|
||||
- CUDA tensor operation on GTX 1070 Ti: passed.
|
||||
- `venv\Scripts\nougat.exe --help`: passed.
|
||||
- Ran earlier repository validation before default Python test discovery was added:
|
||||
- `python scripts/validate_workspace.py`: passed at that time with no configured validation commands.
|
||||
- Confirmed sample PDFs:
|
||||
- `samples/2007쉘구조물의유한요소해석에대하여.pdf`: 13 pages, first page text length 3523, first page images 0.
|
||||
- `samples/FourNodeQuadrilateralShellElementMITC4.pdf`: 7 pages, first page text length 3269, first page images 0.
|
||||
- `samples/MITC공부.pdf`: 13 pages, first page text length 226, first page images 2.
|
||||
- `samples/유한요소해석법을이용한쉘구조물의동적좌굴해석.pdf`: 76 pages, first page text length 446, first page images 10.
|
||||
- Strengthened the project for Anthropic-style Harness Engineering:
|
||||
- Added `docs/HARNESS.md` with planner/generator/evaluator roles, file protocol, Sprint Contract template, evaluator hard thresholds, and simplification rules.
|
||||
- Added executable phase registry `phases/index.json`.
|
||||
- Added first self-contained phase `phases/0-harness-foundation/` with four pending steps:
|
||||
- `sample-metadata-contract`
|
||||
- `core-package-skeleton`
|
||||
- `page-preanalysis-contract`
|
||||
- `markdown-quality-gates`
|
||||
- Updated `AGENTS.md`, `PLAN.md`, `README.md`, `docs/ARCHITECTURE.md`, and `docs/ADR.md` to reference the Harness workflow.
|
||||
- Added `.codex/commands/sprint-contract.md`.
|
||||
- Strengthened Harness workflow/review skill guidance to require Sprint Contracts.
|
||||
- Updated hooks for simpler Windows-friendly command paths and expanded handoff checks to include `phases/`, `scripts/`, `.agents/`, and `plugins/`.
|
||||
- Made `scripts/validate_workspace.py` discover repo-local Python validation by default.
|
||||
- Added `scripts/test_validate_workspace.py` and fixed `scripts/test_execute.py` UTF-8 fixture handling on Windows.
|
||||
- Established the full phase-by-phase implementation roadmap before starting engine implementation:
|
||||
- Added `docs/IMPLEMENTATION_PLAN.md`.
|
||||
- Expanded `phases/index.json` from Phase 0 only to Phases 0 through 9.
|
||||
- Added executable pending step contracts for:
|
||||
- `1-core-runtime-contracts`
|
||||
- `2-marker-adapter`
|
||||
- `3-formula-pipeline`
|
||||
- `4-semantic-enrichment`
|
||||
- `5-markdown-rendering-assets`
|
||||
- `6-cli-runtime-resume`
|
||||
- `7-mvp-quality-hardening`
|
||||
- `8-release-docs-packaging`
|
||||
- `9-pyqt-thin-client`
|
||||
- Updated `PLAN.md`, `AGENTS.md`, and `README.md` to point new agents to the full implementation roadmap.
|
||||
- Implemented Phase 0 Harness foundation:
|
||||
- Step 0 `sample-metadata-contract`: added deterministic `samples/metadata.json` and metadata contract tests.
|
||||
- Step 1 `core-package-skeleton`: added `pyproject.toml`, importable `src/pdftomd` package, typed model contracts, and model tests.
|
||||
- Step 2 `page-preanalysis-contract`: added PyMuPDF-only `analyze_pdf()` preanalysis, deterministic OCR candidate logic, and chunk candidate tests.
|
||||
- Step 3 `markdown-quality-gates`: added focused Markdown quality gates and tests for math delimiters, LaTeX environments, image links, tables, chunk frontmatter, and anchors.
|
||||
- Parallel work was split by disjoint write scopes: sample metadata/model contracts first, then preanalysis/quality gates.
|
||||
- Reviewed Phase 0 with `harness-review` criteria:
|
||||
- No blocking findings.
|
||||
- Architecture boundary remained intact: Marker and Nougat are not invoked in foundation contracts, and PyMuPDF is limited to page pre-analysis.
|
||||
- `python scripts\validate_workspace.py` passed before Phase 1 work started.
|
||||
- Implemented Phase 1 core runtime contracts:
|
||||
- Step 0 `input-normalization-slug`: added deterministic PDF path normalization, document identity creation, anchors, and output bundle path contracts.
|
||||
- Step 1 `conversion-options-config`: added typed conversion options, runtime modes, and formula parser options without CLI parsing.
|
||||
- Step 2 `output-bundle-contract`: added deterministic document bundle paths while keeping runtime artifacts separate from document output.
|
||||
- Step 3 `runtime-cache-policy`: added explicit `.models/` default cache policy, `PDFTOMD_MODEL_CACHE` override, Hugging Face offline environment mappings, and runtime artifact paths.
|
||||
- Updated `docs/TOOLCHAIN.md` and `.gitignore` for model cache policy.
|
||||
|
||||
## Web Research Notes
|
||||
- Marker currently supports Markdown/JSON/chunks/HTML output and includes tables, equations, inline math, image extraction, layout, and reading-order functionality.
|
||||
- Nougat is the intended isolated formula parser candidate; Windows GPU use depends on a correct PyTorch install.
|
||||
- PyMuPDF remains appropriate for page counting, PDF splitting/chunk planning, and low-level image/page operations.
|
||||
- PyMuPDF4LLM, Docling, and MinerU are useful comparison baselines but are not the primary parser under the current architecture.
|
||||
- MathJax notes that `$...$` inline math can conflict with ordinary dollar signs, so delimiter validation is required.
|
||||
|
||||
## In Progress
|
||||
- None.
|
||||
|
||||
## Blockers
|
||||
- None yet.
|
||||
|
||||
## Decisions
|
||||
- Personal-use context lowers immediate licensing risk, but Marker GPL/model license implications must be revisited before redistribution or commercial use.
|
||||
- Mixed text/scanned PDFs are in scope, with page-level OCR intervention decisions based on lightweight text-layer quality analysis.
|
||||
- Marker owns layout, reading order, body text, headings, tables, figures, captions, and OCR/layout handling.
|
||||
- Nougat owns only mathematical expressions and formula blocks, with Marker text fallback on failure.
|
||||
- Markdown tables are preferred, but limited HTML tables and table-region screenshot fallbacks are allowed for complex tables.
|
||||
- Figure/table/formula numbers and body references should become internal Markdown links when confidence is sufficient.
|
||||
- Chunking should prefer logical block boundaries over strict 20-page boundaries when a block would be split.
|
||||
- Chunk Markdown may include concise frontmatter with core context, but document-output sidecars remain out of scope by default.
|
||||
- CLI should write warnings/errors to stderr and local logs, not into generated Markdown.
|
||||
- Resume support may use local runtime state/cache files to skip successful chunks.
|
||||
- Custom agents will be created later, only one at a time after explicit user approval.
|
||||
- Planning files are the source of truth for multi-agent coordination.
|
||||
- Harness phase files now exist. `PLAN.md` remains the overall plan, `PROGRESS.md` remains the handoff state, and `phases/{phase}/index.json` is the phase execution status.
|
||||
- Each future implementation step should use the `docs/HARNESS.md` planner/generator/evaluator workflow and include a Sprint Contract before code changes.
|
||||
- Full implementation sequencing is recorded in `docs/IMPLEMENTATION_PLAN.md`; phase files are pending tickets and should not be executed out of dependency order.
|
||||
- Phase 0 and Phase 1 are complete. `phases/index.json` marks both `0-harness-foundation` and `1-core-runtime-contracts` as completed.
|
||||
- Main and Nougat dependencies can share one environment when Nougat's loose dependencies are pinned explicitly.
|
||||
- `torch==2.11.0+cu128` was rejected for this machine because it does not support GTX 1070 Ti `sm_61`.
|
||||
- `torch==2.7.1+cu126` was selected because it satisfies Marker `torch>=2.7.0` and successfully runs CUDA tensor operations on GTX 1070 Ti.
|
||||
- `nougat-ocr==0.1.17` requires dependency pins:
|
||||
- `transformers==4.57.6`, because `transformers 5.7.0` breaks Nougat imports.
|
||||
- `albumentations==1.3.1`, because `albumentations 2.x` breaks Nougat transform initialization.
|
||||
- `fsspec==2026.2.0`, because newer `fsspec` conflicts with `datasets`.
|
||||
- `pypdfium2==4.30.0`, `opencv-python-headless==4.11.0.86`, and `Pillow==10.4.0`, because Marker/Surya depend on these versions and Nougat can operate with them.
|
||||
|
||||
## Next Work
|
||||
1. Review Phase 1 output with `harness-review` before moving to Phase 2.
|
||||
2. If review passes, start `phases/2-marker-adapter/step0.md`.
|
||||
3. Execute phases in order unless `PLAN.md` and `docs/IMPLEMENTATION_PLAN.md` are updated with a clear dependency rationale.
|
||||
4. Do not create new custom agents unless the user explicitly approves another agent.
|
||||
|
||||
## Latest Validation
|
||||
- `.\venv\python.exe -m pytest scripts\test_validate_workspace.py`: passed, 7 tests.
|
||||
- `.\venv\python.exe -m py_compile scripts\execute.py scripts\validate_workspace.py .codex\hooks\*.py`: passed.
|
||||
- JSON parse check for `phases/index.json`, `phases/0-harness-foundation/index.json`, and `.codex/hooks.json`: passed.
|
||||
- Phase structure check for all `stepN.md` files: passed.
|
||||
- `.codex/commands/*.md` frontmatter check: passed.
|
||||
- `python scripts\validate_workspace.py`: passed, 103 tests after Phase 1 implementation.
|
||||
- `.\venv\python.exe -c "import pdftomd; print(pdftomd.__name__)"`: passed after adding editable package metadata.
|
||||
Reference in New Issue
Block a user