remove files
This commit is contained in:
@@ -1,63 +0,0 @@
|
||||
# PDFtoMD
|
||||
|
||||
PDFtoMD는 수학, 공학, 역학 중심 PDF를 AI Agent가 읽기 쉬운 Markdown 문서 묶음으로 변환하는 로컬 우선 변환 엔진입니다.
|
||||
|
||||
목표는 단순 텍스트 추출이 아니라 원문 문서의 읽기 순서, 문단 흐름, 수식, 표, 그림, 캡션, 본문 참조를 보존한 구조화 변환입니다.
|
||||
|
||||
## Status
|
||||
- Current phase: Harness foundation planning.
|
||||
- Implementation: not started.
|
||||
- Primary target: Windows 10 native CLI/library engine.
|
||||
- UI: future PyQt thin client.
|
||||
|
||||
## Core Direction
|
||||
- Marker handles document structure, reading order, OCR/layout, body text, tables, figures, headings, and captions.
|
||||
- Nougat handles only mathematical expressions and formula blocks.
|
||||
- PyMuPDF handles lightweight page analysis, text-layer quality checks, page counts, chunk planning, and low-level PDF operations.
|
||||
- Mixed text/scanned PDFs are in scope.
|
||||
- Output is chunked Markdown plus image/table assets under a document slug directory.
|
||||
|
||||
## Environment
|
||||
Use one repo-local Python 3.11 environment.
|
||||
|
||||
```powershell
|
||||
conda create -p .\venv python=3.11 -y
|
||||
.\venv\python.exe -m pip install -r requirements.txt
|
||||
```
|
||||
|
||||
Verified local baseline:
|
||||
- Windows 10
|
||||
- NVIDIA GeForce GTX 1070 Ti, 8 GB VRAM
|
||||
- NVIDIA driver 577.00
|
||||
- PyTorch `2.7.1+cu126`
|
||||
- Marker `1.10.2`
|
||||
- Nougat OCR `0.1.17`
|
||||
|
||||
## Verification
|
||||
```powershell
|
||||
python scripts\validate_workspace.py
|
||||
.\venv\python.exe -m pip check
|
||||
.\venv\python.exe -c "import torch; x=torch.ones((1,), device='cuda'); print(torch.__version__, torch.version.cuda, x.item())"
|
||||
.\venv\Scripts\nougat.exe --help
|
||||
```
|
||||
|
||||
`scripts/validate_workspace.py` now discovers repo-local Python validation by default. It prefers `.\venv\python.exe`, compiles Harness scripts, and runs `scripts/test_*.py` with pytest unless `HARNESS_VALIDATION_COMMANDS` or npm scripts override discovery.
|
||||
|
||||
## Important Documents
|
||||
- `AGENTS.md`: persistent repository instructions.
|
||||
- `PLAN.md`: multi-agent planning state.
|
||||
- `PROGRESS.md`: multi-agent progress state.
|
||||
- `phases/`: executable Harness phase tickets.
|
||||
- `docs/PRD.md`: product requirements.
|
||||
- `docs/ARCHITECTURE.md`: engine architecture.
|
||||
- `docs/CONVERSION_POLICY.md`: detailed conversion decisions.
|
||||
- `docs/HARNESS.md`: planner/generator/evaluator Harness workflow.
|
||||
- `docs/IMPLEMENTATION_PLAN.md`: full phase-by-phase implementation roadmap.
|
||||
- `docs/ADR.md`: architecture decision records.
|
||||
- `docs/TOOLCHAIN.md`: toolchain and dependency notes.
|
||||
- `docs/UI_GUIDE.md`: future PyQt UI guidance.
|
||||
|
||||
## Sample Corpus
|
||||
The `samples/` directory is used for quality evaluation and regression tests. Current sample PDFs include Korean filenames, engineering/mechanics documents, formulas, figures, and a long 76-page document.
|
||||
|
||||
Before implementation, create a sample metadata mapping file that tags each PDF by text-layer quality, scanned pages, multi-column layout, formula density, table density, figure density, and Korean filename coverage.
|
||||
Reference in New Issue
Block a user