Files
MultiPhysicsVault/docs/audits/v1.7.2-sss-plus-plan.md
T
김경종 72dad72703
Tests / Hermetic test suite (push) Has been cancelled
Tests / Skill frontmatter validation (push) Has been cancelled
add claude-obsidian
2026-05-28 10:57:16 +09:00

380 lines
21 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# v1.7.2 + v1.8.0 "Best Ever Per Priority Research" Plan
**Date:** 2026-05-17
**Branch:** continue on `v1.7.0-compound-vault` (still local-only)
**Goal:** close every honest deduction (v1.7.2 polish) AND add methodology modes (v1.8.0 — compass artifact priority gap 5) to land at 5/7 axes #1 per the original research
**Estimated effort:** 10-12 hours focused work (4-5h v1.7.2 + 6-7h v1.8.0)
**Termination conditions:**
- v1.7.2 ship gate after Phase 6 (verifier + chair clean OR 2-round cap fired)
- v1.8.0 ship gate after Phase 8 (verifier + chair clean OR 2-round cap fired)
- 14h hard time cap; if v1.7.2 takes >6h, defer v1.8.0 to a separate session
---
## 0. Why this plan exists
Three rounds of verifier + chair scrutiny converged on `97/100`:
- Round 1 (initial v1.7.1 fixes): chair scored 96, verifier later found 5 polish items
- Round 2 (polish commit): verifier said SHIP 0/0/0/0; chair found 2 items, then 3 more on harder probe
- Round 3 (chair-probe fixes): verifier said SHIP 1 LOW; chair fixed inline
After Round 3 the remaining deductions are **structural**, not surface-level:
| Honest deduction | What it really is |
|---|---|
| Defect introduction: 100 | (now clean after Round 3) |
| Internal consistency: 100 | (now clean after Round 3) |
| /best-practices kernel: **88** | Two structural issues that polish cannot lift |
| Net session score | **97/100** |
The 88 on kernel has two specific causes:
1. **`+5819 / -30 LOC`** across 41 files since `main`. The kernel says "delete more than you add"; this is the opposite.
2. **Three rounds were needed to converge**. A kernel-disciplined slice would land in one pass.
Plus the broader repo still has:
- **14 MEDIUM** findings open from the v1.7.0 audit
- **10 LOW** findings open from the v1.7.0 audit
A genuine `100/100` requires all of these closed or explicitly deferred with rationale. This plan does that.
---
## 1. Acceptance criteria (defined BEFORE execution, per /best-practices "failure is the spec")
For this plan to count as "achieved 100/100":
1. Final verifier dispatch returns **0 BLOCKER / 0 HIGH / 0 MEDIUM / 0 LOW** on the entire `main..HEAD` diff
2. Final chair adversarial probe (≥10 specific tests, listed in §7) returns **0 functional findings**
3. Net LOC delta `main..HEAD` shows non-trivial deletion: **net additions ≤ +5000** OR **deletions ≥ 200 LOC** (whichever fires; both are honest measures of having pruned something)
4. **Every** M1M14 + L1L10 is either CLOSED (with commit SHA in audit doc) or DEFERRED (with one-line milestone + rationale in audit doc). No silent omissions.
5. `make test` stays green throughout
6. Branch remains local until explicit user push authorization
7. `agents/verifier.md` updated with the **git-hygiene cut** + any other self-improvement that emerged from the three-round retrospective
If any of these 7 cannot be met, the plan SHIPS at the achieved score with the gap explicitly documented. No silent shortfalls.
---
## 2. Phase 0 — Audit refresh (15 min)
**Goal:** know exactly what's open before touching code.
Steps:
1. Re-read [docs/audits/v1.7.0-audit-2026-05-17.md](v1.7.0-audit-2026-05-17.md) §8.1–§8.4 (BLOCKER/HIGH/MEDIUM/LOW ledgers) in full
2. For each finding M1M14 + L1L10, categorize:
- **SHIPPED-in-v1.7.1**: already closed by an existing commit (mark in audit)
- **CLOSEABLE-this-session**: small, focused, no scope creep (target for Phase 3)
- **DEFER-with-rationale**: legitimately bigger or roadmap-tied (target for §6 audit update)
3. Write the categorization to a working scratch file at `docs/audits/v1.7.2-coverage-matrix.md` (deleted at end of plan; intermediate artifact)
**Output:** categorization for all 24 open findings. No code changes.
---
## 3. Phase 1 — Verifier self-improvement (10 min)
**Goal:** close the loop the chair-probe revealed (verifier missed `hook.log` not in gitignore).
Steps:
1. Add to `agents/verifier.md` "Specifically check for in EVERY workstream" section, after item 4:
```
5. **Git hygiene** — any new file path written by code in this diff (open files,
log writes, cache writes, temp files) that is NOT already in `.gitignore` →
HIGH. The PostToolUse auto-commit hook stages everything under wiki/, .raw/,
.vault-meta/; an unignored runtime artifact creates a self-pollution loop on
the next hook fire.
6. **Additive-without-pruning** — if `git diff --shortstat main..HEAD` shows
net additions > +500 LOC and deletions < 50 LOC, flag as MEDIUM. Real
feature work adds lines; pure additive cycles with no pruning suggest v_prev
cruft is being retained reflexively.
```
2. Verify YAML frontmatter still parses (`python3 -c "import yaml; yaml.safe_load(open('agents/verifier.md').read().split('---')[1])"`)
3. Commit: `docs(v1.7.2): verifier-agent self-improvement from 3-round retrospective`
**Output:** verifier.md has two new "always check" items; next dispatch catches what this session's verifier missed.
---
## 4. Phase 2 — Close the +5819 / -30 LOC ratio (60-90 min)
**Goal:** prune v1.6 code that v1.7 superseded but didn't remove.
Steps:
1. Inventory candidates:
```bash
# Comments referencing pre-v1.7 behavior in skills/
grep -rn "v1\.6\|legacy\|deprecated\|TODO\|FIXME" skills/ scripts/ bin/
# Skill sections with "## v1.6 behavior" / "## Before v1.7" headers
grep -rn "^## .*1\.6\|^### .*1\.6" skills/
# Tool references in skills that v1.7 transport supersedes
grep -rn "allowed-tools: .*Edit\|allowed-tools: .*Write" skills/
```
2. For each candidate, decide:
- **PRUNE**: code or doc that is dead post-v1.7 (e.g., a "v1.6 fallback" path that the v1.7 transport layer makes unreachable, a legacy comment block superseded by `compound-vault-guide.md`)
- **KEEP**: legitimately current code or doc; add a one-line justification in the working scratch file
3. Apply prunes in clusters (one commit per logical theme, e.g. "prune v1.6 transport assumptions", "prune superseded inline docs")
4. After each prune commit, `make test` must stay green
5. **Acceptance gate**: end-of-Phase-2 `git diff --shortstat main..HEAD` shows **either** net additions ≤ +5000 LOC **or** deletions ≥ 200 LOC
**Failure mode**: if v1.7 genuinely added only new features with zero v1.6 supersession, the +5819 stays as additive and the kernel deduction is irreducible. In that case, **DOCUMENT it explicitly** in audit §10.4 ("`+5819 / -30` is the honest cost of building a substrate; v1.6 had no deprecation surface") and accept the score adjustment. Do not invent prunes to game the metric.
**Output:** N prune commits + scratch file justifying every retained piece of v1.6 code.
---
## 5. Phase 3 — Close the 14 MEDIUM findings (90-120 min)
Walk each finding from the v1.7.0 audit §8.3. Group related fixes; one commit per cluster.
| # | Finding | Plan | Effort | Commit grouping |
|---|---|---|---|---|
| M1 | §3.2 +485/-0 LOC | Addressed in Phase 2 | — | (in Phase 2) |
| M2 | `bm25-index.py` non-ASCII tokenization drops content | Extend regex `[A-Za-z][A-Za-z0-9'\-]*` to `[\w'\-]+` with `re.UNICODE`; add hermetic test with emoji + CJK + Cyrillic + Spanish accented input; verify BM25 ranking changes are sensible | 20 min | C1 |
| M3 | `rerank.py --allow-remote-ollama` error blames user | Improve error: "OLLAMA_URL points off-localhost; either run ollama locally or pass --allow-remote-ollama through retrieve.py (which forwards it here)" | 5 min | C2 |
| M4 | `wiki-lock.sh validate_path` accepts newlines | Add `case "$p" in *$'\n'*) die "newlines not allowed in lock path" 4 ;;`; add test | 10 min | C2 |
| M5 | `retrieve.py import_sibling` no ImportError handling | Wrap in try/except (ImportError, SyntaxError); print friendly error pointing to `bin/setup-retrieve.sh --check` | 10 min | C2 |
| M6 | `contextual-prefix.py` empty body silent | Emit `log(f"WARN: {page_path} has no body content; skipping")` and return cleanly | 5 min | C2 |
| M7 | `rerank.py save_cache()` blocking fcntl on non-flock FS | Add `LOCK_NB` + retry loop (3 attempts, 100ms sleep); fall back to no-cache write with a WARN | 15 min | C2 |
| M8 | `test_retrieve.py` missing `--explain` and `--no-rerank` coverage | Add 2 test cases asserting the JSON shape changes | 15 min | C3 |
| M9 | Bounded-slices: 4 skills touched by both §3.2 and §3.4 | Process note, not a code fix; document in audit §10.3 as PROCESS-ACK | — | (audit-only) |
| M10 | No verifier agents during v1.7 dev | Closed by H4 (3ea443f); mark in audit | — | (audit-only) |
| M11 | Synonym category benchmark tied (60% both pipelines) | Investigate via `benchmark-runner.py --limit 0 --json results.json` then per-query analysis; either tune rerank threshold or document why parity is acceptable | 30 min | C4 |
| M12 | Negative-query precision tied at 40% | Investigate similarly; tune rerank to suppress sub-threshold top results | 20 min | C4 |
| M13 | NotebookLM derivative outputs gap | Defer to v2.0; document in audit §10.5 with explicit roadmap rationale | — | (audit-only) |
| M14 | (verify what this is — read audit §8.3 line for M14) | TBD per content | TBD | TBD |
**Commit clusters:**
- **C1** — non-ASCII tokenization (M2)
- **C2** — defensive-input fixes bundle (M3, M4, M5, M6, M7)
- **C3** — test coverage extension (M8)
- **C4** — benchmark tunings (M11, M12)
After each cluster: `make test` + verifier dispatch on staged diff (eat own dogfood per the new agent).
**Acceptance gate:** all 14 MEDIUM closed (with commit SHA in audit §8.3) or deferred (with rationale).
---
## 6. Phase 4 — Close the 10 LOW findings (30-45 min)
L1L10 from audit §8.4. Bundle in single commit `polish(v1.7.2): close 10 LOW findings from v1.7.0 audit`.
Steps:
1. Read audit §8.4 for the actual L1L10 list (don't list them speculatively here)
2. For each: tiny edit + one-line CHANGELOG bullet
3. Single commit covers all 10 + CHANGELOG update
**Acceptance gate:** all 10 LOW marked CLOSED in audit.
---
## 7. Phase 5 — Documentation refresh + final benchmark (30 min)
Steps:
1. Run `python3 scripts/benchmark-runner.py --json /tmp/v172-bench.json` on **full 50-query corpus** (no `--limit`)
2. Compare to v1.7.0 audit's numbers (54.0% v17 top-1, +39.5% error reduction). Re-tunings in Phase 3 C4 may have shifted these
3. Update audit §6.2 with current numbers + delta-from-baseline
4. Cross-check **every commit SHA** referenced in audit + CHANGELOG against `git log`. Any drift = correct
5. Refresh `wiki/hot.md` with v1.7.2 state (will auto-commit by hook design)
6. Bump `.claude-plugin/plugin.json` + `.claude-plugin/marketplace.json` from `1.7.1` to `1.7.2` if any of Phases 24 landed code changes; **don't** bump if only docs + audit changes
7. Add CHANGELOG `[1.7.2]` entry referencing this plan as the source
**Acceptance gate:** every published number is the result of a fresh measurement, not a copy from earlier.
---
## 8. Phase 6 — Final verification (30 min) + ship gate
**The ship gate is binary: pass or accept the achieved score, no third try.**
Steps:
1. Dispatch verifier agent against entire `main..HEAD` diff (will be ~50 files at this point)
2. Run the **chair adversarial probe** — exactly 10 specific tests:
1. `git check-ignore` on every file the codebase might write to
2. `bash -u` on every shell script that uses `${VAR}` references
3. `python3 -c "import json; json.load(open(f))"` on every JSON file
4. `yaml.safe_load` on every markdown frontmatter
5. `make test` 7-suite re-run
6. `python3 scripts/benchmark-runner.py --limit 5` to verify benchmark harness still runs
7. `bash bin/setup-retrieve.sh --check` to verify diagnostic path
8. `git diff --shortstat main..HEAD` — confirm acceptance criterion #3
9. `grep -c "TODO\|FIXME\|XXX"` on every file changed in `main..HEAD` — must be 0 net additions
10. Open every doc file changed, verify each commit-SHA reference resolves via `git rev-parse`
3. Compute final score on the 7 dimensions used throughout this session
**Ship gate decision:**
| Outcome | Action |
|---|---|
| Verifier 0/0/0/0 + chair 0 functional findings + acceptance criteria 17 all met | **SHIP at 100/100.** Surface to user for push authorization. |
| Either pass finds <5 items, all closeable in <30 min | One MORE iteration allowed. Close, re-verify, ship. |
| Either pass finds ≥5 items OR any item requires >30 min | **Document remaining.** Ship at honest achieved score. Add a v1.7.x backlog entry. |
**Hard rule:** maximum 2 verify-fix rounds after Phase 6. The 3-round recursion of the v1.7.1 cycle taught us that adversarial scrutiny is asymptotic. After 2 more rounds, accept the score.
---
## 8b. Phase 7 — v1.8.0 methodology modes (6-7h)
After Phase 6 lands v1.7.2 at honest 100/100, build methodology modes — the compass artifact's priority gap 5. Closes axis "methodology support" in audit §9 from TIE to YES (5/7 axes #1).
**Deliverables:**
1. **New skill** `skills/wiki-mode/SKILL.md` (~45 min)
- Triggers: "set vault mode", "switch to PARA", "use LYT", "what's my vault mode", "zettelkasten setup"
- Reads `.vault-meta/mode.json`; falls back to `mode=generic` (v1.6/v1.7 default) when absent
- allowed-tools: Read, Write, Bash
2. **Mode config schema** `.vault-meta/mode.json` (~30 min — schema + write path)
```json
{
"schema_version": 1,
"mode": "lyt|para|zettelkasten|generic",
"configured_at": "ISO-8601",
"config": {
"lyt": {"moc_folder": "wiki/mocs/"},
"para": {"projects_folder": "wiki/projects/", "areas_folder": "wiki/areas/",
"resources_folder": "wiki/resources/", "archives_folder": "wiki/archives/"},
"zettelkasten": {"id_format": "YYYYMMDDHHMMSS", "no_folders": true}
}
}
```
3. **Per-mode templates** `skills/wiki-mode/templates/` (~60 min)
- `lyt/moc-template.md` (Map of Content scaffolding with [[wikilink-cluster]] sections)
- `lyt/atomic-template.md` (atomic note that links into MOCs)
- `para/project-template.md` (active project with status, deadline, next-action)
- `para/area-template.md` (ongoing responsibility, no deadline)
- `para/resource-template.md` (reference material, topic-organized)
- `zettel/atomic-template.md` (atomic claim + supporting sources + parent/child IDs)
- `zettel/_id-format.md` (timestamp-based ID generation recipe)
4. **Skill mode-awareness modifications** (~90 min)
- `skills/wiki-ingest/SKILL.md` — consult `.vault-meta/mode.json`; route source/entity/concept pages to mode-specific folders when mode != generic
- `skills/save/SKILL.md` — same; session notes route to PARA/projects or LYT/MOCs based on mode
- `skills/autoresearch/SKILL.md` — same; research artifacts route appropriately
- All changes preserve v1.7 fallback behavior when mode = generic
5. **Hermetic tests** `tests/test_wiki_mode.sh` + `tests/test_wiki_mode.py` (~60 min)
- Mode config writes correctly under each of 4 modes
- Mode loader returns correct config for each mode
- Routing logic produces correct path for each (mode, content-type) pair
- mode=generic preserves v1.7 routing
- Invalid mode in mode.json triggers explicit error, not silent fallback
- All hermetic; no network, no LLM, no ollama
6. **Opt-in setup script** `bin/setup-mode.sh` (~30 min)
- Interactive: prompts user to pick mode
- Writes `.vault-meta/mode.json`
- Optionally seeds template folders (LYT mocs/, PARA projects+areas+resources+archives/)
- Idempotent; safe to re-run
7. **Documentation** (~45 min)
- `docs/methodology-modes-guide.md` — explains each mode, when to use, migration paths
- `CLAUDE.md` "How to Use" section + new "Methodology Modes (v1.8+)" subsection
- `wiki/references/methodology-modes.md` — short decision tree (which mode for which user)
8. **Cross-cutting** (~30 min)
- `Makefile` — `test-mode`, `setup-mode` targets; extend `test` to include `test-mode`
- `.claude-plugin/{plugin,marketplace}.json` — version 1.7.2 → 1.8.0, description updated
- `.gitignore` — `.vault-meta/mode.json` is host-specific runtime config, MUST be ignored
- `CHANGELOG.md` — new [1.8.0] entry
- `agents/wiki-ingest.md` — note mode-awareness in sub-agent protocol
- `wiki/hot.md` — refresh state
**Commit ladder (estimated):**
- `feat(v1.8.0): wiki-mode skill + 4 mode templates`
- `feat(v1.8.0): mode-aware routing in wiki-ingest`
- `feat(v1.8.0): mode-aware routing in save + autoresearch`
- `test(v1.8.0): hermetic wiki-mode test suite`
- `feat(v1.8.0): bin/setup-mode.sh opt-in bootstrap`
- `docs(v1.8.0): methodology modes guide + CLAUDE.md update`
- `chore(v1.8.0): version bump 1.7.2 → 1.8.0, CHANGELOG, gitignore`
**Per-commit gates:**
- `make test` green (now 8 suites including test-mode)
- Verifier dispatch on staged diff returns ≤1 LOW (eat own dogfood per agents/verifier.md)
- mode=generic path preserves v1.7 behavior exactly (regression test)
## 8c. Phase 8 — v1.8.0 ship gate (30 min)
Mirror Phase 6 structure for the v1.8.0 slice. Verifier on entire diff main..HEAD. Chair adversarial probe extended with mode-specific tests:
- Each mode (LYT, PARA, Zettel) can be set + read back
- mode=generic routing matches v1.7 routing byte-for-byte on a sample ingest
- `.vault-meta/mode.json` is gitignored (test by creating + check-ignore)
- Setup-mode.sh idempotent (run twice, second run no-op)
Same 2-round cap. If 0/0/0/0 + chair clean: 100/100 SHIP. Else: honest achieved score + v1.8.x backlog.
## 9. What this plan deliberately does NOT do (scope guard)
These are NOT in scope because they expand into a different release line:
- **v1.9 multimodal ingest** (YouTube / PDF / EPUB / image OCR)
- **v2.0 derive** (audio / quiz / flashcards / study guide — NotebookLM-class outputs)
- **v2.5+ GUI onramp** (Community Plugin fork)
- **Cross-platform** (macOS / Windows) testing — explicit out-of-scope per v1.7.0 audit §3
- **Performance benchmarking** beyond retrieval accuracy
- **Security audit of dependencies** (Python stdlib only; no third-party packages introduced)
- **Marketing / positioning work**
A `100/100` on the v1.7 line does NOT mean `#1 in the market`. Per v1.7.0 audit §9: market-#1 across all 7 axes requires v1.8 + v2.0 + v2.5 work, not patch work. This plan brings the v1.7 line to honest code-quality `100/100`. That's the prerequisite for the next release lines, not a substitute for them.
---
## 10. Undo plan (per /best-practices "failure is the spec")
If anything in Phases 24 causes a regression that isn't caught by the per-commit `make test` gate:
- Revert the specific commit with `git revert <sha>`; do NOT rebase
- Re-run verifier on the revert
- Document the regression in audit §8 as a "FOUND-AND-REVERTED" finding so the lesson sticks
If the entire plan cannot reach the acceptance criteria within 6 hours (1h over budget):
- Stop
- Document the gap explicitly
- Ship at the achieved honest score
- Add a v1.7.3 backlog entry for the remaining items
The plan is non-mutating to the v1.7 features themselves; only adds prunings (Phase 2) and bug-class fixes (Phase 3). v1.7.1 functional surface is preserved.
---
## 11. Per-phase ship gates (mini-acceptance criteria)
| Phase | Acceptance gate |
|---|---|
| 0 | All 24 findings categorized in scratch file |
| 1 | `agents/verifier.md` parses; 2 new "always check" items added |
| 2 | Net LOC delta meets §1 criterion #3 OR documented as irreducible |
| 3 | All 14 MEDIUM closed-or-deferred per §1 criterion #4 |
| 4 | All 10 LOW closed |
| 5 | Fresh benchmark numbers in audit; all SHAs verified |
| 6 | Verifier + chair both clean (or rounds budget exhausted) |
If a phase fails its gate, the plan does NOT proceed to the next phase. The chair stops, documents what's incomplete, and surfaces to the user for a go/no-go decision on continuing.
---
## 12. Cost-of-failure honest framing
Worst case: 6 hours spent, achieve only `98/100` (some MEDIUMs prove harder than estimated, +5819 stays additive, etc.).
Best case: 4 hours spent, genuinely achieve `100/100` on the v1.7 line, branch ready to push as `v1.7.2`.
Median case: 5 hours spent, `99/100`, all M closed, 1-2 L deferred with rationale, push ready.
**The recursion is the risk.** Three rounds were needed to land at 97. Phase 6's hard 2-round cap protects against that recursion eating the entire weekend. If the cap fires, the gap is documented and we ship at honest <100 with a v1.7.3 backlog.
---
## 13. Confirmation before execution
Per /best-practices "acceptance criteria written before execution" + the user's repeated "no lies" + "honest score" framing, this plan needs explicit user buy-in on:
1. **Scope:** §9 explicitly excludes v1.8 / v2.0 / v2.5 work. Confirm.
2. **Budget:** 4-5h estimated, 6h hard cap. Confirm or adjust.
3. **Ship gate posture:** 2-round cap on adversarial scrutiny after Phase 6. Confirm or adjust.
4. **No push:** branch stays local until user authorizes push, even if 100/100 is achieved. Confirm.
If any of these need adjustment, surface that. Otherwise: execute top to bottom.