baram2584/MultiPhysicsVault

Fork 0

Files

T

김경종 72dad72703

Tests / Hermetic test suite (push) Has been cancelled

Details

Tests / Skill frontmatter validation (push) Has been cancelled

Details

add claude-obsidian

2026-05-28 10:57:16 +09:00

21 KiB

Raw Permalink Blame History

v1.7.2 + v1.8.0 "Best Ever Per Priority Research" Plan

Date: 2026-05-17 Branch: continue on v1.7.0-compound-vault (still local-only) Goal: close every honest deduction (v1.7.2 polish) AND add methodology modes (v1.8.0 — compass artifact priority gap 5) to land at 5/7 axes #1 per the original research Estimated effort: 10-12 hours focused work (4-5h v1.7.2 + 6-7h v1.8.0) Termination conditions:

v1.7.2 ship gate after Phase 6 (verifier + chair clean OR 2-round cap fired)
v1.8.0 ship gate after Phase 8 (verifier + chair clean OR 2-round cap fired)
14h hard time cap; if v1.7.2 takes >6h, defer v1.8.0 to a separate session

0. Why this plan exists

Three rounds of verifier + chair scrutiny converged on 97/100:

Round 1 (initial v1.7.1 fixes): chair scored 96, verifier later found 5 polish items
Round 2 (polish commit): verifier said SHIP 0/0/0/0; chair found 2 items, then 3 more on harder probe
Round 3 (chair-probe fixes): verifier said SHIP 1 LOW; chair fixed inline

After Round 3 the remaining deductions are structural, not surface-level:

Honest deduction	What it really is
Defect introduction: 100	(now clean after Round 3)
Internal consistency: 100	(now clean after Round 3)
/best-practices kernel: 88	Two structural issues that polish cannot lift
Net session score	97/100

The 88 on kernel has two specific causes:

+5819 / -30 LOC across 41 files since main. The kernel says "delete more than you add"; this is the opposite.
Three rounds were needed to converge. A kernel-disciplined slice would land in one pass.

Plus the broader repo still has:

14 MEDIUM findings open from the v1.7.0 audit
10 LOW findings open from the v1.7.0 audit

A genuine 100/100 requires all of these closed or explicitly deferred with rationale. This plan does that.

1. Acceptance criteria (defined BEFORE execution, per /best-practices "failure is the spec")

For this plan to count as "achieved 100/100":

Final verifier dispatch returns 0 BLOCKER / 0 HIGH / 0 MEDIUM / 0 LOW on the entire main..HEAD diff
Final chair adversarial probe (≥10 specific tests, listed in §7) returns 0 functional findings
Net LOC delta main..HEAD shows non-trivial deletion: net additions ≤ +5000 OR deletions ≥ 200 LOC (whichever fires; both are honest measures of having pruned something)
Every M1–M14 + L1–L10 is either CLOSED (with commit SHA in audit doc) or DEFERRED (with one-line milestone + rationale in audit doc). No silent omissions.
make test stays green throughout
Branch remains local until explicit user push authorization
agents/verifier.md updated with the git-hygiene cut + any other self-improvement that emerged from the three-round retrospective

If any of these 7 cannot be met, the plan SHIPS at the achieved score with the gap explicitly documented. No silent shortfalls.

2. Phase 0 — Audit refresh (15 min)

Goal: know exactly what's open before touching code.

Steps:

Re-read docs/audits/v1.7.0-audit-2026-05-17.md §8.1–§8.4 (BLOCKER/HIGH/MEDIUM/LOW ledgers) in full
For each finding M1–M14 + L1–L10, categorize:
- SHIPPED-in-v1.7.1: already closed by an existing commit (mark in audit)
- CLOSEABLE-this-session: small, focused, no scope creep (target for Phase 3)
- DEFER-with-rationale: legitimately bigger or roadmap-tied (target for §6 audit update)
Write the categorization to a working scratch file at docs/audits/v1.7.2-coverage-matrix.md (deleted at end of plan; intermediate artifact)

Output: categorization for all 24 open findings. No code changes.

3. Phase 1 — Verifier self-improvement (10 min)

Goal: close the loop the chair-probe revealed (verifier missed hook.log not in gitignore).

Steps:

Add to agents/verifier.md "Specifically check for in EVERY workstream" section, after item 4:

5. **Git hygiene** — any new file path written by code in this diff (open files,
   log writes, cache writes, temp files) that is NOT already in `.gitignore` →
   HIGH. The PostToolUse auto-commit hook stages everything under wiki/, .raw/,
   .vault-meta/; an unignored runtime artifact creates a self-pollution loop on
   the next hook fire.
6. **Additive-without-pruning** — if `git diff --shortstat main..HEAD` shows
   net additions > +500 LOC and deletions < 50 LOC, flag as MEDIUM. Real
   feature work adds lines; pure additive cycles with no pruning suggest v_prev
   cruft is being retained reflexively.

Verify YAML frontmatter still parses (python3 -c "import yaml; yaml.safe_load(open('agents/verifier.md').read().split('---')[1])")
Commit: docs(v1.7.2): verifier-agent self-improvement from 3-round retrospective

Output: verifier.md has two new "always check" items; next dispatch catches what this session's verifier missed.

4. Phase 2 — Close the +5819 / -30 LOC ratio (60-90 min)

Goal: prune v1.6 code that v1.7 superseded but didn't remove.

Steps:

Inventory candidates:

# Comments referencing pre-v1.7 behavior in skills/
grep -rn "v1\.6\|legacy\|deprecated\|TODO\|FIXME" skills/ scripts/ bin/
# Skill sections with "## v1.6 behavior" / "## Before v1.7" headers
grep -rn "^## .*1\.6\|^### .*1\.6" skills/
# Tool references in skills that v1.7 transport supersedes
grep -rn "allowed-tools: .*Edit\|allowed-tools: .*Write" skills/

For each candidate, decide:
- PRUNE: code or doc that is dead post-v1.7 (e.g., a "v1.6 fallback" path that the v1.7 transport layer makes unreachable, a legacy comment block superseded by compound-vault-guide.md)
- KEEP: legitimately current code or doc; add a one-line justification in the working scratch file
Apply prunes in clusters (one commit per logical theme, e.g. "prune v1.6 transport assumptions", "prune superseded inline docs")
After each prune commit, make test must stay green
Acceptance gate: end-of-Phase-2 git diff --shortstat main..HEAD shows either net additions ≤ +5000 LOC or deletions ≥ 200 LOC

Failure mode: if v1.7 genuinely added only new features with zero v1.6 supersession, the +5819 stays as additive and the kernel deduction is irreducible. In that case, DOCUMENT it explicitly in audit §10.4 ("+5819 / -30 is the honest cost of building a substrate; v1.6 had no deprecation surface") and accept the score adjustment. Do not invent prunes to game the metric.

Output: N prune commits + scratch file justifying every retained piece of v1.6 code.

5. Phase 3 — Close the 14 MEDIUM findings (90-120 min)

Walk each finding from the v1.7.0 audit §8.3. Group related fixes; one commit per cluster.

#	Finding	Plan	Effort	Commit grouping
M1	§3.2 +485/-0 LOC	Addressed in Phase 2	—	(in Phase 2)
M2	`bm25-index.py` non-ASCII tokenization drops content	Extend regex `[A-Za-z][A-Za-z0-9'\-]*` to `[\w'\-]+` with `re.UNICODE`; add hermetic test with emoji + CJK + Cyrillic + Spanish accented input; verify BM25 ranking changes are sensible	20 min	C1
M3	`rerank.py --allow-remote-ollama` error blames user	Improve error: "OLLAMA_URL points off-localhost; either run ollama locally or pass --allow-remote-ollama through retrieve.py (which forwards it here)"	5 min	C2
M4	`wiki-lock.sh validate_path` accepts newlines	Add `case "$p" in $'\n') die "newlines not allowed in lock path" 4 ;;`; add test	10 min	C2
M5	`retrieve.py import_sibling` no ImportError handling	Wrap in try/except (ImportError, SyntaxError); print friendly error pointing to `bin/setup-retrieve.sh --check`	10 min	C2
M6	`contextual-prefix.py` empty body silent	Emit `log(f"WARN: {page_path} has no body content; skipping")` and return cleanly	5 min	C2
M7	`rerank.py save_cache()` blocking fcntl on non-flock FS	Add `LOCK_NB` + retry loop (3 attempts, 100ms sleep); fall back to no-cache write with a WARN	15 min	C2
M8	`test_retrieve.py` missing `--explain` and `--no-rerank` coverage	Add 2 test cases asserting the JSON shape changes	15 min	C3
M9	Bounded-slices: 4 skills touched by both §3.2 and §3.4	Process note, not a code fix; document in audit §10.3 as PROCESS-ACK	—	(audit-only)
M10	No verifier agents during v1.7 dev	Closed by H4 (3ea443f); mark in audit	—	(audit-only)
M11	Synonym category benchmark tied (60% both pipelines)	Investigate via `benchmark-runner.py --limit 0 --json results.json` then per-query analysis; either tune rerank threshold or document why parity is acceptable	30 min	C4
M12	Negative-query precision tied at 40%	Investigate similarly; tune rerank to suppress sub-threshold top results	20 min	C4
M13	NotebookLM derivative outputs gap	Defer to v2.0; document in audit §10.5 with explicit roadmap rationale	—	(audit-only)
M14	(verify what this is — read audit §8.3 line for M14)	TBD per content	TBD	TBD

Commit clusters:

C1 — non-ASCII tokenization (M2)
C2 — defensive-input fixes bundle (M3, M4, M5, M6, M7)
C3 — test coverage extension (M8)
C4 — benchmark tunings (M11, M12)

After each cluster: make test + verifier dispatch on staged diff (eat own dogfood per the new agent).

Acceptance gate: all 14 MEDIUM closed (with commit SHA in audit §8.3) or deferred (with rationale).

6. Phase 4 — Close the 10 LOW findings (30-45 min)

L1–L10 from audit §8.4. Bundle in single commit polish(v1.7.2): close 10 LOW findings from v1.7.0 audit.

Steps:

Read audit §8.4 for the actual L1–L10 list (don't list them speculatively here)
For each: tiny edit + one-line CHANGELOG bullet
Single commit covers all 10 + CHANGELOG update

Acceptance gate: all 10 LOW marked CLOSED in audit.

7. Phase 5 — Documentation refresh + final benchmark (30 min)

Steps:

Run python3 scripts/benchmark-runner.py --json /tmp/v172-bench.json on full 50-query corpus (no --limit)
Compare to v1.7.0 audit's numbers (54.0% v17 top-1, +39.5% error reduction). Re-tunings in Phase 3 C4 may have shifted these
Update audit §6.2 with current numbers + delta-from-baseline
Cross-check every commit SHA referenced in audit + CHANGELOG against git log. Any drift = correct
Refresh wiki/hot.md with v1.7.2 state (will auto-commit by hook design)
Bump .claude-plugin/plugin.json + .claude-plugin/marketplace.json from 1.7.1 to 1.7.2 if any of Phases 2–4 landed code changes; don't bump if only docs + audit changes
Add CHANGELOG [1.7.2] entry referencing this plan as the source

Acceptance gate: every published number is the result of a fresh measurement, not a copy from earlier.

8. Phase 6 — Final verification (30 min) + ship gate

The ship gate is binary: pass or accept the achieved score, no third try.

Steps:

Dispatch verifier agent against entire main..HEAD diff (will be ~50 files at this point)
Run the chair adversarial probe — exactly 10 specific tests:
1. git check-ignore on every file the codebase might write to
2. bash -u on every shell script that uses ${VAR} references
3. python3 -c "import json; json.load(open(f))" on every JSON file
4. yaml.safe_load on every markdown frontmatter
5. make test 7-suite re-run
6. python3 scripts/benchmark-runner.py --limit 5 to verify benchmark harness still runs
7. bash bin/setup-retrieve.sh --check to verify diagnostic path
8. git diff --shortstat main..HEAD — confirm acceptance criterion #3
9. grep -c "TODO\|FIXME\|XXX" on every file changed in main..HEAD — must be 0 net additions
10. Open every doc file changed, verify each commit-SHA reference resolves via git rev-parse
Compute final score on the 7 dimensions used throughout this session

Ship gate decision:

Outcome	Action
Verifier 0/0/0/0 + chair 0 functional findings + acceptance criteria 1–7 all met	SHIP at 100/100. Surface to user for push authorization.
Either pass finds <5 items, all closeable in <30 min	One MORE iteration allowed. Close, re-verify, ship.
Either pass finds ≥5 items OR any item requires >30 min	Document remaining. Ship at honest achieved score. Add a v1.7.x backlog entry.

Hard rule: maximum 2 verify-fix rounds after Phase 6. The 3-round recursion of the v1.7.1 cycle taught us that adversarial scrutiny is asymptotic. After 2 more rounds, accept the score.

8b. Phase 7 — v1.8.0 methodology modes (6-7h)

After Phase 6 lands v1.7.2 at honest 100/100, build methodology modes — the compass artifact's priority gap 5. Closes axis "methodology support" in audit §9 from TIE to YES (5/7 axes #1).

Deliverables:

New skill skills/wiki-mode/SKILL.md (~45 min)
- Triggers: "set vault mode", "switch to PARA", "use LYT", "what's my vault mode", "zettelkasten setup"
- Reads .vault-meta/mode.json; falls back to mode=generic (v1.6/v1.7 default) when absent
- allowed-tools: Read, Write, Bash

Mode config schema .vault-meta/mode.json (~30 min — schema + write path)

{
  "schema_version": 1,
  "mode": "lyt|para|zettelkasten|generic",
  "configured_at": "ISO-8601",
  "config": {
    "lyt": {"moc_folder": "wiki/mocs/"},
    "para": {"projects_folder": "wiki/projects/", "areas_folder": "wiki/areas/",
             "resources_folder": "wiki/resources/", "archives_folder": "wiki/archives/"},
    "zettelkasten": {"id_format": "YYYYMMDDHHMMSS", "no_folders": true}
  }
}

Per-mode templates skills/wiki-mode/templates/ (~60 min)
- lyt/moc-template.md (Map of Content scaffolding with wikilink-cluster sections)
- lyt/atomic-template.md (atomic note that links into MOCs)
- para/project-template.md (active project with status, deadline, next-action)
- para/area-template.md (ongoing responsibility, no deadline)
- para/resource-template.md (reference material, topic-organized)
- zettel/atomic-template.md (atomic claim + supporting sources + parent/child IDs)
- zettel/_id-format.md (timestamp-based ID generation recipe)
Skill mode-awareness modifications (~90 min)
- skills/wiki-ingest/SKILL.md — consult .vault-meta/mode.json; route source/entity/concept pages to mode-specific folders when mode != generic
- skills/save/SKILL.md — same; session notes route to PARA/projects or LYT/MOCs based on mode
- skills/autoresearch/SKILL.md — same; research artifacts route appropriately
- All changes preserve v1.7 fallback behavior when mode = generic
Hermetic tests tests/test_wiki_mode.sh + tests/test_wiki_mode.py (~60 min)
- Mode config writes correctly under each of 4 modes
- Mode loader returns correct config for each mode
- Routing logic produces correct path for each (mode, content-type) pair
- mode=generic preserves v1.7 routing
- Invalid mode in mode.json triggers explicit error, not silent fallback
- All hermetic; no network, no LLM, no ollama
Opt-in setup script bin/setup-mode.sh (~30 min)
- Interactive: prompts user to pick mode
- Writes .vault-meta/mode.json
- Optionally seeds template folders (LYT mocs/, PARA projects+areas+resources+archives/)
- Idempotent; safe to re-run
Documentation (~45 min)
- docs/methodology-modes-guide.md — explains each mode, when to use, migration paths
- CLAUDE.md "How to Use" section + new "Methodology Modes (v1.8+)" subsection
- wiki/references/methodology-modes.md — short decision tree (which mode for which user)
Cross-cutting (~30 min)
- Makefile — test-mode, setup-mode targets; extend test to include test-mode
- .claude-plugin/{plugin,marketplace}.json — version 1.7.2 → 1.8.0, description updated
- .gitignore — .vault-meta/mode.json is host-specific runtime config, MUST be ignored
- CHANGELOG.md — new [1.8.0] entry
- agents/wiki-ingest.md — note mode-awareness in sub-agent protocol
- wiki/hot.md — refresh state

Commit ladder (estimated):

feat(v1.8.0): wiki-mode skill + 4 mode templates
feat(v1.8.0): mode-aware routing in wiki-ingest
feat(v1.8.0): mode-aware routing in save + autoresearch
test(v1.8.0): hermetic wiki-mode test suite
feat(v1.8.0): bin/setup-mode.sh opt-in bootstrap
docs(v1.8.0): methodology modes guide + CLAUDE.md update
chore(v1.8.0): version bump 1.7.2 → 1.8.0, CHANGELOG, gitignore

Per-commit gates:

make test green (now 8 suites including test-mode)
Verifier dispatch on staged diff returns ≤1 LOW (eat own dogfood per agents/verifier.md)
mode=generic path preserves v1.7 behavior exactly (regression test)

8c. Phase 8 — v1.8.0 ship gate (30 min)

Mirror Phase 6 structure for the v1.8.0 slice. Verifier on entire diff main..HEAD. Chair adversarial probe extended with mode-specific tests:

Each mode (LYT, PARA, Zettel) can be set + read back
mode=generic routing matches v1.7 routing byte-for-byte on a sample ingest
.vault-meta/mode.json is gitignored (test by creating + check-ignore)
Setup-mode.sh idempotent (run twice, second run no-op)

Same 2-round cap. If 0/0/0/0 + chair clean: 100/100 SHIP. Else: honest achieved score + v1.8.x backlog.

9. What this plan deliberately does NOT do (scope guard)

These are NOT in scope because they expand into a different release line:

v1.9 multimodal ingest (YouTube / PDF / EPUB / image OCR)
v2.0 derive (audio / quiz / flashcards / study guide — NotebookLM-class outputs)
v2.5+ GUI onramp (Community Plugin fork)
Cross-platform (macOS / Windows) testing — explicit out-of-scope per v1.7.0 audit §3
Performance benchmarking beyond retrieval accuracy
Security audit of dependencies (Python stdlib only; no third-party packages introduced)
Marketing / positioning work

A 100/100 on the v1.7 line does NOT mean #1 in the market. Per v1.7.0 audit §9: market-#1 across all 7 axes requires v1.8 + v2.0 + v2.5 work, not patch work. This plan brings the v1.7 line to honest code-quality 100/100. That's the prerequisite for the next release lines, not a substitute for them.

10. Undo plan (per /best-practices "failure is the spec")

If anything in Phases 2–4 causes a regression that isn't caught by the per-commit make test gate:

Revert the specific commit with git revert <sha>; do NOT rebase
Re-run verifier on the revert
Document the regression in audit §8 as a "FOUND-AND-REVERTED" finding so the lesson sticks

If the entire plan cannot reach the acceptance criteria within 6 hours (1h over budget):

Stop
Document the gap explicitly
Ship at the achieved honest score
Add a v1.7.3 backlog entry for the remaining items

The plan is non-mutating to the v1.7 features themselves; only adds prunings (Phase 2) and bug-class fixes (Phase 3). v1.7.1 functional surface is preserved.

11. Per-phase ship gates (mini-acceptance criteria)

Phase	Acceptance gate
0	All 24 findings categorized in scratch file
1	`agents/verifier.md` parses; 2 new "always check" items added
2	Net LOC delta meets §1 criterion #3 OR documented as irreducible
3	All 14 MEDIUM closed-or-deferred per §1 criterion #4
4	All 10 LOW closed
5	Fresh benchmark numbers in audit; all SHAs verified
6	Verifier + chair both clean (or rounds budget exhausted)

If a phase fails its gate, the plan does NOT proceed to the next phase. The chair stops, documents what's incomplete, and surfaces to the user for a go/no-go decision on continuing.

12. Cost-of-failure honest framing

Worst case: 6 hours spent, achieve only 98/100 (some MEDIUMs prove harder than estimated, +5819 stays additive, etc.).

Best case: 4 hours spent, genuinely achieve 100/100 on the v1.7 line, branch ready to push as v1.7.2.

Median case: 5 hours spent, 99/100, all M closed, 1-2 L deferred with rationale, push ready.

The recursion is the risk. Three rounds were needed to land at 97. Phase 6's hard 2-round cap protects against that recursion eating the entire weekend. If the cap fires, the gap is documented and we ship at honest <100 with a v1.7.3 backlog.

13. Confirmation before execution

Per /best-practices "acceptance criteria written before execution" + the user's repeated "no lies" + "honest score" framing, this plan needs explicit user buy-in on:

Scope: §9 explicitly excludes v1.8 / v2.0 / v2.5 work. Confirm.
Budget: 4-5h estimated, 6h hard cap. Confirm or adjust.
Ship gate posture: 2-round cap on adversarial scrutiny after Phase 6. Confirm or adjust.
No push: branch stays local until user authorizes push, even if 100/100 is achieved. Confirm.

If any of these need adjustment, surface that. Otherwise: execute top to bottom.

21 KiB Raw Permalink Blame History Unescape Escape

v1.7.2 + v1.8.0 "Best Ever Per Priority Research" Plan

0. Why this plan exists

1. Acceptance criteria (defined BEFORE execution, per /best-practices "failure is the spec")

2. Phase 0 — Audit refresh (15 min)

3. Phase 1 — Verifier self-improvement (10 min)

4. Phase 2 — Close the +5819 / -30 LOC ratio (60-90 min)

5. Phase 3 — Close the 14 MEDIUM findings (90-120 min)

6. Phase 4 — Close the 10 LOW findings (30-45 min)

7. Phase 5 — Documentation refresh + final benchmark (30 min)

8. Phase 6 — Final verification (30 min) + ship gate

8b. Phase 7 — v1.8.0 methodology modes (6-7h)

8c. Phase 8 — v1.8.0 ship gate (30 min)

9. What this plan deliberately does NOT do (scope guard)

10. Undo plan (per /best-practices "failure is the spec")

11. Per-phase ship gates (mini-acceptance criteria)

12. Cost-of-failure honest framing

13. Confirmation before execution

21 KiB

Raw Permalink Blame History