GBrain-Evals Proof Layer
Positioning
gbrain-evals is the proof layer for Garry Tan's personal-knowledge stack. It does not primarily sell "gbrain exists"; it proves which parts of gbrain matter, under which workloads, against named baselines.
| Layer | What It Proves | Source |
|---|---|---|
| Public benchmark credibility | LongMemEval _s, 500 questions, public dataset, comparison rows against MemPal, Hindsight, Stella, Contriever, BM25, Mastra, Supermemory | gbrain-evals/README.md, docs/comparison-systems.md |
| In-house system proof | BrainBench Cats 1-12 test retrieval, identity, temporal, provenance, prose extraction, latency, compliance, workflows, adversarial robustness, multimodal, MCP contract | gbrain-evals/README.md |
| Adapter neutrality | Any adapter implementing init -> query -> ranked docs can be scored against the same corpus and gold | eval/runner/types.ts |
| Artifact traceability | Runs can emit transcript, scorecard, judge notes, brain export, entity graph, citations | eval/runner/recorder.ts |
Core thesis: gbrain-evals turns gbrain from a personal tool into a falsifiable system. The repo's product is not just benchmark results; it is a repeatable proof apparatus.
Benchmark Catalog
| Benchmark / Cat | Workload | Current Claim | Method / Metric | Status |
|---|---|---|---|---|
LongMemEval _s | 500 public long-memory chat questions, ~50 sessions per haystack | gbrain-hybrid 97.60% R@5 | Retrieval recall@5, no QA judge | Published |
| BrainBench Cats 1+2 | Relational retrieval over 240-page fictional corpus | gbrain P@5 49.1%, R@5 97.9% | P@5 / R@5 against sealed gold | Published |
| Cat 2 Type Accuracy | Link-type extraction quality | Per-link F1 / type accuracy | Gold _facts vs extracted edges | Shipping |
| Cat 3 Identity | Alias / handle / email resolution | Documented aliases strong; undocumented weak in early report | Recall top-10 | Shipping |
| Cat 4 Temporal | Point/range/recency/as-of questions | 100% in early report | Timeline table + app logic | Shipping |
| Cat 5 Provenance | Claim source attribution | Target citation accuracy >0.90 | Haiku classifier over claims and source pages | Programmatic / baseline-oriented |
| Cat 6 Prose Scale | Auto-link precision under injected prose variants | Baseline-only | Link precision/recall/F1, leak rates | Shipping |
| Cat 7 Perf | Operation latency at 1K/10K pages | P95 target <200ms per query | P50/P95/P99 and throughput | Shipping |
| Cat 8 Skill Compliance | Agent behavior laws | Brain-first, backlinks, citation format, tier escalation | Transcript-derived deterministic metrics | Programmatic |
| Cat 9 Workflows | Meeting/email/prep/briefing/sync workflows | Target >80% pass per workflow | Agent replay + rubric judge | Programmatic |
| Cat 10 Adversarial | 22 hand-crafted robustness cases | No crash / reject unsafe inputs | Programmatic checks | Shipping |
| Cat 11 Multimodal | PDF/audio/HTML ingest fidelity | PDF text >0.95, audio WER <0.15, HTML recall >0.80 | Modality-specific fidelity metrics | Opt-in fixtures |
| Cat 12 MCP Contract | Tool trust boundary / validation | No silent corruption | Valid/boundary/invalid/injection/resource tests | Shipping |
| Cat 13 Conceptual | Conceptual recall | Vector leads: 49.1% nDCG@5 | nDCG@5, P@5, P@1 | Published |
| Cat 13b Source Swamp | Curated source vs chat swamp ranking | v0.22 top-1 93.3%, top-3 100% | Top-k hit, swamp@top | Published |
Claims Architecture
| Claim | Evidence | Caveat |
|---|---|---|
gbrain beats MemPal raw on LongMemEval _s by 1.0 point | 97.60% R@5 vs MemPal raw 96.6% R@5 | MemPal hybrid rerank held-out is 98.4%; MemPal tuned row hits 100% with LLM rerank |
Vector retrieval is nearly enough for LongMemEval _s | gbrain-vector 97.40% vs hybrid 97.60% | This is conversational memory; not proof for code, entities, or curated-source ranking |
| Query expansion is a null result on LongMemEval | Hybrid and hybrid+expansion both 97.60% | Expansion may still help sparse-vocabulary or domain-jargon cases |
| Graph layer is load-bearing for relational questions | gbrain 49.1% P@5 / 97.9% R@5 vs vector-grep-fusion 17.8% / 65.1% | Conceptual Cat 13 shows graph is neutral, not universal |
| Source-aware ranking reduces chat swamp | Cat 13b v0.22 top-1 93.3%, swamp@top 6.7% | Only 30 queries; two misses are legitimately hard-signal cases |
| Minions beat OpenClaw subagents for deterministic production jobs | Production benchmark: 753ms vs gateway timeout; lab: 10/10 durability vs 0/10 under SIGKILL | These docs benchmark execution substrate, not memory retrieval quality |
Methodology
| Mechanism | Purpose | Implementation Signal |
|---|---|---|
| Sealed adapter boundary | Prevent adapters from reading gold labels | sanitizePage strips _facts / frontmatter; sanitizeQuery strips gold |
| Adapter interface | Score gbrain and external systems on same bar | Adapter.init(rawPages, config), query(q,state), optional snapshot/poison/teardown |
| N-run variance | Catch order-dependence / nondeterminism | BRAINBENCH_N; seeded page shuffle in multi-adapter runner |
| Public vs in-house split | External credibility plus workload-specific proof | LongMemEval public; BrainBench fictional corpora |
| Programmatic Cats | Evaluate agent behavior and workflows where CLI alone is insufficient | Cats 5, 8, 9 omitted from all.ts subprocess list and run via harness |
| Judge contract | Avoid raw prompt-injection text entering the judge | Judge receives structured JudgeEvidence, not raw tool results |
| Flight recorder | Make every run auditable | Transcript, scorecard, judge notes, optional exports/citations |
| Cache fairness | Avoid recomputing embeddings without leaking future data | LongMemEval cache keyed by model, dimensions, SHA-256(text) |
| Resume / sharding | Make long public benchmark runs operationally robust | NDJSON append, worker sharding, wall-clock budget, engine recycle |
Artifacts
| Artifact | Role |
|---|---|
docs/benchmarks/*.md | Permanent published scorecards and narratives |
docs/comparison-systems.md | Living competitor / baseline number registry |
eval/reports/ | Transient local run output, raw JSON/Markdown |
transcript.md | Full model/tool/timing trace |
scorecard.json | Metrics, config card, corpus SHA, seed, adapter |
judge-notes.md | Per-rubric rationale for Cats 5/8/9 |
brain-export.json | Optional adapter state export |
entity-graph.json | Optional node/edge artifact for graph scoring |
citations.json | Optional claim-to-source artifact |
| SVG charts | Inline report visuals for LongMemEval |
Personas
| Persona | Job | Why This Layer Matters |
|---|---|---|
| Garry / gbrain maintainer | Prove changes improve memory quality | Prevents intuition-driven changes from shipping without measurable lift |
| External memory-system builder | Submit adapter and compare fairly | Adapter contract gives a neutral arena |
| AI agent builder | Decide retrieval architecture | LongMemEval shows where vector, keyword, hybrid, expansion matter |
| OpenClaw / Ren contributor | Find contribution opportunities | Proof gaps become precise PR candidates |
| Skeptical reader / founder audience | Decide whether gbrain is more than demos | Public reports provide shareable evidence |
| Research-oriented evaluator | Compare systems without metric confusion | Comparison docs distinguish R@k vs QA accuracy |
JTBD
| JTBD | Current Product Support |
|---|---|
| When I change retrieval, tell me if it actually helped. | Multi-adapter BrainBench scorecards |
| When I claim SOTA-ish memory performance, make it defensible. | LongMemEval public benchmark and comparison table |
| When graph features add complexity, prove they pay rent. | Cats 1+2, type accuracy, graph vs no-graph comparisons |
| When retrieval works in one domain, show me where it fails. | Cat 13 conceptual and LongMemEval per-type breakdown |
| When agents write to memory, verify behavior without mutating state. | Tool bridge dry-run writes and Cat 8 compliance |
| When benchmarks get expensive or flaky, make them resumable. | NDJSON resume, embedding cache, wall budgets, engine recycle |
| When an LLM judges output, keep the judge from being poisoned. | Structured evidence contract and poison tagging |
Product Requirements Implied
| Requirement | Priority | Source-Grounded Rationale |
|---|---|---|
| Keep adapter boundary strict and eventually sandboxed | P0 | Current enforcement is soft; comments note Docker sandbox for v2 |
Expand public benchmarks beyond LongMemEval _s | P0 | README lists _oracle, _m, ConvoMem, LoCoMo as roadmap |
| Add QA accuracy pass for LongMemEval | P1 | Report explicitly says retrieval recall is not QA accuracy |
| Make Cat 5/8/9 harnesses first-class CLI flows | P1 | all.ts skips them because they require runtime inputs |
| Preserve metric hygiene in public claims | P1 | Comparison docs warn QA accuracy and R@k are not directly comparable |
| Keep benchmark reports narrative, not just numbers | P1 | Existing docs explain what each result proves and does not prove |
| Add temporal-aware retrieval ranking | P1 | LongMemEval temporal row is the main underperformance vs MemPal raw |
| Maintain run artifacts as reviewable evidence | P1 | Recorder design exists but depends on adapters opting in |
| Add external adapter contribution docs/tests | P2 | README defines contributor path; adapter interface exists |
| Keep caches content-addressed and model-versioned | P2 | LongMemEval cache design guards fairness and correctness |
Risks / Caveats
| Risk | Why It Matters |
|---|---|
| Retrieval recall can be mistaken for answer quality | LongMemEval report explicitly says R@5 is not QA accuracy |
| Some competitor rows are metric-mismatched | Mastra / Supermemory are QA accuracy, not R@k |
| Soft sealed-gold boundary is not a security boundary | Code strips fields, but a malicious adapter could read eval/data/gold from disk until sandboxing exists |
| In-house fictional corpora may overfit to gbrain's worldview | Useful for regression, weaker as external proof than LongMemEval |
| Programmatic Cats are less discoverable | Cats 5/8/9 are not run by the main subprocess runner |
| Benchmark narratives can become marketing if caveats decay | Current docs are unusually honest; that needs maintenance discipline |
LongMemEval _s may under-test gbrain's graph advantage | It is conversational retrieval; Cats 1+2 show graph value elsewhere |
| Caches can create reproducibility confusion | Correctly keyed, but cold vs warm cache affects latency interpretation |
| Production Minions benchmarks are adjacent proof, not eval proof | They prove deterministic job substrate, not memory retrieval quality |
Implications For Ren / OpenClaw
| Implication | Action |
|---|---|
| Proof should be a first-class product layer, not an afterthought | Ren/OpenClaw should ship scorecards and artifacts alongside features |
| Claims need metric names in the headline | Always distinguish retrieval recall, QA accuracy, pass rate, latency, cost |
| Agent systems need dry-run write surfaces | OpenClaw evals should copy the intent-without-mutation pattern |
| Subagents are not the right unit for deterministic batch work | Use durable jobs for repeatable pipelines; reserve agents for judgment |
| Benchmarks should produce artifacts humans can audit | Transcript + scorecard + config card should become standard |
| Public + private eval split is powerful | Public benchmarks build credibility; fictional/internal corpora test product-specific workflows |
| Metric-mismatch tables are content assets | The comparison docs are defensible thought leadership, not just engineering notes |
| The best contribution lane is proof hardening | CLI harnesses for Cats 5/8/9, sandboxed adapter execution, QA pass for LongMemEval, or temporal retrieval ranking would all strengthen the proof layer |
Bottom line: gbrain-evals is not a side repo. It is the legitimacy engine for the whole Garry Tan system: public proof for outsiders, regression proof for maintainers, and narrative proof for the market.