GBrain-Evals Proof Layer

Positioning

gbrain-evals is the proof layer for Garry Tan's personal-knowledge stack. It does not primarily sell "gbrain exists"; it proves which parts of gbrain matter, under which workloads, against named baselines.

Layer	What It Proves	Source
Public benchmark credibility	LongMemEval `_s`, 500 questions, public dataset, comparison rows against MemPal, Hindsight, Stella, Contriever, BM25, Mastra, Supermemory	`gbrain-evals/README.md`, `docs/comparison-systems.md`
In-house system proof	BrainBench Cats 1-12 test retrieval, identity, temporal, provenance, prose extraction, latency, compliance, workflows, adversarial robustness, multimodal, MCP contract	`gbrain-evals/README.md`
Adapter neutrality	Any adapter implementing init -> query -> ranked docs can be scored against the same corpus and gold	`eval/runner/types.ts`
Artifact traceability	Runs can emit transcript, scorecard, judge notes, brain export, entity graph, citations	`eval/runner/recorder.ts`

Core thesis: gbrain-evals turns gbrain from a personal tool into a falsifiable system. The repo's product is not just benchmark results; it is a repeatable proof apparatus.

Benchmark Catalog

Benchmark / Cat	Workload	Current Claim	Method / Metric	Status
LongMemEval `_s`	500 public long-memory chat questions, ~50 sessions per haystack	`gbrain-hybrid` 97.60% R@5	Retrieval recall@5, no QA judge	Published
BrainBench Cats 1+2	Relational retrieval over 240-page fictional corpus	gbrain P@5 49.1%, R@5 97.9%	P@5 / R@5 against sealed gold	Published
Cat 2 Type Accuracy	Link-type extraction quality	Per-link F1 / type accuracy	Gold `_facts` vs extracted edges	Shipping
Cat 3 Identity	Alias / handle / email resolution	Documented aliases strong; undocumented weak in early report	Recall top-10	Shipping
Cat 4 Temporal	Point/range/recency/as-of questions	100% in early report	Timeline table + app logic	Shipping
Cat 5 Provenance	Claim source attribution	Target citation accuracy >0.90	Haiku classifier over claims and source pages	Programmatic / baseline-oriented
Cat 6 Prose Scale	Auto-link precision under injected prose variants	Baseline-only	Link precision/recall/F1, leak rates	Shipping
Cat 7 Perf	Operation latency at 1K/10K pages	P95 target <200ms per query	P50/P95/P99 and throughput	Shipping
Cat 8 Skill Compliance	Agent behavior laws	Brain-first, backlinks, citation format, tier escalation	Transcript-derived deterministic metrics	Programmatic
Cat 9 Workflows	Meeting/email/prep/briefing/sync workflows	Target >80% pass per workflow	Agent replay + rubric judge	Programmatic
Cat 10 Adversarial	22 hand-crafted robustness cases	No crash / reject unsafe inputs	Programmatic checks	Shipping
Cat 11 Multimodal	PDF/audio/HTML ingest fidelity	PDF text >0.95, audio WER <0.15, HTML recall >0.80	Modality-specific fidelity metrics	Opt-in fixtures
Cat 12 MCP Contract	Tool trust boundary / validation	No silent corruption	Valid/boundary/invalid/injection/resource tests	Shipping
Cat 13 Conceptual	Conceptual recall	Vector leads: 49.1% nDCG@5	nDCG@5, P@5, P@1	Published
Cat 13b Source Swamp	Curated source vs chat swamp ranking	v0.22 top-1 93.3%, top-3 100%	Top-k hit, `swamp@top`	Published

Claims Architecture

Claim	Evidence	Caveat
gbrain beats MemPal raw on LongMemEval `_s` by 1.0 point	97.60% R@5 vs MemPal raw 96.6% R@5	MemPal hybrid rerank held-out is 98.4%; MemPal tuned row hits 100% with LLM rerank
Vector retrieval is nearly enough for LongMemEval `_s`	`gbrain-vector` 97.40% vs hybrid 97.60%	This is conversational memory; not proof for code, entities, or curated-source ranking
Query expansion is a null result on LongMemEval	Hybrid and hybrid+expansion both 97.60%	Expansion may still help sparse-vocabulary or domain-jargon cases
Graph layer is load-bearing for relational questions	gbrain 49.1% P@5 / 97.9% R@5 vs vector-grep-fusion 17.8% / 65.1%	Conceptual Cat 13 shows graph is neutral, not universal
Source-aware ranking reduces chat swamp	Cat 13b v0.22 top-1 93.3%, swamp@top 6.7%	Only 30 queries; two misses are legitimately hard-signal cases
Minions beat OpenClaw subagents for deterministic production jobs	Production benchmark: 753ms vs gateway timeout; lab: 10/10 durability vs 0/10 under SIGKILL	These docs benchmark execution substrate, not memory retrieval quality

Methodology

Mechanism	Purpose	Implementation Signal
Sealed adapter boundary	Prevent adapters from reading gold labels	`sanitizePage` strips `_facts` / frontmatter; `sanitizeQuery` strips `gold`
Adapter interface	Score gbrain and external systems on same bar	`Adapter.init(rawPages, config)`, `query(q,state)`, optional snapshot/poison/teardown
N-run variance	Catch order-dependence / nondeterminism	`BRAINBENCH_N`; seeded page shuffle in multi-adapter runner
Public vs in-house split	External credibility plus workload-specific proof	LongMemEval public; BrainBench fictional corpora
Programmatic Cats	Evaluate agent behavior and workflows where CLI alone is insufficient	Cats 5, 8, 9 omitted from `all.ts` subprocess list and run via harness
Judge contract	Avoid raw prompt-injection text entering the judge	Judge receives structured `JudgeEvidence`, not raw tool results
Flight recorder	Make every run auditable	Transcript, scorecard, judge notes, optional exports/citations
Cache fairness	Avoid recomputing embeddings without leaking future data	LongMemEval cache keyed by model, dimensions, SHA-256(text)
Resume / sharding	Make long public benchmark runs operationally robust	NDJSON append, worker sharding, wall-clock budget, engine recycle

Artifacts

Artifact	Role
`docs/benchmarks/*.md`	Permanent published scorecards and narratives
`docs/comparison-systems.md`	Living competitor / baseline number registry
`eval/reports/`	Transient local run output, raw JSON/Markdown
`transcript.md`	Full model/tool/timing trace
`scorecard.json`	Metrics, config card, corpus SHA, seed, adapter
`judge-notes.md`	Per-rubric rationale for Cats 5/8/9
`brain-export.json`	Optional adapter state export
`entity-graph.json`	Optional node/edge artifact for graph scoring
`citations.json`	Optional claim-to-source artifact
SVG charts	Inline report visuals for LongMemEval

Personas

Persona	Job	Why This Layer Matters
Garry / gbrain maintainer	Prove changes improve memory quality	Prevents intuition-driven changes from shipping without measurable lift
External memory-system builder	Submit adapter and compare fairly	Adapter contract gives a neutral arena
AI agent builder	Decide retrieval architecture	LongMemEval shows where vector, keyword, hybrid, expansion matter
OpenClaw / Ren contributor	Find contribution opportunities	Proof gaps become precise PR candidates
Skeptical reader / founder audience	Decide whether gbrain is more than demos	Public reports provide shareable evidence
Research-oriented evaluator	Compare systems without metric confusion	Comparison docs distinguish R@k vs QA accuracy

JTBD

JTBD	Current Product Support
When I change retrieval, tell me if it actually helped.	Multi-adapter BrainBench scorecards
When I claim SOTA-ish memory performance, make it defensible.	LongMemEval public benchmark and comparison table
When graph features add complexity, prove they pay rent.	Cats 1+2, type accuracy, graph vs no-graph comparisons
When retrieval works in one domain, show me where it fails.	Cat 13 conceptual and LongMemEval per-type breakdown
When agents write to memory, verify behavior without mutating state.	Tool bridge dry-run writes and Cat 8 compliance
When benchmarks get expensive or flaky, make them resumable.	NDJSON resume, embedding cache, wall budgets, engine recycle
When an LLM judges output, keep the judge from being poisoned.	Structured evidence contract and poison tagging

Product Requirements Implied

Requirement	Priority	Source-Grounded Rationale
Keep adapter boundary strict and eventually sandboxed	P0	Current enforcement is soft; comments note Docker sandbox for v2
Expand public benchmarks beyond LongMemEval `_s`	P0	README lists `_oracle`, `_m`, ConvoMem, LoCoMo as roadmap
Add QA accuracy pass for LongMemEval	P1	Report explicitly says retrieval recall is not QA accuracy
Make Cat 5/8/9 harnesses first-class CLI flows	P1	`all.ts` skips them because they require runtime inputs
Preserve metric hygiene in public claims	P1	Comparison docs warn QA accuracy and R@k are not directly comparable
Keep benchmark reports narrative, not just numbers	P1	Existing docs explain what each result proves and does not prove
Add temporal-aware retrieval ranking	P1	LongMemEval temporal row is the main underperformance vs MemPal raw
Maintain run artifacts as reviewable evidence	P1	Recorder design exists but depends on adapters opting in
Add external adapter contribution docs/tests	P2	README defines contributor path; adapter interface exists
Keep caches content-addressed and model-versioned	P2	LongMemEval cache design guards fairness and correctness

Risks / Caveats

Risk	Why It Matters
Retrieval recall can be mistaken for answer quality	LongMemEval report explicitly says R@5 is not QA accuracy
Some competitor rows are metric-mismatched	Mastra / Supermemory are QA accuracy, not R@k
Soft sealed-gold boundary is not a security boundary	Code strips fields, but a malicious adapter could read `eval/data/gold` from disk until sandboxing exists
In-house fictional corpora may overfit to gbrain's worldview	Useful for regression, weaker as external proof than LongMemEval
Programmatic Cats are less discoverable	Cats 5/8/9 are not run by the main subprocess runner
Benchmark narratives can become marketing if caveats decay	Current docs are unusually honest; that needs maintenance discipline
LongMemEval `_s` may under-test gbrain's graph advantage	It is conversational retrieval; Cats 1+2 show graph value elsewhere
Caches can create reproducibility confusion	Correctly keyed, but cold vs warm cache affects latency interpretation
Production Minions benchmarks are adjacent proof, not eval proof	They prove deterministic job substrate, not memory retrieval quality

Implications For Ren / OpenClaw

Implication	Action
Proof should be a first-class product layer, not an afterthought	Ren/OpenClaw should ship scorecards and artifacts alongside features
Claims need metric names in the headline	Always distinguish retrieval recall, QA accuracy, pass rate, latency, cost
Agent systems need dry-run write surfaces	OpenClaw evals should copy the intent-without-mutation pattern
Subagents are not the right unit for deterministic batch work	Use durable jobs for repeatable pipelines; reserve agents for judgment
Benchmarks should produce artifacts humans can audit	Transcript + scorecard + config card should become standard
Public + private eval split is powerful	Public benchmarks build credibility; fictional/internal corpora test product-specific workflows
Metric-mismatch tables are content assets	The comparison docs are defensible thought leadership, not just engineering notes
The best contribution lane is proof hardening	CLI harnesses for Cats 5/8/9, sandboxed adapter execution, QA pass for LongMemEval, or temporal retrieval ranking would all strengthen the proof layer

Bottom line: gbrain-evals is not a side repo. It is the legitimacy engine for the whole Garry Tan system: public proof for outsiders, regression proof for maintainers, and narrative proof for the market.