Garry Tan System Map

GBrain-Evals Proof Layer

The benchmark and evidence system that makes gbrain claims falsifiable.

gstack = methodgbrain = continuitygbrain-evals = proofYC = network

GBrain-Evals Proof Layer

Positioning

gbrain-evals is the proof layer for Garry Tan's personal-knowledge stack. It does not primarily sell "gbrain exists"; it proves which parts of gbrain matter, under which workloads, against named baselines.

LayerWhat It ProvesSource
Public benchmark credibilityLongMemEval _s, 500 questions, public dataset, comparison rows against MemPal, Hindsight, Stella, Contriever, BM25, Mastra, Supermemorygbrain-evals/README.md, docs/comparison-systems.md
In-house system proofBrainBench Cats 1-12 test retrieval, identity, temporal, provenance, prose extraction, latency, compliance, workflows, adversarial robustness, multimodal, MCP contractgbrain-evals/README.md
Adapter neutralityAny adapter implementing init -> query -> ranked docs can be scored against the same corpus and goldeval/runner/types.ts
Artifact traceabilityRuns can emit transcript, scorecard, judge notes, brain export, entity graph, citationseval/runner/recorder.ts

Core thesis: gbrain-evals turns gbrain from a personal tool into a falsifiable system. The repo's product is not just benchmark results; it is a repeatable proof apparatus.

Benchmark Catalog

Benchmark / CatWorkloadCurrent ClaimMethod / MetricStatus
LongMemEval _s500 public long-memory chat questions, ~50 sessions per haystackgbrain-hybrid 97.60% R@5Retrieval recall@5, no QA judgePublished
BrainBench Cats 1+2Relational retrieval over 240-page fictional corpusgbrain P@5 49.1%, R@5 97.9%P@5 / R@5 against sealed goldPublished
Cat 2 Type AccuracyLink-type extraction qualityPer-link F1 / type accuracyGold _facts vs extracted edgesShipping
Cat 3 IdentityAlias / handle / email resolutionDocumented aliases strong; undocumented weak in early reportRecall top-10Shipping
Cat 4 TemporalPoint/range/recency/as-of questions100% in early reportTimeline table + app logicShipping
Cat 5 ProvenanceClaim source attributionTarget citation accuracy >0.90Haiku classifier over claims and source pagesProgrammatic / baseline-oriented
Cat 6 Prose ScaleAuto-link precision under injected prose variantsBaseline-onlyLink precision/recall/F1, leak ratesShipping
Cat 7 PerfOperation latency at 1K/10K pagesP95 target <200ms per queryP50/P95/P99 and throughputShipping
Cat 8 Skill ComplianceAgent behavior lawsBrain-first, backlinks, citation format, tier escalationTranscript-derived deterministic metricsProgrammatic
Cat 9 WorkflowsMeeting/email/prep/briefing/sync workflowsTarget >80% pass per workflowAgent replay + rubric judgeProgrammatic
Cat 10 Adversarial22 hand-crafted robustness casesNo crash / reject unsafe inputsProgrammatic checksShipping
Cat 11 MultimodalPDF/audio/HTML ingest fidelityPDF text >0.95, audio WER <0.15, HTML recall >0.80Modality-specific fidelity metricsOpt-in fixtures
Cat 12 MCP ContractTool trust boundary / validationNo silent corruptionValid/boundary/invalid/injection/resource testsShipping
Cat 13 ConceptualConceptual recallVector leads: 49.1% nDCG@5nDCG@5, P@5, P@1Published
Cat 13b Source SwampCurated source vs chat swamp rankingv0.22 top-1 93.3%, top-3 100%Top-k hit, swamp@topPublished

Claims Architecture

ClaimEvidenceCaveat
gbrain beats MemPal raw on LongMemEval _s by 1.0 point97.60% R@5 vs MemPal raw 96.6% R@5MemPal hybrid rerank held-out is 98.4%; MemPal tuned row hits 100% with LLM rerank
Vector retrieval is nearly enough for LongMemEval _sgbrain-vector 97.40% vs hybrid 97.60%This is conversational memory; not proof for code, entities, or curated-source ranking
Query expansion is a null result on LongMemEvalHybrid and hybrid+expansion both 97.60%Expansion may still help sparse-vocabulary or domain-jargon cases
Graph layer is load-bearing for relational questionsgbrain 49.1% P@5 / 97.9% R@5 vs vector-grep-fusion 17.8% / 65.1%Conceptual Cat 13 shows graph is neutral, not universal
Source-aware ranking reduces chat swampCat 13b v0.22 top-1 93.3%, swamp@top 6.7%Only 30 queries; two misses are legitimately hard-signal cases
Minions beat OpenClaw subagents for deterministic production jobsProduction benchmark: 753ms vs gateway timeout; lab: 10/10 durability vs 0/10 under SIGKILLThese docs benchmark execution substrate, not memory retrieval quality

Methodology

MechanismPurposeImplementation Signal
Sealed adapter boundaryPrevent adapters from reading gold labelssanitizePage strips _facts / frontmatter; sanitizeQuery strips gold
Adapter interfaceScore gbrain and external systems on same barAdapter.init(rawPages, config), query(q,state), optional snapshot/poison/teardown
N-run varianceCatch order-dependence / nondeterminismBRAINBENCH_N; seeded page shuffle in multi-adapter runner
Public vs in-house splitExternal credibility plus workload-specific proofLongMemEval public; BrainBench fictional corpora
Programmatic CatsEvaluate agent behavior and workflows where CLI alone is insufficientCats 5, 8, 9 omitted from all.ts subprocess list and run via harness
Judge contractAvoid raw prompt-injection text entering the judgeJudge receives structured JudgeEvidence, not raw tool results
Flight recorderMake every run auditableTranscript, scorecard, judge notes, optional exports/citations
Cache fairnessAvoid recomputing embeddings without leaking future dataLongMemEval cache keyed by model, dimensions, SHA-256(text)
Resume / shardingMake long public benchmark runs operationally robustNDJSON append, worker sharding, wall-clock budget, engine recycle

Artifacts

ArtifactRole
docs/benchmarks/*.mdPermanent published scorecards and narratives
docs/comparison-systems.mdLiving competitor / baseline number registry
eval/reports/Transient local run output, raw JSON/Markdown
transcript.mdFull model/tool/timing trace
scorecard.jsonMetrics, config card, corpus SHA, seed, adapter
judge-notes.mdPer-rubric rationale for Cats 5/8/9
brain-export.jsonOptional adapter state export
entity-graph.jsonOptional node/edge artifact for graph scoring
citations.jsonOptional claim-to-source artifact
SVG chartsInline report visuals for LongMemEval

Personas

PersonaJobWhy This Layer Matters
Garry / gbrain maintainerProve changes improve memory qualityPrevents intuition-driven changes from shipping without measurable lift
External memory-system builderSubmit adapter and compare fairlyAdapter contract gives a neutral arena
AI agent builderDecide retrieval architectureLongMemEval shows where vector, keyword, hybrid, expansion matter
OpenClaw / Ren contributorFind contribution opportunitiesProof gaps become precise PR candidates
Skeptical reader / founder audienceDecide whether gbrain is more than demosPublic reports provide shareable evidence
Research-oriented evaluatorCompare systems without metric confusionComparison docs distinguish R@k vs QA accuracy

JTBD

JTBDCurrent Product Support
When I change retrieval, tell me if it actually helped.Multi-adapter BrainBench scorecards
When I claim SOTA-ish memory performance, make it defensible.LongMemEval public benchmark and comparison table
When graph features add complexity, prove they pay rent.Cats 1+2, type accuracy, graph vs no-graph comparisons
When retrieval works in one domain, show me where it fails.Cat 13 conceptual and LongMemEval per-type breakdown
When agents write to memory, verify behavior without mutating state.Tool bridge dry-run writes and Cat 8 compliance
When benchmarks get expensive or flaky, make them resumable.NDJSON resume, embedding cache, wall budgets, engine recycle
When an LLM judges output, keep the judge from being poisoned.Structured evidence contract and poison tagging

Product Requirements Implied

RequirementPrioritySource-Grounded Rationale
Keep adapter boundary strict and eventually sandboxedP0Current enforcement is soft; comments note Docker sandbox for v2
Expand public benchmarks beyond LongMemEval _sP0README lists _oracle, _m, ConvoMem, LoCoMo as roadmap
Add QA accuracy pass for LongMemEvalP1Report explicitly says retrieval recall is not QA accuracy
Make Cat 5/8/9 harnesses first-class CLI flowsP1all.ts skips them because they require runtime inputs
Preserve metric hygiene in public claimsP1Comparison docs warn QA accuracy and R@k are not directly comparable
Keep benchmark reports narrative, not just numbersP1Existing docs explain what each result proves and does not prove
Add temporal-aware retrieval rankingP1LongMemEval temporal row is the main underperformance vs MemPal raw
Maintain run artifacts as reviewable evidenceP1Recorder design exists but depends on adapters opting in
Add external adapter contribution docs/testsP2README defines contributor path; adapter interface exists
Keep caches content-addressed and model-versionedP2LongMemEval cache design guards fairness and correctness

Risks / Caveats

RiskWhy It Matters
Retrieval recall can be mistaken for answer qualityLongMemEval report explicitly says R@5 is not QA accuracy
Some competitor rows are metric-mismatchedMastra / Supermemory are QA accuracy, not R@k
Soft sealed-gold boundary is not a security boundaryCode strips fields, but a malicious adapter could read eval/data/gold from disk until sandboxing exists
In-house fictional corpora may overfit to gbrain's worldviewUseful for regression, weaker as external proof than LongMemEval
Programmatic Cats are less discoverableCats 5/8/9 are not run by the main subprocess runner
Benchmark narratives can become marketing if caveats decayCurrent docs are unusually honest; that needs maintenance discipline
LongMemEval _s may under-test gbrain's graph advantageIt is conversational retrieval; Cats 1+2 show graph value elsewhere
Caches can create reproducibility confusionCorrectly keyed, but cold vs warm cache affects latency interpretation
Production Minions benchmarks are adjacent proof, not eval proofThey prove deterministic job substrate, not memory retrieval quality

Implications For Ren / OpenClaw

ImplicationAction
Proof should be a first-class product layer, not an afterthoughtRen/OpenClaw should ship scorecards and artifacts alongside features
Claims need metric names in the headlineAlways distinguish retrieval recall, QA accuracy, pass rate, latency, cost
Agent systems need dry-run write surfacesOpenClaw evals should copy the intent-without-mutation pattern
Subagents are not the right unit for deterministic batch workUse durable jobs for repeatable pipelines; reserve agents for judgment
Benchmarks should produce artifacts humans can auditTranscript + scorecard + config card should become standard
Public + private eval split is powerfulPublic benchmarks build credibility; fictional/internal corpora test product-specific workflows
Metric-mismatch tables are content assetsThe comparison docs are defensible thought leadership, not just engineering notes
The best contribution lane is proof hardeningCLI harnesses for Cats 5/8/9, sandboxed adapter execution, QA pass for LongMemEval, or temporal retrieval ranking would all strengthen the proof layer

Bottom line: gbrain-evals is not a side repo. It is the legitimacy engine for the whole Garry Tan system: public proof for outsiders, regression proof for maintainers, and narrative proof for the market.