GBrain Data Model and Runtime Map
Product Frame
gbrain is persistent memory for AI agents. Its job is to let an agent search, cite, update, and reuse knowledge across sessions, repos, machines, and clients.
The public README positions it as "the memory your agent actually keeps between sessions." The codebase backs that with a Postgres/pgvector schema, source-scoped tenancy, page/chunk storage, embeddings, code-symbol metadata, soft delete, effective dates, OAuth-scoped HTTP MCP, and stdio MCP tools.
Runtime Loop
- Source is registered: local repo, wiki, media archive, YC/media corpus, or default brain.
- Content is ingested into pages with source_id, slug, type, title, compiled_truth, timeline, and frontmatter.
- Content is chunked into content_chunks with text, embeddings, token counts, modality, language, symbols, and line ranges.
- Search combines vector, text, graph/code edges, salience, source scope, and recency filters.
- MCP/CLI exposes search, get_page, put_page, sync, code-def, code-refs, code-callers, code-callees, and admin operations.
- Skills run read-enrich-write loops: search the brain, synthesize, update pages, maintain citations/backlinks.
Core Entities
| Entity | Product Meaning | Important Fields / Behavior |
|---|---|---|
| sources | A logical brain inside the DB | id, name, local_path, federation config, chunker_version, archived flags |
| pages | Canonical knowledge objects | source_id, slug, type, page_kind, title, compiled_truth, timeline, frontmatter |
| content_chunks | Retrieval units | chunk_text, embedding, model, token_count, language, symbol metadata, modality |
| code_edges_chunk | Resolved code graph | from_chunk_id, to_chunk_id, edge_type, symbol identities |
| code_edges_symbol | Unresolved code refs | target symbol known before source definition is imported |
| files | Binary/file storage references | used for images, uploads, and filesystem-backed memory |
| ingest logs/jobs | Operational trace | tells operators what was imported, embedded, transformed, or failed |
| OAuth clients/tokens | Remote trust boundary | scopes read/write/admin and source-specific access |
Page Types
The TypeScript PageType union includes person, company, deal, yc, civic, project, concept, source, media, writing, analysis, guide, hardware, architecture, meeting, note, email, slack, calendar-event, code, image, and synthesis.
This matters because gbrain is not just note search. It encodes the major objects of founder work:
- people and companies
- meetings and messages
- deals and YC context
- concepts and analysis
- code and architecture
- media and sources
- generated synthesis
Source Scoping
sources is the tenancy layer. Each page belongs to a source. A source can be federated or non-federated:
- federated sources participate in default cross-source search.
- non-federated sources are searched only when explicitly requested.
- archived sources are hidden and later purged.
Product implication: gbrain can support a personal brain, team brain, client brain, repo brain, and public knowledge corpus without mixing all writes into one undifferentiated pile.
Retrieval Model
The schema points to four retrieval modes:
- Semantic: pgvector embeddings on text chunks.
- Text: trigram and full-text search vectors.
- Structural: code graph edges and symbol-aware lookup.
- Time/salience: effective_date, salience_touched_at, emotional_weight.
The "Cathedral II" direction in comments is important. It moves gbrain from markdown memory into code-aware retrieval:
- symbol_name
- symbol_name_qualified
- parent_symbol_path
- doc_comment
- start_line/end_line
- code_edges_chunk and code_edges_symbol
That lets coding agents ask "where is this defined?", "who calls it?", and "what context should I inspect next?" without falling back to broad grep.
Trust and Deployment Model
gbrain supports multiple deployment shapes:
- local PGLite for low-friction personal use
- Supabase for shared/cloud memory
- stdio MCP for local agents
- HTTP MCP with OAuth 2.1 for remote clients
- source-scoped OAuth clients for shared brains
The product spec hiding here: persistent memory only works if users can trust the boundary. gbrain's trust boundary is source-aware, client-aware, and operation-aware.
Product Requirements
For gbrain to feel like real agent memory, it must satisfy:
- Durable writes: agent updates survive across sessions.
- Citation discipline: claims point back to pages/sources.
- Search precision: good recall without hallucinating missing facts.
- Source separation: no accidental client/repo contamination.
- Toolability: CLI and MCP operations are structured, not natural-language guesses.
- Maintenance: stale pages, orphan links, citations, and dead links get audited.
- Reproducibility: evals show retrieval quality on known corpora.
Open Questions
- Which source-scoping model becomes default for teams: one shared source, many federated personal sources, or project-specific sources?
- How often does emotional_weight materially improve retrieval versus creating tuning complexity?
- Which MCP operations are most used in real agent sessions?
- Does code graph retrieval reduce cost/time enough to become the default for coding agents?
Deep Runtime Evidence Map
This pass goes below the README and treats gbrain as a runtime product. The source basis is gbrain/README.md, gbrain/src/schema.sql, gbrain/src/core/types.ts, gbrain/src/core/operations.ts, gbrain/src/core/search/*, and the MCP/security docs.
Expanded Entity Map
| Entity | Runtime Role | Evidence | Product Meaning |
|---|---|---|---|
sources | In-DB tenancy namespace. Every page/file/ingest row belongs to a logical source. | gbrain/src/schema.sql | Lets one brain hold wiki, code repos, media corpora, team knowledge, or isolated project memory without global slug collision. |
pages | Canonical knowledge object. | gbrain/src/schema.sql, gbrain/src/core/types.ts | The memory page: title, compiled truth, timeline, frontmatter, type, soft-delete state, emotional weight, effective date. |
content_chunks | Retrieval unit for text/code/image. | gbrain/src/schema.sql | Converts pages into searchable chunks with embeddings, FTS vector, symbol metadata, modality, and token count. |
links | Page graph, backlinks, entity relationships. | gbrain/src/schema.sql, gbrain/src/core/types.ts | The brain wires itself through typed relationships from markdown, frontmatter, and manual edges. |
timeline_entries | Structured temporal facts. | gbrain/src/schema.sql, gbrain/src/core/types.ts | Lets the brain answer when/what-changed questions beyond vector similarity. |
code_edges_chunk / code_edges_symbol | Resolved and unresolved code graph edges. | gbrain/src/schema.sql | Powers code-def, code-refs, callers/callees, and two-pass code retrieval. |
files | Binary/file sidecar index. | gbrain/src/schema.sql | Stores references for images/uploads without stuffing bytes into core page rows. |
oauth_clients, oauth_tokens, oauth_codes | Remote MCP identity and authorization. | gbrain/src/schema.sql | Defines client identity, grants, scopes, write source, and federated-read set. |
mcp_request_log | Remote tool-call audit log. | gbrain/src/schema.sql, gbrain/SECURITY.md | Makes remote brain access observable without retaining raw payloads by default. |
minion_jobs and subagent tables | Durable agent runtime. | gbrain/src/schema.sql | Lets background jobs, subagent loops, tool executions, and rate leases persist. |
eval_candidates | Real retrieval eval capture. | gbrain/README.md, gbrain/src/schema.sql | Turns real query/search calls into replayable BrainBench-Real examples. |
Source / Tenant Model
gbrain has two axes:
| Axis | Meaning | Boundary Rule |
|---|---|---|
| Brain | One database: PGLite, Postgres, or Supabase. | Data owner / access-control boundary. |
| Source | Named content repo inside a brain. | Repo, topic, team, client, or workstream boundary inside one DB. |
Key mechanics:
| Mechanic | Runtime Behavior |
|---|---|
| Per-source slug namespace | pages enforces unique (source_id, slug), so different sources can safely contain the same slug. |
| Federation | sources.config.federated=true joins default unqualified search; false requires explicit source selection. |
| Source resolution | Precedence flows through explicit source flag/env/project files/registered local path/default source. |
| Agent citation | Multi-source citations need source-qualified slugs such as [source-id:slug]. |
| OAuth source model | Remote clients get write authority through source_id and separate read authority through federated-read configuration. |
Interpretation: source scoping is not metadata decoration. It is the anti-leak primitive for shared brains. Several code comments in operations/search paths treat missing source propagation as a P0 leak class.
Retrieval Pipeline
| Stage | What Happens | Product Reason |
|---|---|---|
| Mode resolution | conservative, balanced, and tokenmax set defaults for cache, intent weighting, token budget, expansion, limit, and reranker use. | Lets operators trade cost, speed, and depth. |
| Intent classification | Query intent influences detail, salience, recency, and RRF weights. | Memory search should adapt to task shape. |
| Keyword path | Always runs first and works without embeddings. | Day-one installs and offline paths still work. |
| Vector path | If an embedding provider exists, query variants are embedded and searched. | Semantic recall covers fuzzy questions. |
| Fusion | Keyword and vector lists merge via weighted reciprocal-rank fusion, then score adjustments. | Combines exact and semantic evidence. |
| Boosts | Backlinks, salience, recency, and exact match affect rank. | Operator memory needs relationships and time, not only similarity. |
| Structural expansion | Optional graph walk via nearSymbol / walkDepth, capped. | Coding agents need symbol adjacency. |
| Dedup | Composite (source_id, slug), text similarity, type diversity, per-page cap, compiled-truth guarantee. | Avoids noisy repeated chunks. |
| Rerank / budget | Optional reranker and token budget enforcement. | Keeps output useful inside model context. |
| Eval capture | search and query can capture retrieved slugs/chunks when enabled. | Converts real usage into benchmark fuel. |
Product read: gbrain is not "vector DB with markdown." It is layered retrieval: lexical, vector, graph, temporal, salience, source tenancy, and code-structure expansion.
MCP / OAuth / Trust Boundary
| Boundary | Mechanism | Requirement Implied |
|---|---|---|
| Local stdio MCP | gbrain serve exposes tools over stdio. | Local agents get structured brain tools without HTTP setup. |
| HTTP MCP | gbrain serve --http exposes OAuth-backed MCP and admin dashboard. | Remote clients need scoped auth, logs, discovery, and client management. |
| Operation contracts | Tool definitions derive from shared operations. | No hand-maintained schema drift between CLI/MCP/HTTP. |
| Shared dispatch | Stdio and HTTP use the same validation/context/result path. | Transport parity is a correctness and security feature. |
| Remote flag | OperationContext.remote is required for remote/untrusted callers. | Filesystem/tool operations fail closed for remote agents. |
| Source read scope | Read helpers prefer OAuth allowed sources, then context source ID. | Every read path must thread source scope into filters. |
| Scopes | Operations are tagged read/write/admin/local-only. | Remote clients get least-privilege tool access. |
| Local-only ops | Sync/file operations are rejected over HTTP regardless of scope. | Remote agents cannot touch local filesystem surfaces. |
| Logging redaction | MCP params are logged as redacted shape by default. | Admin observability must not become a private-data leak. |
| Network hardening | Loopback default, CORS deny by default, rate limits, proxy warnings. | Personal brains should not become accidentally exposed. |
Product Requirements, Ranked
| Priority | Requirement | Why It Exists | Acceptance Signal |
|---|---|---|---|
| P0 | Source isolation must be enforced on every read/write path. | Multi-source brains otherwise leak client/team/repo context. | Same-slug pages across sources remain distinct; every read handler honors source filters. |
| P0 | Remote MCP must be scoped, logged, and local-file-safe. | Remote clients operate outside the owner's OS trust boundary. | OAuth scopes honored, local-only ops rejected, params redacted, rate limits active. |
| P0 | Search must degrade gracefully without embeddings. | Day-one installs may lack provider keys. | search works; query falls back instead of failing. |
| P1 | Retrieval must combine lexical, semantic, graph, recency, salience, and code structure. | Founder/operator memory needs factual, temporal, relationship, and code recall. | Hybrid path returns ranked, deduped, source-aware chunks with metadata. |
| P1 | Every page needs provenance and citation discipline. | Memory must be auditable, not merely plausible. | Compiled truth and timeline carry source citations; conflicts are explicit. |
| P1 | Agent tool schemas must be generated from one operation contract. | MCP/CLI drift breaks clients and causes strict-schema failures. | Tool definitions derive from operations; dispatch is shared. |
| P1 | Code memory must be symbol-aware. | Coding agents need where-defined/who-calls/near-symbol more than broad prose recall. | Code metadata and edge tables populate; two-pass retrieval respects source scope. |
| P2 | Runtime should expose health, eval capture, and replay. | Memory quality needs proof. | eval_candidates captures real calls when explicitly enabled. |
| P2 | Search modes should make cost/quality tunable. | Haiku loops and Opus/tokenmax workflows have different budgets. | Search modes resolve deterministic knobs and cache keys. |
Bottom line: gbrain's product is a source-scoped, citation-aware memory runtime exposed through CLI/MCP. The defensible moat is the combination of typed memory entities, source tenancy, hybrid retrieval, agent-safe operation contracts, and eval-backed quality loops.