GBrain Data Model and Runtime Map

Product Frame

gbrain is persistent memory for AI agents. Its job is to let an agent search, cite, update, and reuse knowledge across sessions, repos, machines, and clients.

The public README positions it as "the memory your agent actually keeps between sessions." The codebase backs that with a Postgres/pgvector schema, source-scoped tenancy, page/chunk storage, embeddings, code-symbol metadata, soft delete, effective dates, OAuth-scoped HTTP MCP, and stdio MCP tools.

Runtime Loop

Source is registered: local repo, wiki, media archive, YC/media corpus, or default brain.
Content is ingested into pages with source_id, slug, type, title, compiled_truth, timeline, and frontmatter.
Content is chunked into content_chunks with text, embeddings, token counts, modality, language, symbols, and line ranges.
Search combines vector, text, graph/code edges, salience, source scope, and recency filters.
MCP/CLI exposes search, get_page, put_page, sync, code-def, code-refs, code-callers, code-callees, and admin operations.
Skills run read-enrich-write loops: search the brain, synthesize, update pages, maintain citations/backlinks.

Core Entities

Entity	Product Meaning	Important Fields / Behavior
sources	A logical brain inside the DB	id, name, local_path, federation config, chunker_version, archived flags
pages	Canonical knowledge objects	source_id, slug, type, page_kind, title, compiled_truth, timeline, frontmatter
content_chunks	Retrieval units	chunk_text, embedding, model, token_count, language, symbol metadata, modality
code_edges_chunk	Resolved code graph	from_chunk_id, to_chunk_id, edge_type, symbol identities
code_edges_symbol	Unresolved code refs	target symbol known before source definition is imported
files	Binary/file storage references	used for images, uploads, and filesystem-backed memory
ingest logs/jobs	Operational trace	tells operators what was imported, embedded, transformed, or failed
OAuth clients/tokens	Remote trust boundary	scopes read/write/admin and source-specific access

Page Types

The TypeScript PageType union includes person, company, deal, yc, civic, project, concept, source, media, writing, analysis, guide, hardware, architecture, meeting, note, email, slack, calendar-event, code, image, and synthesis.

This matters because gbrain is not just note search. It encodes the major objects of founder work:

people and companies
meetings and messages
deals and YC context
concepts and analysis
code and architecture
media and sources
generated synthesis

Source Scoping

sources is the tenancy layer. Each page belongs to a source. A source can be federated or non-federated:

federated sources participate in default cross-source search.
non-federated sources are searched only when explicitly requested.
archived sources are hidden and later purged.

Product implication: gbrain can support a personal brain, team brain, client brain, repo brain, and public knowledge corpus without mixing all writes into one undifferentiated pile.

Retrieval Model

The schema points to four retrieval modes:

Semantic: pgvector embeddings on text chunks.
Text: trigram and full-text search vectors.
Structural: code graph edges and symbol-aware lookup.
Time/salience: effective_date, salience_touched_at, emotional_weight.

The "Cathedral II" direction in comments is important. It moves gbrain from markdown memory into code-aware retrieval:

symbol_name
symbol_name_qualified
parent_symbol_path
doc_comment
start_line/end_line
code_edges_chunk and code_edges_symbol

That lets coding agents ask "where is this defined?", "who calls it?", and "what context should I inspect next?" without falling back to broad grep.

Trust and Deployment Model

gbrain supports multiple deployment shapes:

local PGLite for low-friction personal use
Supabase for shared/cloud memory
stdio MCP for local agents
HTTP MCP with OAuth 2.1 for remote clients
source-scoped OAuth clients for shared brains

The product spec hiding here: persistent memory only works if users can trust the boundary. gbrain's trust boundary is source-aware, client-aware, and operation-aware.

Product Requirements

For gbrain to feel like real agent memory, it must satisfy:

Durable writes: agent updates survive across sessions.
Citation discipline: claims point back to pages/sources.
Search precision: good recall without hallucinating missing facts.
Source separation: no accidental client/repo contamination.
Toolability: CLI and MCP operations are structured, not natural-language guesses.
Maintenance: stale pages, orphan links, citations, and dead links get audited.
Reproducibility: evals show retrieval quality on known corpora.

Open Questions

Which source-scoping model becomes default for teams: one shared source, many federated personal sources, or project-specific sources?
How often does emotional_weight materially improve retrieval versus creating tuning complexity?
Which MCP operations are most used in real agent sessions?
Does code graph retrieval reduce cost/time enough to become the default for coding agents?

Deep Runtime Evidence Map

This pass goes below the README and treats gbrain as a runtime product. The source basis is gbrain/README.md, gbrain/src/schema.sql, gbrain/src/core/types.ts, gbrain/src/core/operations.ts, gbrain/src/core/search/*, and the MCP/security docs.

Expanded Entity Map

Entity	Runtime Role	Evidence	Product Meaning
`sources`	In-DB tenancy namespace. Every page/file/ingest row belongs to a logical source.	`gbrain/src/schema.sql`	Lets one brain hold wiki, code repos, media corpora, team knowledge, or isolated project memory without global slug collision.
`pages`	Canonical knowledge object.	`gbrain/src/schema.sql`, `gbrain/src/core/types.ts`	The memory page: title, compiled truth, timeline, frontmatter, type, soft-delete state, emotional weight, effective date.
`content_chunks`	Retrieval unit for text/code/image.	`gbrain/src/schema.sql`	Converts pages into searchable chunks with embeddings, FTS vector, symbol metadata, modality, and token count.
`links`	Page graph, backlinks, entity relationships.	`gbrain/src/schema.sql`, `gbrain/src/core/types.ts`	The brain wires itself through typed relationships from markdown, frontmatter, and manual edges.
`timeline_entries`	Structured temporal facts.	`gbrain/src/schema.sql`, `gbrain/src/core/types.ts`	Lets the brain answer when/what-changed questions beyond vector similarity.
`code_edges_chunk` / `code_edges_symbol`	Resolved and unresolved code graph edges.	`gbrain/src/schema.sql`	Powers `code-def`, `code-refs`, callers/callees, and two-pass code retrieval.
`files`	Binary/file sidecar index.	`gbrain/src/schema.sql`	Stores references for images/uploads without stuffing bytes into core page rows.
`oauth_clients`, `oauth_tokens`, `oauth_codes`	Remote MCP identity and authorization.	`gbrain/src/schema.sql`	Defines client identity, grants, scopes, write source, and federated-read set.
`mcp_request_log`	Remote tool-call audit log.	`gbrain/src/schema.sql`, `gbrain/SECURITY.md`	Makes remote brain access observable without retaining raw payloads by default.
`minion_jobs` and subagent tables	Durable agent runtime.	`gbrain/src/schema.sql`	Lets background jobs, subagent loops, tool executions, and rate leases persist.
`eval_candidates`	Real retrieval eval capture.	`gbrain/README.md`, `gbrain/src/schema.sql`	Turns real `query`/`search` calls into replayable BrainBench-Real examples.

Source / Tenant Model

gbrain has two axes:

Axis	Meaning	Boundary Rule
Brain	One database: PGLite, Postgres, or Supabase.	Data owner / access-control boundary.
Source	Named content repo inside a brain.	Repo, topic, team, client, or workstream boundary inside one DB.

Key mechanics:

Mechanic	Runtime Behavior
Per-source slug namespace	`pages` enforces unique `(source_id, slug)`, so different sources can safely contain the same slug.
Federation	`sources.config.federated=true` joins default unqualified search; false requires explicit source selection.
Source resolution	Precedence flows through explicit source flag/env/project files/registered local path/default source.
Agent citation	Multi-source citations need source-qualified slugs such as `[source-id:slug]`.
OAuth source model	Remote clients get write authority through `source_id` and separate read authority through federated-read configuration.

Interpretation: source scoping is not metadata decoration. It is the anti-leak primitive for shared brains. Several code comments in operations/search paths treat missing source propagation as a P0 leak class.

Retrieval Pipeline

Stage	What Happens	Product Reason
Mode resolution	`conservative`, `balanced`, and `tokenmax` set defaults for cache, intent weighting, token budget, expansion, limit, and reranker use.	Lets operators trade cost, speed, and depth.
Intent classification	Query intent influences detail, salience, recency, and RRF weights.	Memory search should adapt to task shape.
Keyword path	Always runs first and works without embeddings.	Day-one installs and offline paths still work.
Vector path	If an embedding provider exists, query variants are embedded and searched.	Semantic recall covers fuzzy questions.
Fusion	Keyword and vector lists merge via weighted reciprocal-rank fusion, then score adjustments.	Combines exact and semantic evidence.
Boosts	Backlinks, salience, recency, and exact match affect rank.	Operator memory needs relationships and time, not only similarity.
Structural expansion	Optional graph walk via `nearSymbol` / `walkDepth`, capped.	Coding agents need symbol adjacency.
Dedup	Composite `(source_id, slug)`, text similarity, type diversity, per-page cap, compiled-truth guarantee.	Avoids noisy repeated chunks.
Rerank / budget	Optional reranker and token budget enforcement.	Keeps output useful inside model context.
Eval capture	`search` and `query` can capture retrieved slugs/chunks when enabled.	Converts real usage into benchmark fuel.

Product read: gbrain is not "vector DB with markdown." It is layered retrieval: lexical, vector, graph, temporal, salience, source tenancy, and code-structure expansion.

MCP / OAuth / Trust Boundary

Boundary	Mechanism	Requirement Implied
Local stdio MCP	`gbrain serve` exposes tools over stdio.	Local agents get structured brain tools without HTTP setup.
HTTP MCP	`gbrain serve --http` exposes OAuth-backed MCP and admin dashboard.	Remote clients need scoped auth, logs, discovery, and client management.
Operation contracts	Tool definitions derive from shared operations.	No hand-maintained schema drift between CLI/MCP/HTTP.
Shared dispatch	Stdio and HTTP use the same validation/context/result path.	Transport parity is a correctness and security feature.
Remote flag	`OperationContext.remote` is required for remote/untrusted callers.	Filesystem/tool operations fail closed for remote agents.
Source read scope	Read helpers prefer OAuth allowed sources, then context source ID.	Every read path must thread source scope into filters.
Scopes	Operations are tagged read/write/admin/local-only.	Remote clients get least-privilege tool access.
Local-only ops	Sync/file operations are rejected over HTTP regardless of scope.	Remote agents cannot touch local filesystem surfaces.
Logging redaction	MCP params are logged as redacted shape by default.	Admin observability must not become a private-data leak.
Network hardening	Loopback default, CORS deny by default, rate limits, proxy warnings.	Personal brains should not become accidentally exposed.

Product Requirements, Ranked

Priority	Requirement	Why It Exists	Acceptance Signal
P0	Source isolation must be enforced on every read/write path.	Multi-source brains otherwise leak client/team/repo context.	Same-slug pages across sources remain distinct; every read handler honors source filters.
P0	Remote MCP must be scoped, logged, and local-file-safe.	Remote clients operate outside the owner's OS trust boundary.	OAuth scopes honored, local-only ops rejected, params redacted, rate limits active.
P0	Search must degrade gracefully without embeddings.	Day-one installs may lack provider keys.	`search` works; `query` falls back instead of failing.
P1	Retrieval must combine lexical, semantic, graph, recency, salience, and code structure.	Founder/operator memory needs factual, temporal, relationship, and code recall.	Hybrid path returns ranked, deduped, source-aware chunks with metadata.
P1	Every page needs provenance and citation discipline.	Memory must be auditable, not merely plausible.	Compiled truth and timeline carry source citations; conflicts are explicit.
P1	Agent tool schemas must be generated from one operation contract.	MCP/CLI drift breaks clients and causes strict-schema failures.	Tool definitions derive from operations; dispatch is shared.
P1	Code memory must be symbol-aware.	Coding agents need where-defined/who-calls/near-symbol more than broad prose recall.	Code metadata and edge tables populate; two-pass retrieval respects source scope.
P2	Runtime should expose health, eval capture, and replay.	Memory quality needs proof.	`eval_candidates` captures real calls when explicitly enabled.
P2	Search modes should make cost/quality tunable.	Haiku loops and Opus/tokenmax workflows have different budgets.	Search modes resolve deterministic knobs and cache keys.

Bottom line: gbrain's product is a source-scoped, citation-aware memory runtime exposed through CLI/MCP. The defensible moat is the combination of typed memory entities, source tenancy, hybrid retrieval, agent-safe operation contracts, and eval-backed quality loops.