Graph-Based Research Operating System

Part 3 of the human–AI research partnership series.

TL;DR. A research project, once large enough, is a graph. A graph alone is storage, not power. The power comes from treating the graph the way an operating system treats a filesystem: an ontology that classifies every asset into a small MECE set of leaf types (claim, literature, advice, prose, mathematic, data, design choice, code, output) plus one container; a handful of read primitives (path-to-root, nearest-of-type, neighborhood-ball) that do the structural traversal; and a decomposition layer sitting on top that lets a researcher ask a long, compound question and get back a structured set of typed answers — the equivalent of a question-specific Wikipedia page assembled on demand. Downstream skills (advisors, referees, proofreaders) consume that page as preflight context. The graph stops being a bookkeeping system and starts being the substrate the project runs on.

From Graph to Operating System

The previous post made the case that a research project should be modelled as a graph. Findings, fronts, measures, data, prose, literature — each a node, each connected by explicit edges. That argument stops at the filesystem layer. It says where things should live. It does not say what the system should do with them.

A filesystem is not an operating system. The OS is what arrives when you build abstractions over the filesystem that let arbitrary applications ask useful questions without re-reading every file. POSIX gave Unix a stable contract: every file is a stream of bytes, every device is a file, every process inherits standard descriptors. The contract is what makes grep composable with wc composable with sort — none of those tools knew about each other when they were written, and none of them have to.

A research project deserves the same treatment. The graph is the filesystem. What we need is the contract.

The Ontology: Nine Leaves, One Container

The contract starts with classification. Every asset in a research project must answer to exactly one type — otherwise the system spends its life disambiguating. After cycling through three concrete projects (one analytical theory paper, two empirical papers in accounting), the type set that holds up is small:

                        ANCHOR / CLAIM
                        ──────────────
                            claim
                              │
                              │  (subtype: finding | front)
                              │
        ┌─────────────────────┼─────────────────────┐
        │                     │                     │
   KNOWLEDGE              PROSE-SIDE             EXECUTABLE
   ─────────              ──────────             ──────────
   literature             prose                  data
   advice                 mathematic             design_choice
                          (notation,             code
                           equation,             output
                           theorem,              (magnitude, figure,
                           proof)                 table)

                              │
                              │  (container; can hold any leaf)
                              ▼
                              ri

Nine leaf types, one container. Each leaf carries an optional subtype: field for the cases where flavors matter inside the leaf (a mathematic node is one of {notation, equation, theorem, proof}; an output node is one of {magnitude, figure, table}). Each leaf carries points_to: — a file path or a file-plus-line-range slice — pointing at the source-of-truth artifact. The card itself is a thin shim. The actual content lives in the manuscript, the codebase, the dataset.

The container ri (research iteration) is the only type that holds heterogeneous sub-cards. Lifecycle is implied by location: a card inside an ri/ folder is uncommitted (sandbox); a card outside is committed. No status field, no lifecycle attribute — the directory tree carries the information.

The point of the ontology is not aesthetics. The point is that every elementary research question maps to exactly one leaf type — and that mapping is what makes machine-driven retrieval tractable. "What does the prose say about ICW?" → prose. "Which prior papers cite this mechanism?" → literature. "How does the code build the matched sample?" → code. "What were we trying in that abandoned iteration last month?" → ri. The leaf type is the routing decision; the rest is bookkeeping.

The Three Layers of the OS

Infrastructure: the graph as filesystem

The infrastructure layer is what the previous post described — the directory layout under graph/, the frontmatter conventions, the edge-type registry, the index-building scripts. It is the machine-readable counterpart to the manuscript. It serializes nothing new; it indexes what's already there. The infrastructure is necessary but inert. Without consumers, it does not earn its keep.

System calls: individual query primitives

The system-call layer is a small, fixed set of read primitives that walk the graph. Three are sufficient at this scale:

path_to_root(anchor) — shortest typed-edge chain from the anchor back to the project's thesis root. Answers "how does this connect to the main story?"
nearest_of_type(anchor, type, k) — the K nearest nodes of a specific leaf type. Answers "which prior papers / advisories / RIs are most relevant here?"
ball(anchor, hops) — the neighborhood within N hops, with per-type quotas so one type doesn't crowd out the others. Answers "give me the working set around this node."

Each primitive is a few hundred lines of Python. There is no embedding, no vector store, no semantic similarity. Edges are explicit. A node either is connected or it isn't. The primitives compose with each other and with the underlying card layer; they do not compose with themselves. This is the contract that downstream applications target.

User space: composing primitives into answers

The user-space layer is where the interesting work happens. A researcher rarely asks an elementary question. They ask: "Tell me about the dataset for the divestiture-with-vs-without-special-dividends sample. How does the prose describe it, how does the code build it, what defines treat-vs-control events?" — a single utterance composed of three or four elementary questions, each with a different leaf type-root.

The user-space layer's job is to decompose. Read the long question; identify the elementary sub-questions; assign each one a leaf type and a fuzzy anchor term; resolve the anchor against the manifest; dispatch the right system call; collect the results; write a synthesis. The output is a structured artifact — call it a question-specific Wikipedia page — that captures both the decomposition (so the researcher can audit it) and the synthesis (so they can read the answer first).

This decomposition layer is what turns a pile of read primitives into a usable system. Without it, every researcher must hand-translate their question into a sequence of CLI calls. With it, the researcher states intent in their own language, and the OS handles the routing.

The Decomposition Layer in Practice

The decomposition layer trusts the language model to do something the language model is good at: read intent, identify the relevant types, and produce a structured plan. It does not trust the language model to do something it is bad at: structural traversal, exact recall, unbiased ranking. Those are the system calls, and they remain deterministic Python.

A compound question collapses into a list:

# User asked: "tell me about the divestiture sample — prose, code,
# treat vs control events."

- question: "How does the prose describe sample/treatment/control?"
  target_type: prose
  anchor_term: "data and empirical strategy"
  mode: nearest_of_type
- question: "Which RIs build the matched sample?"
  target_type: ri
  anchor_term: "divestiture matched sample"
  mode: ball
- question: "How does the code build the dataset?"
  target_type: code
  anchor_term: "prep_data divestiture"
  mode: nearest_of_type
- question: "What defines the corporate events?"
  target_type: data
  anchor_term: "divestiture county panel"
  mode: nearest_of_type

Four elementary questions run in parallel. Each returns a sub-graph. A synthesis paragraph stitches them together. The full artifact lands on disk at md/gqa/decomp_<topic>_<date>.md — re-anchorable in future sessions, citable from prose, auditable per element.

The MECE-by-type-root premise is doing real work here. Because every elementary question routes to exactly one leaf type, the decomposer cannot waffle between candidates. "Code that builds the dataset" is a code question, not a data question and not a prose question. The ontology forces the model to commit, which is also what makes the decomposition reproducible across runs.

Downstream: Skills That Read the Graph

The decomposition layer's first internal user is not the researcher directly. It is the family of advisor / referee / proofreading skills that already exist in the workspace. Before this layer existed, an advisor skill would either:

Re-read CLAUDE.md and a status file every session (which were already drowning under accumulated context), or
Ask the researcher to manually paste in the relevant prior context.

Both are unreliable. The first goes stale; the second forgets. The fix is a graph preflight at the front of each downstream skill: if the project has graph infrastructure (the loaded CLAUDE.md mentions it) AND the user's question references project-specific terms (a named measure, finding, RI, prose section, advisory, paper, or theorem), dispatch the decomposition layer as a sub-agent, save the artifact, read it into context, and then answer.

The advisor skill answers from grounded evidence rather than from persona alone. The referee skill cites the actual prior advisories and prior iterations. The proofreader knows which findings the paragraph in front of it is supposed to support. None of these skills know how the graph is stored. None of them know what decomposition is. They invoke a single sub-agent and read a single markdown file. The OS handles the rest.

The same artifact also informs the researcher — not just the agent. When a project hits a decision point — should we keep pushing this front, or promote it; should we revise the measure, or hold; should we cut Table 3, or defend it? — the artifact is the briefing. The researcher reads the synthesis; the synthesis links to the elementary answers; the elementary answers link to the actual artifacts. Three minutes from question to grounded answer, instead of thirty.

Implementation Details

The infrastructure side is one Python package: a schema validator with type-specific functions, a universal node writer with idempotent upsert semantics (create / no-op / list-merge / scalar-conflict / body-stacking on conflict), three traversal primitives, an anchor resolver with two-phase scoring (frontmatter scan first, body-grep fallback), a decomposition entry point. About 2,000 lines of deterministic code. The frontmatter contract is YAML; the bodies are markdown; the index is sharded JSON. Nothing exotic.

The contract that holds the whole thing together is one writer. Every path that creates or modifies a graph card goes through one function. There is no parallel implementation in another script. There is no way to write a card that bypasses schema validation. There is no way to silently overwrite a hand-edited body — when a new write conflicts with an existing one, the old body is demoted under a ## Earlier versions / ### YYYYMMDD header and the researcher is informed. The contract makes the system robust to the single most common cause of graph rot: divergent code paths writing the same data with different conventions.

The application side is a slash-command skill: a SKILL.md file that teaches the language model when to decompose, how to route elementary questions to leaf types, how to dispatch the primitives, and how to write the synthesis artifact. The skill is invoked either directly by the researcher ("ask the graph: …") or by other skills as a preflight. Both paths share the same underlying function calls; only the context differs.

The boundary between deterministic and probabilistic is the design discipline that matters most. The model decomposes (reasoning). The Python resolves anchors and traverses edges (structure). The model writes the synthesis (reasoning). The Python ensures the write is idempotent (structure). Each layer plays to its strength; neither is asked to do the other's job.

Build the OS, Then the Applications Follow

A graph is data. An ontology is contract. A small set of read primitives is the system call surface. A decomposition layer is the shell. Downstream skills are the applications. Strip any one of those out and what's left is either inert storage or a pile of ad-hoc scripts. Put them together and the project becomes addressable: any question with a named anchor turns into a query; any query produces a structured artifact; any artifact informs both the agent and the researcher.

The result is not a tool. It is a substrate. The substrate is what lets the next paper take half the elapsed-time of the last one — not because the model got faster, but because the project's own history is finally available to be queried at the speed of research, not at the speed of memory.

Build the operating system. The applications follow.