Skip to content

A Research Project is a Graph, Not a Document

Part 2 of a series on human–AI research partnerships. Part 1: The File System Is the Collaboration.


TL;DR. I'm working on a smart /compact that compresses the entire repo-level collaboration history — every chat, every commit, every advisor remark, every cited paper, every table, every prose paragraph — into one big graph. The manuscript is {prose + finding}; a finding is {method, data, analysis, table} linked to its supporting paper citations; advisor notes, literature, research-iterations (RIs), tables, and prose are all interwoven nodes in the same graph, connected by typed edges.

graph LR
    PR["📝 paragraph"] -.->|discusses| F["📌 finding"]
    F -.->|uses| M["🧪 measure"]
    M -.->|"justified by<br/>Citation (Author YYYY)"| LIT["📚 paper"]
    TBL["📊 Table X"] -->|evidences| F

    classDef p fill:#FBEAF0,stroke:#72243E,color:#4B1528
    classDef f fill:#EAF3DE,stroke:#27500A,color:#173404
    classDef m fill:#ED93B1,stroke:#72243E,color:#4B1528
    classDef l fill:#FAEEDA,stroke:#854F0B,color:#412402
    classDef t fill:#F1EFE8,stroke:#444441,color:#2C2C2A
    class PR p
    class F f
    class M m
    class LIT l
    class TBL t

Is this GraphRAG?

Structurally, yes — Microsoft's GraphRAG is the closest mainstream analog. But unlike GraphRAG (and most "graph + RAG" systems), this approach uses no embeddings, no vector search, no BERT. Pure structural traversal: an edge either exists or it doesn't. Nodes are authored as you work, not extracted post-hoc by an LLM. The LLM consumes the subgraph; it never builds it. The intellectual ancestor is less "RAG with extra structure" and more symbolic knowledge graph + git — Roam-style backlinks with research-project semantics on top.

Once the project is a graph, any question is effectively a subgraph query. Claude Code reads only the nodes that subgraph touches — and has 100% of the information it needs, with zero noise from the rest of the repo. No more 60-page PDFs feeding into a context window; no more LLM attention fraying across 5,000 lines of code. The fix is to stop treating the project as a document and start treating it as a graph: findings, fronts, measures, instruments, data, advisor remarks, and literature each become discrete Markdown-file nodes connected by explicit edges, with a single mechanical operation (git mv from fronts/ to findings/) that promotes working hypotheses into validated claims. The manuscript is then just one serialization of the graph, AI subagents traverse a handful of files instead of inhaling the whole repo, and orphan detection becomes a daily report rather than a silent worry.


When the Checklist Breaks

In the previous post, I argued that human–AI research partnerships need the file system as the collaboration medium — tacit knowledge moved out of the lead author's head and into plain Markdown (CLAUDE.md, STATUS.md) that both parties can read, write, and audit. That model holds beautifully for the first six months. Then it cracks.

By the time an empirical paper has survived three rounds of robustness checks, two pivots in identification, and an R&R from a top journal, the project contains thirty-plus measures, eight active hypotheses, two hundred cited papers, and a 60-page manuscript. A flat checklist cannot answer the questions you actually need to ask: which paragraph relies on Doyle, Ge & McVay (2007)? does the alternative ICW measure we coded last week link to any finalized table? if we drop Hypothesis 3, which subsections of §4 become orphaned? These are graph queries, and you cannot answer them by scrolling.

The deeper trap is that we let the manuscript dictate the project's structure — folders named after sections, files named after subsections, status updates organized by chapter. But the prose is just the narrative wrapper. The epistemic content of the research — claims, variables, data — has an identity completely independent of the LaTeX subsection it currently lives in. Delete the paragraph that describes a finding and the finding remains true; it just lost its vehicle for communication. Once you separate truth from narrative, a different architecture emerges underneath: a network of nodes and edges.


The Nodes: Mapping the Intellectual Territory

In a graph-based research project, every concept becomes a discrete, addressable node — a single Markdown file at a known path.

1. Findings. A finding is a claim, not a table. "ICW firms have 23% higher one-year-ahead crash risk" is a finding; the table that demonstrates it is an artifact of the finding. Treating findings as propositions (rather than PDFs) is what makes the edges work — a paragraph cites a claim, and the claim points to the table that backs it.

2. Fronts. Active, in-progress investigations. A front is a working hypothesis with code attached and evidence accumulating. Fronts are explicitly distinct from findings because the workflow that turns one into the other — promotion — is the kinetic mechanism of the entire system (more on this below).

3. The Methodology Triad: Data, Instruments, Measures. This is the engine room. The handwritten sketch that motivated this whole framework specifically broke this triad apart, because each piece has different dependencies:

  • Data — the raw inputs (a Compustat panel, an Audit Analytics extract, a hand-collected sample).
  • Instruments / Surveys — the protocols or formulas used to extract meaning (an F-score specification, a PCAOB inspection protocol, a sentiment classifier).
  • Measures — the operationalized variables produced by applying an instrument to data (an ICW_dummy column in your panel).

The triad is connected by directed edges: an Instrument applied to Data yields a Measure. Treating the three as a single "methodology" lump is exactly the move that hides project debt.

4. Literature. Foundational papers, theories, canonical models. Each cited paper is a node, not a line in a .bib file.

5. Prose. The actual LaTeX subsections. These are nodes too — but they are consumers of the graph, not its core content. The manuscript decomposes into a flat tree of sections, each becoming a node:

graph LR
    PROSE["📝 PROSE<br/><i>the manuscript</i>"]
    PROSE --> ABS["Abstract"]
    PROSE --> INT["Introduction"]
    PROSE --> BG["Background / Hypotheses"]
    PROSE --> DSM["Data / Survey / Measure"]
    PROSE --> RES["Results"]
    PROSE --> CON["Conclusion"]

    classDef parent fill:#F4C0D1,stroke:#72243E,stroke-width:2px,color:#4B1528
    classDef leaf fill:#FBEAF0,stroke:#72243E,color:#4B1528
    classDef dsm fill:#ED93B1,stroke:#72243E,stroke-width:2px,color:#4B1528
    class PROSE parent
    class ABS,INT,BG,RES,CON leaf
    class DSM dsm

A directory tree makes this concrete:

graph/
├── findings/
│   ├── F01_icw_predicts_crash.md
│   └── F03_robustness_alt_measures.md
├── fronts/
│   ├── RI_internal_control/_node.md
│   └── RI_audit_partner/_node.md
├── measures/
│   ├── M_icw_dummy.md
│   └── M_crash_risk_ncskew.md
├── data/
│   ├── D_audit_analytics.md
│   └── D_crsp_compustat.md
├── instruments/
│   └── I_doyle_icw_classifier.md
└── papers/
    ├── doyle_ge_mcvay_2007.md
    └── kim_li_zhang_2011.md

Every node is a small Markdown file with a YAML frontmatter declaring its edges. The repository becomes a queryable knowledge base, not a pile of .tex and .do files.

The two epistemic states of the project — what we believe versus what we are still investigating — are kept structurally distinct:

graph TB
    ROOT(["<b>A research project<br/>is a graph</b>"])
    ROOT --> FINDINGS["📌 FINDINGS<br/><i>validated claims</i>"]
    ROOT --> FRONTS["🔬 FRONTS<br/><i>working hypotheses</i>"]

    FINDINGS --> F1["F01_icw_predicts_crash"]
    FINDINGS --> F2["F03_robustness_alt_measures"]

    FRONTS --> FR1["RI_internal_control"]
    FRONTS --> FR2["RI_audit_partner"]

    classDef root fill:#F1EFE8,stroke:#444441,stroke-width:3px,color:#2C2C2A
    classDef parentFind fill:#C0DD97,stroke:#27500A,stroke-width:2px,color:#173404
    classDef parentFront fill:#F5C4B3,stroke:#712B13,stroke-width:2px,color:#4A1B0C
    classDef findNode fill:#EAF3DE,stroke:#27500A,color:#173404
    classDef frontNode fill:#FAECE7,stroke:#712B13,stroke-dasharray:5 3,color:#4A1B0C
    class ROOT root
    class FINDINGS parentFind
    class FRONTS parentFront
    class F1,F2 findNode
    class FR1,FR2 frontNode

The Edges: Preserving the "Why" and the "How"

Nodes alone are just a database. The power of the graph lives in the edges — the strict logical dependencies between nodes.

In a graph-mediated workflow, every claim has a structural line back to its origin:

  • A Prose paragraph cites a Finding.
  • That Finding uses a Measure.
  • That Measure consumes Data and is operationalized from an Instrument.
  • That Instrument is justified by a Literature node.

Citations stop being a monolithic list at the bottom of a manuscript. They become bidirectional edges. You don't just know that you cited Doyle, Ge & McVay (2007); you know exactly which methodological choice the paper justifies, which measure inherits from it, and which paragraph relies on the resulting finding. Walk the graph in either direction and the chain of reasoning is fully traced.

This is the structural answer to the question every co-author and every referee eventually asks: why did you do it this way? The why is no longer a memory the lead author has to keep alive. It is an edge in a file.

A toy fragment of the cross-cutting edge layer — promotion, citation, justification, and use — looks like this (the node names and citations below are fabricated for illustration, not real findings):

graph TB
    FR1["🔬 RI_dq_coc_implied<br/><i>(front)</i>"]
    F1["📌 F01_dq_reduces_coc<br/><i>(finding)</i>"]
    DSM["📝 Data / Survey / Measure<br/><i>(prose/measure)</i>"]
    RES["📝 Results<br/><i>(prose)</i>"]
    BG["📝 Background<br/><i>(prose)</i>"]
    P1["📚 chen_et_al_2015_dq"]
    P2["📚 gebhardt_et_al_2001_coc"]

    %% Promotion Lifecycle
    FR1 ==>|"git mv → PROMOTION"| F1

    %% Epistemic Dependencies (A relies on B)
    RES -->|"references_findings"| F1
    F1 -.->|"uses_measures"| DSM

    %% Literature Justifications
    BG -->|"cites"| P1
    DSM -.->|"justified_by (Cost of Capital)"| P2
    DSM -.->|"justified_by (Disclosure Quality)"| P1

    %% Styling
    classDef findNode fill:#EAF3DE,stroke:#27500A,stroke-width:2px,color:#173404
    classDef frontNode fill:#FAECE7,stroke:#712B13,stroke-width:2px,stroke-dasharray:5 3,color:#4A1B0C
    classDef proseNode fill:#FBEAF0,stroke:#72243E,color:#4B1528
    classDef proseDsm fill:#ED93B1,stroke:#72243E,stroke-width:2px,color:#4B1528
    classDef litNode fill:#FAEEDA,stroke:#854F0B,color:#412402

    class F1 findNode
    class FR1 frontNode
    class RES,BG proseNode
    class DSM proseDsm
    class P1,P2 litNode

    linkStyle 0 stroke:#993C1D,stroke-width:4px,color:#993C1D

Solid arrows show structural dependencies; dashed arrows show citation/justification edges; the thick coral arrow is the promotion action — the kinetic step where a front becomes a finding.

The two 📚 literature nodes correspond to:

  • chen_et_al_2015_dq — Chen, S., Miao, B., & Shevlin, T. (2015). A New Measure of Disclosure Quality: The Level of Disaggregation of Accounting Data in Annual Reports. Journal of Accounting Research, 53(5), 1017–1054. https://doi.org/10.1111/1475-679X.12094
  • gebhardt_et_al_2001_coc — Gebhardt, W. R., Lee, C. M. C., & Swaminathan, B. (2001). Toward an Implied Cost of Capital. Journal of Accounting Research, 39(1), 135–176. https://doi.org/10.1111/1475-679X.00007

The toy finding being illustrated — higher disclosure quality lowers the implied cost of capital — uses the disaggregation measure of Chen et al. (2015) and the implied-cost-of-capital construct of Gebhardt, Lee, and Swaminathan (2001).


Promotion: The Workflow That Makes the Graph Move

A graph that never updates is a diagram. The mechanism that makes this system kinetic — that distinguishes a research workflow from a knowledge-management hobby — is promotion.

A front is a working hypothesis: code is attached, evidence is accumulating, the result is provisional. A finding is a validated claim: the evidence has held up across the robustness suite, the magnitude is stable, the lead author is willing to defend it.

Promotion is the moment one becomes the other. In file-system terms it is a single mechanical action:

graph/fronts/RI_internal_control/_node.md
  →  graph/findings/F04_icw_predicts_crash.md

Every edge that pointed at the front rewires to point at the finding. The front file is archived but not deleted — its history is the audit trail of how the claim became defensible.

Promotion is the act that distinguishes exploring from knowing. Most research projects never make this distinction explicit, which is why their authors lose track of which results they actually believe. Making promotion a literal git mv forces the question to be answered in the open, on a specific date, in a specific commit.


What the Graph Buys You

Restructuring a project this way is not architectural cosplay. It changes how you and your AI agents interact with the work.

1. Automated Epistemic Auditing

When every concept is a node connected by edges, orphan detection becomes trivial. Run a script over the graph and surface:

  • Measures with no used_by_finding edge — engineered code that produces nothing the paper relies on. Cut it.
  • Findings with no cited_by_prose edge — results that exist in your tables folder but appear nowhere in the manuscript. Either write them up or stop maintaining them.
  • Measures with no justified_by edge to literature — empirical choices that have no precedent in the cited record. The exact gap a referee will find.

This is the audit that the lead author normally performs by reading the manuscript end-to-end and silently worrying. The graph turns it into a daily report.

2. Targeted AI Subagents

Feed a 60-page PDF and a 5,000-line codebase to an LLM and its attention will fray. Hallucination rates climb. Context drops. The standard failure mode of AI-assisted writing is that the agent has too much context, not too little.

A graph inverts this. When a subagent is dispatched to draft §4.2 (Robustness Tests), it does not read the whole repo. It reads graph/findings/F03.md, follows the uses_measure edge to M_icw_dummy.md, follows consumes_data to D_audit_analytics.md, follows justified_by to papers/doyle_ge_mcvay_2007.md, and stops. Four files. Maybe three thousand tokens. No hallucination surface, because every fact it could state is grounded in a node it just read.

The graph is the index. The subagent is a graph traversal with a writing assignment at the end of it.

3. STATUS.md Becomes the Index, Not the State

A reasonable concern after the previous post: doesn't this contradict the "everything in STATUS.md" model?

It does not. STATUS.md was the right tool when the project was a checklist. Once the project has thirty measures and eight fronts, STATUS.md is not deprecated — it is promoted, in exactly the same sense as a front being promoted to a finding. It stops being the state of the project and becomes the rolling report that the graph generates: which fronts are active, which findings are stable, which orphans were detected this week, which next steps the lead author should pick up.

The five-milestone tree from the previous post still anchors the report. The contents of each milestone now come from a graph query rather than a hand-edited bullet list.


The Manuscript as a Serialization

The file system has been the universal interface for technical collaboration for fifty years. The previous post argued that human–AI research partnerships should use that same interface, extended into the parts of the project that previously lived only in researchers' heads.

This post is the natural next step. Once the file system is the collaboration medium, the structure of that file system stops being a flat checklist and becomes a graph. Findings link to measures link to data link to instruments link to literature. Fronts get promoted into findings by a single mechanical rewrite. AI subagents traverse the graph instead of inhaling the whole repository. Orphan detection runs on a schedule.

A research project is not a document you write. It is a graph of interconnected truths that you build over years.

The final manuscript is just one possible serialization of that graph — the one optimized for a human reader to consume in three hours, in order, from abstract to conclusion. There are other serializations: a slide deck, a referee response letter, a discussant's comment, a job market talk. Each is a different walk of the same underlying graph.

Build the graph first. The serializations follow.