Skip to content

Skills Working Together: The Application Layer of a Human-AI Research OS

Part 4 of the Human-AI Research Partnerships series.

A Friday Night in 2016

It is Friday, 11pm, in my PhD office. The building is empty. Everyone left hours ago — some to bars, some to families, some to sleep. I am still here, overwhelmed and absorbed by a problem I cannot solve alone.

I have written three pages of notes. I have read the same paragraph in Ohlson (1995) four times. I think I see something — a connection between the measurement approach in my data and a theoretical restriction I had not considered — but I am not sure. I need someone to push back, to ask the hard question, to tell me whether this is insight or self-deception.

My next meeting with my advisor is Tuesday. Until then, I do what PhD students have always done: I wait. I read more papers. I second-guess. I lose momentum. The cycle is weekly at best: explore on my own, meet with the advisor, get feedback, go back and revise. A good week produces one real iteration.


A Friday Night in 2026

It is Friday, 11pm. I type:

/prof-storming jobs

Within seconds, Steve Jobs — a pseudonym for a well-known, widely cited full professor at a top research university, someone whose work has shaped multiple subfields over decades — asks me two questions. Or rather, the skill that channels him does. Not generic questions. Questions grounded in the 43 papers I have compiled from Jobs's work: papers spanning disclosure regulation, causal inference methodology, capital market consequences, valuation, and real effects of accounting — published across nearly four decades in the Journal of Accounting and Economics, the Journal of Accounting Research, The Accounting Review, the Review of Accounting Studies, the Review of Financial Studies, the Journal of Financial Economics, the Quarterly Journal of Economics, and the American Economic Review.

The skill has read every section of those papers. Not summaries — the actual methodology sections, the actual identification arguments, the actual robustness discussions. When it asks "why are you not using the specification from Cook and Jobs (2019, §3.2) here?" it is citing a real paragraph it loaded thirty seconds ago.

I answer. It pushes back. I revise my argument. It points to a tension with something Jobs wrote in 2014 about exactly this measurement problem. I had forgotten that paper existed. The skill did not forget — it indexed all 43 papers by topic when I onboarded the advisor, and it retrieved the relevant sections the moment my question matched the topic cluster.

Twenty minutes later, I have a sharper hypothesis, two new robustness tests to run, and a clear sense of why my original intuition was half-right and half-wrong.

The weekly iteration — explore, advise, improve — just happened in twenty minutes instead of seven days.


What Changed

The difference between 2016 and 2026 is not that I work harder or that AI is "smart." The difference is architectural. Over the past two months (around since March 14, 2026), I have built a collection of 50+ skills — small, composable programs that each do one thing well — and wired them together into a system where the full research lifecycle can run at the speed of thought rather than the speed of scheduling.

This post is about what those skills do together. Parts 1 through 3 of this series built the infrastructure: the file system as collaboration medium, the research graph as data model, and the operating system layer that makes the graph queryable. This post — Part 4 — is the application layer. The place where a researcher actually lives.

APPLICATION LAYER — what the researcher touches
Advisors
prof-storming
prof-officehour
prof-writing
prof-add-advisor
AUGMENTED JUDGMENT
Empirical
research-iteration
empirical · stata
diffindiff-guide
AUGMENTED EXECUTION
Writing
prof-writing
language-editing
wordcraft-abstract
AUGMENTED JUDGMENT
Review
editorial-review ×7
econfin-feedback
tar-reviewer2
COULDN'T DO ALONE
SERVICE LAYER — acquisition, processing, formatting
Literature
lit-* (9 skills)
scholar · gs-get-pdf
local-zotero
FULL DELEGATION
Output
latex-document
nature-figure · drawio
slidev · reg-table
AUGMENTED EXECUTION
Domain Experts
penman-valuation
wooldridge · enders
lean4-prover
COULDN'T DO ALONE
Utility
claude-doc
manuscript-audit
run-sas-via-ssh
FULL DELEGATION
INFRASTRUCTURE LAYER — state, memory, continuity
Session
session-start · wrap-session
handoff · big-picture-questions
Graph
init-graphs · graph-query
add-to-graph
Pipeline
ssot-management · extract-qa
repo-to-collab
FULL DELEGATION — runs beneath every session

The Researcher's Lifecycle

A research project, from first idea to published paper, requires the researcher to do roughly ten things:

  1. Get oriented — remember where you left off
  2. Find and read literature — know what exists
  3. Think through research design — develop hypotheses, choose identification
  4. Run empirical tests — estimate, diagnose, iterate
  5. Track decisions and provenance — remember why you did what you did
  6. Write prose — turn results into argument
  7. Get feedback — simulate or seek peer review
  8. Format outputs — tables, figures, slides
  9. Verify everything — citations, numbers, cross-references
  10. Onboard collaborators — bring someone up to speed

In 2016, I did all ten alone or with weekly advisor input. In 2026, each maps to a skill family — and the families talk to each other.


The Advisors Who Never Sleep

The /prof-storming skill is not a chatbot pretending to be a professor. It is a retrieval-augmented Socratic advisor built on a library of real papers.

When I onboard an advisor — say, Steve Jobs — the system processes every paper through parallel agents that extract verbatim sections, classify them by topic, and build an index. For Jobs, that is 43 papers across 9 topic clusters: valuation models, disclosure theory, causal inference, regulatory effects, measurement, financial reporting, capital markets, archival methods, and structural modeling. Each cluster contains the exact paragraphs where Jobs made arguments, chose specifications, or defended decisions.

When I invoke /prof-storming jobs, the skill:

  1. Asks 2-4 clarifying questions (Socratic, not generic)
  2. Retrieves the sections most relevant to my specific question
  3. Delivers grounded advice — citing specific papers and sections
  4. Produces concrete action items and skill handoffs

The result is not "what would a generic professor say." It is "what would Steve Jobs say, given that he wrote this exact thing in JAR 2017 §4.2 about this exact problem."

I have four advisors onboarded. I can consult any of them at any hour.


The Literature Machine

Finding papers used to mean hours on Google Scholar, downloading PDFs one by one, reading abstracts to decide relevance, and manually maintaining a bibliography.

Now I say /scholar and an orchestrator dispatches seven specialized sub-agents in parallel: one searches my local collection, one scouts Google Scholar, one downloads PDFs, one converts them to readable format, one updates references.bib, one verifies citations against source text, and one manages the catalog.

The literature family — nine skills total — handles the full pipeline from "I need papers on X" to "here are the relevant citations, verified against the source PDFs, added to your bibliography." What used to take a weekend now takes minutes.


The Empirical Engine

/research-iteration is a four-stage loop: advise, estimate, build table, write prose. Each stage calls other skills:

  • The advise stage consults the advisor personas for specification guidance
  • The estimate stage runs Stata regressions via /empirical
  • The table stage formats output to journal standards via /reg-table
  • The prose stage drafts results sections in a named writing style

Between stages, the researcher reviews and redirects. The AI proposes; the human decides. But the execution — writing the do-file, running the regression, formatting the table, drafting the paragraph — happens in minutes, not days.

When I suspect my difference-in-differences design has a staggered-adoption problem, I invoke /diffindiff-guide and get a diagnostic grounded in Baker, Larcker, and Wang's (2025) Practitioner's Guide. When I need to think through an instrumental variables strategy, /wooldridge answers in the voice of the textbook — precise, assumption-driven, conditions stated explicitly.


The Memory Layer

The hardest problem in a multi-year research project is not computation. It is forgetting.

Why did we drop that control variable in March? What was the referee's concern about endogeneity, and how did we address it? Which script produces Table 3, and what data does it require?

The graph family — three skills that maintain a structured knowledge graph of every finding, decision, measure, literature cite, and advisory — solves this. When I return to a project after two months away, I type /session-start and the system tells me exactly where I left off: what milestones are complete, what blockers exist, what the next action is.

When I ask "why did we switch from OLS to IV in the main specification?" the graph answers from the advice node that recorded that decision, linked to the advisor session that recommended it, linked to the literature node that justified the instrument.

Nothing is lost. The weekly iteration between 2016-me and my advisor — where half the meeting was spent re-establishing context — is gone. Context is always available, instantly, because it lives in the graph.


The Referee Gauntlet

Before submitting a paper, I used to have one option: send it and wait six months for referee reports.

Now I invoke /ear-editorial-review or /car-editorial-review or any of the seven journal-specific review skills. Each dispatches a panel of 6-8 simulated reviewers — an editor-in-chief for desk-reject screening, associate editors with relevant expertise, field experts, anonymous referees, and a manuscript auditor.

The reports come back in minutes. They are not always right, but they are almost always useful: they catch the gap in your robustness section, the uncited paper from 2023 that does something similar, the identification concern a real Reviewer 2 would raise. I revise before submitting. The hit rate at journals improves.

This is something I genuinely could not do alone. No amount of self-editing replicates the experience of hostile-but-fair external readers. Seven journal-calibrated panels, available on demand, is a capability that did not exist at any price in 2016.


The Autonomy Spectrum

Not every skill works the same way. I think of them on a spectrum:

Full delegation — the AI does the task end-to-end; I review the output. Session infrastructure, literature acquisition, graph maintenance, citation verification. These are tasks where human judgment adds little to the execution — the value is in having them done consistently and fast.

Augmented judgment — I direct; the AI provides substance and pushback. Advisor personas, research design iteration, prose writing. These are tasks where I need to think, but thinking alone is slower and worse than thinking with a grounded interlocutor.

Couldn't do alone — capabilities beyond what a solo researcher can achieve. Simulated referee panels across seven journals. Formal theorem proving in Lean 4. Processing twelve papers through parallel agents simultaneously. These are not "faster versions of what I already did." They are things I simply did not do before because they were impossible for one person.


The Circle Shrinks

In 2016, the research iteration cycle was weekly. Explore on my own. Meet the advisor. Get feedback. Go revise. Repeat.

In 2026, the cycle is hourly. Sometimes faster. I can move from question to literature to hypothesis to estimation to results to prose to simulated review in a single extended session. Not because any individual step is instant — thinking still takes time — but because the dead time between steps has collapsed. There is no waiting for the advisor's calendar, no waiting for the RA to run the code, no waiting for the co-author to read the draft.

The constraint has shifted from access to judgment. The bottleneck is no longer "can I get feedback on this?" It is "is my question good enough to deserve feedback?" That is a better bottleneck to have.

Orientsession-start
Readlit-* · scholar
Formatlatex · figures
Verifylit-verify
Onboardrepo-to-collab
Thinkprof-*
Testempirical
Trackgraph
Revieweditorial
Writeprose
RESEARCHER
(hourly cycle)
2016
Weekly iteration
7 days per cycle
2026
Hourly iteration
1–2 hrs per cycle

What This Is Not

This is not a story about AI replacing researchers. Every skill in the collection requires a human who knows what question to ask, what result to trust, and what direction to take. The AI does not choose the research question. It does not decide whether an identification strategy is credible. It does not know whether a result is economically meaningful or statistically trivial.

What it does is remove the friction between having a thought and testing it. Between needing feedback and getting it. Between knowing what to write and having it written well.

The researcher is still the researcher. The work is still the work. But the dead time — the waiting, the forgetting, the rebuilding of context, the logistical overhead of being a single person trying to do ten different jobs — that part is largely gone.


Part 4 of the Human-AI Research Partnerships series. Previous: Part 3: Graph-Based Research Operating System.