Building a System of Agents

At Airtree, we’re building agents to automate parts of our investment process. Company research, meeting preparation, deal analysis, routine due diligence formatting. The kind of knowledge work that consumes most of a venture investor’s week. We’re about 3 months in, and the decisions we’re navigating are the same ones every team building agent products is facing right now.

This piece is about those decisions. What questions to ask first, what the options look like, and what tradeoffs you’ll hit if you’re designing a system of AI agents for knowledge work today.

Do you actually need agents?

Anthropic’s “Building Effective Agents” guide, based on their work with dozens of teams, opens with a warning: find the simplest solution possible, and only increase complexity when needed, which might mean not building agentic systems at all. If I want to summarise a research paper, I don’t need an agent.

Agents earn their place when the tasks require branching or iteration. Our company research workflow needs to be an agent: it searches multiple data sources, decides which leads to follow based on what it finds, loops back when initial results are thin, and synthesises across sources.

What changed in twelve months

A year ago, the orchestration decision was primarily about frameworks. LangGraph or CrewAI? AutoGen or something custom? Teams spent weeks evaluating options and debating architecture.

That debate has largely resolved itself. The frameworks increasingly overlap in what they offer. LangGraph has fine-grained state management. CrewAI suits team-like coordination. Pick the one that fits your stack and move on. The practical consensus from every credible source we’ve read is the same: a single agent with a ReAct loop and well-chosen tools handles most real-world tasks. Only add multi-agent complexity when you have clear evidence a single agent can’t meet your quality bar.

But the bigger change happened underneath the frameworks: protocols. Anthropic’s Model Context Protocol has become the standard for connecting agents to tools and data. OpenAI and Google both adopted it, and every major coding tool supports it. MCP, OpenAI’s AGENTS.md convention (already in 60,000+ repositories), Google’s Agent-to-Agent protocol and Block’s goose were donated to the newly formed Agentic AI Foundation under the Linux Foundation in December 2025. The protocol layer is converging under open governance. A year ago, connecting an agent to a new data source meant writing custom integration code. Now you look for an MCP server.

Teams are also now treating architecture as a product decision, not just a technical one. Anthropic, Google, and Microsoft have all published strikingly similar pattern taxonomies. The right pattern depends entirely on your task structure. A plan-and-execute split, where a capable model creates a plan and cheaper models execute the steps, can cut costs by an order of magnitude. A router that triages simple queries to a fast model and escalates complex ones keeps average costs low without sacrificing quality on the hard problems. These are product decisions disguised as architecture decisions, and they compound. The wrong pattern can make a profitable product uneconomic at scale.

Context is the real engineering problem

The hardest part of building our system isn’t choosing the framework or the pattern. It’s context management.

Anthropic’s multi-agent research system found that token usage explained 80% of performance variance. That’s a single finding from one system built for information retrieval, where search breadth naturally drives results. For tasks where the bottleneck is reasoning quality or multi-step planning, model capability still dominates. But for our use case – research agents synthesising across data sources – it matches what we’re experiencing. Our agents get meaningfully better when we focus on curating what goes into the context window rather than experimenting with different models.

The production teams doing this well have adopted a pattern called progressive disclosure, letting agents discover context incrementally rather than receiving everything upfront. Cursor’s dynamic context discovery, where tool outputs and definitions load on-demand, cuts token usage by 46.9%. Vercel deleted 80% of their agent’s tools and watched their worst-case query go from failing in 724 seconds over 100 steps to completing in 141 seconds over 19. They’d had too many tools to begin with, which is part of the point. The improvement from removing context rather than adding it was striking.

Model capability made simple scaffolds viable, but within that range, the scaffold determines whether the product works. Manus has rewritten their agent framework four times since launch, and each rewrite focused on shaping context more precisely. Their heuristic is a good one: if your scaffold is getting more complex while models get better, something is wrong.

How do you know it’s working?

This is where most teams, including us, underinvest.

A coding agent can run tests. A customer support agent can measure resolution rates. But how do you evaluate whether a research agent’s synthesis is actually good? Whether it found the right sources? Whether it missed something important?

We don’t have this solved. What we’re exploring is a combination of automated checks and regular human review of a sample of outputs. The automated checks catch mechanical failures: did the agent use the data sources it was supposed to, did it hit the right APIs, is the output structured correctly. The human review catches quality failures that no automated check would surface. Neither alone is sufficient.

The human-in-the-loop design is arguably the hardest product question in the whole system. 89% of respondents to LangChain’s survey (which skews toward AI-forward teams) have observability in place. OpenTelemetry’s GenAI semantic conventions are becoming the standard for instrumentation. But observability tells you what happened. The design question is what should trigger human intervention, and what that intervention experience feels like.

Stripe reviews every AI-generated pull request before merge. That works for code, where review is asynchronous and the cost of a bad merge is high. It doesn’t transfer to workflows where the agent needs to act in real time. We’re still working through where our review gates should sit for different workflows. When should an agent surface findings for a human investor to evaluate, and when can it act on its own? The right intervention rate depends on the cost of errors in your domain. For customer service routing, best-in-class teams hit single-digit escalation rates. For investment research, we deliberately keep humans in the loop. Getting the triggers right is product craft, not engineering.

What we’re still figuring out

The honest answer is that we’re early. Our research agent produces a company brief in about three minutes that used to take an analyst a few hours. But the system is rough, and we can’t yet quantify the full ROI with confidence.

Context engineering is harder than we expected and the tooling lags the need. Understanding what your agent sees at each step, and why it made the decisions it did, requires custom instrumentation and we haven’t yet found good frameworks for knowledge work.

Security is another dimension. Agents that access internal data, make API calls, and take actions on behalf of users introduce risk that needs to be managed.

The models keep getting better. Stripe’s one-shot “minions,” generating over 1,300 pull requests a week with no multi-turn reasoning, wouldn’t have been possible two years ago. As models improve, simpler patterns start handling harder problems. The system you design today may need less scaffolding in six months.

We think the design of AI agent systems for knowledge work is one of the most important product problems of 2026. The framework layer is commoditising. The protocol layer is standardising. What isn’t commoditised yet is domain-specific context engineering and the product design of human-agent collaboration. Claude Cowork is a first step to solving that problem, but it’s still rudimentary.

The lessons from building our own system are proving more useful than any “thesis development” we’ve done. If you’re building in this space, or if you’ve figured out eval for knowledge-work agents, we’d love to see it.

​Building a System of Agents