Teaching Machines What You Already Know

When Airwallex started building AirDev, their internal AI coding platform, they ran into a problem that no off-the-shelf tool could solve. Generic coding assistants don’t understand Airwallex’s conventions for propagating configuration changes across environments. They don’t know the patterns for creating new API endpoints, or how infrastructure updates need to be replicated across a global deployment.

The models could write code. They couldn’t write Airwallex code.

The Airwallex team has a line that captures the ambition precisely: “the agent’s code should be indistinguishable from code written by a human engineer familiar with the repository.” “Familiar” means an engineer who has absorbed months of institutional context about how this particular codebase functions.

The frontier models are extraordinary. But generic tools built on those models can produce code that is plausible rather than correct. The companies seeing real results have built the deepest context layer around whatever model they happen to be using.

What context actually looks like

Stripe built its own platform as a fork of Block’s open-source Goose agent: customising the orchestration flow in an opinionated way to interleave agent loops and deterministic code—for git operations, linters, testing, and so on—so that “minion” runs mix the creativity of an agent with the assurance that they’ll always complete Stripe-required steps. When a Stripe engineer asks an agent to make a change, the agent doesn’t just see the code. It sees why the code is structured that way, what related systems it touches, and what conventions apply to that specific subdirectory. The agent consumes “coding agent rule files” that change depending on which part of the codebase you’re working in. The rules for the payments core are different from the rules for the dashboard frontend.

Coinbase gave its agent, internally called “Claudebot,” the same tools and context as its human engineers. Uber baked context into specialised tooling for code quality, testing, and UI patterns. Goldman Sachs embedded Anthropic engineers inside the bank for six months to learn why their systems were built the way they were.

The context that matters most is that institutional knowledge about why things are the way they are. A generic model might suggest simplifying a piece of code, not knowing the complexity exists because of a regulatory requirement discovered during an audit three years ago.

When context goes wrong

The optimistic version of this story is that context makes AI output correct. The less-discussed version is that context, done poorly, makes AI output confidently wrong in ways that are very hard to catch.

A rule file that encodes an outdated convention will produce code that looks right because it matches what the codebase used to look like. And because the output came from a well-contextualised agent, engineers will review it with more trust and less scrutiny than they’d apply to code from a generic tool.

Beyond this, deeply contextualised agents are very good at producing more of the same. But when a team needs to rearchitect a system or challenge an assumption baked into the codebase for years, a context-rich agent might actually resist that change.

Will this last?

Every organisation has proprietary knowledge that will never appear in a model’s training data. Goldman’s trade accounting workflows. Stripe’s payment routing edge cases. Airwallex’s deployment conventions. The context layer bridges a gap the model cannot bridge on its own.

But models are getting better at inferring context with every generation. GPT-3 needed elaborate prompt engineering. GPT-4 needed less. Opus 4.6 needs even less. Today you might need 400 MCP tools to get an agent to produce Stripe-quality code. In two years, you might need the agent to read a few key documents and have access to the codebase itself. In five years, the model might infer most conventions directly from the code.

Who this applies to

The examples I’ve discussed are almost all large engineering organisations with the scale to build custom platforms and amortise the cost. A ten-person startup doesn’t need to invest engineering time in custom context infrastructure.

The right answer for most startups is to use the best off-the-shelf tools available and invest in the habits that make those tools effective. Write clear documentation. Maintain consistent coding conventions. Structure your codebase so it’s legible to both humans and machines.

At Airtree, we’re watching closely for the tooling that makes this easier, particularly for growth-stage companies that sit between off-the-shelf tools and full custom platforms. If you’re building in this space, we’d love to hear from you.

​Teaching Machines What You Already Know

What context actually looks like

When context goes wrong

Will this last?

Who this applies to