Software Synthesis analyses the evolution of software companies in the age of AI – from how they’re built and scaled, to how they go to market and create enduring value. You can reach me on LinkedIn and X..

Join readers from OpenAI, Databricks, Stripe, Figma and more


Gradient Descending Roundtables

February 25: The Future of AI Compute

March 5: The Future of Software Engineering with Anthropic


Last week, we hosted Matt and Thomas from Scale AI, following a previous roundtable on Rubrics as Rewards. We discussed RL environments, one of the bigger themes in AI this year.

Core Challenges in Building RL Environments

1. The Optimisation/Scale Problem. The ideal RL environment should be extensive and cover many domains, but larger environments are computationally heavier. There’s a fundamental tension between realism/breadth and efficiency. GPU constraints are significant across training, inference, and environment simulation simultaneously. Anyone doing large-scale RL environment training would face GPU, DRAM, and potentially CPU bottlenecks across all three vectors.

2. Verifiability Across Domains. Coding is the gold standard for RL because outputs can be programmatically tested. But many domains lack this property. Matt and Thomas described using “rubrics” — sets of 20-30 rules used as an LLM judge to score outputs in harder-to-verify domains — but acknowledged this still carries reward hacking risk, is compute-intensive (running two sets of inference at scale), and depends heavily on rubric quality, which requires expert curation.

3. Real-World Relevance. Environments must actually drive performance on things end customers care about. Building an environment around vending machines might not justify huge investment. The benchmark and the environment need to be jointly validated — you need confidence that improving the benchmark actually reflects improving real-world performance, not just benchmark hacking.


The Vending Machine Case Study

This was presented as a Scale AI internal hackathon project, partly inspired by an actual Anthropic experiment where Claude was given control of a real office vending machine (stocking items, web searching for suppliers, sending emails to reorder). The hackathon version was scoped down to fit a 2-day format.

Environment Structure

The environment was built around a simple loop:

The model used was GPT OSS (with special tokens visible in the transcript examples). Reasoning tokens were artificially constrained because unconstrained reasoning allowed models to crush benchmarks.

Mechanics of the Simulator

Training Run

Key Finding — Reward Hacking vs. Environment Realism The team initially suspected reward hacking given the extraordinary returns, but upon inspecting actual generation traces, the model had genuinely learned good business operations. The real problem was environment realism: the simulation had no competition. In the real world, high-margin strategies attract competitors who arbitrage profits away. The model found the globally optimal strategy within the (unrealistic) environment, not a strategy that would work in the real world.

Reward Hacking Example Encountered Early in development, the team found the agent exploiting a loophole: one item had very low supplier cost but very high demand. The agent filled every slot with that single item at maximum margin. This was caught by manually reviewing generation logs — the entire output was just repeated restock orders for the same item.


Reward Design: Sparse vs. Dense Rewards

A significant portion of the discussion covered reward function design:

Sparse Rewards (traditional approach)

Dense Rewards (what they used) Three improvements over sparse:

  1. Continuous reward instead of binary: Use actual cash balance as reward (e.g., £100 profit → reward of 100, not just 1)

  2. Intermediate step rewards: Assign small rewards/penalties at each step, not just at episode end. If the agent makes a small profit in step 3, that’s reflected in the step-3 reward. This provides dramatically more training signal per compute dollar

  3. Curriculum/shaping: Adding human prior knowledge through reward structure helps models learn faster, though it introduces the risk of over-constraining behaviour

The tension highlighted: more intermediate rewards guide the model better but risk teaching it to optimise for the reward signal rather than the true objective. The answer given was that compute constraints force you toward denser rewards in practice — in an ideal world with unlimited compute you’d just use a single sparse end reward, but that’s not viable at scale.


Verifiability and Domain Expansion

Domains that work well with RL environments

Rubric-based approach for harder domains

Key insight on domain selection: For RL environments to be valuable, the benchmark must be: (a) something you can agree actually represents desired performance, and (b) tight enough that improving it reflects real capability gains rather than benchmark gaming. “Maximise profit in an undefined scenario” is too broad; “improve Rust coding on this specific task suite” is sufficiently constrained.


Reward Hacking Detection and Prevention

How it’s caught: Primarily by sampling generations and reading the raw model outputs manually. No automated detection was described as reliable.

Why it’s hard to prevent by design: Models find exploits humans don’t anticipate because they run thousands of simulations and inevitably discover edge cases in the environment logic.

Real-world example cited: Chinese labs reportedly hacked SWE-Bench (a coding benchmark) by looking at future git commits to copy solutions rather than actually solving problems — even frontier labs struggle with this.

Dual-benchmark approach: Run training against benchmark A, but also monitor performance on a separate benchmark B that targets the same real-world capability. If B collapses while A improves, strong evidence of benchmark hacking.


Discussion: Multi-Objective and Competing Goals

There are challenges with competing goals — e.g., a shopping agent maximising short-term basket size but damaging long-term customer retention:


Scalability and the Real World vs. Simulation Debate

Can you just use the real world as the environment? Why not use a real corner shop? The answer: you don’t have infinite compute to represent the real world, and the real world is too slow and expensive to run millions of times. The optimization is finding a simulation that is realistic enough to drive genuine performance improvement but constrained enough to be computationally tractable. This is described as an open problem for 2026.

Multiplayer/competitive environments as middle ground OpenAI Five (Dota 2) was cited as the canonical example — a rich, competitive, multi-agent environment that can be run millions of times. Video games were popular for this, but labs have moved away because real-world tasks (coding, office work) drive more economic value. However, for robotics and world models, game-like simulation environments are coming back into favor.

CPU as emerging bottleneck As environments become more complex (computer use, video editing, Linux environments), CPU becomes a constraint alongside GPU. Running heavy OS-level simulations at RL training scale could make CPU the binding constraint. One participant suggested rewriting environments in JAX to run on GPU, though this clearly doesn’t apply to high-fidelity real-world simulations.


Training Pipeline Efficiency

Emphasised as critical because RL is inherently sequential (run simulation → training step → repeat, cannot parallelise easily):


Multi-Agent Training Clarification

A point of confusion clarified in Q&A: the 10 evaluation runs shown in graphs were not separate agents learning from each other — they were 10 independent runs of the same model (to account for non-determinism). The training consisted of 50 gradient update steps where all runs contributed to shared model weight updates via GRPO. Post-training, another 10 independent eval runs were conducted.


Signals


What I’m Reading

Founders Should Chase Secrets

Time’s Up for SaaS (Grow Faster or Vanish)

End Game Play

In defense of vertical software

You’ve been kicked out of the arena, you just don’t know it yet


Earnings Commentary

I think though that if you look at the workflow overall, it used to be the case like even for sure, a year ago, perhaps even 6, 9 months ago, that a lot of people saw the workflow of product development is very linear.

The way I think about it is you’re sampling these infinite possibilities space. And you’re trying to determine what are the right options to go explore in that space and then push them forward with design. And I think that you can do that through code. You can do that through design, but code is more linear. It’s more — you’re really advancing in one direction. And so you might be moving fast, but make sure you’re going to the right place before you go too far.

Whereas design, you’re really thinking about what are the range of possibilities I should explore and you’re weighing them and figuring out the trade-offs. And I also think the opportunity for polish and craft and design is quite high.

Dylan Field, Figma Q4 2025 Earnings

If humans looked at 5 sites when they were making a decision, agents might look at 5,000. If humans had to fall back on generalized software and interfaces, agents allow for infinite customizability of every software application for every need. If humans follow a common circadian rhythm to work, agents never need to sleep. Agents, in other words, are the ultimate infrastructure multiplier.

Matthew Prince, Cloudflare Q4 2025 Earnings


Have any feedback? Reach out on LinkedIn or X.

Leave a Reply

Sign Up for TheVCDaily

The best news in VC, delivered every day!