Hi there,

Welcome to the 135th edition of Heartcore Insights, curated with 🖤 by the Heartcore Team.

If you missed the past newsletters, you can catch up here. Now, let’s dive in!


Reinforcement Learning is Re-writing AI, and Europe has a Seat at the Table

For most of the half-decade, the story of AI progress was simple: make the model bigger, feed it more data and performance improves. However, recent frontier models are posting smaller benchmark gains despite eye-watering compute budgets and high-quality training data is running out. Anthropic CEO Dario Amodei said in 2025, “two years ago, we thought there was this fundamental obstacle around reasoning. Turned out just to be RL.

Reinforcement learning is the second scaling axis that pre-training alone couldn’t provide, generating its own training signal through feedback rather than consuming ever more human-written data.

A 70-year-old Idea whose Moment has Arrived

RL isn’t new. Richard Bellman laid its foundations in the 1950s, Sutton and Barto formalised temporal-difference learning in the 1980s which won the 2025 Turing Award for that foundational work (and humbled Dwarkesh in a podcast appearance last year).

DeepMind’s 2013 Atari paper reignited it, AlphaGo shocked the world in 2016, and AlphaFold earned the 2024 Nobel Prize in Chemistry. Still, these were domain-specific wins.

The real unlock came when RL collided with language models. Standard language models learn by predicting the next word, sophisticated pattern matching at enormous scale. That works well, but it has a ceiling. You can’t pattern-match your way to solving a hard math problem you’ve never seen before and pre-training, by itself, teaches imitation rather than teaching reasoning.

RL changes the equation fundamentally. At its core, RL is about trial and error. An agent tries things, gets rewarded or penalised at the end of its run (up or down weighted), and gradually learns what works.

Instead of imitating human text, models trained with RL learn to reason toward correct answers, verified against objective outcomes like code that compiles, proofs that check out, or answers that match a ground truth. This approach, called Reinforcement Learning from Verifiable Rewards (RLVR), generates training signals automatically with no human annotators needed. Less dependence on curated data, less reliance on ever-larger pretrained models. This gets a little more tricky for non-verifiable tasks, where the fall-back are basically judge models trained from RLHF, RLAIF or other methods, but the principle still remains.

From Research Labs to Enterprise

Europe has the chance to be a leader in the space:

Paris-based Adaptive ML has built a dedicated RLOps platform already deployed by AT&T and SK Telecom.

Mistral’s Magistral confirmed in recent research, that a moderately-sized model fine-tuned with a single phase of RL on the right domain can consistently outperform larger general-purpose models. A 14B parameter model trained with RL on biological reasoning can outperform GPT-4 class models on those tasks at a fraction of the inference cost.

DeepMind’s RL system cut Google’s data centre cooling costs by 40%.

London’s InstaDeep (acquired by BioNTech for £562M) identified 12 of 13 WHO-flagged COVID variants two months ahead of official designation.

Wayve raised $1.2B in early 2026 at an $8.6B valuation, training autonomous vehicles with end-to-end RL across 500+ cities.

Mistral AI raised €2B with RL-based reasoning models. On the emerging side, Isomorphic Labs is applying RLVR to drug discovery and a flood of startups are addressing “RL for code generation”, thanks to tight verifiable feedback loops, and recommendation systems are being rebuilt around long-horizon reward signals rather than clicks.

And countless more very promising early-stage companies across Europe.

Emerging RL Categories Worth Watching

RL for code has already gone mainstream due to tight feedback loop and clear verifiable objectives, making it one of the cleanest domains for RL to shine. Beyond the obvious use cases, several newer RL applications are gaining serious traction:

Multi-agent RL, systems where multiple agents learn to coordinate in real time, is moving into logistics, energy grids and autonomous fleets.

Scientific discovery is arguably the most exciting frontier. RLVR-trained agents are now automating multi-step research tasks (hypothesis generation, experimental design, data analysis) using computational verification as the reward signal.

Personalisation and recommendation systems are getting an RL refresh too. Rather than optimising for click-through rates, next-generation systems are training agents with long-horizon reward signals (user retention, satisfaction and real-world outcomes) producing qualitatively different behaviour to traditional recommendation algorithms. Next will be real world simulation and feeding it as behavioural context into world models.

Bottomline: RL is having a moment (again, or still?), and Europe should be the main stage! As always if you are building in this space, we’d love to talk.

~ Bodi Tent, Associate, Heartcore Capital



🇪🇺 Notable European early-stage rounds

🇺🇸 Notable US early-stage rounds

🔭 Notable later stage rounds

🖤 Heartcore News

Share Heartcore Insights

Leave a Reply

Sign Up for TheVCDaily

The best news in VC, delivered every day!