Happy Sunday and welcome to Investing in AI! Be sure to check out our AI in NYC podcast, if you want to follow the weekly news and analysis on what’s happening in the AI space. We also now have a paid version of this newsletter with a weekly bottoms up stock analysis of how AI will impact a specific business, so please upgrade if you want to read those.

In my last post, I argued that SRAM stagnation is creating a “density wall” that threatens on-device AI. The problem is straightforward: the models we want to run don’t fit in the memory we can afford to put on edge devices.

But what if we’re thinking about scaling wrong?

The traditional playbook says: want a smarter model? Build a bigger one. More parameters, more weights, more memory required. If your edge device has 16MB or 32MB of SRAM, you’re physically locked out of frontier-class intelligence. Game over.

Test-time scaling flips this equation. Instead of making models bigger, you make them think longer.

Trading Thinking Time for Model Size

Here’s the core insight: a smaller model that fits in limited SRAM can achieve the performance of a much larger one by doing more work at inference time.

Instead of a single “one-shot” forward pass to generate an answer, the model performs multiple internal reasoning steps. It searches through solution paths, checks its own logic, and iterates toward better answers. The technical implementations vary—chain-of-thought prompting, tree search, best of N —but the principle is the same: substitute compute cycles for parameter count.

This is a profound shift. For the first time, “intelligence” becomes partially decoupled from “model size.” And that decoupling is exactly what edge deployment needs.

How Test-Time Scaling Helps the SRAM Problem

Test-time scaling alleviates the edge memory crisis in three specific ways.

Fixed memory footprint. You can take a 3B or 7B parameter model—which fits comfortably in the memory of a modern smartphone or edge gateway—and give it the reasoning depth of a 70B parameter model. The memory requirement stays constant; only the execution time increases. This is the fundamental trade: time for space.

Reduced quantization pressure. In my previous post, I described how edge designers are forced into aggressive 4-bit or even 2-bit quantization to fit models on-device, often sacrificing accuracy and nuance. Test-time scaling offers a release valve. Because the model can “correct” its own logic through multiple passes, you can use more aggressive compression without losing as much accuracy as a one-shot model would. The iterative reasoning process catches errors that quantization introduces.

Dynamic resource allocation. Not every query requires frontier-level intelligence. “What’s the weather?” doesn’t need 30 seconds of deep reasoning. “Diagnose this engine fault from sensor data” might. Test-time scaling enables the device to dynamically match compute effort to task complexity. Simple queries get instant, low-power responses. Complex queries get extended reasoning. This avoids the need for a massive, always-heavy memory architecture sized for worst-case workloads.

The New Bottleneck: The KV Cache

If this sounds too good to be true, that’s because there’s a catch.

While test-time scaling solves the parameter storage problem, it introduces a new memory challenge. As a model “thinks” longer, it generates more intermediate tokens—the chain of thought. These tokens must be stored in a part of memory called the KV cache (key-value cache), which holds the attention state the model needs to remain coherent across its reasoning steps.

And here’s where SRAM comes back to haunt us.

If the model thinks too long, the KV cache can grow so large that it overflows the available on-chip SRAM. When that happens, the device is forced to swap data to slower external DRAM. This swap triggers exactly what edge deployment is trying to avoid: a massive spike in power consumption from data movement.

So test-time scaling doesn’t eliminate the memory constraint—it transforms it. You’re no longer bottlenecked by how many parameters you can store. You’re bottlenecked by how long you can reason before your working memory overflows.

Is It a Fix?

Test-time scaling is a force multiplier, not a cure.

It allows us to extract frontier-level intelligence from edge-sized memory. That’s genuinely transformative for what’s deployable on phones, wearables, and embedded devices. But it shifts the bottleneck rather than eliminating it—from silicon area (how big is my chip?) to energy-per-inference (how much battery did that “thought” cost?).

For edge AI, this is a favorable trade. Silicon area is a hard physical constraint that gets worse every process node. Compute cycles are something we can optimize, manage dynamically, and improve with better algorithms.

Three Early Applications

As investors get excited about edge AI, but this SRAM issues comes into play, what kind of opportunities should we look for first? We should look for the kind that benefit from test time scaling – so remote applications that aren’t latency sensitive.

Here are three opportunities I see where edge AI may gain traction first, which can give you a framework for discovering and evaluating more.

1- Agricultural AI – Systems that monitor crops and power farming rarely need low latency, and can apply test time scaling algorithms. The biological processes being monitored happen over days or weeks. If a solar-powered edge node takes 30 seconds to “reason” through a complex visual scene to determine if a specific spot on a leaf is a fungus or just bird droppings, it doesn’t matter.

2 – Predictive Maintenance – Industrial AI that deploys AI for analyzing machines at factories is another good opportunity. Mechanical failure usually gives off subtle “tells” long before a catastrophic break. A small model using test-time scaling can perform deep “System 2” thinking to distinguish between normal wear-and-tear and a developing structural crack.

3- Privacy Focused Personal Digital Assistants – This is “batch” processing for your personal life. If you ask your device at 10:00 PM to “summarize the key takeaways from my meetings today,” you don’t necessarily need the answer in 200ms. If it takes 2 minutes of local, private “reasoning” while your phone sits on the charger, that’s an acceptable trade-off

When investors think about edge AI, they think about the use cases that would most benefit from the technology. But they rarely understand the technical constraints that control where it can be adopted successfully and where it can’t. My goal was to give you a mental model to evaluate that so you can make better investment decisions about the upcoming edge AI boom.

The SRAM wall is still real. But test-time scaling gives us a ladder to climb over it—as long as we’re willing to wait for the answer. The early opportunities will be the ones that can tolerate a little longer latency so that small models can upgrade their performance by thinking longer.

Thanks for reading.

Leave a Reply

Sign Up for TheVCDaily

The best news in VC, delivered every day!