People ask me to explain my investing thesis in AI more often than I’d expect. Up until now I’ve kept it deliberately vague — describing myself as opportunistic, backing ideas as they surfaced. That’s still true in practice. But I think it’s worth being more precise about what I’m actually looking for.
I break it into two buckets. The first is frontier applications of software — places where AI can meaningfully unlock or accelerate work in domains that have historically resisted it. Manufacturing, construction, science, chip design. Places where the bottleneck is real-world complexity, not compute. The second bucket is what I’d call frontier AI itself — weirder, earlier, more speculative bets on ideas that could represent genuine step function changes in how models are built. Moonshots, essentially.
JEPA (Joint Embedding Predictive Architecture) lives firmly in the second bucket. And because it has become a growing topic of conversation, I figured it was worth writing down what it actually is.
What is JEPA?
JEPA is a model architecture that learns by predicting abstract representations of the world rather than reconstructing raw data. To understand why that distinction matters, think about your mother for a second.
What came to mind wasn’t a pixel-perfect image. It wasn’t her voice rendered in lossless audio. It was something compressed and abstract — a feeling, a posture, a set of associations your brain has accumulated over decades and folded into a single retrievable idea. Your brain doesn’t store raw footage. It stores meaning.
Most models don’t work that way. Understanding why — and understanding what JEPA does instead — requires a short detour into how models actually learn.
The standard pipeline
Every modern generative model follows roughly the same structure.
Raw data comes in — pixels, tokens, audio waveforms. An encoder processes that data and compresses it into a latent representation: a dense numerical vector that captures the essential features of the input. This is the model’s internal language. It’s abstract, high-dimensional, and lives nowhere the human eye can see.
From there, a decoder takes that latent representation and reconstructs something — an image, a sentence, a prediction. The model learns by comparing its reconstruction against the original input. Predicted a slightly wrong pixel? Adjust. Wrong word? Adjust. The error signal flows backward through the network, and the model gets better at reconstruction over time.
This is the loop: raw data → encoder → latent space → decoder → reconstruction → error signal → repeat.
It works. GPT, Claude, Gemini, Stable Diffusion — all of these are, at their core, reconstruction machines trained on massive quantities of data. The learning comes from the prediction error at the output.
The limitation is subtle but important. When a model trains to reconstruct raw data, it has to model everything — including all the noise, texture, and surface variation that carries no semantic meaning. A model predicting missing image patches has to care about lighting gradients and film grain, because those details exist in the reconstruction target. The model can’t distinguish signal from noise because it’s penalized for getting either wrong.
The result is models that are capable but inefficient — models that have to internalize enormous quantities of surface detail to develop a useful understanding of what things mean.
What JEPA does
JEPA stands for Joint Embedding Predictive Architecture. It removes the decoder.
The pipeline becomes: raw data → encoder → latent space → predictor → predicted latent representation.
There is no reconstruction step. The model never tries to generate pixels or tokens. Instead, it takes a context (some portion of the input) and predicts what the latent representation of a target would look like — not the target itself.
The model’s goal is simple: minimize the distance between what it predicted the representation would be and what the representation actually is. Two vectors. The distance between them. That’s the whole training signal.
Because the prediction target is already abstract — already encoded — the model has no incentive to model noise. Pixel-level variation doesn’t survive the encoder. What survives is structure, causality, and semantics. The model learns to predict meaning, and the learning signal comes entirely from whether it got the meaning right.
Think about what it takes to catch a football. Once the ball is in the air, the person catching it isn’t stopping to process the full sensory scene — wind speed, trajectory, distance, the ground beneath his feet — before deciding where to go. He reads the play, feels where the ball is heading, and runs to that spot before it arrives. The catch happens because he predicted an abstract future state and moved toward it. JEPA is trying to build that same capacity — the ability to predict what comes next without reconstructing what it looks like.
Where JEPA fits in a world model
JEPA is a component, not a complete architecture.
A world model is a full cognitive architecture. It includes a state representation (where am I and what’s happening), a predictor (what comes next), a memory (what has happened before), a critic (is this good or bad), and an actor (what should I do). These components work together to let a model perceive, anticipate, remember, evaluate, and act.
JEPA handles state and prediction. It answers the question: given a representation of the current situation, what representation should I expect next? It’s the part of the architecture that models how the world evolves — what causes what, what follows from what.
A chess player perceives the board, anticipates opponent moves, recalls prior games, evaluates positions, and selects a move. JEPA is the part of that player that understands how pieces move and can build an internal model of where the game is going. That’s foundational — everything else in a world model depends on a good predictor — but it’s one piece of a larger architecture.
The reason this framing matters is that “world model” gets thrown around loosely. JEPA’s proponents are making a specific claim: that prediction in latent space, grounded in semantics rather than raw data, is the right substrate for building models that can reason about how the world works. Whether the rest of the world model architecture gets built on top of that substrate, and what it looks like when it does, is a separate and still-open question.
Why any of this matters
The choice of what to predict during training is a choice about what the model learns to understand.
Reconstruction-based training produces models that are fluent in their output medium — text, images, audio. The knowledge is real, but it’s organized around the surface properties of the data rather than its causal structure. That’s useful for generation. It’s less useful for planning, for physical reasoning, for understanding what will happen if you do something.
It’s also worth noting that most models people loosely call “world models” today — LLMs, VLMs — are still fundamentally token predictors. They have to think in words. Even when a vision-language model is processing an image, it’s eventually routing everything through language to reason about it. That’s not nothing, but it’s a constraint. It means the model has to narrate the world before it can act on it — the same problem as our football analogy, just dressed up in more parameters.
Latent prediction is a bet that there’s a more efficient path to understanding the world — one that involves learning to anticipate meaning directly, without the detour through reconstruction. It’s also a bet that this kind of prediction is closer to how intelligence actually works: not by rendering the world in full detail, but by maintaining a compact model of it and projecting forward.
Your brain doesn’t reconstruct your mother when you think of her. It accesses a representation. JEPA is trying to build models that learn representations the same way — by predicting them, rather than generating their raw form.
Yann LeCun and his team at Meta’s FAIR lab have been the loudest proponents of this approach, and his applied research spinout AMI Labs is now working to build on it commercially. But they’re not the only ones. Last year I backed a company called Primate AI, founded specifically to take JEPA from research into production — building the kind of models that can reason about the physical world in real time. Robotics is the obvious application, but the surface area is much larger than that. Any domain where a model needs to act on an incomplete picture of the world — rather than generate a description of it — is fair game.
At some point I’ll write about a company I backed called After Thought, which is working at the intersection of Neuro-Symbolic AI and computational cognitive science — building reasoning models that reason the way humans do, not just ones that approximate the output. It ties directly back to everything covered here.