In this episode, we talk with Stefano Ermon, Stanford professor, co-founder & CEO of Inception AI, and co-inventor of DDIM, FlashAttention, DPO, and score-based/diffusion models, about why diffusion-based language models may overtake the autoregressive paradigm that dominates today's LLMs.
We start with the fundamental topics, such as what diffusion models actually are, and why iterative refinement (starting from noise, progressively denoising) offers structural advantages over autoregressive generation.
From there, we dive into the technical core of diffusion LLMs. Stefano explains how discrete diffusion works on text, why masking is just one of many possible noise processes, and how the mathematics of score matching carries over from the continuous image setting with surprising elegance.
A major theme is the inference advantage. Because diffusion models produce multiple tokens in parallel, they can be dramatically faster than autoregressive models at inference time. Stefano argues this fundamentally changes the cost-quality Pareto frontier, and becomes especially powerful in RL-based post-training.
We also discuss Inception AI's Mercury II model, which Stefano describes as best-in-class for latency-constrained tasks like voice agents and code completion.
In the final part, we get into broader questions - why transformers work so well, research advice for PhD students, whether recursive self-improvement is imminent, the real state of AI coding tools, and Stefano's journey from academia to startup founder.
TIMESTAMPS
0:12 – Introduction
1:08 – Origins of diffusion models: from GANs to score-based models in 2019
3:13 – Diffusion vs. autoregressive: the typewriter vs. editor analogy
4:43 – Speed, creativity, and quality trade-offs between the two approaches
7:44 – Temperature and sampling in diffusion LLMs — why it's more subtle than you think
9:56 – Can diffusion LLMs scale? Inception AI and Gemini Diffusion as proof points
11:50 – State space models and hybrid transformer architectures
13:03 – Scaling laws for diffusion: pre-training, post-training, and test-time compute
14:33 – Ecosystem and tooling: what transfers and what doesn't
16:58 – From images to text: how discrete diffusion actually works
19:59 – Theory vs. practice in deep learning
21:50 – Loss functions and scoring rules for generative models
23:12 – Mercury II and where diffusion LLMs already win
26:20 – Creativity, slop, and output diversity in parallel generation
28:43 – Hardware for diffusion models: why current GPUs favor autoregressive workloads
30:56 – Optimization algorithms and managing technical risk at a startup
32:46 – Why do transformers work so well?
33:30 – Research advice for PhD students: focus on inference
34:57 – Recursive self-improvement and AGI timelines
35:56 – Will AI replace software engineers? Real-world experience at Inception
37:54 – Professor vs. startup founder: different execution, similar mission
39:56 – The founding story of Inception AI — from ICML Best Paper to company
42:30 – The researcher-to-founder pipeline and big funding rounds
45:02 – PhD vs. industry in 2026: the widening financial gap
47:30 – The industry in 5-10 years: Stefano's outlook
Music:
"Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
"Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
Changes: trimmed
About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.