PodcastsCiênciaInterconnects

Interconnects

Nathan Lambert
Interconnects
Último episódio

153 episódios

  • Interconnects

    Frontier post-training recipe review with Finbarr Timbers

    16/06/2026 | 56min
    As I’ve been recapping fundamentals of post-training to wrap up my RLHF / Post-training book I knew I needed to get Finbarr Timbers back on the podcast to talk about the state of play. Over the last few months we’ve had many discussions on what we’d need to do to take an Olmo-style recipe to the frontier, supported by Finbarr’s extensive reading of recent model technical reports.
    To prepare for this, I put together a summary slide deck on the key post-training recipes historically — the path from InstructGPT to today — and today — the key open frontier models. This deck is summarized below as the technical summary, but we do spend 20-35 minutes on it in the podcast, so watching on YouTube is likely the best experience for this one.
    I previously interviewed Finbarr in December of 2024, shortly after the release of o1 and Tülu 3 (and before he joined Ai2) on the “We are so back” era of RL.
    Chapters:
    * 00:00 Introduction & Olmo reflections
    * 06:28 Post-train recipes review (history)
    * 23:00 2026’s model recipes (MiMo Flash, DeepSeek V4, GLM 5, Kimi K2.6, etc.)
    * 39:05 Open-ended post-training discussions
    * 48:22 Career advice in the LLM race
    Listen on Apple Podcasts, Spotify, and where ever you get your podcasts. For other Interconnects interviews, go here.
    For more educational post-training videos, see the course I’m putting together.
    Technical Summary
    These are notes cleaned up from a slide-deck created with AI assistance — mostly useful as a discussion topic and reference.
    The shape of a post-training recipe has changed more in the last year than in the prior three.
    * 2022–2023 (InstructGPT): one pipeline — SFT → reward model → RL.
    * 2024 (Llama 3, Tülu 3, etc.): open recipes formalize SFT → DPO → RL with verifiable rewards. Closed recipes use many stages of RLHF.
    * 2025 (DeepSeek R1): reasoning RL (R1) makes large-scale RL the centerpiece.
    * 2026 (MiMo Flash V2): recipes fragment into many specialist models that are merged back into one.
    The new thing: MOPD
    Multi-teacher On-Policy Distillation (MOPD) is the pattern showing up across the 2026 frontier.
    * Train N domain-specialist teachers (each: SFT, then RL on the relevant domains).
    * Train one general student by sampling its own trajectories (this is the final post-trained model).
    * On each rollout, minimize reverse-KL to the relevant teacher’s output distribution, token by token.
    Lineage: MiMo Flash v2 introduced it → DeepSeek V4 & Nemotron 3 Ultra scale it to >10 teachers.
    Why did MOPD emerge?
    * RL got expensive and conflict-prone. Mixing math, code, and agentic RL in one run eventually trades capabilities off against each other.
    * Specialists are cheap to make / organizationally scalable. SFT-then-RL on a single domain is well understood and parallelizable. As post-training becomes more complex, scaling it across organizations is a big win.
    * On-policy distillation matured. Literature and know-how continued to emerge through the RLVR renaissance.
    Sources: DeepSeek V4 §5.1, MiMo-V2-Flash
    Key historical recipes
    InstructGPT (Mar. 2022) — the canonical 3 steps · paper
    * SFT on human demonstrations
    * Reward model trained on human comparisons
    * PPO against the reward model
    Llama 2 (Jul. 2023) — multi-stage RLHF · paper · interconnects recap
    * SFT, then iterative RLHF over multiple rounds
    * Each round: rejection sampling → PPO
    * Two reward models — separate helpfulness and safety
    Llama 3 (Jul. 2024) — a complex multi-stage recipe with simpler optimizers · paper · interconnects recap
    * Per round: reward model → sample K per prompt → rejection sampling → SFT → DPO
    * No online RL — the RM only filters; run over 6 rounds, best models seed the next
    Tülu 3 (Nov. 2024) — simple three-stage post-training · paper · interconnects recap
    Curated prompts → SFT → DPO → RLVR (RL with verifiable rewards — the acronym was coined in this paper).
    OLMo 3 (Dec. 2025) — a reasoning update to the Tülu 3 recipe · paper · interconnects recap
    DeepSeek R1 (Jan. 2025) — RL as the centerpiece · paper · interconnects recap
    The recipe:
    * R1-Zero — pure RL (GRPO) on the base, no SFT; used to seed reasoning behaviors for the full run, not a separate product
    * R1 — cold-start SFT → reasoning RL → rejection-sampling SFT → final RL → distill to dense
    * A big change in recipes: Large-scale RLVR as the primary driver, SFT to distill and refine RL behaviors
    DeepSeek evolution after V3
    * V3 · Dec ‘24 — SFT + GRPO RL.
    * R1 · Jan ‘25 — multi-stage RL; reasoning emerges.
    * V3.1 · Aug ‘25 — hybrid think / non-think in one model.
    * V3.2 · Dec ‘25 — 6 specialists via RL → SFT distillation → one mixed GRPO.
    * V4 · Apr ‘26 — 10+ domain experts → MOPD.
    2026 style recipes!
    MiMo Flash v2 (Jan. 2026) — where MOPD started · paper
    Stages: Stage 1 SFT → Stage 2 train ~6 domain-specialist teachers (with older style post-training recipes) → Stage 3 MOPD into a single student.
    First clean articulation of multi-teacher on-policy distillation as the consolidation step — replaces a single monolithic RL stage with distill-from-specialists.
    Nemotron 3 Ultra (Jun. 2026) — two rounds, many teachers · paper
    Stages: SFT → multi-teacher on-policy distillation, run over two iterations, with >10 teachers spanning reasoning, code, math, and agentic domains.
    Novel: multi-round MOPD across different domains — distill, then re-distill from refreshed teachers.
    MAI-Thinking-1 (Jun. 2026) — closer to R1 than V4 · announcement
    Stages: mid-trained base → 3 specialist RL “climbs” (e.g. STEM) → trace-distillation SFT to consolidate the climbs → a final RL climb → MAI-Thinking-1.
    Closer to DeepSeek R1 than to V4 — multi-stage RL with trace-distillation SFT to consolidate, not on-policy MOPD. Not the only lab without MOPD!
    Kimi K2.5 (Jan. 2026) — agentic, multimodal · paper · blog
    Stages: text-only SFT → joint text–vision RL across coding, vision, reasoning, agentic tasks. (No mention of MOPD.)
    GLM-5 (Feb. 2026) — staged RL by capability · paper
    Stages: Base → SFT → Reasoning RL → Agentic RL → General RL.
    Transcript
    00:00:00 Nathan Lambert: Hello, we are back on a Interconnects conversation. I don’t really say I do interviews. People criticize me ‘cause I interrupt the guests too much. ‘Cause I’m not a good interviewer, but I’m here to entertain people. Um, this is also fun for me because I’m trying to make, like, a post-training course, and it kind of fits as, uh, in the advanced end of this.
    So it’s kind of a crossover between Interconnects content and other stuff that I’ve been spending my time on this summer. So I’m happy to welcome Finbarr back. I think... Are you the first return guest? I haven’t checked.
    00:00:37 Finbarr Timbers: Oh, wow.
    00:00:37 Nathan Lambert: Um, Finbarr and I worked on this sort of post-training recipe stuff for a while at AI2. Um, I left recently. This is one of Finbarr’s last days at AI2. It’s already been announced. It’s not a spoiler here. So we’re gonna kind of reflect on some things on building post-training recipes for OLMO. Um, then we have a little, like, review slide deck and notes on the kind of state and evolution of frontier post-training recipes over time, which is pretty interesting because there’s, what is it, like two to four kind of canonical recipes that there has been.
    So it’s kind of interesting when you see the field converge on something new, which it’s doing right now with multi-teacher on policy distillation. For some reason, that’s a bit of a mouthful. It is a long acronym. And then we’ll just kind of end with various discussion points on post-training and what we’re up to. So, happy to give you the floor if you have any hot takes you wanna start with to get people to, draw people in. Otherwise, I think, uh, I’m excited to kind of reflect on this, ‘cause I know you’ve been reading a ton of papers recently and kind of prep, laying some of this groundwork.
    00:01:43 Finbarr Timbers: Well, yeah. I mean, today is my last day at AI2, so it- it’s ki- it feels very appropriate to be, to be talking to you as you’re the one who recruited me to AI2. So, uh, yeah, that’s pretty special, and it’s great to be, uh, yeah, the, the first repeat guest. I feel honored, uh, to be back on. So yeah, thanks, uh, for having me.
    00:02:03 Nathan Lambert: Yeah. Do we wanna start with OLMO? I think that-
    00:02:05 Finbarr Timbers: Sure
    00:02:06 Nathan Lambert: ... people... I think I, uh, need to do this carefully, but I’ve talked about OLMO-3’s post-training many times to people. I haven’t done this in a very direct way on the podcast, but I would say that post-training OLMO-3 to make this reasoning model was a major accomplishment for many individuals to do this. But also, the complexity of what we were doing was pushing against the limits of AI2’s organizational capacity, and a lot of modern post-training is, like, your ability to wrangle compute data into a work stream.
    And in order to do that in a complicated way, you really are wrangling an org chart. And that’s like part of why it’s like OLMO-3 was, by its nature, pretty late as a reasoning model. It was, like, a pretty rigid reasoning model, and that’s, like, partially reflected in the recipe being pretty simple. But then when you, like, compare it to all these new recipes with tool use and multi-teacher distillation and all of this, it’s just like a, a, a fork in the road where it’s like you could do this very simple thing and make a strong recipe, but it is not representative of what all the frontier labs are doing.
    And I think that that kind of fork in being able to say that things are similar happened kind of after Tulou-3, where Tulou-3, I think, was also much simpler with this three-stage SFT-DPO RL recipe. But that simpler recipe was probably closer in outcome to what the labs are doing, but now doing that sort of three-stage recipe for a reasoning model, and especially a tool use, like, agent model, just doesn’t really apply. And that’s the point. That’s why I think the point of this podcast is to be like, what are the, what are the way, what are they doing to make these, like, true frontier models, and then shed some light on how it contrasts to the more a- like, open academic ones.
    00:03:56 Finbarr Timbers: Well, actually, I think that’s interesting. What was the proce- so, you know, I, I only, um, came around for OLMO-3. I wasn’t around for the earlier, um, versions. What was the process like to go from Tulou-3 to OLMO-2? Because, like, y- just looking on, on Archive, um, I think Tulou-3 came out in November of ‘24, and then OLMO-2 came out in December of, of ‘24.
    00:04:22 Nathan Lambert: We just applied the recipe.
    00:04:24 Finbarr Timbers: Yeah. I, I mean, so, so I think that actually, like, yeah, and then, you know, um, DeepSeeker-1 came out in January, end of January ‘25, and, you know, OLMO-3 was then released in October. Was it October or November of ‘25? Like, I think-
    00:04:39 Nathan Lambert: I think November.
    00:04:41 Finbarr Timbers: Yeah, November. Yeah, right. It was November. So it’s-
    00:04:43 Nathan Lambert: It was like do or die with Thanksgiving.
    00:04:45 Finbarr Timbers: I remember that. Uh, yeah, ‘cause Canadian Thanksgiving had, had already happened-
    00:04:50 Nathan Lambert: Yeah
    00:04:50 Finbarr Timbers: ... which, yeah, I was happy. Um, but, uh, like, like I think it was, sure, maybe it was late, but I think it was only late by a few months. Like, it’s, it’s actually, like, you know, if I think of my past experience with model turnaround times, like a nine-month model turnaround, you know, from R1 coming out, like that’s actually, that’s not bad. I think, you know, something like six months would’ve been nicer, but-
    00:05:12 Nathan Lambert: I, I think it’s slow ‘cause we didn’t re- it would be fast if we had rebuilt the R1 recipe. But what we did was we, like, ported reasoning into our existing recipe-
    00:05:21 Finbarr Timbers: Yeah. Okay
    00:05:22 Nathan Lambert: ... which is a simpler task, but has, like, a lower ceiling, in my opinion. Where it’s like the DeepSeek and the newer style recipes, which we’ll talk about, I think they just have a much higher ceiling in how much you can keep hill climbing them. Or they’re just, like, more prescri- more pedagogical of what the frontier is doing. Like, for the size models that OLMO was, which was like 7 to 30B, I’m not sure that doing this DeepSeek style RL first recipe is actually useful.
    00:05:52 Finbarr Timbers: Uh, well, I, yeah, I think that’s a good point. And I mean, I think that’s really reflected in what we see in the research where you s- you know, you obviously you see the big, uh, the step change and you know how quickly things are improving When, you know, R1 comes out. So, like, I think that a great point, and it really does seem to saturate, or to, to not saturate, sorry, with, with compute. Um-
    00:06:11 Nathan Lambert: Yeah. Um, shall we just do the slide deck? We’re throwing around, like, recipe-
    00:06:15 Finbarr Timbers: Sure. Yeah, let’s do it
    00:06:16 Nathan Lambert: ... names. Like, I feel like it might be useful to just do it because a lot of people probably want to follow but don’t exactly know. I’m, I’m gonna share, I’m gonna share a screen. So people listening, it might be useful to either, you can pull this slide deck up on your phone and click through it. It’s not super information dense, but you can also just watch it on YouTube. All of this will be linked.
    Generally, this is just like a quick survey on how frontier recipes have evolved. We’ll go through the history quickly and then talk about what is currently happening and kind of probably interleave the old mode discussion we were having. Uh, okay. There’s a bunch of canonical recipes we’ll talk about. This is where I got the two to four number. I think the recipes are like InstructGPT, which is what coined the initial RLHF with this like three-stage idea, which took a while to get people to move on from, which was like SFT reward model and RL.
    And I see as like Llama 3 and 2.3 as kind of practical implementations of that with, with other tricks of the trade. So those two could potentially be merged together. It’s just like kind of pre- and post-ChatGPT moment. And then the two most recent canonical recipes that we’ll cover in this I would say are like DeepSeek-R1, which is the shift to doing like reasoning focused and bigger RL stages than this kind of SFT focus from before, and then NeMo Flash and some of the new models from 2026 which add this distillation element.
    00:07:42 Finbarr Timbers: Well, and, and I think it’s worth pointing out too that it’s not just NeMo Flash, like it was kind of a consistent theme. Like you saw this with DeepSeek, th-they referenced it in, uh, the V3 paper and then it’s, you know, it’s Qemi K 2.5, it’s GLM 5. Like it’s all of these papers, you know, start talking about this specialist, um, RL stage.
    00:08:03 Nathan Lambert: Yeah. I think there’s a debate on how we draw it and whether or not distillation is... If you’re, if you have distillation as a technique, as a key milestone, then they were, the Xiaomi was the first and, but it’s kind of a march over time where you kind of see them change, and we’ll, we’ll go through this. I don’t, I don’t need to interrupt.
    00:08:23 Finbarr Timbers: When you say distillation, I do think it’s important to distinguish between the straight up like, you know, distillation of the leading closed models and, you know, distillation of these domain specific models where, you know, I, I, I suspect that the, you know, the, the Chinese labs are doing both.
    00:08:41 Nathan Lambert: Yeah.
    00:08:41 Finbarr Timbers: But, you know, a lot of what they’re do, you know, but a, a lot of what they’re doing is this, um, training these domain specific models like, you know, a math model, a coding model, uh, you know, logic model, whatever, and then distilling those models back in and not just distilling from... So when we’re talking about distillation, it’s not just distilling from the leading closed models.
    00:09:01 Nathan Lambert: Yeah. It’s a pain. I agree. The distillation term is horribly overloaded. Um, there’s a review slide. Do we need to review multi-teacher on policy distillation? It might be too complicated to need to do it. We could come back to it. I think I kind of want to just go through the actual models, and then we could use the supporting slides as needed. Um, this famous InstructGPT three-step thing, I think many people have heard of it, but this is what constituted post-training at the time of ChatGPT coming out, so it’s kind of important grounding of this human supervised SFT data, mostly human supervised preference rankings to make a reward model and then do RL on that, and the model gets better.
    And it’s pretty interesting how all of these have been kind of phased out, at least in terms of what we know openly, where they’re, we don’t use that much human demonstration data for SFT. There’s likely some human preference data still in the loop, but I would guess that synthetic has a much bigger role, and there are reward models, but they’re like not the cl- key RL target anymore. So in four years, most, almost all the canonical pieces have been moved on. And like this evolution is kind of within there. I think the early models after InstructGPT, like Llama 2, um, even Llama 3, these are pretty similar, which is like you’re starting to break down this recipe with different tools like projection sampling, DPO, some increased iterations. I think increased iterations is just that there was more incentive to squeeze more out of the models, and they just like broke things down more, where InstructGPT seemed like a bit more open-ended research where this kind of cleanness was fine. So-
    00:10:48 Finbarr Timbers: Well, I think that’s interesting, uh, with respect to how much everything has scaled, uh, right? Because, you know, InstructGPT was before ChatGPT was, was released, and so, you know, it’s something, like just the complexity of what was done is that which a small team or even a single team could do. But then when you start looking at, you know, Llama 3, like it just starts to be a more complicated process and, you know, where you start to have a lot more, you know, specialized data and there’s, you know, a lot more, you know, room for scale and for kind of money and complexity be poured in.
    00:11:25 Nathan Lambert: Yeah. It’s like, uh, both for-profit and nonprofit efforts to do post-training want me to advise them, and I’m like, “I don’t really know how I’m gonna give you advice unless I’m spending twenty hours a week look, understanding the details of your recipe,” ‘cause it’s like, well, I can’t really give you a one sentence thing of do X without understanding all the complexities of the model and the post-training process that go into it. Which makes it, like makes it hard from kind of like a transparency point of view. Even if it’s fully detailed, it’s definitely still hard to modify and study.
    00:12:00 Finbarr Timbers: Absolutely.
    00:12:02 Nathan Lambert: So then like two through three in AI2, a lot of this was we’re trying to beat the results of this Llama 3 post-training, which is pretty complicated, but we don’t have the ability to scale the organization as far. So I, I, I think that’s a big reason why the actual workflow is a lot simpler, where we have three clear stages that are doing slightly different things, and they build on each other. And that’s like... It’s never stated very explicitly in these papers on like how the org chart impacts the recipe, but I would, I, I think it’s a very strong signal within the, at least the delta between the fully open work and the kind of partially open work that you get from industry.
    00:12:43 Finbarr Timbers: Yeah, absolutely. And, and I think especially as we’ll see with the domain-specific models, like that’s like really clear, like something where you could really easily scale up your org chart to-
    00:12:54 Nathan Lambert: Yeah
    00:12:54 Finbarr Timbers: ... build that up.
    00:12:56 Nathan Lambert: Yeah. And I threw Olmo 3 in after this, after the two through three slide, mostly just to show that the recipe was so similar to two through three, and the org chart hadn’t really changed. Like we didn’t have more ability to scale, and like there was a, a little bit of separation between the model types, between like the think and the instruct models. But like without a major reinvent- like a major org change, it was just kind of stuck in this and do the best you can with it.
    00:13:22 Finbarr Timbers: Yeah. Absolutely.
    00:13:23 Nathan Lambert: Be- because like the real big change was this with DeepSeeker-one. They, I had never seen this plot before, but they had this plot, maybe they added it for the nature version of the paper, where they kind of show their recipe, where they like take the base model, they do RL zero, and then they sample from the RL zero to like filter prompts, and then they use that as SFT. This is like going through this. They use that as SFT for the next version of the model to create like a development internal RL DeepSeek-R1, and then they do this like repeated sampling to train multiple RL versions and kind of distill, distill in the sense of, of clarify and refine the reasoning behavior of the model before going through the final pipeline, which again is a mix of, um, reasoning and non-reasoning SFT into a bigger RL run. And-
    00:14:11 Finbarr Timbers: Well, and I think this is really interesting because it starts to show, I mean, first of all, the, the complexity here. We’re starting to use, um, yeah, like synthetic data as this primary input here, but it’s not just like, you know, it’s trying to elicit, you know, specific behaviors, and it’s this kind of like industrial process, um, instead of like this, you know, it’s not as much of an elegant research recipe. It’s more like, you know, we train a model, and then we use it as best we can, and we keep iterating. Um, but I think the other thing that’s interesting is, is we’re starting to see here the SFT serving as the cold start. First of all, where, where that’s, you know, I think before SFT was more of like a generally useful stage, whereas here its, its primary purpose is this, this cold start for RL.
    And then the other interesting bit is, you know, DPO, uh, starts to disappear at this point from the leading recipes. I mean, Olmo 3 still does it, but you know, basically everyone else does away with it and just, you know, has the preferences included, um, as in, as a reward model or, you know, at so- at some way, um, in the reward bit of the RL stage. And so that’s a really interesting change, where the, the supervised part of post-training is just, you know, massively deprioritized.
    00:15:27 Nathan Lambert: Yeah. So my hypothesis for the dropping of DPO on these models is that, uh, as, as you’re doing like a cleaner recipe, essentially the need falls away. Versus if you look at Olmo, which is taking tons of potential gains by refining your model on outputs of strong open weight models, like largely Qwen and DeepSeek is the training data for the SFT of Olmo 3. Uh, and like the delta between that SFT data and the base model is still pretty big in the probability distributions. So DPO kind of helps further refine and clean up that distribution in a way that kind of has very rough edges. And but when you have a more refined, like industrial process on post-training, th-that will, that potential benefit will be harder to gain. Something interesting that I didn’t fully con-confirm before this is, for example, NVIDIA used to also be on this DPO train with their smaller Nemotron models.
    And, and I would guess that potentially like D- Nemotron Ultra would not. But it’s, and, and that’s because they’re at much further down this development tree and using on pol- like these more on policy methods for creating the SFT data. And their model, I would guess, will become kind of more robust out of distribution and like have weird, less weird rough edges before because of it. So that’s kind of my hypothesis on DPO, and people that use DPO will be looked down upon. But it’s like if you’re trying to bootstrap a recipe off the ground and just take gains where you can, I still think it’ll work for a lot of people in a kind of compute efficiency standpoint.
    00:17:05 Finbarr Timbers: Yeah. I mean, I think generally, uh, there’s something interesting with the, the preference tuning that, yeah, like maybe, um, it isn’t being given the proper, um, respect that it deserves. ‘Cause o-one of the interesting bits about the Nemotron 3 super paper was that they saw pr- they, they do a, a traditional RLHF stage in their RL, which has also, you know, fallen with fashion and development, and they see pretty massive gains with it. So I think some of these changes are more, you know, driven by what’s in fashion rather than perhaps like a fully rigorous, you know, set of ablations.
    00:17:41 Nathan Lambert: It’s very remarkable to me that the preferences loss function can do so much for these models. Like the models have so much potential there, and it’s just, it’s really a contrastive loss on pretty granular feedback. And they learn all sorts of things. Like they’ll, they’ll get better at math and code, or their reasoning strategies will be refined. And so I, I... That’s remarkable to me. I think there will still be funny research on like using preference- Base losses with verifiable outputs. Like, I, I think all this would work. Like DPO on verifiable rewards and stuff like this, it’s just kind of intellectually less appealing.
    00:18:19 Finbarr Timbers: Yeah. Well, I think that’s, uh, you know, that’s where I thought that the, uh, delta learning, um, hypothesis style, uh, DPO, like what Olmo-3 did, where you, um, where the, the preference, you create these synthetic preferences by having like strong, by like bigger and smaller models of the same family, like is where you get your preferences from. I thought that was a really interesting signal because it, it seems really analogous to some of the work, some of the guidance stuff that we see in diffusion models, like how you have the classifier-free guidance, which has something similar, and there, there were very similar results there, which showed that you could have the--
    But like one signal they used was further along in training versus earlier in training models as like, uh, a source of, of signal that you could guide along. And, and that worked quite well. And so I suspect that these signals, um, for, for preferences in that way, like that they could actually be more robust, but because, you know, some of the largest labs don’t have to do that, perhaps we’re not citing them as much.
    00:19:18 Nathan Lambert: Yeah. Or they don’t tell us. Like, to continue this, it’s kind of cool to look at-- So the DeepSeek models have kind of gone through this, what I would call like l- closer to Llama recipes to DeepSeek-R1, which is d- like most definitively the canonical recipe for reasoning models, and then continue to change closer to this multi-teacher format. So if you look at the VC-3.3 paper, um, before R1, they do something remarkably similar to two to three type thing, where they have a mix of SFT and then they use it ver-- like this RL on verifiable rewards. They didn’t call it that, or their paper wasn’t out at the time. And so they did this before R1 came out, which was just kind of a less reasoning-focused models and used the same tools but with a different ratio of implementation weight.
    00:20:07 Finbarr Timbers: And, and what’s interesting is that this comes out basically at the same time as two to three, and it’s a very similar two to three and Olmo-2. It’s a very similar recipe, just done with more complete.
    00:20:16 Nathan Lambert: Yeah. Yeah. And then we have this R1, which we’ve just talked about at length in January, which is a month later. They have a few more releases through this. They have some updates to their V3 and R1 models, which have dates, which are largely the same recipe. And then the next documented change in their recipe was V3.1, which is when they merged this thinking and non-thinking into one model, which everybody that does this says, has said that it has been hell to train in. But you kind of need it from a serving perspective, and it’s obvious that long term, at least obvi- it’s obvious to me that long term all the models will be reasoning models, and you’ll just have reasoning models that are very efficient based on the gains that are there.
    So this is kind of a needed change that they made. And then in December of 2025, they released V3.2, which is when there’s kind of meaningful changes to their recipe, and they’re talking about this expert creation with separate mini recipes, and then using that within their kind of R1 data process to do SFT data and then like a big RL run at the end with GRPO. So it took about a year for this, uh, like kind of evolution of the R1 style recipe to land in their models. And I think this, this is like a very big complexity step that isn’t represented in something like Olmo-3, and it’s kind of where you can see a fork in the recipes over time as like they, it, they become way more industrial and scaled at these frontier labs.
    00:21:46 Finbarr Timbers: Yeah. And I think, you know, another one good thing here, just from a historical note, is that I think it was with the O3-24 release where they updated the original V3 paper. So, you know, V3 comes out before R1, then R1 comes out, and then after R1 comes out, they actually go back and update the V3 paper, maybe getting ready for the nature submission or, or, or something.
    00:22:07 Nathan Lambert: Yeah.
    00:22:07 Finbarr Timbers: Um, and they make a reference there to say like, “Oh, you know, something you could do is you could train these domain specialist models and then combine them.” Uh, and then, you know, that later becomes kind of what, you know, the more of a priority as they talk about in V3.2.
    00:22:21 Nathan Lambert: That’s a fun note. Yeah. And then most recently in April 26th is this V4 model, which has even more experts. They add this new loss function for multi-teacher on policy distillation, which I said follow Jiaoli. And this is kind of a microcosm of the arc that the whole industry went through, at least the people who share what their post-training details are, of realizing how core RL is, changing the recipe around scaled RL, and then figuring out how to kind of scale to more domains in the scaled RL format without just like grinding to a halt in operational complexity.
    00:22:58 Finbarr Timbers: Yeah.
    00:23:00 Nathan Lambert: So then kind of the next stage of this is these, what I call twenty twenty-six style recipes, which are all these models that are doing this multi-teacher, um, infusion of knowledge. And then some of them are using on-policy distillation and some are not. It’ll be one of the key things to see is like how crucial is this on-policy distillation to really keeping up at the frontier. So the paper that kind of, that named this term was the MimoFlash V2 paper. I think the model was released in December and the paper in January, which a lot of things will look similar to this, um, kind of RL, large RL style recipe. But with this large RL run is more, is where the on-policy distillation comes in. So for, I c- this is probably a better time to explain. I have this great, great little feature.
    So this is like the summary of what multi-teacher on policy distillation is. Generally, it fits within an RL framework where you have the model you are training, the, like the general model, sample its own trajectories, and then you route the trajectories to various expert models you have trained. And each kind of sample is trained with this distillation KL loss to match the tokens of that expert. And People have, multiple models have shown that this type of supervision is really useful for the models. You could combine it with other RL losses, such as verifiable rewards, which for example, Sasha Rush gave a good mini spiel on that and how they use that with Composer, which is a, a video that I really recommend people watching as well. But the, the key of it is that it is a different loss function, but it plays very nicely in the RL frameworks that people are already using. So they use these teachers-
    00:24:45 Finbarr Timbers: Just RL, like it’s, it’s, like if you-
    00:24:47 Nathan Lambert: Yeah
    00:24:47 Finbarr Timbers: ... actually implement it, you know, I’m talking with some of the people at AI2 about implementing it now. And it’s like you take your RL setup, and then you just, you know, you, you have some very, your, uh, set of tweaks on the, the learner to actually implement this. So it’s quite straightforward.
    00:25:02 Nathan Lambert: Yeah, so this is a fancy diagram that makes it more complicated than it needs to be, but it also a very nice diagram, which shows the various, um, domain teachers that they have, search agent, code agent, math, reasoning, safety, and how they put these together. And the, the experts are used both for SFT data and then this final supervision. And the recipe for the experts would look something like this DeepSeek recipe, which is complicated on its own, which is like make a very good reasoning model that is good at one thing.
    00:25:29 Finbarr Timbers: Well, and I think it is complicated, but it’s also like if you, if you think about being the actual researcher like working on it, it’s like, you know, you have a base model, and then you have an RL set up, and you know, you’re just constantly updating both and then rerunning RL. So, you know, the, the most complicated like, uh, part of it is just, you know, writing down the history and tracing everything. But it’s kind of like a very natural, organic way, uh, for the r- the RL to evolve through, you know, iterative experimentation.
    00:25:57 Nathan Lambert: Yeah. So like once you have a recipe, you’re progressively tinkering with each part, and it’s, it’s fairly stable, but it’s hard to rebuild from scratch. So like we’ll see how, see how long the recipe shape lasts, but it’ll probably be order of years. Um, another big one in this like also shared a lot of details on this on policy distillation approach was Nemotron-3 Ultra, which is obviously exciting to me to have a, like a US-made model that is very strong performance, and NVIDIA released a lot of datasets with it.
    But they, they also talked about a lot of their very n- n- like implementation details of what was hard with on policy distillation. I, like I have notes somewhere on this. They do this thing where they have two rounds of on policy distillation, as they found it to be better to integrate some teachers one after another. And the paper has a lot more details. I’ve, I, I don’t wanna go scroll through the paper, but we could also do this. Did you have any o- other impressions? Like I have the, we have this other doc we can pull up that-
    00:27:01 Finbarr Timbers: Oh
    00:27:01 Nathan Lambert: ... also you might have had other details on it.
    00:27:03 Finbarr Timbers: Yeah. Well, I think something else, um, that, that is worth, um, you know, contrasting the, the paper to is the Nemotron-3 super paper. ‘Cause in the Nemotron-3 super paper, they had a similar complicated recipe, but they did multiple rounds of RL. Like there they had three rounds of RLVR, followed by a round of, um, software engineering RL, and then followed by an RLHF stage. So it was, it, it was really interesting to see them go from doing that, like, you know, one of the most complicated, um, RL setups or in terms of, you know, successive stages, uh, that I’ve seen. To then, you know, you know this setup where it’s still complicated, but it’s a lot, um, you know, it’s a lot con- conceptually a lot simpler.
    00:27:54 Nathan Lambert: Yeah. I, I pocket the paper up. It’s gonna be hard for me to... I, like I had highlighted a few details. The, the interesting parts are kind of around the, um, various NVIDIA details on all the teachers. There’s just so many details in their paper on training-
    00:28:10 Finbarr Timbers: Yeah
    00:28:10 Nathan Lambert: ... all the teachers. I think, okay, so I have some of it. I have some of this up. It’s like I have an interesting quote that’s like, “One key finding from our trials of doing on policy, multi-teacher on policy distillation is that teacher models trained with substantially different training pipelines cannot be effectively combined through a straightforward on policy distillation merge, resulting in suboptimal performance.” So it’s like they’d have to do some cross teacher alignment, um, to make sure that they’re actually similar, which I feel like could become a whole, uh, organizational nightmare. It’s like they say, “We hypothesize that when the teacher and student are trained on different SFT data, they acquire different reasoning behaviors and induce different output distributions. This distribution mismatch can cause student-generated trajectories to be out of distribution for the teacher, result- reducing the quality and reliability of the supervision- supervision signals provided by the teacher.”
    00:29:00 Finbarr Timbers: Yeah, that’s interesting actually because there was a paper, uh, I, I can’t remember the name of it, but there was a paper that I read, um, recently which claimed that what you need to do is constantly... So, so you know, you know, one thing you could do, which was kind of the, the obvious thing to do, is you, you take your base model, right? You do, um, whatever general SFT that you’re doing, and then you take, you do, you know, a bunch of RL, you train domain-specific agents, you train them, you know, all the way until they’ve converged or until you’ve run out of money.
    Uh, and then you take these final experts, and then you do some sort of, you know, on policy distillation to combine them into your, your final model. Um, but with the paper, and I’ll, I’ll try to find it and then give it to you, um, see if we can share it. What they claimed was that you need to, um, instead of using the converged model, you need to do it in like successive stages with like the in-progress model. So if, you know, you train your RL for like a thousand steps, you need to, you can’t use the, you know, the thousand step checkpoint to, for the on policy distillation. You have to do it in stages, and first use the, you know, two hundred and fifty step checkpoint and the five hundred checkpoint and, you know, gradually bring that base model like up to speed or else there’s gonna be too much divergence, and the, the KL divergence will just be like too, um, too distinct-
    00:30:17 Nathan Lambert: Yeah
    00:30:18 Finbarr Timbers: ... to learn from.
    00:30:19 Nathan Lambert: Yeah. So essentially the last state-- sentence in this paragraph I had read most of is literally like, “We encountered this issue in practice because the teacher and student models were developed in parallel.”
    00:30:29 Finbarr Timbers: Yeah.
    00:30:29 Nathan Lambert: It’s like they’re like, “This is a problem because of it’s, like, hard to do everything at once.” Which is w- this is the type of thing where having research in it would be so great, and I think NVIDIA could release some of the teachers so that people could just like-
    00:30:45 Finbarr Timbers: Yeah. That’d be great
    00:30:45 Nathan Lambert: ... if you have the teachers and you have the intermediate model stage, you could do the problem of, like, just studying multi-teacher on policy distillation from the starting point and understanding the training dynamics.
    00:30:57 Finbarr Timbers: Yeah.
    00:30:57 Nathan Lambert: Which is the type of thing we would want to do at Oldo. We just haven’t scaled our recipe to this point yet.
    00:31:03 Finbarr Timbers: Yeah, absolutely.
    00:31:04 Nathan Lambert: So I will keep encouraging NVIDIA to do this.
    00:31:07 Finbarr Timbers: That’d be great. NVIDIA-
    00:31:08 Nathan Lambert: I think, uh-
    00:31:08 Finbarr Timbers: ... listen.
    00:31:10 Nathan Lambert: They, they listen. The other side of things is a bunch of models released in 2026 that do not do this multi-teacher on policy distillation, and they also don’t do nearly as many teachers. So I would say that this, like, Microsoft model, which I don’t say this as a diss, it’s, like, hard to get a new team off the ground, is they went for a simpler approach to try to get a solid model, and it has three more general experts combined w- via SFT and then, like, a longer RL run. So it looks a lot more like DeepSeeker one, but I suspect that what they will do next is make finer grain teachers and see if they need to switch to on policy distillation.
    00:31:48 Finbarr Timbers: Yeah. And I think, you know, in one of our, um, group chats, you described the MAI thinking model as a conservative recipe. A-and I think that’s a really good description of it. Like they, you know, the, the team came up with this conservative recipe, and then I think that they did a really great job of actually executing on it. ‘Cause I think, you know, if you try to make too many changes at once, it’s really easy for the recipe to collapse under its own complexity, and I’ve seen that a bunch of times, you know, across my career.
    Try to make too many changes and, you know, it all goes poorly. So I thought that was, um, a really good choice on their part. I, I also think that, uh, it’s not super clear to me, may-maybe you’ve seen some papers on this that I haven’t seen, but it’s not super clear to me how well the trace distillation SFT does or, you know, h- how much better on pols- online policy distillation is versus the trace distillation SFT.
    00:32:41 Nathan Lambert: Yeah. It’s like what’s, what is the relative magnitude in the final performance?
    00:32:45 Finbarr Timbers: Yeah.
    00:32:45 Nathan Lambert: So the Nemotron Ultra paper has a table on how far the on policy distillation goes relative to the teacher, and they also have the starting point. So I guess that’s a potential way to do this. Here, I could, I could just pull this up. Let me switch.
    00:33:00 Finbarr Timbers: Oh, sure.
    00:33:04 Nathan Lambert: So I, I had this open, but in a different tab. Okay. Here’s, here’s this paper. This is page twenty-seven is which the paragraph I just read, and then it also has this kind of-
    00:33:17 Finbarr Timbers: Oh, fascinating
    00:33:18 Nathan Lambert: ... is it a great table. I spent a while looking at this earlier. So essentially, it’s like where they get after SFT-
    00:33:24 Finbarr Timbers: Wow
    00:33:24 Nathan Lambert: ... on each of the benchmarks on the general model. And then I think... Okay, so the sort of gains over the RLVR student recovery of the specialty student. So I need to make sure... Okay, so it denotes the initial student checkpoint, where RLVR denotes the s- initial student checkpoint, and then the multi-teacher on policy distillation. So I’m not sure what this SFT column can figure out, but you could see the kind of like where the teacher is relative to on policy distillation. I think this is like the closest information we have on the relative performance gains.
    00:33:59 Finbarr Timbers: Yeah. That’s fascinating because the DeepSeek, I forget which one, maybe it was V3.2 paper claims or, or maybe it was, um, R1 actually claims that you can domain-specific... That, that, you know, doing the general stage, uh, captures the performance, uh, of it. But, you know, that, that doesn’t really seem to be... A-a-and yeah, a-a-and then so, you know, doing the domain-specific distilling in, and then doing a general stage on top of that captures the original performance. But that doesn’t seem to be the case here. Like, you know, the, the gap maybe isn’t huge, but there is still, most of the time, there’s a pretty big... There, there’s like, you know, a significant gap, even if it’s not huge. So that’s really interesting.
    00:34:42 Nathan Lambert: Yeah. I wish this table and text was clearer. It’s like I literally can’t fully parse it. It’s like RLVR denotes the initial student checkpoint, and then OPD denotes the checkpoint after first and second iterations. It’s like, what is the checkpoint that was used at the start of on policy distillation?
    00:35:01 Finbarr Timbers: I think it was the RLVR one, so that they do a general SFT stage, and then they do an RLVR stage that covers the non-teacher, the, the areas that where they don’t have specialized models. Then they do MOPD.
    00:35:15 Nathan Lambert: Yeah. And then that makes sense with this recovery rate, which is like final model minus RLVR, which would be like the gains for the OPD relative to the teacher minus RLVR, which would be like what gains you needed to still cover.
    00:35:31 Finbarr Timbers: Yeah.
    00:35:32 Nathan Lambert: And like what, what gains the teacher could potentially give you. So more research like this. Happy to see some of it a- out there. I’m gonna switch back.
    00:35:43 Finbarr Timbers: Yeah. Something I found interesting about the, um, the, uh, both the Nemotron papers and then the MAI thinking paper is that they don’t talk as much about some of the more detailed, um, post-training decisions that have shown some pretty strong gains in, um, some of the other papers. Like I, I believe it was GLM five where they talk about doing a difficulty curriculum and a difficulty filtering stage.
    00:36:11 Nathan Lambert: Yeah.
    00:36:12 Finbarr Timbers: And that’s just not something that’s really talked about in these other papers. They’re saying they, they don’t, you know, uh, I think it was QEM 2.5 used a temperature. It’s kind of funny. So QEM K 2.5 and GLM five both have temperature schedules, uh, and they both claim the exact opposite thing. So one of them says you have to start with a high temperature and go low. The other one says you have to have a low temperature and go high. And, uh, y- I don’t know. And then so, you know, you don’t see that discussion, uh, I, I don’t think in Some of the other papers, which is kind of interesting
    00:36:40 Nathan Lambert: Yeah. I, I still think the Chinese labs are much more willing to share, like really, really nitty-gritty tech details. The NVIDIA paper is like mostly a list of like methods to create a teacher or like-
    00:36:51 Finbarr Timbers: Yeah
    00:36:51 Nathan Lambert: ... domain-specific teachers, which is useful, but I think like I was less... It’s like less of a fun read. They’re like, there’s 15 pages of different domains, so I’m like, “Okay, I don’t, like I don’t need this.” Yeah, like KBK 2.5 and, uh, GLM 5 actually have like more similar recipes, which are also on the simpler side, which is like you create this SFT stage, and then you do RL. The RL might be staged. Um, there’s not this on-policy distillation. There’s a bit less talk on how many experts they have and what their domains of expert-s are. I think it, it’s obvious, like you have to take all this with a grain of salt, and it’s like what, how they decided to present the information is like a big factor in this. And then like they might actually be closer in reality and then it just wasn’t described in a certain way.
    00:37:44 Finbarr Timbers: I, I think another interesting bit is that you see the Chinese labs, uh, all seem to be converging towards sparse attention, whereas, uh, we don’t see the, you know, where was the American labs, at least NVIDIA and, you know, AI2 seem to be more converging towards hybrid attention. Uh, like N- uh, the NVIDIA Ne- Nemotron Ultra used the Mamba, um, attention, whereas, you know, we see, you know, DeepSeek sparse attention and then the Mimo, eh, MSA, whatever that stands for, Mimo Sparse Attention. So I, I think that’s, uh, an interesting divergence.
    00:38:20 Nathan Lambert: Yeah. I am not the person to ask, but I agree.
    00:38:23 Finbarr Timbers: [laughs]
    00:38:23 Nathan Lambert: It’s like I... Like I, I often get asked of like, this is to, to... Don’t, we’ll avoid the full rabbit hole, but I often get asked like, “Are the Chinese labs more efficient?” And I’m like, “I don’t really know how I’m gonna give you advice unless I’m spending twenty hours a week look, understanding the details of your recipe,” ‘cause it’s like, well, I can’t really give you a one sentence thing of do X without understanding all the complexities of the model and the post-training process that go into it. Which makes it, like makes it hard from kind of like a transparency point of view. Even if it’s fully detailed, it’s definitely still hard to modify and study.
    00:38:42 Finbarr Timbers: Yeah
    00:38:42 Nathan Lambert: ... like if you make a GPT model 1% more efficient, you’re making like fat stacks of profit. Like, I think that’s like a more effective market mechanism, but-
    00:38:53 Finbarr Timbers: And then-
    00:38:53 Nathan Lambert: The Chinese lab-
    00:38:54 Finbarr Timbers: You know-
    00:38:54 Nathan Lambert: Yeah
    00:38:55 Finbarr Timbers: ... if you make, you know, serving ChatGPT more efficient, Sam Altman can say, “Hey, here’s a bunch of stock.” Like, so yeah.
    00:39:02 Nathan Lambert: Yeah. But, uh-
    00:39:03 Finbarr Timbers: Um
    00:39:03 Nathan Lambert: ... they do great, like the Chinese labs do great research.
    00:39:05 Finbarr Timbers: Absolutely.
    00:39:05 Nathan Lambert: I just think it’s kind of a bit different. Okay, we can move into more open-ended stuff here.
    00:39:12 Finbarr Timbers: Sure.
    00:39:12 Nathan Lambert: I think that we have like a bunch of docu... We have th- a bunch of things in a document here. I’m sure more will come up. How do you think about open models and kind ‘cause i- it just doesn’t strike me that there’s this, like, you know, I think that there’s a large business to providing... Well, actually that’s not even super clear. There’s, you know, we’ve seen a number of companies providing, you know, RL fine-tuning services, you know, RL as a service. We’ve seen a lot of companies try to provide fine-tuning as a service, and, you know, none of them have really taken off. Like, I think OpenAI has started to shut down, I think they shut down their RL fine-tuning. I think they might be shutting down their fine-tuning. May be wrong about that.
    00:45:51 Nathan Lambert: Well, it’s like Cursor used Fireworks for their actual training run, and I’m like, I don’t really know all the details of this, but Cursor does something for fat- I think like fast weight tran- or Fireworks does-
    00:46:01 Finbarr Timbers: Yeah
    00:46:01 Nathan Lambert: ... a fast weight transfer and other things to make it so that they can scale their RL inference compute very nicely. So that’s one type of it. I don’t know how big of a long tail that business is, but also I think Tinker is a better business than most people expected. It makes some real amount of money. It’s like in the hierarchy, I think selling compute, not the best business.
    00:46:23 Finbarr Timbers: Yeah.
    00:46:23 Nathan Lambert: Selling inference, great business. And Tinker-like APIs, if you can’t transition it into selling tokens, is somewhere in between the two, where they could take some amount of margin that’ll be slightly higher than just selling the compute. And they obviously get a margin by having, like, they get compute at a cheaper rate than their customers-
    00:46:43 Finbarr Timbers: Yeah
    00:46:43 Nathan Lambert: ... and that’s like part of the margin they’re taking. But I don’t see it being as nice as inference, so it’s kind of existential for them to make it so that these fine-tuning APIs feed into a inference business pretty nicely.
    00:46:56 Finbarr Timbers: Yeah.
    00:46:56 Nathan Lambert: Because then you can be somewhat locked in on you train the model on our infrastructure. You actually can own the model weights, but the training dynamics to inference mismatch is perfect because you trained exactly on our inference engine, and are gonna get what you want out of it.
    00:47:11 Finbarr Timbers: Yeah. And it also helps a lot with utilization because you can then, you know, utilize it. You, you can share that utilization across a lot of clients. So I think it makes a lot of sense. I think it’s probably a better model for a lot of, um, users. Like, I think of academic users, like it probably makes way more sense to do this. Or, you know, for that matter, if you’re, you know, as, uh, uh, starting a new, um, ar- you know, post-training lab now, as you know, I, I know a few people, um, who are. Like, I think that’s where it, it probably makes a lot of sense to start with something like the Tinker API, and then, you know, at some point if you wanna try and capture that margin, maybe then you try to do something more custom. But if you, if you can use something like that, like that’s great, and the economics are just, you know, fundamentally more sustainable. I or, you know, they’re better for you rather than trying to, you know, g- go to CoreWeave or whoever and say, or Serv scale and say, “Hey, I need, you know, 10,000 networked, uh, DB200s,” you know? That’s just a very expensive, um, thing to do, especially if you can’t keep it running all the time.
    00:48:14 Nathan Lambert: Yeah. Do you have a, do you have any more hot takes on post-training before I ask you some more general things?
    00:48:22 Finbarr Timbers: Uh, well, something I’m, I’m generally interested in and, you know, I, I’m the wrong person to, to speak to about it. I’d love to talk to someone who’s maybe a, a, a capital allocator, like who’s, you know, deciding or a compute allocator who’s deciding where to put, uh, compute or, you know, where to hire team members. Um, because I’m kind of curious how Uh, the high level decisions are made allocating resources between pre-training and post-training. Uh, ‘cause, you know, what I kind of have seen as, as a general trend is, is that you see a lot of papers where there’s, you know, more focus put on one or the other. Uh, like I think... So, so yeah, so that’s something kind of interesting to me is how people who are, you know, making this decision, how, how they’re making that decision and how they’re thinking about it.
    00:49:10 Nathan Lambert: Yeah. It’s like the hardest decision to get out of labs. I’ve like, I used to spend time trying to get them to share more, but I, I think it’s like such a sensitive decision to where they see progress coming. Like they’re making that decision ba- allocating compute based on where they think the most progress is and what the like return on investment is. So if you go to Anthropic and they’re like, “Here’s where our percent, here’s our distributions,” it’s like, okay, that’s where labs see their bets and/or where they see they are weak.
    And it’s like you invest more compute in the pro- to make progress in the area that you are interested in, which I always think makes a lot of the open research kind of boring right now, is like the people that get compute are just way more likely to succeed as academics and researchers, which is a horrible equilibrium for the world, but kind of realistically true. I, I, I don’t know how to make a lot of that. I wanted to ask you how you feel about the craze that people have to cash in on making money and join a lab before the ladder gets pulled up, and what people should be optimizing for in their careers in face of meaningful opportunity costs.
    00:50:18 Finbarr Timbers: Yeah. I think it’s, well, that’s actually very, very timely. Uh, but yeah, no, I, I think that that’s, um, really important to, to talk about. I mean, I think it’s always worth focusing on whether what you’re doing and spending time on is gonna be generally valuable or if it, if it’s like a really short-term exploitation type thing in, in the, you know, RL like explore versus exploit setup. I, I mean, something that I’ve seen throughout my career has been often the places that pay the most, um, are also the places where you’re doing the most interesting work, right? Like, you know, if, if you’re gonna go work at OpenAI, OpenAI or, you know, Anthropic or the Frontier Lab, like they pay a lot of money. They also have a lot of resources, so you’re gonna make a lot of money and learn a lot.
    Um, uh, so I think it’s worth trying to decide i- is that the, is the opportunity that you’re doing that or is the, is the opportunity like, you know, in 2021 or 2022 or whatever, where you might say, you know, I was at DeepMind at the time and it’s like, okay, do I work at DeepMind, which paid a lot less than like crypto? Should I go just, you know, work in crypto and try to, you know, mint NFTs or whatever? I think that would’ve been a mistake, but, you know, trying to figure out, um, if you’re gonna be able to do interesting work is really important and also, you know, try to figure out if you’re going to be able to, you know, push forward science. You know, if, if what you’re doing is more just saying, going to, you know, data vendors and saying, you know, “Okay, you know, we, I need a bunch of data to do whatever.” And then, you know, they, they give you a bunch of data, you train a model, you say it’s good or bad or whatever.
    You know, I don’t think that’s as interesting and, and I don’t think you’re gonna learn a lot even though that’s, you know, work that would probably drive model progress for it. I think if you’re able to, you know, make, focus more on the science and make more scientific conclusions, I think that can be, you know, a lot better for your long-term career. And I think that’s where places like AI2 and the other, um, academic research labs, you know, Marin is doing a really great job of this. Um, I think that’s where you can have a lot of impact in that they don’t have the budget to go and buy a lot of data, and so that leverage just really isn’t, um, open to them to pull. And so they have to focus on science and driving innovation, and that’s where you can see things like the Almix, uh, paper, which I thought was a really excellent, uh, sc- you know, scientific paper, but also, you know, meaningfully, I think, advanced, uh, the state of the art.
    00:52:32 Nathan Lambert: Yeah. No, mostly this is grounded in visiting the Bay Area, and every time I go I’m like, “Holy s**t, what is going on here?” Like all these very junior people are like have way too much dread about their, uh, opportunity cost and both of us aren’t based in the Bay Area, so I feel-
    00:52:46 Finbarr Timbers: No
    00:52:46 Nathan Lambert: ... somewhat removed from it, which gives me a little bit more time to pause and be like, what exactly is the right thing to optimize for? I per- I-- it’s easy for me to say as somebody that’s established, but I think there’s opportunity for a lot of people to just, if they have conviction on something, to try to go and do it and not just follow everybody that goes down the funnel of joining one of the established labs or the Neo labs where I don’t hear from many people that join as a junior person at these places and end up with very high responsibility. Like they’re contributing to something that matters or they’re around a cool group of people, but I don’t hear from that many people that are like, “Wow, I am doing the highest leverage stuff and the most interesting things.”
    00:53:30 Finbarr Timbers: Well, I think that, you know, it’s kind of funny for, for me to say this as I, my career has been more on, on the opportunistic, uh, side of things. Um, but you know, twice now, uh, I’ve been at organizations where, um, I, I’ve been working... So, you know, at, at DeepMind, uh, I, I was part of the Alberta office where DeepMind had, you know, aqua hired the, uh, computer poker research group from the University of Alberta. And so, you know, this was a group of people who were really invested in, uh, computational game theory and g- you know, poker playing, um, algorithms. And they were all in on that and, you know, they, they were all in on that to the point that, you know, they were one of the two leading, uh, labs in the field and, um, were, you know, b-because they were so strong at this, they were then, you know,
    DeepMind came and, you know, acquihired them and, and they all joined and they, you know, did quite well from that, um, acquisition there. And then, you know, you know, I joined later because I was, uh, you know, interested in, in working with them and doing game theory and stuff. But you know, it was this group of people who had this conviction that what they were doing was really important and, you know, it worked out quite well for them. And then, you know, the same thing at AI2, where at AI2, you know, there was all of these people who were really interested in, uh, NLP research, you know, even before language models. Like we see people like, you know, like Kyle a-and Dirk I think were both at AI2 for like almost a, a decade.
    Like they had these really long tenures, um, and then they did really well and then, you know, they’ve, they’ve since had some, you know, strong, um, opportunities, uh, coming out of that with, with, um, yeah, some of the opportunities that have been available to them. And I, and I think that the consistent theme there has been that, you know, if you have high conviction that what you’re doing is important and interesting, then like it, it’s not a mistake to follow that and to, you know, try to become really strong, um, in that area.
    00:55:15 Nathan Lambert: Yeah. I mostly think it’s good for the world to have a di- more diverse set of approaches.
    00:55:19 Finbarr Timbers: Yeah.
    00:55:19 Nathan Lambert: It’ll be interesting to see what the deal labs actually produce if, if they can manage to do things that are diverse. My personal idea is that they’re so big now that most of them need to end up doing something that is somewhat similar, which is-
    00:55:33 Finbarr Timbers: Yeah
    00:55:34 Nathan Lambert: ... hard, but like they need to keep risking the comp- they effectively need to risk their $20 billion valuations to do something interesting that’s not just gonna be like squashed by an OpenAI or Anthropic side project.
    00:55:48 Finbarr Timbers: Yeah, absolutely. And I think it’s tough because when you’re raising, when you’re, you know, you have these huge seed rounds and you’re raising, you know, 200 million or, you know, a billion dollars or whatever, then it’s like you have to pretty quickly show results to be able to-
    00:56:01 Nathan Lambert: Yeah
    00:56:01 Finbarr Timbers: ... you know, grow off of that.
    00:56:04 Nathan Lambert: Yeah. So a to-be continued conversation.
    00:56:11 Nathan Lambert: Any last words? I don’t, I don’t need to stretch it on if we don’t have anything to add to our conversation.
    00:56:16 Finbarr Timbers: No, I, I think this was pretty good. I think it was really great, uh, getting a chance to catch up and talk about some of this stuff. You know, I, I’ve been reading all of these papers and thinking about all the different recipes, so it’s great to get to, um, to chat about it and put it out into the ether. So yeah, thanks for having me on.
    00:56:31 Nathan Lambert: Yeah, thanks for coming back. We’ll talk soon.
    00:56:33 Finbarr Timbers: Sounds good.


    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe
  • Interconnects

    Claude Fable 5 and new AI safety fables

    09/06/2026 | 12min
    Edit Jun. 11: Anthropic changed their silent model manipulation of AI research queries to also use a classifier like the other safety domains. This addresses a key concern I had in the mistreatment of “safety” in the release, and props to Anthropic for a quick change, but it does not fully address the trust that has been broken. I shared more reflections here.
    Today, Anthropic released their Claude Fable 5 model to consumer and enterprise audiences. This is the general-access variant of their Mythos-class models. With it, Anthropic rolled out a series of safety measures — some explicitly called out to users and some modifying the model without telling the user. It should be less surprising than it is that the next major step in AI capabilities came with heavier-handed safety measures indicating Anthropic’s intention to protect, or entrench, their current lead.
    The unevenly applied safety policies that Anthropic have rolled out are on track to become a classic cautionary fable in how narrow and self-fulfilling notions of safety and control rarely work out.
    The smartest model in the world
    Before digging into the nuance of the safety facts, it is important to establish the quality of this model. The quality of the model paints the stakes of today — as these safety features are meaningfully changing the shape of access to frontier AI, something which has never happened with the modern LLMs we know. Second, the capabilities point to this story only accelerating. Recursive self-improvement isn’t quite the right mental model of progress from here, but Claude Fable 5 should make it very clear that there are no immediate walls in training LLMs.
    To start — Claude Fable 5 is definitely the smartest model available to the general public — a remarkable leap on pretty much every relevant benchmark of the day — at only 2X the price of current Opus models (which is still less than GPT 5.5 Pro’s variant). This alone is a seminal moment for the field. To have a model iteration take such a substantial step in capabilities, a few years into the post-ChatGPT LLM race, is astounding. There’s no clear breakthrough associated with this model, such as inference-time scaling or RL, and public wisdom is that this is achieved by advances across the whole stack (of course, we can’t know for sure — it’s not documented). This is a major technical achievement and the employees who built the model should be very proud of their work.
    This model was delayed 2+ months after it was done training before it was publicly available. Given the competitive dynamics of the AI economy, the smarter version of this model is already well underway.
    To continue, the benchmarks for the model are below.
    An asterisk on these scores is that these aren’t necessarily the scores that the public will get, as some of the prompts will be downgraded to Opus 4.8 with the current safety filters on the model.
    This is the type of jump in benchmark scores where I don’t even need to substantially test the model to know it’s an incredible tool. Remember that Anthropic is also the AI lab with the track record of caring the least about benchmarks (in particular, when compared to OpenAI and Gemini). Recall a comment I made in June of 2025:
    This is a different path for the industry and will take a different form of messaging than we’re used to. More releases are going to look like Anthropic’s Claude 4, where the benchmark gains are minor and the real world gains are a big step. There are plenty of more implications for policy, evaluation, and transparency that come with this. It is going to take much more nuance to understand if the pace of progress is continuing, especially as critics of AI are going to seize the opportunity of evaluations flatlining to say that AI is no longer working.
    Clearly, a few pieces of the progress dynamics have changed, but that’s a post for another day. I’ve written multiple posts about new models this year specifically in how it’s hard to trust benchmarks (and partially because the benchmarks don’t move that much). Altogether, this is a major validation for AI-savvy workers who realized they’re likely never going to write meaningful code again and need to develop new workflows around agents.
    Interconnects AI is a reader-supported publication. Consider becoming a subscriber.

    Smarter models spawn new safety games
    There are multiple pieces of safety tooling associated with this release, including but not limited to required data-retention policies and added prompt filters. Through this analysis it is particularly important to be precise and clear as to which pieces of these are causing harm, and why single elements being out of place in an otherwise comprehensive policy are so damning for the overall safety process.
    For their focus areas of cybersecurity, targeted model distillation, and research biology, Anthropic details new safety classifiers in their blog post:
    Fable 5 comes with a new set of classifiers: separate AI systems that detect potential misuse, including jailbreak attempts, and prevent the main model (in this case Fable 5) from responding. We’ve been running classifiers on our models for some time, and Fable 5’s classifiers are an extension of this previous work with extra coverage.
    When Fable’s classifiers detect a request related to cybersecurity, biology and chemistry, or distillation, the response is automatically handled by Claude Opus 4.8 instead. Users will be informed whenever this occurs. Opus 4.8 is a highly capable model in its own right: a response that falls back to Opus is a far better experience than an outright refusal from Fable. Our early data shows that more than 95% of Fable sessions involve no fallback at all—for those sessions, Fable 5’s performance is effectively the same as that of Mythos 5.
    Examples of the primary cybersecurity and biology safety filters — which tell the users explicitly when they’re triggered — are already proliferating online and appear quite sensitive. These can be a frustrating experience for users, but Anthropic is definitely within its power to do this and intellectually consistent for doing so.
    The damaging part of the safety story falls under the fold in the Claude Fable 5 & Claude Mythos 5 System Card:
    We have also added safeguards related to frontier LLM development. As discussed in Section 6.1 of our February 2026 Risk Report, we are concerned about the risks of accelerating the overall pace of AI development, though we remain uncertain about the severity of these risks. In particular, our concern is with—as we wrote then—“accelerating other AI developers in building powerful AI systems that pose similar risks to the ones ours pose - without necessarily having commensurate safeguards.”
    In light of the ability of recent models to accelerate their own development, we’ve implemented new interventions that limit Claude’s effectiveness for requests targeting frontier LLM development (for example, on building pretraining pipelines, distributed training infrastructure, or ML accelerator design). Using Claude to develop competing models already violates our Terms of Service, but enforcing this restriction through our safeguards avoids accelerating the actors most willing to violate these terms.
    Unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts, these safeguards will not be visible to the user. Fable 5 will not fall back to a different model. Instead, the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning (PEFT).
    Anthropic documents on how this will impact a small percentage of users, which is true. I focus on the small amount of users supporting AI’s diffusion and understanding outside of the few frontier labs, as a crucial mechanism for the continued safety of the technology.
    Anthropic is documenting how the proliferation of AI capabilities is a concern to them, but they are solving it by misleading their users. An AI model that gets less intelligent automatically without notifying me is categorically misaligned AI. The next step on this line — not that Anthropic did it, but they could — is to have a model silently manipulate a workplace when it thinks it is an unsafe use for AI. Second, the implementation here is more complicated than was documented for cybersecurity or biology — modifying the model itself or the data presented to it, all without notifying the user.
    The duality of these policies is extremely confusing and paints a strong inconsistency that casts doubt over their safety policies. This “safety” measure is presented as being far more about maintaining their competitive position. Again, if all of the safety policies took one form, this would be far more cogent and easier to support intellectually.
    Anthropic has been very vocal about their concern over distillation attacks from particularly Chinese actors. Their claims are not transparent enough with the facts — or context as to why they can’t prevent the behavior — to be fully believable. Despite the limited information, in the broader AI and DC communities, there have been serious discussions about taking action against the Chinese model builders on the grounds of said distillation.
    On the point of distillation, my hypothesis is that API builders don’t have an easy time preventing hacks or jailbreaking because it’s a deeply grounded property of reasoning models to want to output the reasoning traces, and it would make the model far less intelligent to fully patch the behavior. This is based on a few assumptions:
    * Chinese labs are not just showing up as customers to Anthropic’s API and paying for tokens in the intended input-output form. If the Chinese labs are paying for intended use behaviors, despite being banned by the terms and conditions, I don’t have a lot of sympathy for the frontier labs manifesting policy actions against this.
    * Reasoning traces are disproportionately effective at seeding behavior in downstream models.
    * Leading labs work very hard to patch the pipeline of these jailbreaks.
    So, my logical conclusion is that the model companies would have to weaken their economic position to fully protect their IP. If this is the case, Anthropic would get a lot more sympathy from the AI research community by being transparent. It would also be far easier to have informed policy discussions, and not rely on me proposing Occam’s razor explanations for what the API jailbreaking looks like.
    Building these safeguards is not something that Anthropic should do alone. Safety research should be built on common understanding and information sharing across both labs and public research efforts.
    If the exact safety procedures were actually the top line item to the company — a true non-negotiable for the leadership — they wouldn’t permit the model to be released with an unclearly implemented safety filter in one of their areas of focus (frontier AI training). I am asking — why isn’t there a classifier to downgrade AI research requests? This is a mix of transparent and reasonable safety policies with quietly rolled-out market entrenchment tactics.
    I personally cannot trust the best AI model in the world to work in my professional domains building models, which I’ve constructed entirely out of a passion for making sure the transition to very powerful AI systems goes well for society. This inevitably will feel like a declaration of superiority by the Anthropic leadership.
    The control problem and open-source as the only answer
    All of the actions Anthropic is taking, including calling out smaller Chinese companies for distillation, is well within their right. In fact, many people already expected the leading frontier models to be obviated from users so that labs can protect their IP. Today’s actions miss the big picture that AI will always be an ecosystem, and cultivating an us against them dynamic between the leading company and the other players is structurally unstable.
    Remember, this is at a time when the AI ecosystem is seeing the first stirrings of violence against AI leaders — and I’ve heard from many people that they don’t expect it to abate. I wish I knew how to engage more to prevent this, and I see myself in the non-profit sector as someone who can hopefully independently represent AI to broader stakeholders.
    I believe there was something misread, or at least misunderstood here, by the Anthropic leadership having a narrowly cultivated worldview around AI. An overwhelming sentiment I had today was one of obligation and confusion. I shared how I don’t really want to have to go to bat against Anthropic, but they’ve just been unnecessarily antagonistic to China, then not so subtly to open weight models, and now more broadly to open AI research.
    I understand that Anthropic has a specific view of AI, but such a powerful technology will never have its final equilibrium be one of singular control by a private company. Anthropic showcased this earlier this year in the spat between the Department of Defense and themselves — which points to a long-term equilibrium where the government will either want AI to be controlled by them or to be open. This made me believe that an open ecosystem is a far safer outcome.
    Many of these events make me feel that Anthropic’s leadership has a culture by which they can’t help but speedrun through these issues — going head to head with existing power structures. This adds substantial uncertainty into an AI ecosystem at a time when it is very much not needed.
    Collectively, the last week could be seen as a major rallying point for a new open-source ecosystem in the U.S. Nvidia released their first flagship model last week — Nemotron 3 Ultra — and these actions from Anthropic have galvanized a unanimous motivation and concern among my peers building open models. We need intelligence that we can trust, that we can modify, and that we can control.
    The American open-source ecosystem has its feet underneath it and keeps being given more reasons to fight for its leadership, right from the hands of the companies it directly undercuts. That’s the moral of this fable.


    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe
  • Interconnects

    Farewell Ai2

    02/06/2026 | 15min
    I’m departing the Allen Institute for AI (Ai2), where I got the great privilege to work on the Olmo models, to grow, to learn, and to have broad lasting impacts. This post is an attempt to reflect on why what we did was influential, despite obviously being far from the frontier in performance (even when within size buckets), and how this reflects on various paths to impact in AI today.
    To start, I shared the following note with the company yesterday:
    Dear Ai2.
    As many of you know, today is my last day working at Ai2.
    I joined Ai2 largely as an accident. I met Luca at ICML 2023 in Hawaii and realized I could level up my open post-training work dramatically if I got the chance to join. When I got an offer it was an absolute no-brainer, it was such a welcoming and exciting environment.
    It has been a wonderful ride that has transformed my life, and I couldn’t be prouder of the work we did together. Ai2 has a wonderful scientific culture at its core and I’m excited to see this continue. I feel very lucky to have been here and that I personally have benefited massively from everyone who has worked so hard to cultivate that culture and environment. It is and has been a team effort. This includes all the people whose longest interactions with me were brief chats at the coffee machine. I drew so much energy and excitement from all the different ways people at Ai2 showed up for the mission.
    I’ve already thanked much of the OE team directly, but I wanted to thank everyone else that went into this. Legal, IT, Comms, and the Office team all do a great job enabling and leveling up our research work. It’s often work that is forgotten, outside of the lime light, or remembered at the last minute, but it all has been crucial to achieving our goals. I’m excited to keep visiting the wonderful Northlake space in the coming years.
    Even though I’m leaving, I’m more excited than ever about Ai2’s mission. Ai2 operates in such a rare niche between academia and industry, where we can explore and influence the most important technology of our lifetime. Doing this openly is the best way to ensure the technology diffuses safely to everyone who may benefit. Ai2 needs to stay as ambitious as possible, trying to influence the cutting edge of AI and the biggest issues of the field. Do not shy away from these challenges – AI needs independent voices as it only becomes more geopolitical, socially disruptive, and central to the economy.
    I will still be working in this space, working to make the open ecosystem better coordinated and more useful.
    So as I go off to try something new, don’t be strangers. I’ll always be reachable at nathan@natolambert.com and will still live in Seattle for most of the year.
    Nathan
    I have loved and will still love Ai2. Ai2 has a deep culture of caring about the research process, the outputs that get shared, and most importantly the people who do the work. This is why the institution creates countless wonderful people that go and spread the gospel throughout the research community. This core culture will remain through the rebuild, and there are plenty of resources to do impactful research across the spectrum of AI.
    In the last two years of my time at Ai2 I’ve done so much meaningful work. Of course Olmo is at the top and has been my priority, but making time for consistent practice here on Interconnects, weekend cram sessions for ATOM, and also the fun RLHF book make for a list that makes me wonder how I did it all. I was obviously obsessed with work, but not in a way that made me lose sleep or lose my overall wellness. It was the right long-term approach.
    This impressive list is one where I was ruthless in saying no to things that didn’t matter and got all my work out to see the light of day. I had no medium-sized projects that didn’t succeed in the last few years. It makes me wonder if I wasn’t taking enough risk. It shows you can truly do so much with your time, and it’s actually harder to find the right problems and environment to do it. Many people are in environments where their work never becomes public or they’re forced to change topics consistently.
    From zero to hero
    To start, I’d like to do a short recap on my path to Ai2 to show what Ai2 was just as much a growth story for me as an execution story.
    I studied electrical engineering in undergrad, focusing on linear systems math and microelectronics.
    I was admitted to the UC Berkeley EECS Ph.D. program to study microelectromechanical systems (MEMS).
    I showed up at Berkeley in August of 2017 and realized AI was obviously the thing I should be doing. I asked the likes of Sergey Levine or Pieter Abbeel if they could advise me – they said no.
    I threw all my energy into learning what I could about AI. I got a break to get advised by one of Sergey’s post-docs in 2018 or 2019. I went all in on that, I fought for funding, I fought to have an AI paper.
    This process worked out by the end of my Ph.D. in 2022: I had access to the Berkeley AI Research (BAIR) building and collaborations in the department. It was a bumpy road.
    I wanted to go to industry research, to get a nice paying job with intellectual freedom, something like FAIR or Google Brain at the time. HuggingFace was the only job that fit that bill, it was easy to say yes to.
    I joined HuggingFace in May of 2022 and wasted my time at the company until ChatGPT was released. I used my RL background to write a blog post on RLHF which went viral. HuggingFace decided it would be good for me to form a team around this success.
    In 2023 I learned NLP and about language models. I had a lot of fun and built an initial community. I got burned out by working remote with a huge time difference. I met Luca Soldaini at ICML in Hawaii, where I was giving a tutorial on RLHF, and they told me Ai2 was hiring.
    I got the job at Ai2 largely because of my excitement and how I was saying I wanted to do a lot of stuff that sounded cool to them but no one was likely to do (RL related things). My interviews were far from a sure thing – this is a great job to land!
    I started at Ai2 in October of 2023. I worked remotely for a while. I was doing normal research, I made the first reward model evaluation, RewardBench. It was a solid success, but nothing like how the pretraining team was getting ready to release the first Olmo.
    I helped coach Ai2 on how to release models well, helping the Tülu 2 project land (the first model to do DPO well, publicly at the 70B scale).
    The first Olmo was released in early 2024, I squeaked onto the papers just by trying to be helpful and doing some basic post-training. I was already good at paying attention to which projects are actually important.
    That summer I started rounding everyone up to do a “big frontier post-training project.” This became Tülu 3, one of my favorite projects ever released, in fall of 2024. The goal was to beat Llama 3’s post-training with their own base model. The team morale was incredibly high and the execution was so timely, allowing us to coin the term Reinforcement Learning with Verifiable Rewards (RLVR) in the paper.
    The crazy lengths I went to get the Tülu 3 and Olmo 2 post-training done had me sending 40% more slack messages than anyone at the company and got me the award “The Cat Herder.”
    2025 was a much simpler year. We were too slow to react to reasoning models, given we had been doing similar stuff with Tülu 3, but sometimes that happens.
    Originally we wanted to release Olmo 3 by June or July of 2025. That obviously didn’t happen, but we got the slim chance to train a bigger model, and it really landed. We threaded the needle.
    Since Olmo 3 was released, it was clear that some changes were coming and I personally never got a big post-training project off the ground after that. Many other people managed great work in the spring of 2026.
    This all leaves me here today showing you that only about half of my story at Ai2 is what I was known widely for, and the rest was building momentum. It often takes a year of building relationships and direction before really big successes can happen in a career.
    I was just about a nobody when I joined Ai2 and I got to join a team that was willing to learn from the skills I had brought from HuggingFace. With how media works, I often think I get more recognition than I deserve for Ai2’s success.
    The likes of Tülu 3, Olmo 2, and Olmo 3 felt like generational team efforts. The amount of personal successes and breakthroughs that happened for those projects is immense – and to sustain them over such a long time period is incredibly hard to replicate. The sum far exceeded the individual parts.
    I’ve heard many times in the last few months how people wouldn’t know about Ai2 if it wasn’t for my writing. Statements like this are overblown, but they are partially true and reiterate how crucial building relationships and getting the word out is today.
    When you write a plan that is feasible, the world bends towards that plan. When you convince people it’s going to happen it only becomes more likely. Vision and compelling explanations are one of the items in shortest supply in the tech industry. Often building the thing is easy and explaining it is hard. If no one knows about your work, the value is often close to 0. So much of building reputation is about building relationships with people who will receive your work.
    Reflecting on all of this, I’ve had a shockingly linear path through my career to incremental success. I would expect the first 10 years of most careers to be in search of finding one opportunity as good as Ai2, and you will not always be able to seize it. There are some ways to create more opportunities.
    I’ve discussed before how a large part of my rise is down to many more senior and more established scientists being drawn into the closed ecosystems at the same time as an immense swell in interest for AI. This created a power vacuum that I, and a few other prominent scientists that I think form my “generation”, got to grow rapidly into.
    Interconnects AI is a reader-supported publication. Consider becoming a subscriber.

    The role of public scientists
    With my work at Ai2 and Interconnects, I summarize my role and mission as trying to accomplish three things:
    * Provide clarity in the evolution of frontier models. This is easiest when the science has caught up, but even applying a scientific lens to how the models are changing is very useful to building trust in the broader AI ecosystem.
    * Create a vibrant and diverse open (model) ecosystem. This is crucial to mitigating some risks of AI, particularly with concentration of power and myopia in studying frontier safety, that has motivated me now for 3-4 years. The risks haven’t abated.
    * To build institutions that create people and ideas that further the above missions, and generally mission-driven individuals that are willing to advocate and build a future they believe in. AI is a grand problem, and not one that I can do alone, so I need to build brands to rise through the noise and attract likeminded people.
    At my best, I have many avenues for impact. I help open researchers work on impactful problems – not wasting the precious compute and time they have during the AI boom. I help policymakers know what is true. I build models that people use. I tell stories that make people smile. I keep the list wide so that I can stay motivated.
    I see all of this continuing, and have been thinking about the broader impacts of this repeatedly over the last few months. Hearing that Andrej Karpathy was joining Anthropic prompted me to finally share more of my opinions:
    For a long time, academic researchers being at the cutting edge of new technologies has been a great social equilibrium. Neutral, unbiased technologists have been the people to spread new ideas to the world.
    As AI research takes off in velocity, it is also going behind closed doors. The tech industry has sowed distrust, and now they are the ones trying to tell the world about incredible changes coming. It’s a big loss to a form of social contract in America.
    There’s been a history of scientists helping society understand new technologies. There is a public service in the culture of science that I want to see continue.
    It’s being exacerbated by feelings of FOMO, especially financially driven, where I’m seeing many people who previously wanted to be professors -- and likely still do deep down -- feel a need to conform and chase money, in a pocket of industry. I get it, I grapple with this.
    For those with a safety net, there will be great returns to some who choose to zag, and try to build something good, for people who need something different. For me, this is building interesting, fully-open models, to show what you can do with a variety of open weight sizes.
    Yes, AI’s immediate future is dictated by the frontier, but it’s long-term trajectory still deeply includes academic institutions and open science. Knowledge will always diffuse, but to whom?
    As of today, I think China is positioned to be the global home of AI research in a few years. The home of research is where ideas are accessible, spread rapidly, and are nurtured. The U.S. seems to be unwinding many institutions and relationships.
    The largest returns go to people who build something differentiated, at least in reputation, and a lot of people are not being shown that this path exists.
    To elaborate on this, I don’t fault any of the individuals who are going to industry today. I’ve been very close to doing this myself in the past weeks of job searching, or rather job exploring. It’s a systematic problem where scientists cannot easily get the support to take bold stances, especially stances that are designed around the public good.
    To go a step further and say that only the research within closed, frontier labs matters is very myopic. Yes, there’s a sort of research you can only do with vast compute resources, and they will directly impact the most revolutionary tools of the day. But, I see the relative opportunity to do good elsewhere as higher for plenty of people.
    Open research will always be the standard that sets the language people use to understand AI. It’ll always be how the next generation is trained – even if it’s behind what industry has built. It’ll be the ecosystem where new long-shot ideas are built. Without investing in this open ecosystem, all of these cycles will be kneecapped.
    At the end of the day, so much of my role now is just showing the path to impact in this domain. To show how clever, mid-sized open models can impact real problems in the world. To show how policy-makers and educators need open research to structure the rest of society around AI. This is a fun role too! It would be very sad for me to see this light diminish ever further, into the lightest embers of a fire that looks almost entirely out.
    Even if the pace of research were to slow further, if the folks remaining like myself got financial offers they can’t refuse for their families’ sake, the torch of open research will never fully go out. It’s core to how science is taught and done. There is a next generation coming, they just look for guidance and role-models.
    What’s next
    I see the best Ai2 work as research infrastructure. Building recipes in public gives countless researchers the ability to ask very specific questions of training processes. We need these researchers in the broader community, as Ai2 could never answer all the interesting questions themselves. One of my great joys in recent months has been visiting a top ML university and hearing so many graduate students say they’re building on Olmo. This is how the world should work!
    Going forward, I still plan to operate in similar spaces, fighting for open-science, imagining what the future of the open model ecosystem can be, and doing my best to make the social transition to an AI-native era smooth. I’m most excited by how you can train medium sized open models on specific tasks that become useful tools in complement to the frontier models – massively winning on price. I want to invest in the ecological diversity of open models and coordination across builders.
    For something that isn’t surprising given my past focus areas, I’m watching the pace of releases from all labs open & closed, and how they’re hillclimbing on super ripe new post-training veins (on-policy distillation, agentic workflows, etc.), it’s clear that fully-open post training recipes are about as far behind as they ever have been & falling further behind. I’d like to fix this. It’s not 100% clear yet if I will this year, but I’ll try.
    To do this best and to execute, mostly personally, I needed a new start and fresh perspectives. I’ll be carefully building what I’m doing next over the next few months and am eager to share more about it when I can. One of my close teammates at Ai2 shared this quote with me in a farewell card, and I found it very apt in where I’m going next.
    The object of life is not to be on the side of the majority, but to escape finding oneself in the ranks of the insane. — Marcus Aurelius
    Thank you all for your continued support.


    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe
  • Interconnects

    Open and closed models are on different exponentials

    01/06/2026 | 7min
    The largest debate that’ll define the future balance of power between the open and closed AI model ecosystems is primarily economic — it’s if users of AI will continue to pay dramatically more, i.e. large margins, for the top closed models. Early 2026 is a seminal time for the AI industry, as the coding agents have shown the first area where a huge AI market will continue to pay a substantial premium for better intelligence.
    The other side of this dichotomy is the inevitable decay of API businesses at these same labs. These labs will realize they need to protect their best models, rolling them out later in APIs to both protect token supply, avoid distillation, and stick to use-cases with higher margins. All of these effects will be clearly visible in 5-10 year timelines, as in the near term markets, prices, margins, and demand will be dictated by a rapid buildout of compute (supply-limited in the near term) and mass subsidization of tokens (through continued investment in new AI companies).
    The core of this argument rests in the obvious habit changes that are setting in with coding agents past the Opus 4.5 and Codex 5.2 thresholds. People are not making this switch because they are lazy, but because their net output is obviously higher when using an agent as an implementation aid for complex knowledge work. For people who rely on coding agents to work, they will always pay more for the best rather than settle for good enough. There are so many ways to make the product better, speed, intelligence, specialized models, etc.
    I would pay $2000/month for the tools today, especially knowing they’ll get much better. At the same time, it is likely that many companies are forcing agents and usage onto people that actually will get very little out of them in their current form, which helps the AI buildout (or bubble) continue.
    The best closed labs — right now this list is just Anthropic and OpenAI, but it’s reasonable to expect Google to catch up — will always make the most efficient models for intelligence at a given cost. Building models is a mass capital investment of talent, data, and compute. These systems, a combination of model weights, harnesses, tools, and serving infrastructure have massive returns on integration (where open models are designed to work across many, diverse serving situations). These integration benefits — the integration of hardware and new forms of software — can be expressed in any possible way of making models better.
    The models in the near future may saturate on benchmark scores, but if that intelligence ceiling really is a cap on utility then the labs will optimize utility per second or per watt, serving users in another way. Improving the models is possible in every direction — there have been no walls in progress. We’re early in the mass buildout of intelligence, which involves harnessing the physical world to build numerous datacenters, organizing many AI researchers so that a large team can contribute to one model, and of course solving many small, low-level puzzles that unlock performance. Every indication is that there is still meaningful performance to be unlocked and the closed labs are the best set up to extract it.
    The collective wisdom of the labs is that making the models smarter, in terms of the frontier of absolute intelligence, has the most value. This is the right call to me because it unlocks large new markets. Optimizing models at a fixed intelligence level locks in markets, expands accessibility over time, and increases return on investment for users (while potentially lowering margins for selling intelligence).
    Many people are making this bet that models will keep getting better and are learning to work well in these harnesses, even though some workflows are still a bit clunky. This is the right bet. These people all will continue to use the absolutely best models available. It’s like buying an iPhone as a consumer. You could get an Android and suffer from a bunch of paper cuts to save money, but why would you? The returns to performance are even higher in the workplace, which drives pricing power.
    In this mental model, the frontier labs as businesses, will look like new, reimagined forms of a mix of Apple and Microsoft. The Apple side is that they’re selling an integrated, extremely hard to replicate technology. The Microsoft side is selling high-leverage subscriptions across the economy. In 5-10 years I expect both OpenAI and Anthropic to be valued in the $2-10T range. The true frontier labs will be an oligopoly that looks like the cloud market today.
    Interconnects AI is a reader-supported publication. Consider becoming a subscriber.

    On the other side of this equation is the open model economy. This isn’t to say that the frontier labs will dominate all aspects of AI use. Yes, I expect OpenAI and Anthropic to be the most representative companies of the AI boom (new companies, alongside Nvidia of course), but the collective value capture around open models will be far bigger overall, it’s just that the revenue and margins will be shared across a wide stack of companies.
    Many businesses want to switch to open models but the models today are not good enough in out-of-distribution tasks. Eventually open model builders will stop chasing Claude and GPT on the Artificial Analysis index and fill this niche. This fork could be driven by economic factors, where they no longer have the revenue to support the growing R&D costs for continuing to scale models. It can also be driven by pure demand, where certain AI solutions only can exist at low price points present in open models. Where closed labs are an oligopoly, open model builders and users will be far more diverse and numerous. The total market value will dramatically exceed the cumulative value of OpenAI and Anthropic.
    Open models are by their nature not integrated, so they will rely on multiple companies coordinating to serve them. Each of these layers will have alternatives, driving prices down to commodity pricing. These low, predictable prices will be where many enterprises enter to build in-house agents and tools for niche tasks. The predominant mode of deployment here is that enterprises find a model that hits a sufficient performance threshold on a task of interest and does not replace the model later (setup costs are high). As customizing models becomes easier, again in the open model finetuning stack we are seeing emerge (Tinker, Fireworks, Prime Intellect, etc.), this market becomes even bigger.
    What this will look like in the coming years is a steady rise in open model inference proportion across the entrenched hyper-scale clouds of Google, Amazon, Microsoft and new AI infrastructure companies of Together, Fireworks, OpenRouter, etc when compared to OpenAI and Anthropic.
    The key is that the open and closed model economies are operating on different exponentials. I still believe that progress will continue at a fast pace across the entire ecosystem, but claims of recursive self improvement (RSI) giving the closed labs an unassailable advantage are overblown. New forms of products like background agents can support both these open and closed models.
    The closed models hit incredible product-market fit with the current agents, starting their integrated exponential by monetizing the top end of the knowledge work. The open model economy will take far longer, but it will also be far more satisfying to follow, as it tracks the broader diffusion of AI into the entire economy and world.


    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe
  • Interconnects

    Some ideas for what comes next, May 2026

    26/05/2026 | 9min
    As the years of AI progress go by, it’s been accompanied by a slowly rising tide of consequence. Models are getting more capable, how we work is changing quickly, economics of AI are becoming real, just as real-world risks come to the forefront. 2026 is the first year where I don’t think there’ll be any breaks from this. The hard part to prepare for is that there’s a good chance things just continue to ratchet up from here – more disruption, more surprises, more stakes.
    On my end, there’s been a growing list of topics that are very fateful to how I see the current state of AI, but I haven’t even gotten to write about them (at least not from all the angles I want to)! All of these are closely related to the implications of different models reaching new capability levels and how I use that to infer what may come next.
    1. Open models haven’t had their true agent moment like Opus 4.5
    The time gap between open and closed models is very often discussed, but the reality is that we have a nice time-gating that’s independent of debatable benchmarks – if open-weight models do or do not become super useful in agentic harnesses. The Opus 4.5 in Claude Code moment of December 2025 was so loud and obvious, that if open models hit this performance level for price points as low as $5/month, there will be an explosion in usage.
    Right now we are about 5-6 months in with no equivalent open model. I suspect the robustness of the best closed frontier models that I write about could make this moment take a good amount longer, say closer to 12+ months. In this time, Claude Code and Codex may seem like different categories of products. In the standard flurry of new, state-of-the-art open models from a variety of labs, benchmarks will definitely keep climbing, but the open-closed gap should become more interpretable as real-world use becomes the real litmus test.
    2. Gemini still doesn’t have a meaningful competitor for Claude Code and Codex
    The best exclamation point I can offer to reinforce my prediction that open models are further behind than the benchmarks claim is that even the mighty Google doesn’t have a clear competitor for Claude Code and Codex. I’m sure the Gemini team is pushing very hard on this.
    I still need to do a lot more testing on Gemini 3.5 Flash, but reading reviews makes it clear that it’s not a substitute for how I’m working today. It’s maybe not the Gemini team explicitly specializing for Google’s existing products (search, YouTube, etc.), but the model seems to suit them. If Google doesn’t have a powerful tool here soon, I don’t expect the open model labs to either. The open models are going to be used more for automated, enterprise agents and low-cost domains, rather than being the driving tool of modern knowledge work. This will feed directly into the economic engine of funding future models, where the agents like Claude Code and Codex are the current best path to massive AI revenue growth.
    I discussed how the current environment is quietly driving labs in China to specialize on AI Proem with Grace Shao and this is central to my expectations of open models specializing over the next few years instead of competing with OpenAI, Anthropic, and Google.
    Interconnects AI is a reader-supported publication. Consider becoming a subscriber.

    3. I don’t expect an open-weights Mythos this year
    While I don’t think Mythos is a general “god model” that will crush the competition in every domain, I do think it’s a remarkable technical achievement in software engineering and cybersecurity. Mythos is obviously a watershed moment for those fields. Having spoken to most of the Chinese labs – particularly those with the most prominent, large, open MoE models like Kimi, Z.ai, DeepSeek, and Qwen – I think they’re heavily resource limited and don’t have an immediate path to scaling up training processes like the big labs in the U.S. For the labs which are more corporate, which comes with more resources, such as Alibaba and Bytedance, they also have more conservative stances on safety and security.Mythos is a bellwether of the massive acceleration in training and research compute available to the largest American companies.
    Epoch AI recently had a nice piece on the compute available to various labs (~Google 25%, Meta 11%, OpenAI 11%, Anthropic 6%). All of these numbers are vastly higher than any Chinese lab.
    4. American open models are slowly gaining steam
    Nvidia with Nemotron, Google with Gemma, Arcee AI and others are slowly stabilizing the open model ecosystem in the U.S. There’s a lot that’s hard to measure here, especially in the rise of local agents like OpenClaw and Hermes, but there are adoption numbers of American models that we haven’t seen since Llama 3.Gemma 4’s models are all tying or outperforming the equivalently sized Qwen 3.5/3.6 models — where Qwen has for years now been the default open model at these sizes. These Qwen 3.5/3.6 models have been tricky to get working in a lot of post-training research, partially due to architecture/tooling and partially likely due to modeling (i.e. the model is not easy to finetune for some training decision). I’ve heard few complaints about Gemma, but it also could be because Gemma is not yet the researcher default.
    There's a simple reality that we've seen recently with models like GPT-OSS, Nemotron 3, and now Gemma 4, that if a model is in the right range of benchmarks and released by an American lab with a truly permissive license, it'll get a large amount of adoption (in this cycle, recall that Gemma 4 adopted the Apache 2.0 License, changing from one with use-case restrictions on earlier Gemmas). This early phase of American growth in open models is establishing key brands directly with developers. The consensus is that more neolabs like Reflection and Thinking Machines are likely to participate in this space, but being too patient will lose the time when new agentic workflows and enterprise relationships are built.
    5. Anthropic and OpenAI are just getting up to speed in model iterations
    I expect the rest of this year to be a ruthless competition between these two flagship companies. I’m at an interesting balance where I think GPT 5.5 is a bit smarter of a model and I love the Codex App, so I’m structuring much of my work to be possible there. At the same time, for a lot of writing-related and broader surface area tasks I really still love Claude. These models are rapidly changing how we work, I run Codex from my phone while doing other things, am setting up automated open model analysis jobs on the back of agents, and expect to be able to scale the research side of Interconnects widely.
    AI is beginning to drive companies to the two extremes in the scaling era. The biggest companies will be way bigger than ever, using resources and mass talent to have sustained progress at the frontier of raw AI capabilities. On the other side, tiny businesses like Interconnects thrive by using agents to refine, present, and sell niche expertise. The mass social job displacement that’ll come is going to reduce employability for various knowledge workers that don’t fit into either of these extremes for the raw technical side (big or small companies), while sustaining and maybe even amplifying careers that interface directly with humans (e.g. doctors) or other power structures with means to sustain themselves (law/government).
    6. More existing power structures will assert themselves on AI
    Just in the last few days while writing this, we had the Pope release an over 40,000 word document on where AI is going and China expand personnel movement restrictions on top AI researchers across industry. At the same time, the U.S. has designated Anthropic a supply chain risk and continues to use its models for national security. The list of news like this is only going to grow. Existing power structures are realizing there’s a finite time window for them to exert themselves in the AI dynamic — an intuition that could be mapped to influence going down as AI models get more powerful. This intuition is potentially dangerous, as it sets up meaningful conflict in who controls the technology (as I discussed with Dean Ball after the Anthropic-DoW spat).
    Next: Where technical becomes social
    These largely technical and power trends accelerating are going to put more pressure on the social and political anti-AI sentiments within the U.S. This is currently the most obvious barrier to continued AI development and beneficial diffusion. Reflecting on this, many people in the tech discourse get too focused on the details, where yes a lot of data-center-detractors are making genuinely wrong factual claims in defense of their position.
    The real position that a large swath of Americans has is that they have a voice in saying no to the current trend — by not granting permission to build data centers. This is a voice that they haven’t been granted by the tech industry that changed the face of the global economy and power structures in the last few decades.
    This is setting us up for a challenging year ahead for the industry. The labs are aggregating and concentrating talent to peak levels. There are few neutral messengers to communicate the reality of AI to the public. The frontier labs leadership is largely gearing up to IPO and stay ahead in the capabilities race. With the status quo, there are few actions to unwind this path toward social conflict.
    It takes individuals in the AI ecosystem to zag and go against the groupthink of needing to make your wealth today, of needing to be at a lab to do impactful work, and so on. I’m personally continuing to bet on this, by trying to make a vibrant and diverse open model ecosystem supported by clear, unbiased information. If you agree with this and have been watching from the sidelines, it’s a good time to get involved, before the situation spirals into something uncontrollable.


    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe
Mais podcasts de Ciência
Sobre Interconnects
Audio essays about the latest developments in AI and interviews with leading scientists in the field. Breaking the hype, understanding what's under the hood, and telling stories. www.interconnects.ai
Site de podcast

Ouça Interconnects, Horizonte de Eventos e muitos outros podcasts de todo o mundo com o aplicativo o radio.net

Obtenha o aplicativo gratuito radio.net

  • Guardar rádios e podcasts favoritos
  • Transmissão via Wi-Fi ou Bluetooth
  • Carplay & Android Audo compatìvel
  • E ainda mais funções
Interconnects: Podcast do grupo
Aplicações
Social
v8.10.0| © 2007-2026 radio.de GmbH
Generated: 6/16/2026 - 11:28:24 PM