In mid-January, I made a few predictions about AI developments in 2025. In short, I emphasized that this year would likely focus on so-called AI agents, and the big question would be how so-called “thinking” models like OpenAI’s o1 would continue to evolve. I was skeptical that we would see agents this year that could handle a wide variety of tasks with extensive autonomy. My reasoning was twofold: on the one hand, the inference costs (i.e., the “thinking phase” before the final response) of such models were not insignificant (or at least used to be ...), and on the other hand, granting a lot of autonomy could pose significant safety and security risks.
Well, looking back, the timing of my predictions turned out to be quite interesting—just a few days later, and I wouldn’t have dared to stick my neck out so far. Today, I’d like to briefly discuss what happened over the past few weeks and how it has changed my perspective. I strongly believe that regularly recording concrete expectations is valuable for holding ourselves accountable and avoiding the self-deception that often occurs when we unconsciously keep moving the goalposts.
Just one day after my post, the very integration I had speculated about—AI agents tailored to specific Office products—was introduced. Actually, I had assumed we would stay on this plateau for a while, simply because it’s easy money for companies like Microsoft and for OpenAI as a partner. But as the “Stargate” project presented a week later (together with Donald Trump) shows, industry confidence is currently at a whole different level: the conviction that we are quite close to AGI (Artificial General Intelligence) is being communicated increasingly boldly. For now, Microsoft is just a temporary partner, as long as language models can’t replace entire jobs or even entire companies—but supposedly, that will soon change.
Indeed, intense efforts are underway. I openly admit that I did not expect to see such progress in the month following my blog post. For instance, on January 23, OpenAI made “Operator” accessible to Pro users. Operator is an agent that can browse the internet autonomously and complete tasks online. People still have to intervene occasionally, and not everything works smoothly yet, but it’s safe to assume that before long, we’ll have AI agents capable of autonomously—and relatively reliably—purchasing items for us and booking our travel. The question is: Have we now entered a realm where human jobs are threatened, or will this simply save us a bit of time on our smartphones?
This is precisely where we should address the elephant in the room, the one that’s been there since the first line of this post: DeepSeek-R1, a language model from China made publicly accessible on January 20, which also simulates a longer chain of thought before responding—just like OpenAI’s o1 and o3. This development has caused a real shockwave—both in the industry and in public discourse. Why? Initially, it was said that R1 was at least on par with OpenAI’s o1, if not better, despite being trained at a fraction of the cost. What’s more, it’s freely available to users directly on the website. At that time, OpenAI was still charging $200 for a Pro subscription to work with such a model. But the cost revolution doesn’t stop there: if you don’t just want to chat with the model but want to integrate it into your own software via an API, you pay by the input and output tokens. For R1, those costs are now 95% lower than for o1! This factor is highly relevant for potential use in business—even if R1 is less efficient than o1 and therefore consumes more “thinking” tokens. Another reason R1 has made such a splash is that its inference process has been made visible to users (in somewhat polished form, of course). It’s just incredibly compelling to watch the model wrestle with finding the right answer—one can hardly help empathizing with it, much as one might when watching a student struggle during an oral exam.
Is the hype justified?
First, let’s clarify that Reasoning Models have definitely brought about a breakthrough. Anyone who hasn’t yet engaged with reasoning models can get a sense of how they differ from classic language models by trying the following experiment: Ask an “older” language model like GPT-4 how many trips are necessary for a farmer and a goat to cross a river in a boat that can carry two living beings. Very often, it will give you the wrong number or even start talking about a wolf that wasn’t mentioned in the prompt. The simple reason is that, in its training data, this scenario usually appears in the form of a classic riddle, and the language model—merely predicting the statistically most likely next words—has extracted patterns that make it guess incorrectly. If you try the same with R1, you’ll be impressed to see that it doesn’t just spit out a (wrong) answer immediately; it explores various lines of reasoning to avoid errors before ultimately arriving at the correct result with confidence.
Given how hard it is not to anthropomorphize these models (“it thinks!”) or empathize with them (“you can do it!”), it’s important to note that this has not changed the fundamental question of whether machines can be conscious. We’re still working with a computer metaphor for the brain, from which we’ve reverse-engineered powerful software based on the idea of neural networks. But that certainly doesn’t say anything about whether these processes should be interpreted as “thinking,” let alone whether they shed light on human thinking or consciousness. The new Nobel laureate and AI pioneer Geoffrey Hinton, for example, commits a surprisingly simplistic fallacy in a recent interview: asked whether AI systems might “think” in a way analogous to our own cognition, he essentially replies (fairly paraphrased): “Yes. Because these models are the best model of how we think.” Considering how in recent years philosophy of mind has found the “hard problem of consciousness” so intractable that panpsychism has enjoyed a surprising resurgence, such naïve statements are rather astonishing—yet they pop up as soon as AI is mentioned.
With that side note on how misleading it can be to conceptualize the underlying processes, let’s return to R1 and ask how good it really is. Without question, it produces texts that we’d normally associate with fairly high human intelligence. The only issue is what kind of human thinking it is simulating or could replace.
So, how good is R1?
Answer: Not quite as impressive as initial media reports suggested. I was skeptical of those claims right away because the visible “thought protocols” initially reminded me of what you might get if you forced a model like GPT-3.5 to provide a chain of thought—i.e., prompted it to respond step by step. But part of the issue might also have been my expectations: it doesn’t actually take that much to avoid certain logical mistakes. A year ago, I might have claimed that large language models would have difficulty producing complex reasoning without integrating them with other logic-based AI systems. But it seems a lot of the errors language models make can be avoided with relatively simple tricks. In other words, you don’t need a lot of oversight between the separate steps in the chain of thought to stop the answer from veering off track. The inference process of R1 does indeed evoke older chain-of-thought experiments, precisely because each step only requires minimal nudging in the right direction. This also gives us a much better idea of what’s likely going on inside OpenAI’s o1. It does have a demystifying effect—I had imagined it was more complicated.
What’s particularly noteworthy about how R1 was trained is that it relied almost exclusively on reinforcement learning (aside from a later finetuning) with no human examples of logically coherent reasoning. That means: if, in its chain of thought, the model took a route that led to the right answer, that was positively reinforced. As a result, it developed strategies on its own that work across a wide range of queries! Nobody explicitly taught it: “Pay attention to possible mix-ups in names.” It “realized” on its own that mixing up names often leads to wrong answers.
At this point, it bears emphasizing that these strategies don’t always lead to the correct result. Personally, I quickly noticed that R1 gets confused fairly easily when engaged in more advanced logic. Others have shown that when it comes to meta-linguistic explanations, R1 clearly trails behind o1. I was somewhat relieved to see that other reviewers also concluded the hype around R1 was overstated; for a moment, I’d begun to doubt my own judgment (and given the nasty cold I was dealing with, my head wasn’t exactly clear).
[Update from 02/14/2025: At this point, here comes a somewhat confusing update that you’re welcome to skip. However, it also demonstrates how rapidly and complicatedly things are developing right now. That’s why I’m making the post’s revision process completely transparent here. Let’s start with how I initially understood OpenAI’s first reasoning model.
At the beginning, I thought the most likely explanation for o1’s success was that tokens weren’t generated purely sequentially, but rather—put simply—the space of possibilities was being searched in parallel processes for particularly “well-formed” probability trees. That assumption partly stemmed from a lack of imagination on my part. I couldn’t conceive of how the entire system (at the time, I was only able to test o1-preview and had to trust official reports about o1) could work so effectively with just a single, linearly produced chain of thought.
I wasn’t alone in these suspicions; OpenAI fueled them by being tight-lipped about the details. Yet we probably should have been more skeptical, because it also seems intuitively questionable whether algorithms like MCTS would be effective given the vast array of possibilities in natural language. I definitely noticed o1-preview’s issues with rhyming. That seemed only partially compatible with the idea of parallel processes.
The fact that R1 still performed relatively well with just one simple chain of thought apparently convinced many experts that OpenAI, too, was “just cooking with water.” I ended up being convinced, as was evident in the first version of this post. My own experiments via the API contributed to this shift: with many tasks, o1 turned out to be fairly close to R1 in terms of the number of “thinking tokens.” Before that, I had only been able to use o1-preview, where the difference was large enough to suggest different mechanisms might be at work.
After reading this post, however, André Oksas pointed me to Noam Brown’s work, which I only knew superficially. Based on that, he concluded that OpenAI’s reasoning models weren’t just trained differently from DeepSeek-R1 but also worked differently. Accordingly, o1 actually includes an additional—undisclosed in its specifics—search algorithm that allows for a particularly targeted approach in the inference process.
I picked up that tip here in an update and highlighted what it would imply: on the one hand, it would explain the qualitative difference between R1 and o1, and on the other hand, it underscores how impressive it is that a comparatively simple chain of thought can come so close to the performance of OpenAI’s inference process, which remains opaque in its details.
In another note, André Oksas then informed me that a new OpenAI publication explicitly mentions a single chain of thought. It seems that o3, in particular, exhibits astonishing emergent mechanisms for correcting its own responses, although these processes are still carried out sequentially. Another absolutely astonishing aspect of this new model is how proficient o3 is at programming—significantly better, in fact, than a specialized o1 variant that was fine-tuned for coding. Once again, superior general “intelligence” trumps highly specialized applications.
But back to the main text. Long story short: The way my post was originally written still holds up.]
The impact on Nvidia’s stock price (since people speculated that you might not need so much computing power after all to achieve major AI developments) was so significant that conspiracy theories were bound to emerge. If all it takes is releasing a language model with allegedly miraculous capabilities and briefly causing stock prices to crash—for someone with enough capital, that’s a major opportunity to make money. Training such a model can quickly pay for itself. The markets’ vulnerability to questionable information is definitely one lesson to take away from this episode.
That said, R1 is indeed remarkable in its abilities—and above all in its efficiency. It is also noteworthy how far one can get with simple supervised learning. I think we should brace ourselves to see further impressive simulations of complex thought processes emerge from fairly primitive training methods. This is also suggested by a recent paper that used a very different strategy to get an open-source model to match o1’s performance in math: They curated 1,000 carefully selected examples of logical reasoning and simply had the model “pause” instead of giving an immediate answer. The fact that a simple “Wait!” plus training costs under 50 USD were enough to achieve results on par with o1 in math is, in my opinion, just as remarkable as R1’s breakthrough.
There’s another notable aspect of R1 we haven’t mentioned yet—it is open-source and can be run offline on one’s own hardware, which means zero token costs. We really need to let that sink in: a slimmed-down variant of R1—and thus a model at least close to o1-preview or o3-mini—can run on any home PC today! I warned exactly a year ago in an interview about a development like this—and even now, I see no political or societal preparations for a scenario in which criminals could use such models to support their activities. Yes, it’s worth pointing out that the Chinese model gives predictable answers about Taiwan. But the far more fundamental threat to democratic structures likely comes from how much mischief one could theoretically cause with so much simulated cognitive power, right?
Society is certainly facing huge challenges. This is due to the aforementioned safety concerns but also because the race for AGI is now officially on. You can see this from how OpenAI responded to the DeepSeek-R1 hype—namely, in a state of near panic. What’s relevant here is not so much the mockery on social media about the lawsuit claiming DeepSeek stole OpenAI’s copyrighted material. That’s a sideshow. The real point is how hastily OpenAI then released o3-mini to users—on January 31, with surprising flaws in the user interface.
Just a few days later came the “Deep Research” feature, which gave Pro users full access to o3’s capabilities combined with a very robust internet research function. That put OpenAI back in the headlines, dominating the conversation again—and rightly so. (Their attempt to get back on top with Operator hadn’t really succeeded.) Deep Research is genuinely something new: it can produce reports on questions that—under certain conditions (e.g., availability of high-quality online sources)—reach a quality comparable to a text produced by a highly specialized human (someone with at least a Master’s degree, in most cases) who also has advanced reading competencies (more on a doctoral level) and plenty of time. (I plan to write a separate post on this soon.)
So now the possibility of an economic revolution feels very pressing. Sam Altman, CEO of OpenAI, estimates that Deep Research could take over a single-digit percentage of all economically valuable tasks. Granted, the question of resources—whether there’s enough computing power for AI to take over the economy, despite the steady drop in costs—remains unanswered. Nor is the infrastructure in place for companies to replace employees en masse with AI agents. It’s also doubtful whether our societies would even have the political will—or, if so, the political competence—to allow that.
But in principle, OpenAI has demonstrated with Deep Research that many tasks performed by highly qualified humans could be automated in the near future. I had thought this was more of a mid-term prospect, with real emphasis on how effectively, efficiently, and safely various AI agents could work together in teams—an issue that is crucial for automating entire companies. Yet we are going to face those questions sooner than I expected. I suspect it’ll become relevant for certain industries this year. And regarding fundamental upheaval, the only major open question for the foreseeable future is how AI will be integrated into robotics over the coming years...
We really do need to think. While we still can. With systems like Deep Research at our service, our willingness and capacity to push beyond mental comfort zones might decline. At least, that’s the direction indicated by studies of ChatGPT’s role in education. It may sound pessimistic, but I notice the same tendency in myself. And that worries me in a time when such fundamental societal decisions are being made.
Who we are as humans and what we want our future to look like should be at the forefront of our minds. We should be ready to really think hard about it—because there’s a lot at stake. It’s not just about the “10 best AI tools” that someone might be pitching on LinkedIn. It’s about the kind of society we want to live in and what steps we need to take to ensure we don’t lose control of this development.
A bit of regulation at the EU level isn’t going to help much. The forces in play are simply too fundamental. Anyone familiar with the ideological assumptions of leading figures in AI development—including the ethical implications—won’t want to rely on sluggish government measures if the goal is to keep certain sectors of life AI-free or exclusively human.
That is, if you even want to pursue such a thing. In any case, I think we’re overlooking the fact that many key proponents of AGI development are motivated far more by altruistic ideals than by any specific democratic form of governance. The experiments currently exploring possible alternatives are anything but subtle—and so far, they’re running surprisingly smoothly. Elon Musk’s issue, for instance, isn’t just with too much bureaucracy. Clearly, it’s also about testing how flexible traditional social institutions are in restricting rapid development (for the “good of humanity”!).
And we’re simply not giving it any thought.