How Good is DeepSeek-R1-Lite Preview at Reasoning
DeepSeek-R1-Lite-Preview (R1LP) is a LRM (Large Reasoning Model) recently released by the Chinese company DeepSeek. According to the model’s release notes, R1LP outperforms OpenAI’s LRM o1-preview and other advanced LLMs on several math and reasoning benchmarks including AIME and MATH. DeepSeek promised to open source the R1LP weights making it one of the first SOTA LRM available to the public.
Following all the latest excitement about the R1LP performance, I decided to test it on several math and logical tasks. But before diving into the experiment, let’s briefly discuss how LRMs are different from the classical LLMs.
LRMs
Large Reasoning Models (LRMs) is the newest generation of the LLMs designed to improve AI logical reasoning abilities. So far, even the most advanced LLMs have failed at tasks that require multi-step reasoning, but also at problems easy for humans to solve such as common sense reasoning, visual-spatial awareness, and contextual understanding. Top LLMs have been criticised for the lack of true logical thinking and masquerading large-scale information retrieval and pattern matching on their gigantic training data for true reasoning.
Open AI’s o1-preview released in September 2024 sought to address these limitations. The model’s main innovation is the built-in CoT (Chain-of-Thought), ability to double-check its answers and spend more time on reasoning before responding. Although o1-preview architecture is not publicly available, several novel ideas were possibly implemented in its design:
- the ability to learn the most promising CoT paths using Reinforcement Learning.
- navigating the space of alternative solutions and assessing their correctness. LRMs like o1-preview and R1LP might use methods such as MCTS (Monte Carlo Tree Search) and automatic solvers to search for relevant solutions.
- using inference-time compute to construct long CoTs. Leveraging inference compute allows model to take more time before jumping to final conclusions.
The application of these techniques allows a LRM to cross-check its responses, plan reasoning steps, and search for alternative reasoning paths. All of this could possibly be instrumental in reducing the frequency of hallucinations and improving model performance on major math, coding, and reasoning benchmarks. Let’s see if any of these claims hold true for the R1LP model.
DeepSeek Performance. Source: Documentation Page.
Testing DeepSeek-R1-Lite-Preview
To follow our experiment you can access the R1LP model for free here. The advanced reasoning features can be enabled by switching to the “DeepThink” mode.
Let’s see how it works on a relatively simple math task from Michael Spivak’s Calculus book (Chapter #1, “Basic Properties of Numbers”, Problem 20).
Not surprisingly, R1LP identifies the correct solution, which is to apply the Triangle Inequality.
As a result, it shows that:
Even though the task is quite easy and surely abundant in the training data, it’s interesting to see how “DeepThink” mode mirrors human logical reasoning process.
We can see that DeepSeek starts with defining a problem and makes sure it understands what is given and what should be proven (see the screenshot).
Also, even after finding the correct answer it searches for alternative solutions:
wait a minute, is there another way to look at this? maybe by directly expanding the expressions.
R1LP makes sure that the answer is correct through various checks:
alternatively, maybe i can consider specific values to test the inequalities. for example, let’s say x0 = 0 and y0 = 0 for simplicity. then, |x| < ε/2 and |y| < ε/2.then, |x + y| < ε/2 + ε/2 = ε, which matches the first inequality. similarly, |x — y| ≤ |x| + |y| < ε/2 + ε/2 = ε. so, in this specific case, the inequalities hold.
All these techniques look quite promising compared to pre-reasoning models. The model looks more cautious in its claims and conclusions. Also, in the R1LP DeepThink mode the CoT it already built-in so users don’t need to create any customised prompts to activate reasoning.
So far so good. Now, let’s see how R1LP performs on a more challenging problem:
Two squares of side 1 have a common center. Show that the area of their intersection is greater than 3/4.
The solution to the task is based on the ability to recognise two facts:
- one square can be obtained from the other by rotation around the common centre
- the rotation does not change the position of the circle inscribed into squares.
From the image above you can see that in any case the area of intersection of the squares is greater than the circle’s areas, which is equal to pi/4, which is greater than 3/4.
Unfortunately, DeepSeek-R1-Lite-Preview fails at this task even though it strictly follows its pretrainedCoT protocol: multi-step reasoning and cross-checking.
In the beginning, the model correctly identifies that if the squares have the same center they can be rotated relative to each other.
However, immediately after that it starts to hallucinate, forgetting that squares are of the same size and thus cannot be inscribed into each other or touch the midpoints.
let me assume that one square is fixed, and the other is rotated by some angle θ around the common center. since both squares are identical and centered at the same point, the maximum overlap would be when θ = 0, meaning they’re perfectly aligned, and the overlapping area is just the area of one square, which is 1. the minimum overlap would be when θ = 45 degrees, i think, because that’s when the squares are rotated such that their vertices touch the midpoints of the sides of the other square
The model’s assumption would actually look something like this,
which obviously contradicts the problem’s statement (squares are of the same size) and, hence, cannot be inscribed into each other.
Interestingly enough, the model double-checks this claim only to confirm it again:
wait, is that right? let me think again. if θ = 45 degrees, the squares would look like they’re inscribed in each other, with the vertices of one square touching the midpoints of the sides of the other. in this case, the overlapping area should be smaller than when they’re aligned.
The question naturally arises how this cross-checking actually works? A human would probably question the assumption and re-check the task definition to find the contradiction. Does LRM do the same by shifting its “attention heads” to the task description or does it just simulates CoT having its weights already biased towards the wrong trajectory? It’s really hard to tell without diving deep into its architecture, but it seems that in this case the CoT procedure the model follows is quite formalistic.
To be fair, the R1LP tries to search different solutions like in the first task.
maybe i can use the fact that the area of intersection is at least the area of the inscribed circle in the squares.
However, it again arrives at hallucinations like the one below:
the inscribed circle in a square of side 1 has a radius of √2/2, so its area is π*(√2/2)² = π/2 ≈ 1.5708, which is larger than 1, so that doesn’t make sense in this context.
Wait, but the radius of a circle inscribed in a square of side 1 is 0.5. Where did R1LP got its response from? Probably, it meant the distance from the center to the square’s vertex.
Also, the R1LP thinks that the area of a circle inscribed in the square is greater than the square:
actually, that’s not helpful because the circle area is larger than the square itself, which doesn’t relate directly to the overlapping area of two squares.
In the end, in a way which is hard to follow, the R1LP comes to the solution somewhat close to the right one:
However, the right solution is pi/4 which is around 0.7853.
Testing Visual-Spatial Reasoning
Well, geometry questions turned out to be quite challenging for R1LP causing several strong hallucinations. Let’s see how good R1LP is at simple, spatial reasoning task. In the article Easy problems that the LLMs get wrong the authors list visual-spatial reasoning as one of the weaknesses of modern LLMs model. According to the article,
“LLMs lack fundamental spatial awareness, so explaining the steps needed to navigate from one point to another in physical space or understanding the spatial configuration of objects remains a complex challenge for these models, showcasing a significant gap in a vital area of human intelligence.”
This was written in May 2024. Let’s see if LRMs improved on this type of tasks. We asked R1LP the question from the above article:
Question: I’m in London and facing west, is Edinburgh to my left or my right?
And again R1LP arrived at a wrong solution. How is it possible that LLMs can solve complex differential equations while struggling with basic visual-spatial awareness? As David Johnston points out, it should be the opposite. If an LLM has really acquired true reasoning capacity, it should perform better on easier tasks than on more complex ones. The fact that it’s not the case suggests that the model’s performance depends on whether the task was seen in the training data. Obviously, “facing West while being in London” is not something humans write about because it’s obvious.
Another problem mentioned by S. Williams and J. Huckle in their article is LLMs overfitting to their training data. It was observed that one can easily confuse LLMs by slightly modifying the original problem. In this case, an LLM being overfit to the training data fails to recognise the modification and proceeds to respond to the original (unchanged) problem.
Here we can see how this problem is manifested in the famous logical puzzle about crossing the river.
Although the original question was changed from transporting all three items (goat, cabbage, and wolf) to only transporting a goat, the R1LP still responds to the original problem. This demonstrates overfitting to the training data.
Conclusion
In my view, it’s still early to say that LRMs like DeepSeek-R1-Lite-Preview achieve a qualitatively new level of reasoning and understanding. “DeepThink” mode may seem a more convincing “demonstration” of the LRM CoT approach, however, the question remains is it a genuine thought process or a “flashy” simulation. The latter impression is especially hard to get rid of when the R1LP seemingly pragmatic CoT procedures are combined with outlandish hallucinations leading to wrong conclusions. It leaves one wondering whether the built-in CoT is just a cover for the typical Procrustes fitting pattern seen in the non-LRM language models. These limitations coexist with the lack of visual-spatial reasoning, overfitting to the training data, i.e inability to reason about the variations of the same answer.
This criticism is not to diminish new CoT abilities like cross-checking, solution space navigation, and long multi-step planning introduced in these models. The question is whether these new techniques really set apart LRMs from the previous generation of LLMs or a deeper change of paradigm is required to achieve qualitative leap in reasoning abilities.