A team of nine researchers at Sina Weibo has introduced VibeThinker-3B, a compact language model that reportedly matches or exceeds much larger systems from Google DeepMind, OpenAI, Anthropic, and DeepSeek on several reasoning benchmarks.
The 3-billion-parameter model scored 94.3 on AIME 2026, matching the performance range of DeepSeek V3.2, which has 671 billion parameters, and beating Gemini 3 Pro’s score of 91.7.
With a test-time scaling method called Claim-Level Reliability Assessment, VibeThinker-3B’s AIME 2026 score increased to 97.1.
Benchmark Results
VibeThinker-3B scored 91.4 on AIME 2025, 89.3 on HMMT 2025, 93.8 on BruMO 2025, and 76.4 on IMO-AnswerBench.
In coding tests, it achieved an 80.2 Pass@1 score on LiveCodeBench v6 and a 96.1% acceptance rate on unseen LeetCode weekly and biweekly contests held between late April and late May 2026.
It also scored 93.4 on IFEval for instruction following.
The model passed 123 of 128 first-attempt LeetCode submissions, outperforming GPT-5.2, Doubao Seed 2.0 Pro, Kimi K2.5, and Claude Opus 4.6 under the same evaluation conditions.
Much Smaller Than Rivals
VibeThinker-3B has around 224 times fewer parameters than DeepSeek V3.2.
GLM-5 has 744 billion parameters, while Kimi K2.5 exceeds one trillion. By comparison, VibeThinker-3B is small enough to run on a consumer laptop.
The researchers argue that verifiable reasoning tasks, such as mathematics and coding, can be compressed into smaller models more effectively than broad factual knowledge.
They call this the Parametric Compression-Coverage Hypothesis.
Knowledge Limitations
The model does not match large general-purpose systems across every area.
It scored 70.2 on GPQA-Diamond, compared with 91.9 for Gemini 3 Pro and 87.0 for Claude Opus 4.5.
The researchers said this supports their argument that compact models can perform strongly on verifiable reasoning tasks without replacing larger models that provide broader knowledge coverage.
Training Process
VibeThinker-3B is based on Alibaba’s Qwen2.5-Coder-3B and was improved through a four-stage post-training process.
The first stage used supervised fine-tuning on mathematics, coding, STEM reasoning, dialogue, and instruction-following data before shifting toward harder, longer reasoning problems.
Training samples with reasoning traces shorter than 5,000 tokens were removed, along with problems that the earlier VibeThinker-1.5B could solve more than 75% of the time.
The second stage used reinforcement learning across mathematics, coding, and STEM tasks through MaxEnt-Guided Policy Optimization.
Instead of gradually expanding the context window, the researchers used a single 64,000-token window because progressive expansion reduced performance at the 3B scale.
A separate Long2Short Math RL stage rewarded shorter correct solutions to reduce unnecessary verbosity.
The third stage distilled successful reasoning traces from reinforcement-learning checkpoints back into a unified model.
The final stage applied reinforcement learning to instruction-following tasks using rule-based checks and reward models.
Benchmark Concerns
The results quickly attracted attention, but they also raised concerns that the model may have been optimized too heavily for benchmarks.
Some users reported weak performance on practical coding questions, including difficulty with commonly used development tools.
Others questioned why the researchers did not test the model on broader software-engineering benchmarks.
The researchers said the training data underwent strict benchmark decontamination, including filtering for overlapping text.
The recent LeetCode contests provide stronger protection against data leakage because they took place after any likely training cutoff.
However, user reports still suggest a gap between benchmark scores and practical performance.
Open-Source Release
The model was released under the MIT License, with its weights available through Hugging Face and ModelScope.
Within the first day, developers had already produced GGUF quantized versions and derivative models.
The paper received 62 upvotes on Hugging Face’s daily papers page, while the model repository gained 130 likes and the GitHub project reached 685 stars.
Weibo’s AI Work
Sina Weibo is better known for its social media platform than for frontier AI research.
However, VibeThinker-3B is the company’s second major open-source AI release in seven months.
VibeThinker-1.5B, released in November 2025, reportedly beat the original DeepSeek R1 on several mathematics benchmarks.
The team said its post-training cost was $7,800, compared with an estimated $294,000 for DeepSeek R1.
Smaller Models Could Play A Bigger Role
The researchers do not claim that VibeThinker-3B can replace large general-purpose models.
Instead, they argue that small models could handle reasoning work while larger systems provide factual knowledge in hybrid AI systems.
Such an approach could reduce the cost of deploying advanced reasoning and make strong mathematical and coding capabilities available on devices with limited hardware.
The key question is whether the model’s benchmark performance can translate into reliable real-world use.
Stay Connected with ProPakistani
Get the latest tech news, telecom insights, and product launches wherever you prefer.
Add ProPakistani to Preferred Sources and see more of our stories in Google Search and Top Stories.
