Sky-T1–32B-Preview : Open-sourced LLM outperforms OpenAI-o1
UC Berkley’s Sky-T1–32B-Preview details
So, the first big Generative AI release is here where UC Berkeley’s open-sourced LLM, Sky-T1–32B has beaten OpenAI’s o1 on Maths and Coding benchmarks.
What is Sky-T1–32B-Preview?
It is a 32-billion-parameter reasoning model designed to excel in mathematical reasoning and coding tasks.
Key features
1. High-Performance Reasoning
- Competitive Performance: Matches the performance of advanced models like o1-preview on popular reasoning and coding benchmarks.
- Dual-Domain Expertise: Excels in both mathematical reasoning (e.g., AIME, MATH datasets) and coding tasks (e.g., APPs, TACO datasets).
2. Cost-Effective Fine-Tuning
- Affordable Training: Fine-tuned for less than $450, demonstrating that high-level reasoning capabilities can be achieved efficiently and affordably.
- Efficient Training: Completed in 19 hours on 8 H100 GPUs using DeepSpeed Zero-3 offload.
3. Fully Open-Source
- Transparency: All resources, including data, code, model weights, and technical reports, are open-sourced to empower the academic and open-source communities.
- Reproducibility: Provides a single repository for data curation, training, and evaluation, making it easy to replicate and build upon.
4. Advanced Data Curation
- High-Quality Data: Uses 17K curated datasets spanning math, coding, science, and puzzles.
- Rejection Sampling: Ensures data quality by discarding incorrect samples and reformatting outputs for better parsing.
- Balanced Data Mixture: Combines challenging math problems (e.g., NuminaMATH) and complex coding tasks (e.g., APPs, TACO) to enhance reasoning across domains.
5. Model Architecture
- Base Model: Fine-tuned from Qwen2.5–32B-Instruct, an open-source model without inherent reasoning capabilities.
- Training Details: Trained for 3 epochs with a learning rate of 1e-5 and a batch size of 96.
Some key findings
The team, while developing the model, figured out two major insights:
Model Size Matters: Smaller models (7B, 14B) showed limited improvements, while the 32B model delivered significant gains.
Data Mixture Matters: Balancing math and coding data is crucial for optimal performance in both domains.
Performance and metrics
Math Tasks: Sky-T1 slightly outperforms o1 in Math500 and significantly in AIME2024.
Coding Tasks: While o1 excels in easier coding tasks (LiveCodeBench-Easy), Sky-T1 performs better on harder tasks (Medium and Hard).
General Knowledge: o1 has an advantage on GPQA-Diamond.
How to use Sky-T1–32B-Preview?
The model weights are available on HuggingFace and can be accessed below
Concluding,
In conclusion, UC Berkeley’s Sky-T1–32B-Preview marks a significant milestone in the open-source AI landscape, outperforming OpenAI’s o1 on key math and coding benchmarks. This fully open-sourced 32-billion-parameter reasoning model not only delivers high-performance results but also sets a new standard in affordability and accessibility. By making its data, code, and model weights publicly available, the team has empowered the broader community to further innovate and build upon this work. Sky-T1 exemplifies how collaborative efforts in open science can democratize cutting-edge AI advancements for all.