Qwen 2.5 Coder 32B: Is This Best Open Weight Model Better than GPT-4o and Claude 3.5 Sonnet

Rifx.Online
Programming , Machine Learning , Generative AI
20 Nov, 2024

On November 11, Alibaba announced its most advanced coding model to date: Qwen 2.5-Coder-32B-Instruct. But that’s not all, it’s actually part of a whole family of coding models! In addition to the 32B model, there are versions with 0.5B, 1.5B, 3B, 7B, and 14B parameters. Before writing this article, I reviewed feedback from many of YouTubers, writers, and technical experts, and the consensus has been overwhelmingly positive. Today, we’ll explore whether it really lives up to the hype.

I’ve been using ChatGPT, Gemini, and Claude for a while, and I can confidently say that Claude is the best at coding and understanding complex tasks. To be honest, Gemini is in a different league altogether. Fun fact: I once won a machine learning competition two years ago using Bard (now rebranded as Gemini) because everyone else was using ChatGPT. Bard’s performance gave me a competitive edge back then. Today, I believe Gemini might be the best vision model available, but overall, I’m not a big fan.

To evaluate the model’s performance in coding, we should cover the following three key aspects: Code Generation, Code Repair, and Code Reasoning. Additionally, we need to consider how many programming languages the model supports and its alignment with human preferences.

Overview

Before testing out Alibaba’s model, let’s take a look at the performance results they have announced.

Based on their internal benchmarks, it even outperforms GPT-4 and comes very close to Claude 3.5. However, I personally don’t place too much weight on internal benchmarks alone, but the results seem very interesting according to those who have tried it.

You can find more details about the performance here:

I’m testing the model locally using Ollama:

You can also try out the model on the demo page of Hugging Face:

Code Generation

Many people have already covered how well it performs in web and design:

So, we will focus on testing it with some Python challenges. Let’s start with a medium-level LeetCode problem:

Prompt

The problem description: Longest Substring Without Repeating Characters

Answer

class Solution(object):
    def lengthOfLongestSubstring(self, s):
        """
        :type s: str
        :rtype: int
        """
        # Dictionary to store the last seen index of each character
        char_index_map = {}
        # Initialize the start of the current window and the maximum length
        start = 0
        max_length = 0
        
        for end in range(len(s)):
            # If the character is already in the map and its index is within the current window
            if s[end] in char_index_map and char_index_map[s[end]] >= start:
                # Move the start to the right of the last occurrence of the current character
                start = char_index_map[s[end]] + 1
            
            # Update the last seen index of the current character
            char_index_map[s[end]] = end
            
            # Update the maximum length of the substring found so far
            max_length = max(max_length, end - start + 1)
        
        return max_length

Results

Not the best, but it gets things done and passes all the tests. Can it solve a hard problem? (Note that the model could have been trained on these problems already, but it’s still impressive.)

Prompt

Problem description: Regular Expression Matching

Answer

class Solution(object):
    def isMatch(self, s, p):
        """
        :type s: str
        :type p: str
        :rtype: bool
        """
        m, n = len(s), len(p)
        
        # Create a 2D array to store the results of subproblems
        dp = [[False] * (n + 1) for _ in range(m + 1)]
        
        # Base case: empty string matches empty pattern
        dp[0][0] = True
        
        # Handle patterns like a*, a*b*, etc., which can match an empty string
        for j in range(1, n + 1):
            if p[j - 1] == '*':
                dp[0][j] = dp[0][j - 2]
        
        # Fill the dp table
        for i in range(1, m + 1):
            for j in range(1, n + 1):
                if p[j - 1] == '.' or p[j - 1] == s[i - 1]:
                    dp[i][j] = dp[i - 1][j - 1]
                elif p[j - 1] == '*':
                    dp[i][j] = dp[i][j - 2]  # '*' matches zero occurrences
                    if p[j - 2] == '.' or p[j - 2] == s[i - 1]:
                        dp[i][j] = dp[i][j] or dp[i - 1][j]  # '*' matches one or more occurrences
        
        # The result is in dp[m][n]
        return dp[m][n]

Results

After testing it on 10–20 problems on LeetCode, I have the impression that it was trained on all problems, so it’s not the best way to evaluate it. However, it’s still worth sharing. Additionally, it always chooses to provide a Python solution, even when not specifically asked for one. But let’s try something different. I asked the model to optimize the code to minimize resource consumption as much as possible.

I didn’t expect that, but when I asked the model to improve the performance, it even fails in the simplest cases.

Let’s now move to the next test, Code Repair. I will be providing the model with some code containing errors generated by Claude. I asked the model to generate some hard cases to fix, and let’s see how it performs.

Code Repair

Prompt

The following code contains several bugs. Please identify and fix them, 
and provide an explanation for each bug you find.
class AsyncTaskQueue:
    def __init__(self):
        self.tasks = []
        self.running = False
        self.lock = threading.Lock()
        self.workers = []

    async def add_task(self, task):
        async with self.lock:  
            self.tasks.append(task)
            if not self.running:
                self.start_processing()

    async def start_processing(self):
        self.running = True
        while self.tasks:
            task = self.tasks.pop(0)
            worker = asyncio.create_task(self.process_task(task))
            self.workers.append(worker)
            if len(self.workers) >= 3:
                
                done, pending = await asyncio.wait(self.workers)
                self.workers = list(pending)
        self.running = False

    async def process_task(self, task):
        result = await task()
        return result

Answer

Certainly! Let's go through the provided code and identify the issues, then correct them. Here are the main problems and their fixes:

Issues Identified
Threading Lock Usage with Asyncio:
The threading.Lock is not suitable for use with asyncio. Instead, you should use asyncio.Lock.
Async Context Manager for Lock:
The async with statement is used incorrectly with threading.Lock. It should be used with asyncio.Lock.
Starting Processing:
The start_processing method should be called in a non-blocking manner. Since start_processing is an async method, it should be awaited properly.
Task Management:
The start_processing method should handle task processing more efficiently. The current implementation might not handle task completion correctly.
Worker Management:
The workers list should be managed more carefully to ensure that tasks are completed before marking the queue as not running.
Corrected Code
Here's the corrected version of the code:


import asyncio
import threading

class AsyncTaskQueue:
    def __init__(self):
        self.tasks = []
        self.running = False
        self.lock = asyncio.Lock()
        self.workers = []

    async def add_task(self, task):
        async with self.lock:
            self.tasks.append(task)
            if not self.running:
                self.running = True
                asyncio.create_task(self.start_processing())

    async def start_processing(self):
        while True:
            async with self.lock:
                if not self.tasks:
                    self.running = False
                    break
                task = self.tasks.pop(0)
            
            worker = asyncio.create_task(self.process_task(task))
            self.workers.append(worker)
            
            if len(self.workers) >= 3:
                done, pending = await asyncio.wait(self.workers, return_when=asyncio.FIRST_COMPLETED)
                self.workers = list(pending)

    async def process_task(self, task):
        result = await task()
        return result

## Example usage
async def example_task():
    await asyncio.sleep(1)
    print("Task completed")
    return "Result"

async def main():
    queue = AsyncTaskQueue()
    await queue.add_task(example_task)
    await queue.add_task(example_task)
    await queue.add_task(example_task)
    await queue.add_task(example_task)
    await queue.add_task(example_task)
    await queue.add_task(example_task)

asyncio.run(main())
Explanation of Fixes
Asyncio Lock:
Replaced threading.Lock with asyncio.Lock to ensure compatibility with asynchronous operations.
Async Context Manager for Lock:
Used async with self.lock: to properly manage the lock in an asynchronous context.
Starting Processing:
Changed the logic to start processing in a non-blocking manner by using asyncio.create_task(self.start_processing()) inside add_task.
Task Management:
Modified start_processing to continuously check for tasks and process them until no tasks are left. It breaks out of the loop when there are no more tasks and sets self.running to False.
Worker Management:
Ensured that workers are managed correctly by using asyncio.wait with return_when=asyncio.FIRST_COMPLETED to handle task completion efficiently.
Example Usage
The example usage demonstrates how to add tasks to the AsyncTaskQueue and run them asynchronously. The main function adds several tasks to the queue, and they are processed concurrently with a maximum of 3 workers at a time.

This corrected and optimized code should perform better and be more reliable.

Results and feedback

Claude evaluates the model’s response as 7.5/10. with this feedback:

While the model provided a functionally correct solution, it delivered a “tutorial-grade” fix rather than a “production-grade” implementation. This indicates that for complex concurrent programming tasks, human expertise is still crucial for robust system design.

After experimenting with the model, it gets the job done, but often requires follow-up prompts to improve the response. While it doesn’t struggle with understanding the task at hand and can correct most critical bugs, it still lacks some of the strengths seen in models like ChatGPT and Claude, particularly in areas such as error handling, resource management, and optimization. I believe these may be the model’s key weaknesses.

Specialized LLMs: A New Era in AI

It’s fascinating to witness the emergence of relatively small, domain-specific language models that can run locally. I believe we’re entering an exciting new era of specialized LLMs, and this is just the beginning. While the concept isn’t new, we’re finally seeing models that truly excel in specific domains.

The combination of RAG (Retrieval-Augmented Generation) and specialist LLMs could define the AI landscape in the coming months/years. We’re likely to see more powerful coding-focused models from industry leaders like OpenAI and Anthropic. Programming is perhaps one of the most natural domains for specialized AI, and we might soon see even more focused models, imagine LLMs specifically optimized for DevOps or front-end development!

Don’t feel overwhelmed by these rapid advances. Yes, it can be somewhat daunting to see LLMs mastering skills that traditionally took years to develop. Coding, which has challenged humanity for decades, is being transformed before our eyes. But rather than seeing this as an endpoint, we should view it as an opportunity for growth and innovation.

Whether the current wave of LLM advances slows down in the coming years or this is merely the beginning of a longer journey, our response should remain the same: stay curious, keep learning, and never stop innovating. The future of technology is being written right now, and we all have a part to play in shaping it.