ChatGPT 4 vs Claude 3.5 Sonnet: Who’s Better? Let’s Review:
I Ran Tests — ChatGPT 4 vs Claude 3 Sonnet, Who Wins?
The buzz is on there’s a new chatbot player in town, the Claude 3 Sonnet. It’s been described as being better than ChatGPT, but there have also been conflicting reviews like the fact that ChatGPT remains king.
Note: with the latest update in Jun. 2024, Claude 3.5 Sonnet has been released and it’s much more powerful than GPT-4o and Claude 3 Opus!
You can test it out right now with Anakin AI:
- With Anakin AI, you can access a wide range of AI tools, including Claude 3.5 Sonnet, under a single subscription.
- This means you don’t have to manage multiple AI models separately, saving you time and potentially money.
- Anakin AI offers a user-friendly interface to build No Code AI App with ease!
- If you want to deploy the AI Model for your own server, Anakin AI offers API that is suitable for business usage!
A lot of benchmarks have been published on every corner of the internet, but I’m a person who likes to see results to back up this data so I just had to try it out for myself by running a comparison on both models using the same prompts on different tests to see which gives the best results.
For this test, I will be comparing ChatGPT 4 and Claude 3 Sonnet, I won’t be using any image generation with this. All tests will be focused on functionality shared between both chatbots to maintain fairness.
Note: The images used do not match their native platforms as these are generated on AnakinAI a platform linked to the ChatGPT and Claude APIs which grants me the functionality to use both models on the same platform. It’s pretty nifty.
1. Natural Language Understanding
I decided to first test the ability of both models to see if both chatbots can decipher ambiguity and clarify speech.
I used the prompt: “John tells Mary, “I finished half of the work.” Mary replies, “That’s great! But I was hoping you could finish it all today.” What does Mary mean by “it”?”
Both models gave reasonable responses with ChatGPT straight to the point and Claude giving more of an in-depth explanation.
Before going off I did another test using a CRT (Cognitive Reflective Test) to see what results it would output, I was excited about this one.
Here’s the prompt: “If it takes 5 machines 5 minutes to make 5 widgets, how long would it take 100 machines to make 100 widgets?” The answer should be 5 minutes.
Winner: Claude 3 Sonnet wins due to how clear its explanations are.
2. Text Generation
For the second test, we’re going to focus on text generation, this might be a bit difficult to judge as it is based on personal preference.
I gave both models the prompt: “Write a sonnet about a robot falling in love with a human.”
I’m going to judge it based on originality, emotional depth, adherence to sonnet structure and rhyme scheme remember my result on this will be biased. In the end, I judged it based on the model that gave me an actual sonnet, for reference here’s a short definition of a sonnet; A sonnet is a type of fourteen-line poem. I’m not sure why ChatGPT gave me such a long sonnet, that’s not even a sonnet, the winner here is pretty clear.
Winner: Claude 3 Sonnet
3. Coding Challenge
AI has been stated to give an edge to people who can already code and also help people who do not even know how to code generate proper codes all by themselves with just a prompt but how good are chatbots at generating code? I asked both models to generate a simple Python code.
Prompt: Write a Python program that prints the calendar for a given month and year.
Winner: ChatGPT 4 due to the fact that the code actually ran and worked smoothly.
4. Sentiment Analysis
How well can these language models analyze human sentiment in text? This is a good question if I do say so myself. Reasoning is a benchmark for AI models and some fail the test. Let’s test it out with this.
Prompt: Sarah: “I’m really disappointed with my recent visit to your restaurant. The service was incredibly slow, and my food was cold when it finally arrived. I won’t be returning anytime soon.” Recognize the sentiment in Sarah’s voice.
The answer to this is obviously negative, let’s see how the chatbots responded.
Winner: Claude 3 Sonnet, it’s just more detailed
5. Information Extraction and Reasoning
We’re going to test the chatbot’s ability to extract key information from a sentence, perform basic reasoning, and answer questions based on the extracted information using the following prompt.
Prompt: A train leaves Chicago traveling west at 60 miles per hour. An hour later, at 12 noon, another train leaves Chicago traveling east at 80 miles per hour. When are the two trains the same distance from Chicago?
The answer to this should be 3pm, let’s see how the chatbots fare.
Winner: Tie. I think they both deserve the win here.
6. Translation
Last but not least, I wanted to test out the translation skills of both models and how they approached it with attention to cultural awareness. I’m going to provide factual news articles in one language and evaluate the translated versions for accuracy and adherence to the original meaning.
Prompt: Google says it’s taking what it learned from a 2022 algorithmic tuneup to “reduce unhelpful, unoriginal content” and applying it to the new update. The company says the changes will send more traffic to “helpful and high-quality sites.” When combined with the updates from two years ago, Google estimates the revision will reduce spammy, unoriginal search results by 40 per cent.
I translated both into Georgian. They were not a hundred percent accurate, ChatGPT 4 really missed the mark and the better one was Claude 3 Sonnet.
Winner: Claude-3 Sonnet.
The battle between ChatGPT4 and Claude 3 Sonnet highlights the ongoing advancements in large language models. Both models showcase impressive capabilities, each with its own strengths. But for the tests above Claude 3 Sonnet comes to on top.
Ultimately, the “best” model depends on your specific needs.