Anthropic, the company behind the Claude series of models, has released Claude 3.5 Sonnet. It comes at a time when we all have accepted GPT-4o to be the default best model for the majority of tasks like reasoning, summarization, etc. Anthropic makes the bold claim that their model sets the new “industry standard” for intelligence. Additionally, it's available for free on claude.ai if you wish to give it a spin. So, we got excited and wanted to test the model and compare it against GPT-4o. This article starts with an overview of the features released with Claude 3.5 and tests it against GPT-4o on code generation, as well as logical and mathematical reasoning tasks. Main Features The model comes with three main features or novelties that make them claim that it beats GPT-4o in most tasks. Improved vision tasks. The model boasts state-of-the-art performance on 4 out of 5 vision tasks as per their published results below. 2x speed. Compared to GPT-4o or its own predecessors like Claude Opus, Claude Sonnet boasts of 2X generation speed.
Artifacts — a new UI for tasks like code generation and animation. Let's dive deeper into the features and compare them with the long-reigning King of LLMs, GPT-4o. Getting Started To get started we have to be logged into the claude.ai website and enable the artifacts feature. As it's an experimental feature, we need to enable it. We have to go under feature preview and enable Artifacts from there as shown below. Once enabled, the model will show a dedicated window on the side for tasks that need them like coding or animations. Vision Tasks — Visual Reasoning To test the improved visual reasoning ability, we upload the below two plots to the Claude Sonnet model and asked the question, “What can you make out from this data?”. Plots as images for testing visual reasoning The response from Claude Sonnet was astounding. It precisely summarised deep learning progress saying, “This data illustrates rapid progress in deep learning architectures and model scaling, showing a trend towards larger, more powerful models”. We got a similar response from GPT-4o as well. So, to get a better understanding of which is better, we started to compare both the models systematically in four tasks — coding, coding with UI, logical reasoning, and Math reasoning. Versus GPT-4o — Which is best? Now that we have seen an overview let's dive deep and take the model for a ride. Let's test for code generation, logical reasoning, and mathematical reasoning. Code Generation For code generation, I am going to ask both models to generate code for playing the well-known Sudoku game. I prompted both the models with the exact prompt, “write python code to play the sudoku game.” With this prompt, both Claude 3.5 and GPT-4o generate code with which we can interact only from the command prompt. This is expected as we did not specify how to generate UI code. Some initial observations: Both models churn out bug-free code.
Claude generates code with the feature to choose the difficulty level. But GPT-4o doesn’t!
With the speed of code generation, Claude beats GPT-4o without a doubt
GPT-4o tends to generate code with unnecessary packages Code Generation with UI As interacting with the command prompt is not for everyone, I wanted the models to generate code with UI. For this, I modified the prompt to, “write code to play a sudoku game”. This time, I removed “python” from the prompt as I felt that it would prompt it to produce just the backend code. As expected, Claude 3.5 did produce a functional UI this time as below. Though the UI was not completely robust and appealing, it was functional. But GPT-4o, unfortunately, did not produce a similar UI. It still generated code with an interactive command prompt. Puzzle 1 — Logical Reasoning For the first puzzle, I asked the below question: Jane went to visit Jill. Jill is Jane’s only husband’s mother-in-law’s only husband’s only daughter’s only daughter. what relation is Jane to Jill? Both the models came up with a sequence of reasoning steps and answered the question correctly. So it has to be a tie between Claude 3.5 and GPT-4o in this case. Puzzle 2 — Logical Reasoning For the second puzzle, I asked the below question: Which of the words is least like the others. The difference has nothing to do with vowels, consonants or syllables. MORE, PAIRS, ETCHERS, ZIPPER\ For this, both the models came up with different logical reasoning steps to come up with different answers. Claude reasoned that zipper is the only word that can function as both a noun and a verb. But others are either just nouns or adjectives. So, it identified ZIPPER as the answer. GPT-4o, on the other hand, identified MORE reasoning that it's not a concrete object or a specific type of person. All this indicates that we need to make the prompt more specific thereby leading to a tie in this case. Puzzle 3 — Math reasoning Let's move on to a well-known visual reasoning puzzle that can be calculated by a formula. So I gave the below figure along with the below prompt as input to both models. The below 3 circles all have blue dots on their circumference which are connected by straight lines. The first circle has two blue dots separating it into two regions. Given a circle with 7 dots place anywhere on its circumference, what is the maximum number of regions the circle can be divided into? In this case, GPT-4o came up with the bang-on right answer of 57. But Claude 3.5 came up with the answer of 64 which is not quite correct. Both models gave logical reasoning steps as to why they arrived at the answer. The formatting of the math formulas in GPT-4o is preferable to that of Claude 3.5. Our Verdict Based on our tests, we conclude that the winner with code generation tasks, be it pure-backed code or GUI code, is Claude 3.5 sonnet. It's a close tie with logical reasoning tasks. But when it comes to mathematical reasoning tasks, GPT-4o still leads the way and Claude is yet to catch up. In terms of generation speed, Claude is no doubt the winner as it churns out text or code much faster than GPT-4o. Check out our video if you wish to compare the speed of text generation in real time. Shout Out If you liked this article, why not follow me on Twitter where I share research updates from top AI labs every single day of the week? Also please subscribe to my YouTube channel where I explain AI concepts and papers visually. Anthropic, the company behind the Claude series of models, has released Claude 3.5 Sonnet. It comes at a time when we all have accepted GPT-4o to be the default best model for the majority of tasks like reasoning, summarization, etc. Anthropic makes the bold claim that their model sets the new “industry standard” for intelligence. Additionally, it's available for free on claude.ai if you wish to give it a spin. So, we got excited and wanted to test the model and compare it against GPT-4o. This article starts with an overview of the features released with Claude 3.5 and tests it against GPT-4o on code generation, as well as logical and mathematical reasoning tasks. Main Features The model comes with three main features or novelties that make them claim that it beats GPT-4o in most tasks. Improved vision tasks. The model boasts state-of-the-art performance on 4 out of 5 vision tasks as per their published results below. Improved vision tasks. The model boasts state-of-the-art performance on 4 out of 5 vision tasks as per their published results below. Improved vision tasks. 2x speed. Compared to GPT-4o or its own predecessors like Claude Opus, Claude Sonnet boasts of 2X generation speed. Artifacts — a new UI for tasks like code generation and animation. 2x speed. Compared to GPT-4o or its own predecessors like Claude Opus, Claude Sonnet boasts of 2X generation speed. 2x speed. Artifacts — a new UI for tasks like code generation and animation. Artifacts Let's dive deeper into the features and compare them with the long-reigning King of LLMs, GPT-4o. Getting Started To get started we have to be logged into the claude.ai website and enable the artifacts feature. As it's an experimental feature, we need to enable it. We have to go under feature preview and enable Artifacts from there as shown below. Once enabled, the model will show a dedicated window on the side for tasks that need them like coding or animations. Vision Tasks — Visual Reasoning To test the improved visual reasoning ability, we upload the below two plots to the Claude Sonnet model and asked the question, “What can you make out from this data?”. Plots as images for testing visual reasoning The response from Claude Sonnet was astounding. It precisely summarised deep learning progress saying, “This data illustrates rapid progress in deep learning architectures and model scaling, showing a trend towards larger, more powerful models”. We got a similar response from GPT-4o as well. So, to get a better understanding of which is better, we started to compare both the models systematically in four tasks — coding, coding with UI, logical reasoning, and Math reasoning. Versus GPT-4o — Which is best? Now that we have seen an overview let's dive deep and take the model for a ride. Let's test for code generation, logical reasoning, and mathematical reasoning. Code Generation For code generation, I am going to ask both models to generate code for playing the well-known Sudoku game. I prompted both the models with the exact prompt, “write python code to play the sudoku game.” With this prompt, both Claude 3.5 and GPT-4o generate code with which we can interact only from the command prompt. This is expected as we did not specify how to generate UI code. Some initial observations: Both models churn out bug-free code. Claude generates code with the feature to choose the difficulty level. But GPT-4o doesn’t! With the speed of code generation, Claude beats GPT-4o without a doubt GPT-4o tends to generate code with unnecessary packages Both models churn out bug-free code. Claude generates code with the feature to choose the difficulty level. But GPT-4o doesn’t! With the speed of code generation, Claude beats GPT-4o without a doubt GPT-4o tends to generate code with unnecessary packages Code Generation with UI As interacting with the command prompt is not for everyone, I wanted the models to generate code with UI. For this, I modified the prompt to, “write code to play a sudoku game”. This time, I removed “python” from the prompt as I felt that it would prompt it to produce just the backend code. As expected, Claude 3.5 did produce a functional UI this time as below. Though the UI was not completely robust and appealing, it was functional. But GPT-4o, unfortunately, did not produce a similar UI. It still generated code with an interactive command prompt. Puzzle 1 — Logical Reasoning For the first puzzle, I asked the below question: Jane went to visit Jill. Jill is Jane’s only husband’s mother-in-law’s only husband’s only daughter’s only daughter. what relation is Jane to Jill? Jane went to visit Jill. Jill is Jane’s only husband’s mother-in-law’s only husband’s only daughter’s only daughter. what relation is Jane to Jill? Jane went to visit Jill. Jill is Jane’s only husband’s mother-in-law’s only husband’s only daughter’s only daughter. what relation is Jane to Jill? Both the models came up with a sequence of reasoning steps and answered the question correctly. So it has to be a tie between Claude 3.5 and GPT-4o in this case. Puzzle 2 — Logical Reasoning For the second puzzle, I asked the below question: Which of the words is least like the others. The difference has nothing to do with vowels, consonants or syllables. MORE, PAIRS, ETCHERS, ZIPPER\ Which of the words is least like the others. The difference has nothing to do with vowels, consonants or syllables. MORE, PAIRS, ETCHERS, ZIPPER\ Which of the words is least like the others. The difference has nothing to do with vowels, consonants or syllables. MORE, PAIRS, ETCHERS, ZIPPER\ For this, both the models came up with different logical reasoning steps to come up with different answers. Claude reasoned that zipper is the only word that can function as both a noun and a verb. But others are either just nouns or adjectives. So, it identified ZIPPER as the answer. GPT-4o, on the other hand, identified MORE reasoning that it's not a concrete object or a specific type of person. All this indicates that we need to make the prompt more specific thereby leading to a tie in this case. Puzzle 3 — Math reasoning Let's move on to a well-known visual reasoning puzzle that can be calculated by a formula. So I gave the below figure along with the below prompt as input to both models. The below 3 circles all have blue dots on their circumference which are connected by straight lines. The first circle has two blue dots separating it into two regions. Given a circle with 7 dots place anywhere on its circumference, what is the maximum number of regions the circle can be divided into? The below 3 circles all have blue dots on their circumference which are connected by straight lines. The first circle has two blue dots separating it into two regions. Given a circle with 7 dots place anywhere on its circumference, what is the maximum number of regions the circle can be divided into? The below 3 circles all have blue dots on their circumference which are connected by straight lines. The first circle has two blue dots separating it into two regions. Given a circle with 7 dots place anywhere on its circumference, what is the maximum number of regions the circle can be divided into? In this case, GPT-4o came up with the bang-on right answer of 57. But Claude 3.5 came up with the answer of 64 which is not quite correct. Both models gave logical reasoning steps as to why they arrived at the answer. The formatting of the math formulas in GPT-4o is preferable to that of Claude 3.5. Our Verdict Based on our tests, we conclude that the winner with code generation tasks, be it pure-backed code or GUI code, is Claude 3.5 sonnet. It's a close tie with logical reasoning tasks. But when it comes to mathematical reasoning tasks, GPT-4o still leads the way and Claude is yet to catch up. In terms of generation speed, Claude is no doubt the winner as it churns out text or code much faster than GPT-4o. Check out our video if you wish to compare the speed of text generation in real time. video video Shout Out If you liked this article, why not follow me on Twitter where I share research updates from top AI labs every single day of the week? If you liked this article, why not follow me on Twitter where I share research updates from top AI labs every single day of the week? Twitter Twitter Also please subscribe to my YouTube channel where I explain AI concepts and papers visually. Also please subscribe to my YouTube channel where I explain AI concepts and papers visually. YouTube channel YouTube channel

This story contains new, firsthand information uncovered by the writer.

Opus

Puzzle

YouTube

Gone Are Those Days of AI

Claude 3.5 Sonnet vs GPT-4o — An honest review

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

An Intro to Prompting and Prompt Engineering

Can Machines Really Understand Your Feelings? Evaluating Large Language Models for Empathy

The Noonification: BEP 341: Consecutive Block Production (7/7/2024)

The Noonification: BEP 341: Consecutive Block Production (7/14/2024)

Claude Sonnet 3.5 - The Best AI Model : A Trading Experiment

What's Next for AI: Interpreting Anthropic CEO's Vision

An Intro to Prompting and Prompt Engineering

Can Machines Really Understand Your Feelings? Evaluating Large Language Models for Empathy

The Noonification: BEP 341: Consecutive Block Production (7/7/2024)

The Noonification: BEP 341: Consecutive Block Production (7/14/2024)

Claude Sonnet 3.5 - The Best AI Model : A Trading Experiment

What's Next for AI: Interpreting Anthropic CEO's Vision

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps