In my previous benchmarks [1, 2], I showed that LLMs can successfully solve most Leetcode problems. However, they are better at solving well-known problems than novel ones. This can be explained by contaminated training data — solutions to well-known problems are likely to be included in training data (this is partially confirmed by recent OpenAI comments regarding SWE Bench [3]). 1 1 2 2 3 3 The original SWE Bench and SWE Bench Verified use Python. I also use Python, but additionally Go, C#, JavaScript, Bash, and others occasionally. So I was naturally interested: how do LLM results vary across languages? My assumption was that models perform better with more popular languages, given the larger volume of publicly available code. That assumption turned out to be likely correct. This aligns with findings from SWE-bench Multilingual, which observed similar performance drops on non-Python languages in real-world software engineering tasks. However, real-world issues involve additional complexity — tooling, libraries, pipelines, etc. I wanted to verify the pattern, using a cleaner setup. Leetcode problems isolate the language itself, since the underlying algorithms are largely language-agnostic. This is what makes the finding more surprising: even when the logic doesn't change, the language you write it in still affects whether the model gets it right. SWE-bench Multilingual SWE-bench Multilingual Benchmark As in my previous benchmarks, I used Leetcode online judge, to verify LLM skills on solving algorithmic problems. But this time, I experimented with four different languages, with different levels of popularity. Languages There are about 20 languages supported by Leetcode for algorithmic problems at the moment of writing. Leetcode doesn't provide language stats explicitly, but users post their solutions, and the platform provides stats for those posted solutions. So, I was able to derive language popularity. It is based on a few random problems, not the whole Leetcode database. Language

Published solutions, %



C++

26.21%



Java

25.60%



Python3

17.81%



Python

7.99%



JavaScript

6.68%



C

6.45%



Go

2.17%



C#

2.12%



TypeScript

1.44%



Swift

0.86%



Kotlin

0.74%



Rust

0.65%



Ruby

0.36%



PHP

0.43%



Dart

0.25%



Scala

0.16%



Elixir

0.05%



Racket

0.03% Language

Published solutions, %



C++

26.21%



Java

25.60%



Python3

17.81%



Python

7.99%



JavaScript

6.68%



C

6.45%



Go

2.17%



C#

2.12%



TypeScript

1.44%



Swift

0.86%



Kotlin

0.74%



Rust

0.65%



Ruby

0.36%



PHP

0.43%



Dart

0.25%



Scala

0.16%



Elixir

0.05%



Racket

0.03% Language

Published solutions, % Language Language Language Published solutions, % Published solutions, % Published solutions, % C++

26.21% C++ C++ 26.21% 26.21% Java

25.60% Java Java Java 25.60% 25.60% 25.60% Python3

17.81% Python3 Python3 Python3 17.81% 17.81% 17.81% Python

7.99% Python Python 7.99% 7.99% JavaScript

6.68% JavaScript JavaScript 6.68% 6.68% C

6.45% C C 6.45% 6.45% Go

2.17% Go Go 2.17% 2.17% C#

2.12% C# C# 2.12% 2.12% TypeScript

1.44% TypeScript TypeScript 1.44% 1.44% Swift

0.86% Swift Swift 0.86% 0.86% Kotlin

0.74% Kotlin Kotlin 0.74% 0.74% Rust

0.65% Rust Rust Rust 0.65% 0.65% 0.65% Ruby

0.36% Ruby Ruby 0.36% 0.36% PHP

0.43% PHP PHP 0.43% 0.43% Dart

0.25% Dart Dart 0.25% 0.25% Scala

0.16% Scala Scala 0.16% 0.16% Elixir

0.05% Elixir Elixir Elixir 0.05% 0.05% 0.05% Racket

0.03% Racket Racket 0.03% 0.03% I picked four languages: Java and Python3, as two of the most popular. Leetcode distinguishes between Python 3 and 2; there are minimal differences between them, and solutions for version 2 will almost always work for version 3.  Then I picked Rust, which has 50 times fewer published solutions, but its popularity is rapidly rising among the engineering community, making it an interesting case. And finally, Elixir, a niche language with just a handful of solutions. The popularity of these four on Leetcode correlates with the TIOBE index, though it does not match precisely. TIOBE index TIOBE index Language

TIOBE Ratings, %



Python

21.8



Java

8.12



Rust

1.32



Elixir

0.19 Language

TIOBE Ratings, %



Python

21.8



Java

8.12



Rust

1.32



Elixir

0.19 Language

TIOBE Ratings, % Language Language Language TIOBE Ratings, % TIOBE Ratings, % TIOBE Ratings, % Python

21.8 Python Python 21.8 21.8 Java

8.12 Java Java 8.12 8.12 Rust

1.32 Rust Rust 1.32 1.32 Elixir

0.19 Elixir Elixir 0.19 0.19 Additionally, I looked up the number of public GitHub repos for those four: Language

GitHub Repos, Millions



Java

20.20



Python

26.50



Rust

1.00



Elixir

0.12 Language

GitHub Repos, Millions



Java

20.20



Python

26.50



Rust

1.00



Elixir

0.12 Language

GitHub Repos, Millions Language Language Language GitHub Repos, Millions GitHub Repos, Millions GitHub Repos, Millions Java

20.20 Java Java 20.20 20.20 Python

26.50 Python Python 26.50 26.50 Rust

1.00 Rust Rust 1.00 1.00 Elixir

0.12 Elixir Elixir 0.12 0.12 In short, Java and Python3 represent the most common programming languages with millions of public projects, and I expected that LLMs would handle them very well. Elixir is on the opposite side of the scale, with orders of magnitude less available code, so LLMs' abilities may diminish with it. Rust is somewhere in the middle — clearly popular, but can LLMs handle it well? Problem Set I picked 100 problems, published between Oct 2025 and Feb 2026. Easy

Medium

Hard

Total



15

59

26

100 Easy

Medium

Hard

Total



15

59

26

100 Easy

Medium

Hard

Total Easy Easy Easy Medium Medium Medium Hard Hard Hard Total Total Total 15

59

26

100 15 15 59 59 26 26 100 100 The intention was to get recent problems, likely "unseen" by LLMs. It is known that solutions for older, and especially popular problems, get into the models’ training sets. Models The models used in the benchmark are listed in the table below, with all non-default parameters specified. Release and knowledge cutoff dates are obtained from the vendor's official documentation and provided for reference. Vendor

Model

Release date

Knowledge cutoff date

"Reasoning"

Parameters



Anthropic

claude-sonnet-4-5-20250929

Sep 2025

Jul 2025

No

temperature = 0.0max_tokens = 4096



Google

gemini-3-flash-preview

Dec 2025

unknown

Yes

temperature = 0.0





gemini-2.5-flash

Apr 2025

unknown

Yes

temperature = 0.0



xAI

grok-code-fast-1-0825

Aug 2025

unknown

Yes

seed = 42



OpenAI

gpt-5-mini

Aug 2025

May 2024

Yes

seed = 42 Vendor

Model

Release date

Knowledge cutoff date

"Reasoning"

Parameters



Anthropic

claude-sonnet-4-5-20250929

Sep 2025

Jul 2025

No

temperature = 0.0max_tokens = 4096



Google

gemini-3-flash-preview

Dec 2025

unknown

Yes

temperature = 0.0





gemini-2.5-flash

Apr 2025

unknown

Yes

temperature = 0.0



xAI

grok-code-fast-1-0825

Aug 2025

unknown

Yes

seed = 42



OpenAI

gpt-5-mini

Aug 2025

May 2024

Yes

seed = 42 Vendor

Model

Release date

Knowledge cutoff date

"Reasoning"

Parameters Vendor Vendor Vendor Model Model Model Release date Release date Release date Knowledge cutoff date Knowledge cutoff date Knowledge cutoff date "Reasoning" "Reasoning" "Reasoning" Parameters Parameters Parameters Anthropic

claude-sonnet-4-5-20250929

Sep 2025

Jul 2025

No

temperature = 0.0max_tokens = 4096 Anthropic Anthropic Anthropic claude-sonnet-4-5-20250929 claude-sonnet-4-5-20250929 Sep 2025 Sep 2025 Jul 2025 Jul 2025 No No temperature = 0.0max_tokens = 4096 temperature = 0.0max_tokens = 4096 Google

gemini-3-flash-preview

Dec 2025

unknown

Yes

temperature = 0.0 Google Google Google gemini-3-flash-preview gemini-3-flash-preview Dec 2025 Dec 2025 unknown unknown Yes Yes temperature = 0.0 temperature = 0.0 gemini-2.5-flash

Apr 2025

unknown

Yes

temperature = 0.0 gemini-2.5-flash gemini-2.5-flash Apr 2025 Apr 2025 unknown unknown Yes Yes temperature = 0.0 temperature = 0.0 xAI

grok-code-fast-1-0825

Aug 2025

unknown

Yes

seed = 42 xAI xAI xAI grok-code-fast-1-0825 grok-code-fast-1-0825 Aug 2025 Aug 2025 unknown unknown Yes Yes seed = 42 seed = 42 OpenAI

gpt-5-mini

Aug 2025

May 2024

Yes

seed = 42 OpenAI OpenAI OpenAI gpt-5-mini gpt-5-mini Aug 2025 Aug 2025 May 2024 May 2024 Yes Yes seed = 42 seed = 42 All models, except Gemini 3 Flash (Preview), were released earlier than the oldest problem in the dataset (Oct 2025). The benchmark aimed to be as deterministic and reproducible as possible; therefore, parameters such as "temperature" or "seed" were used. However, none of the models tested guarantee fully deterministic output. This should be kept in mind when reproducing these results. All models support "reasoning" or "thinking" modes by default, except for Claude Sonnet 4.5. Other model features (or "tools") like web search were not enabled, even if supported. Results A problem is considered "accepted" or "solved" if the solution was accepted by the online judge. All other outcomes, like "wrong answer" or "time limit exceeded," are simply "not accepted" without any differentiation. Model

python3

java

𝝙 python3

rust

𝝙 python3

elixir

𝝙 python3



claude-sonnet-4-5-20250929

50%

52%

+2

51%

+1

35%

-15



gemini-2.5-flash

82%

82%

+0

77%

-5

39%

-43



gemini-3-flash-preview

84%

93%

+9

78%

-6

83%

-1



gpt-5-mini

93%

94%

+1

80%

-13

63%

-30



grok-code-fast-1-0825

73%

65%

-8

65%

-8

30%

-43 Model

python3

java

𝝙 python3

rust

𝝙 python3

elixir

𝝙 python3



claude-sonnet-4-5-20250929

50%

52%

+2

51%

+1

35%

-15



gemini-2.5-flash

82%

82%

+0

77%

-5

39%

-43



gemini-3-flash-preview

84%

93%

+9

78%

-6

83%

-1



gpt-5-mini

93%

94%

+1

80%

-13

63%

-30



grok-code-fast-1-0825

73%

65%

-8

65%

-8

30%

-43 Model

python3

java

𝝙 python3

rust

𝝙 python3

elixir

𝝙 python3 Model Model Model python3 python3 python3 java java java 𝝙 python3 𝝙 python3 𝝙 python3 rust rust rust 𝝙 python3 𝝙 python3 𝝙 python3 elixir elixir elixir 𝝙 python3 𝝙 python3 𝝙 python3 claude-sonnet-4-5-20250929

50%

52%

+2

51%

+1

35%

-15 claude-sonnet-4-5-20250929 claude-sonnet-4-5-20250929 claude-sonnet-4-5-20250929 50% 50% 52% 52% +2 +2 +2 +2 51% 51% +1 +1 +1 +1 35% 35% -15 -15 -15 -15 gemini-2.5-flash

82%

82%

+0

77%

-5

39%

-43 gemini-2.5-flash gemini-2.5-flash gemini-2.5-flash 82% 82% 82% 82% +0 +0 +0 +0 77% 77% -5 -5 -5 -5 39% 39% -43 -43 -43 -43 gemini-3-flash-preview

84%

93%

+9

78%

-6

83%

-1 gemini-3-flash-preview gemini-3-flash-preview gemini-3-flash-preview 84% 84% 93% 93% +9 +9 +9 +9 78% 78% -6 -6 -6 -6 83% 83% -1 -1 -1 -1 gpt-5-mini

93%

94%

+1

80%

-13

63%

-30 gpt-5-mini gpt-5-mini gpt-5-mini 93% 93% 94% 94% +1 +1 +1 +1 80% 80% -13 -13 -13 -13 63% 63% -30 -30 -30 -30 grok-code-fast-1-0825

73%

65%

-8

65%

-8

30%

-43 grok-code-fast-1-0825 grok-code-fast-1-0825 grok-code-fast-1-0825 73% 73% 65% 65% -8 -8 -8 -8 65% 65% -8 -8 -8 -8 30% 30% -43 -43 -43 The results show a clear drop for Elixir across most models. But are these differences statistically meaningful? To assess whether differences in pass rates between languages are statistically significant, I used a two-proportion z-test. For two languages each tested on N=100 problems, the minimum detectable difference at p=0.05 is given by 1.96×√(2p̄(1-p̄)/N), where p̄ is the average acceptance rate across the two languages. Taking Python as a baseline, the Python-Java and Python-Rust gaps are non-significant for all models (thresholds ~11.7pp and ~12.3pp, respectively). The Python-Elixir gap, however, well exceeds its threshold of ~13.4pp for all models except Gemini 3 Flash Preview, indicating that they handle Elixir significantly worse. Database Problems Interestingly, this pattern holds for SQL as well. I had a collection of 321 Leetcode database problems, published from 2015 to 2025. Easy

Medium

Hard

Total



114

142

65

321 Easy

Medium

Hard

Total



114

142

65

321 Easy

Medium

Hard

Total Easy Easy Easy Medium Medium Medium Hard Hard Hard Total Total Total 114

142

65

321 114 114 142 142 65 65 321 321 I used the same five LLMs as in the algorithmic benchmark, but for only two languages: MySQL and Oracle SQL. Though those two implementations are mostly interchangeable, there are subtle differences. For Oracle SQL, there are 15 times fewer published solutions on Leetcode than for MySQL. TIOBE and GitHub don't provide any statistics for those languages — because they are, in fact, not programming languages. Given that most problems predate the models' knowledge cut-off dates, contamination is possible and should be kept in mind when interpreting these results. Model

MySQL

Oracle SQL

𝝙



claude-sonnet-4-5-20250929

87.5%

76.3%

-11.2



gemini-2.5-flash

86.6%

67.9%

-18.7



gemini-3-flash-preview

95.6%

85.7%

-9.9



gpt-5-mini

89.1%

79.4%

-9.7



grok-code-fast-1-0825

80.4%

66.7%

-13.7 Model

MySQL

Oracle SQL

𝝙



claude-sonnet-4-5-20250929

87.5%

76.3%

-11.2



gemini-2.5-flash

86.6%

67.9%

-18.7



gemini-3-flash-preview

95.6%

85.7%

-9.9



gpt-5-mini

89.1%

79.4%

-9.7



grok-code-fast-1-0825

80.4%

66.7%

-13.7 Model

MySQL

Oracle SQL

𝝙 Model Model Model MySQL MySQL MySQL Oracle SQL Oracle SQL Oracle SQL 𝝙 𝝙 𝝙 claude-sonnet-4-5-20250929

87.5%

76.3%

-11.2 claude-sonnet-4-5-20250929 claude-sonnet-4-5-20250929 claude-sonnet-4-5-20250929 87.5% 87.5% 76.3% 76.3% -11.2 -11.2 -11.2 gemini-2.5-flash

86.6%

67.9%

-18.7 gemini-2.5-flash gemini-2.5-flash gemini-2.5-flash 86.6% 86.6% 67.9% 67.9% -18.7 -18.7 -18.7 gemini-3-flash-preview

95.6%

85.7%

-9.9 gemini-3-flash-preview gemini-3-flash-preview gemini-3-flash-preview 95.6% 95.6% 85.7% 85.7% -9.9 -9.9 -9.9 gpt-5-mini

89.1%

79.4%

-9.7 gpt-5-mini gpt-5-mini gpt-5-mini 89.1% 89.1% 79.4% 79.4% -9.7 -9.7 -9.7 grok-code-fast-1-0825

80.4%

66.7%

-13.7 grok-code-fast-1-0825 grok-code-fast-1-0825 grok-code-fast-1-0825 80.4% 80.4% 66.7% 66.7% -13.7 -13.7 -13.7 With N=321 problems and average pass rates around 82%, the significance threshold is approximately 6 percentage points. That means every tested model shows a significantly higher acceptance rate for MySQL. Conclusion We can see that LLM performance on coding problems correlates with language popularity. This is perhaps surprising: algorithmic problems are largely language-agnostic, so one might expect the underlying logic to transfer across languages. Yet, the data shows otherwise — the language you write in matters, even when the algorithm itself does not change. With Python and Java, the most widely used languages, models outperform Elixir, a niche language. The same trend holds for SQL problems, where LLMs work better in MySQL than in Oracle SQL. The most likely explanation is training data density: more popular languages generate more code examples, giving models more material to learn from. The practical implication is straightforward: if you rely on LLMs for coding assistance, your language choice matters — potentially as much as your model choice. Working with uncommon languages means accepting meaningfully weaker AI support, though Gemini 3 Flash Preview is a notable exception, showing near-uniform results across all tested languages for algorithmic problems. However, it is not clear what the actual popularity relationship is. Rust, despite having much fewer public repositories and published Leetcode solutions, showed no statistically significant difference. Several directions would be worth exploring. First, expanding the problem set would allow the Rust finding to be confirmed or ruled out. Second, testing additional languages such as Scala, Dart, or Racket would help establish the popularity-performance relationship more precisely. And, as LLMs continue to evolve, it will be worth tracking whether the gap for niche languages narrows over time. Links Dataset used for this benchmark: https://huggingface.co/datasets/whiskwhite/leetcode-complete https://huggingface.co/datasets/whiskwhite/leetcode-complete https://huggingface.co/datasets/whiskwhite/leetcode-complete Tool used for prompting and submitting solutions: https://github.com/whisk/leetgptsolver https://github.com/whisk/leetgptsolver https://github.com/whisk/leetgptsolver

This story contains new, firsthand information uncovered by the writer.

Google

OpenAI

Comparing LLMs' Coding Abilities Across Programming Languages

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

ChatGPT-4 Solves 85% of Leetcode Easy Problems

We're Building AI Wrong, and the Indie Developers Will Prove It

Going Back to Manual Mode: How to Not Lose Touch With Your Users

This Is the Biggest Risk to the Open Internet...

17 Stories To Learn About Elixir

3 Things I Learned in 1 Year Working with Functional Programming

ChatGPT-4 Solves 85% of Leetcode Easy Problems

We're Building AI Wrong, and the Indie Developers Will Prove It

Going Back to Manual Mode: How to Not Lose Touch With Your Users

This Is the Biggest Risk to the Open Internet...

17 Stories To Learn About Elixir

3 Things I Learned in 1 Year Working with Functional Programming

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps