Comparing LLMs' Coding Abilities Across Programming Languages

Written by alexsvetkin | Published 2026/03/08
Tech Story Tags: llms | leetcode | java | python | rust | elixir | oracle | hackernoon-top-story

TLDRLLMs solve algorithmic problems significantly worse in niche languages than in popular ones like Python and Java. The same pattern holds for SQL: MySQL outperforms Oracle SQL across all tested models.via the TL;DR App

In my previous benchmarks [1, 2], I showed that LLMs can successfully solve most Leetcode problems. However, they are better at solving well-known problems than novel ones. This can be explained by contaminated training data — solutions to well-known problems are likely to be included in training data (this is partially confirmed by recent OpenAI comments regarding SWE Bench [3]).

The original SWE Bench and SWE Bench Verified use Python. I also use Python, but additionally Go, C#, JavaScript, Bash, and others occasionally. So I was naturally interested: how do LLM results vary across languages? My assumption was that models perform better with more popular languages, given the larger volume of publicly available code. That assumption turned out to be likely correct.

This aligns with findings from SWE-bench Multilingual, which observed similar performance drops on non-Python languages in real-world software engineering tasks. However, real-world issues involve additional complexity — tooling, libraries, pipelines, etc. I wanted to verify the pattern, using a cleaner setup. Leetcode problems isolate the language itself, since the underlying algorithms are largely language-agnostic. This is what makes the finding more surprising: even when the logic doesn't change, the language you write it in still affects whether the model gets it right.

Benchmark

As in my previous benchmarks, I used Leetcode online judge, to verify LLM skills on solving algorithmic problems. But this time, I experimented with four different languages, with different levels of popularity.

Languages

There are about 20 languages supported by Leetcode for algorithmic problems at the moment of writing. Leetcode doesn't provide language stats explicitly, but users post their solutions, and the platform provides stats for those posted solutions. So, I was able to derive language popularity. It is based on a few random problems, not the whole Leetcode database.

Language

Published solutions, %

C++

26.21%

Java

25.60%

Python3

17.81%

Python

7.99%

JavaScript

6.68%

C

6.45%

Go

2.17%

C#

2.12%

TypeScript

1.44%

Swift

0.86%

Kotlin

0.74%

Rust

0.65%

Ruby

0.36%

PHP

0.43%

Dart

0.25%

Scala

0.16%

Elixir

0.05%

Racket

0.03%

I picked four languages: Java and Python3, as two of the most popular. Leetcode distinguishes between Python 3 and 2; there are minimal differences between them, and solutions for version 2 will almost always work for version 3.  Then I picked Rust, which has 50 times fewer published solutions, but its popularity is rapidly rising among the engineering community, making it an interesting case. And finally, Elixir, a niche language with just a handful of solutions.

The popularity of these four on Leetcode correlates with the TIOBE index, though it does not match precisely.

Language

TIOBE Ratings, %

Python

21.8

Java

8.12

Rust

1.32

Elixir

0.19

Additionally, I looked up the number of public GitHub repos for those four:

Language

GitHub Repos, Millions

Java

20.20

Python

26.50

Rust

1.00

Elixir

0.12

In short, Java and Python3 represent the most common programming languages with millions of public projects, and I expected that LLMs would handle them very well. Elixir is on the opposite side of the scale, with orders of magnitude less available code, so LLMs' abilities may diminish with it. Rust is somewhere in the middle — clearly popular, but can LLMs handle it well?

Problem Set

I picked 100 problems, published between Oct 2025 and Feb 2026.

Easy

Medium

Hard

Total

15

59

26

100

The intention was to get recent problems, likely "unseen" by LLMs. It is known that solutions for older, and especially popular problems, get into the models’ training sets.

Models

The models used in the benchmark are listed in the table below, with all non-default parameters specified. Release and knowledge cutoff dates are obtained from the vendor's official documentation and provided for reference.

Vendor

Model

Release date

Knowledge cutoff date

"Reasoning"

Parameters

Anthropic

claude-sonnet-4-5-20250929

Sep 2025

Jul 2025

No

temperature = 0.0
max_tokens = 4096

Google

gemini-3-flash-preview

Dec 2025

unknown

Yes

temperature = 0.0

gemini-2.5-flash

Apr 2025

unknown

Yes

temperature = 0.0

xAI

grok-code-fast-1-0825

Aug 2025

unknown

Yes

seed = 42

OpenAI

gpt-5-mini

Aug 2025

May 2024

Yes

seed = 42

All models, except Gemini 3 Flash (Preview), were released earlier than the oldest problem in the dataset (Oct 2025).

The benchmark aimed to be as deterministic and reproducible as possible; therefore, parameters such as "temperature" or "seed" were used. However, none of the models tested guarantee fully deterministic output. This should be kept in mind when reproducing these results.

All models support "reasoning" or "thinking" modes by default, except for Claude Sonnet 4.5. Other model features (or "tools") like web search were not enabled, even if supported.

Results

A problem is considered "accepted" or "solved" if the solution was accepted by the online judge. All other outcomes, like "wrong answer" or "time limit exceeded," are simply "not accepted" without any differentiation.

Model

python3

java

𝝙 python3

rust

𝝙 python3

elixir

𝝙 python3

claude-sonnet-4-5-20250929

50%

52%

+2

51%

+1

35%

-15

gemini-2.5-flash

82%

82%

+0

77%

-5

39%

-43

gemini-3-flash-preview

84%

93%

+9

78%

-6

83%

-1

gpt-5-mini

93%

94%

+1

80%

-13

63%

-30

grok-code-fast-1-0825

73%

65%

-8

65%

-8

30%

-43

The results show a clear drop for Elixir across most models. But are these differences statistically meaningful?

To assess whether differences in pass rates between languages are statistically significant, I used a two-proportion z-test. For two languages each tested on N=100 problems, the minimum detectable difference at p=0.05 is given by 1.96×√(2p̄(1-p̄)/N), where p̄ is the average acceptance rate across the two languages.

Taking Python as a baseline, the Python-Java and Python-Rust gaps are non-significant for all models (thresholds ~11.7pp and ~12.3pp, respectively).

The Python-Elixir gap, however, well exceeds its threshold of ~13.4pp for all models except Gemini 3 Flash Preview, indicating that they handle Elixir significantly worse.

Database Problems

Interestingly, this pattern holds for SQL as well. I had a collection of 321 Leetcode database problems, published from 2015 to 2025.

Easy

Medium

Hard

Total

114

142

65

321

I used the same five LLMs as in the algorithmic benchmark, but for only two languages: MySQL and Oracle SQL. Though those two implementations are mostly interchangeable, there are subtle differences.

For Oracle SQL, there are 15 times fewer published solutions on Leetcode than for MySQL. TIOBE and GitHub don't provide any statistics for those languages — because they are, in fact, not programming languages.

Given that most problems predate the models' knowledge cut-off dates, contamination is possible and should be kept in mind when interpreting these results.

Model

MySQL

Oracle SQL

𝝙

claude-sonnet-4-5-20250929

87.5%

76.3%

-11.2

gemini-2.5-flash

86.6%

67.9%

-18.7

gemini-3-flash-preview

95.6%

85.7%

-9.9

gpt-5-mini

89.1%

79.4%

-9.7

grok-code-fast-1-0825

80.4%

66.7%

-13.7

With N=321 problems and average pass rates around 82%, the significance threshold is approximately 6 percentage points.

That means every tested model shows a significantly higher acceptance rate for MySQL.

Conclusion

We can see that LLM performance on coding problems correlates with language popularity. This is perhaps surprising: algorithmic problems are largely language-agnostic, so one might expect the underlying logic to transfer across languages. Yet, the data shows otherwise — the language you write in matters, even when the algorithm itself does not change.

With Python and Java, the most widely used languages, models outperform Elixir, a niche language. The same trend holds for SQL problems, where LLMs work better in MySQL than in Oracle SQL.

The most likely explanation is training data density: more popular languages generate more code examples, giving models more material to learn from.

The practical implication is straightforward: if you rely on LLMs for coding assistance, your language choice matters — potentially as much as your model choice. Working with uncommon languages means accepting meaningfully weaker AI support, though Gemini 3 Flash Preview is a notable exception, showing near-uniform results across all tested languages for algorithmic problems.

However, it is not clear what the actual popularity relationship is. Rust, despite having much fewer public repositories and published Leetcode solutions, showed no statistically significant difference.

Several directions would be worth exploring. First, expanding the problem set would allow the Rust finding to be confirmed or ruled out. Second, testing additional languages such as Scala, Dart, or Racket would help establish the popularity-performance relationship more precisely. And, as LLMs continue to evolve, it will be worth tracking whether the gap for niche languages narrows over time.

Dataset used for this benchmark:

https://huggingface.co/datasets/whiskwhite/leetcode-complete

Tool used for prompting and submitting solutions:

https://github.com/whisk/leetgptsolver


Written by alexsvetkin | Software Engineer and Team Lead
Published by HackerNoon on 2026/03/08