In my previous benchmarks [1, 2], I showed that LLMs can successfully solve most Leetcode problems. However, they are better at solving well-known problems than novel ones. This can be explained by contaminated training data — solutions to well-known problems are likely to be included in training data (this is partially confirmed by recent OpenAI comments regarding SWE Bench [3]). 1 1 2 2 3 3 The original SWE Bench and SWE Bench Verified use Python. I also use Python, but additionally Go, C#, JavaScript, Bash, and others occasionally. So I was naturally interested: how do LLM results vary across languages? My assumption was that models perform better with more popular languages, given the larger volume of publicly available code. That assumption turned out to be likely correct. This aligns with findings from SWE-bench Multilingual, which observed similar performance drops on non-Python languages in real-world software engineering tasks. However, real-world issues involve additional complexity — tooling, libraries, pipelines, etc. I wanted to verify the pattern, using a cleaner setup. Leetcode problems isolate the language itself, since the underlying algorithms are largely language-agnostic. This is what makes the finding more surprising: even when the logic doesn't change, the language you write it in still affects whether the model gets it right. SWE-bench Multilingual SWE-bench Multilingual Benchmark As in my previous benchmarks, I used Leetcode online judge, to verify LLM skills on solving algorithmic problems. But this time, I experimented with four different languages, with different levels of popularity. Languages There are about 20 languages supported by Leetcode for algorithmic problems at the moment of writing. Leetcode doesn't provide language stats explicitly, but users post their solutions, and the platform provides stats for those posted solutions. So, I was able to derive language popularity. It is based on a few random problems, not the whole Leetcode database. Language Published solutions, % C++ 26.21% Java 25.60% Python3 17.81% Python 7.99% JavaScript 6.68% C 6.45% Go 2.17% C# 2.12% TypeScript 1.44% Swift 0.86% Kotlin 0.74% Rust 0.65% Ruby 0.36% PHP 0.43% Dart 0.25% Scala 0.16% Elixir 0.05% Racket 0.03% Language Published solutions, % C++ 26.21% Java 25.60% Python3 17.81% Python 7.99% JavaScript 6.68% C 6.45% Go 2.17% C# 2.12% TypeScript 1.44% Swift 0.86% Kotlin 0.74% Rust 0.65% Ruby 0.36% PHP 0.43% Dart 0.25% Scala 0.16% Elixir 0.05% Racket 0.03% Language Published solutions, % Language Language Language Published solutions, % Published solutions, % Published solutions, % C++ 26.21% C++ C++ 26.21% 26.21% Java 25.60% Java Java Java 25.60% 25.60% 25.60% Python3 17.81% Python3 Python3 Python3 17.81% 17.81% 17.81% Python 7.99% Python Python 7.99% 7.99% JavaScript 6.68% JavaScript JavaScript 6.68% 6.68% C 6.45% C C 6.45% 6.45% Go 2.17% Go Go 2.17% 2.17% C# 2.12% C# C# 2.12% 2.12% TypeScript 1.44% TypeScript TypeScript 1.44% 1.44% Swift 0.86% Swift Swift 0.86% 0.86% Kotlin 0.74% Kotlin Kotlin 0.74% 0.74% Rust 0.65% Rust Rust Rust 0.65% 0.65% 0.65% Ruby 0.36% Ruby Ruby 0.36% 0.36% PHP 0.43% PHP PHP 0.43% 0.43% Dart 0.25% Dart Dart 0.25% 0.25% Scala 0.16% Scala Scala 0.16% 0.16% Elixir 0.05% Elixir Elixir Elixir 0.05% 0.05% 0.05% Racket 0.03% Racket Racket 0.03% 0.03% I picked four languages: Java and Python3, as two of the most popular. Leetcode distinguishes between Python 3 and 2; there are minimal differences between them, and solutions for version 2 will almost always work for version 3. Then I picked Rust, which has 50 times fewer published solutions, but its popularity is rapidly rising among the engineering community, making it an interesting case. And finally, Elixir, a niche language with just a handful of solutions. The popularity of these four on Leetcode correlates with the TIOBE index, though it does not match precisely. TIOBE index TIOBE index Language TIOBE Ratings, % Python 21.8 Java 8.12 Rust 1.32 Elixir 0.19 Language TIOBE Ratings, % Python 21.8 Java 8.12 Rust 1.32 Elixir 0.19 Language TIOBE Ratings, % Language Language Language TIOBE Ratings, % TIOBE Ratings, % TIOBE Ratings, % Python 21.8 Python Python 21.8 21.8 Java 8.12 Java Java 8.12 8.12 Rust 1.32 Rust Rust 1.32 1.32 Elixir 0.19 Elixir Elixir 0.19 0.19 Additionally, I looked up the number of public GitHub repos for those four: Language GitHub Repos, Millions Java 20.20 Python 26.50 Rust 1.00 Elixir 0.12 Language GitHub Repos, Millions Java 20.20 Python 26.50 Rust 1.00 Elixir 0.12 Language GitHub Repos, Millions Language Language Language GitHub Repos, Millions GitHub Repos, Millions GitHub Repos, Millions Java 20.20 Java Java 20.20 20.20 Python 26.50 Python Python 26.50 26.50 Rust 1.00 Rust Rust 1.00 1.00 Elixir 0.12 Elixir Elixir 0.12 0.12 In short, Java and Python3 represent the most common programming languages with millions of public projects, and I expected that LLMs would handle them very well. Elixir is on the opposite side of the scale, with orders of magnitude less available code, so LLMs' abilities may diminish with it. Rust is somewhere in the middle — clearly popular, but can LLMs handle it well? Problem Set I picked 100 problems, published between Oct 2025 and Feb 2026. Easy Medium Hard Total 15 59 26 100 Easy Medium Hard Total 15 59 26 100 Easy Medium Hard Total Easy Easy Easy Medium Medium Medium Hard Hard Hard Total Total Total 15 59 26 100 15 15 59 59 26 26 100 100 The intention was to get recent problems, likely "unseen" by LLMs. It is known that solutions for older, and especially popular problems, get into the models’ training sets. Models The models used in the benchmark are listed in the table below, with all non-default parameters specified. Release and knowledge cutoff dates are obtained from the vendor's official documentation and provided for reference. Vendor Model Release date Knowledge cutoff date "Reasoning" Parameters Anthropic claude-sonnet-4-5-20250929 Sep 2025 Jul 2025 No temperature = 0.0max_tokens = 4096 Google gemini-3-flash-preview Dec 2025 unknown Yes temperature = 0.0 gemini-2.5-flash Apr 2025 unknown Yes temperature = 0.0 xAI grok-code-fast-1-0825 Aug 2025 unknown Yes seed = 42 OpenAI gpt-5-mini Aug 2025 May 2024 Yes seed = 42 Vendor Model Release date Knowledge cutoff date "Reasoning" Parameters Anthropic claude-sonnet-4-5-20250929 Sep 2025 Jul 2025 No temperature = 0.0max_tokens = 4096 Google gemini-3-flash-preview Dec 2025 unknown Yes temperature = 0.0 gemini-2.5-flash Apr 2025 unknown Yes temperature = 0.0 xAI grok-code-fast-1-0825 Aug 2025 unknown Yes seed = 42 OpenAI gpt-5-mini Aug 2025 May 2024 Yes seed = 42 Vendor Model Release date Knowledge cutoff date "Reasoning" Parameters Vendor Vendor Vendor Model Model Model Release date Release date Release date Knowledge cutoff date Knowledge cutoff date Knowledge cutoff date "Reasoning" "Reasoning" "Reasoning" Parameters Parameters Parameters Anthropic claude-sonnet-4-5-20250929 Sep 2025 Jul 2025 No temperature = 0.0max_tokens = 4096 Anthropic Anthropic Anthropic claude-sonnet-4-5-20250929 claude-sonnet-4-5-20250929 Sep 2025 Sep 2025 Jul 2025 Jul 2025 No No temperature = 0.0max_tokens = 4096 temperature = 0.0max_tokens = 4096 Google gemini-3-flash-preview Dec 2025 unknown Yes temperature = 0.0 Google Google Google gemini-3-flash-preview gemini-3-flash-preview Dec 2025 Dec 2025 unknown unknown Yes Yes temperature = 0.0 temperature = 0.0 gemini-2.5-flash Apr 2025 unknown Yes temperature = 0.0 gemini-2.5-flash gemini-2.5-flash Apr 2025 Apr 2025 unknown unknown Yes Yes temperature = 0.0 temperature = 0.0 xAI grok-code-fast-1-0825 Aug 2025 unknown Yes seed = 42 xAI xAI xAI grok-code-fast-1-0825 grok-code-fast-1-0825 Aug 2025 Aug 2025 unknown unknown Yes Yes seed = 42 seed = 42 OpenAI gpt-5-mini Aug 2025 May 2024 Yes seed = 42 OpenAI OpenAI OpenAI gpt-5-mini gpt-5-mini Aug 2025 Aug 2025 May 2024 May 2024 Yes Yes seed = 42 seed = 42 All models, except Gemini 3 Flash (Preview), were released earlier than the oldest problem in the dataset (Oct 2025). The benchmark aimed to be as deterministic and reproducible as possible; therefore, parameters such as "temperature" or "seed" were used. However, none of the models tested guarantee fully deterministic output. This should be kept in mind when reproducing these results. All models support "reasoning" or "thinking" modes by default, except for Claude Sonnet 4.5. Other model features (or "tools") like web search were not enabled, even if supported. Results A problem is considered "accepted" or "solved" if the solution was accepted by the online judge. All other outcomes, like "wrong answer" or "time limit exceeded," are simply "not accepted" without any differentiation. Model python3 java 𝝙 python3 rust 𝝙 python3 elixir 𝝙 python3 claude-sonnet-4-5-20250929 50% 52% +2 51% +1 35% -15 gemini-2.5-flash 82% 82% +0 77% -5 39% -43 gemini-3-flash-preview 84% 93% +9 78% -6 83% -1 gpt-5-mini 93% 94% +1 80% -13 63% -30 grok-code-fast-1-0825 73% 65% -8 65% -8 30% -43 Model python3 java 𝝙 python3 rust 𝝙 python3 elixir 𝝙 python3 claude-sonnet-4-5-20250929 50% 52% +2 51% +1 35% -15 gemini-2.5-flash 82% 82% +0 77% -5 39% -43 gemini-3-flash-preview 84% 93% +9 78% -6 83% -1 gpt-5-mini 93% 94% +1 80% -13 63% -30 grok-code-fast-1-0825 73% 65% -8 65% -8 30% -43 Model python3 java 𝝙 python3 rust 𝝙 python3 elixir 𝝙 python3 Model Model Model python3 python3 python3 java java java 𝝙 python3 𝝙 python3 𝝙 python3 rust rust rust 𝝙 python3 𝝙 python3 𝝙 python3 elixir elixir elixir 𝝙 python3 𝝙 python3 𝝙 python3 claude-sonnet-4-5-20250929 50% 52% +2 51% +1 35% -15 claude-sonnet-4-5-20250929 claude-sonnet-4-5-20250929 claude-sonnet-4-5-20250929 50% 50% 52% 52% +2 +2 +2 +2 51% 51% +1 +1 +1 +1 35% 35% -15 -15 -15 -15 gemini-2.5-flash 82% 82% +0 77% -5 39% -43 gemini-2.5-flash gemini-2.5-flash gemini-2.5-flash 82% 82% 82% 82% +0 +0 +0 +0 77% 77% -5 -5 -5 -5 39% 39% -43 -43 -43 -43 gemini-3-flash-preview 84% 93% +9 78% -6 83% -1 gemini-3-flash-preview gemini-3-flash-preview gemini-3-flash-preview 84% 84% 93% 93% +9 +9 +9 +9 78% 78% -6 -6 -6 -6 83% 83% -1 -1 -1 -1 gpt-5-mini 93% 94% +1 80% -13 63% -30 gpt-5-mini gpt-5-mini gpt-5-mini 93% 93% 94% 94% +1 +1 +1 +1 80% 80% -13 -13 -13 -13 63% 63% -30 -30 -30 -30 grok-code-fast-1-0825 73% 65% -8 65% -8 30% -43 grok-code-fast-1-0825 grok-code-fast-1-0825 grok-code-fast-1-0825 73% 73% 65% 65% -8 -8 -8 -8 65% 65% -8 -8 -8 -8 30% 30% -43 -43 -43 The results show a clear drop for Elixir across most models. But are these differences statistically meaningful? To assess whether differences in pass rates between languages are statistically significant, I used a two-proportion z-test. For two languages each tested on N=100 problems, the minimum detectable difference at p=0.05 is given by 1.96×√(2p̄(1-p̄)/N), where p̄ is the average acceptance rate across the two languages. Taking Python as a baseline, the Python-Java and Python-Rust gaps are non-significant for all models (thresholds ~11.7pp and ~12.3pp, respectively). The Python-Elixir gap, however, well exceeds its threshold of ~13.4pp for all models except Gemini 3 Flash Preview, indicating that they handle Elixir significantly worse. Database Problems Interestingly, this pattern holds for SQL as well. I had a collection of 321 Leetcode database problems, published from 2015 to 2025. Easy Medium Hard Total 114 142 65 321 Easy Medium Hard Total 114 142 65 321 Easy Medium Hard Total Easy Easy Easy Medium Medium Medium Hard Hard Hard Total Total Total 114 142 65 321 114 114 142 142 65 65 321 321 I used the same five LLMs as in the algorithmic benchmark, but for only two languages: MySQL and Oracle SQL. Though those two implementations are mostly interchangeable, there are subtle differences. For Oracle SQL, there are 15 times fewer published solutions on Leetcode than for MySQL. TIOBE and GitHub don't provide any statistics for those languages — because they are, in fact, not programming languages. Given that most problems predate the models' knowledge cut-off dates, contamination is possible and should be kept in mind when interpreting these results. Model MySQL Oracle SQL 𝝙 claude-sonnet-4-5-20250929 87.5% 76.3% -11.2 gemini-2.5-flash 86.6% 67.9% -18.7 gemini-3-flash-preview 95.6% 85.7% -9.9 gpt-5-mini 89.1% 79.4% -9.7 grok-code-fast-1-0825 80.4% 66.7% -13.7 Model MySQL Oracle SQL 𝝙 claude-sonnet-4-5-20250929 87.5% 76.3% -11.2 gemini-2.5-flash 86.6% 67.9% -18.7 gemini-3-flash-preview 95.6% 85.7% -9.9 gpt-5-mini 89.1% 79.4% -9.7 grok-code-fast-1-0825 80.4% 66.7% -13.7 Model MySQL Oracle SQL 𝝙 Model Model Model MySQL MySQL MySQL Oracle SQL Oracle SQL Oracle SQL 𝝙 𝝙 𝝙 claude-sonnet-4-5-20250929 87.5% 76.3% -11.2 claude-sonnet-4-5-20250929 claude-sonnet-4-5-20250929 claude-sonnet-4-5-20250929 87.5% 87.5% 76.3% 76.3% -11.2 -11.2 -11.2 gemini-2.5-flash 86.6% 67.9% -18.7 gemini-2.5-flash gemini-2.5-flash gemini-2.5-flash 86.6% 86.6% 67.9% 67.9% -18.7 -18.7 -18.7 gemini-3-flash-preview 95.6% 85.7% -9.9 gemini-3-flash-preview gemini-3-flash-preview gemini-3-flash-preview 95.6% 95.6% 85.7% 85.7% -9.9 -9.9 -9.9 gpt-5-mini 89.1% 79.4% -9.7 gpt-5-mini gpt-5-mini gpt-5-mini 89.1% 89.1% 79.4% 79.4% -9.7 -9.7 -9.7 grok-code-fast-1-0825 80.4% 66.7% -13.7 grok-code-fast-1-0825 grok-code-fast-1-0825 grok-code-fast-1-0825 80.4% 80.4% 66.7% 66.7% -13.7 -13.7 -13.7 With N=321 problems and average pass rates around 82%, the significance threshold is approximately 6 percentage points. That means every tested model shows a significantly higher acceptance rate for MySQL. Conclusion We can see that LLM performance on coding problems correlates with language popularity. This is perhaps surprising: algorithmic problems are largely language-agnostic, so one might expect the underlying logic to transfer across languages. Yet, the data shows otherwise — the language you write in matters, even when the algorithm itself does not change. With Python and Java, the most widely used languages, models outperform Elixir, a niche language. The same trend holds for SQL problems, where LLMs work better in MySQL than in Oracle SQL. The most likely explanation is training data density: more popular languages generate more code examples, giving models more material to learn from. The practical implication is straightforward: if you rely on LLMs for coding assistance, your language choice matters — potentially as much as your model choice. Working with uncommon languages means accepting meaningfully weaker AI support, though Gemini 3 Flash Preview is a notable exception, showing near-uniform results across all tested languages for algorithmic problems. However, it is not clear what the actual popularity relationship is. Rust, despite having much fewer public repositories and published Leetcode solutions, showed no statistically significant difference. Several directions would be worth exploring. First, expanding the problem set would allow the Rust finding to be confirmed or ruled out. Second, testing additional languages such as Scala, Dart, or Racket would help establish the popularity-performance relationship more precisely. And, as LLMs continue to evolve, it will be worth tracking whether the gap for niche languages narrows over time. Links Dataset used for this benchmark: https://huggingface.co/datasets/whiskwhite/leetcode-complete https://huggingface.co/datasets/whiskwhite/leetcode-complete https://huggingface.co/datasets/whiskwhite/leetcode-complete Tool used for prompting and submitting solutions: https://github.com/whisk/leetgptsolver https://github.com/whisk/leetgptsolver https://github.com/whisk/leetgptsolver