Recently OpenAI released GPT-5.2 which has superior benchmark results. However, some online chatters reveal that OpenAI might have used more tokens and compute for the benchmark test, and might be considered “cheating” the tests. If everything is equal, is GPT-5.2 actually on par with Gemini 3 Pro? Here we try to find out. The "Cheating" Controversy: Compute & Tokens The core of the controversy lies in inference-time compute. "Cheating" in this context refers to OpenAI using a configuration for benchmarks that is significantly more powerful (and expensive) than what is available to standard users or what is typical for a "fair" comparison. "xhigh" vs. "Medium" Effort: Reports indicate that OpenAI's published benchmark results were generated using an "xhigh" reasoning effort setting. This mode allows the model to generate a massive number of internal "thought" tokens (reasoning steps) before producing an answer. The Issue: Standard ChatGPT Plus users reportedly only have access to "medium" or "high" effort modes. The "xhigh" mode used for benchmarks consumes vastly more tokens and compute, effectively brute-forcing higher scores by allowing the model to "think" for much longer (sometimes 30-50 minutes for complex tasks) than a standard interaction allows. Inference Scaling: This leverages a concept where allowing a model to generate more tokens during inference (test time) improves performance significantly. Critics argue that comparing GPT-5.2's "xhigh" scores against Gemini 3 Pro's standard outputs is misleading because it compares a "maximum compute" scenario against a "standard usage" scenario. "xhigh" vs. "Medium" Effort: Reports indicate that OpenAI's published benchmark results were generated using an "xhigh" reasoning effort setting. This mode allows the model to generate a massive number of internal "thought" tokens (reasoning steps) before producing an answer. The Issue: Standard ChatGPT Plus users reportedly only have access to "medium" or "high" effort modes. The "xhigh" mode used for benchmarks consumes vastly more tokens and compute, effectively brute-forcing higher scores by allowing the model to "think" for much longer (sometimes 30-50 minutes for complex tasks) than a standard interaction allows. The Issue: Standard ChatGPT Plus users reportedly only have access to "medium" or "high" effort modes. The "xhigh" mode used for benchmarks consumes vastly more tokens and compute, effectively brute-forcing higher scores by allowing the model to "think" for much longer (sometimes 30-50 minutes for complex tasks) than a standard interaction allows. The Issue: Standard ChatGPT Plus users reportedly only have access to "medium" or "high" effort modes. The "xhigh" mode used for benchmarks consumes vastly more tokens and compute, effectively brute-forcing higher scores by allowing the model to "think" for much longer (sometimes 30-50 minutes for complex tasks) than a standard interaction allows. Inference Scaling: This leverages a concept where allowing a model to generate more tokens during inference (test time) improves performance significantly. Critics argue that comparing GPT-5.2's "xhigh" scores against Gemini 3 Pro's standard outputs is misleading because it compares a "maximum compute" scenario against a "standard usage" scenario. Benchmark Comparison (GPT-5.2 vs. Gemini 3 Pro) When the massive compute boost is factored in, GPT-5.2 does post higher scores, but the gap narrows or reverses when conditions are scrutinized. does Benchmark GPT-5.2 (Thinking/Pro) Gemini 3 Pro Context ARC-AGI-2 52.9% ~31.1% Measures abstract reasoning. GPT-5.2's score is heavily reliant on the "Thinking" process. GPQA Diamond 92.4% 91.9% Graduate-level science. The scores are effectively tied (within margin of error). SWE-Bench Pro 55.6% N/A Real-world software engineering. GPT-5.2 sets a new SOTA here. SWE-Bench Verified 80.0% 76.2% A more established coding benchmark. The models are roughly comparable here. Benchmark GPT-5.2 (Thinking/Pro) Gemini 3 Pro Context ARC-AGI-2 52.9% ~31.1% Measures abstract reasoning. GPT-5.2's score is heavily reliant on the "Thinking" process. GPQA Diamond 92.4% 91.9% Graduate-level science. The scores are effectively tied (within margin of error). SWE-Bench Pro 55.6% N/A Real-world software engineering. GPT-5.2 sets a new SOTA here. SWE-Bench Verified 80.0% 76.2% A more established coding benchmark. The models are roughly comparable here. Benchmark GPT-5.2 (Thinking/Pro) Gemini 3 Pro Context Benchmark Benchmark GPT-5.2 (Thinking/Pro) GPT-5.2 (Thinking/Pro) Gemini 3 Pro Gemini 3 Pro Context Context ARC-AGI-2 52.9% ~31.1% Measures abstract reasoning. GPT-5.2's score is heavily reliant on the "Thinking" process. ARC-AGI-2 ARC-AGI-2 52.9% 52.9% ~31.1% ~31.1% Measures abstract reasoning. GPT-5.2's score is heavily reliant on the "Thinking" process. Measures abstract reasoning. GPT-5.2's score is heavily reliant on the "Thinking" process. GPQA Diamond 92.4% 91.9% Graduate-level science. The scores are effectively tied (within margin of error). GPQA Diamond GPQA Diamond 92.4% 92.4% 91.9% 91.9% Graduate-level science. The scores are effectively tied (within margin of error). Graduate-level science. The scores are effectively tied (within margin of error). SWE-Bench Pro 55.6% N/A Real-world software engineering. GPT-5.2 sets a new SOTA here. SWE-Bench Pro SWE-Bench Pro 55.6% 55.6% N/A N/A Real-world software engineering. GPT-5.2 sets a new SOTA here. Real-world software engineering. GPT-5.2 sets a new SOTA here. SWE-Bench Verified 80.0% 76.2% A more established coding benchmark. The models are roughly comparable here. SWE-Bench Verified SWE-Bench Verified 80.0% 80.0% 76.2% 76.2% A more established coding benchmark. The models are roughly comparable here. A more established coding benchmark. The models are roughly comparable here. Private Benchmarks: Some independent evaluations (e.g., restricted "private benchmarks" mentioned in discussions) suggest that Gemini 3 Pro actually outperforms GPT-5.2 in areas like creative writing, philosophy, and tool use when the "gaming" of public benchmarks is removed. Private Benchmarks: Some independent evaluations (e.g., restricted "private benchmarks" mentioned in discussions) suggest that Gemini 3 Pro actually outperforms GPT-5.2 in areas like creative writing, philosophy, and tool use when the "gaming" of public benchmarks is removed. Are They "On Par"? Yes, and Gemini 3 Pro may even be superior in "base" capability. If "everything is equal"—meaning both models are restricted to the same amount of inference compute (thinking time)—the general consensus implies they are highly comparable, with different strengths: Gemini 3 Pro Advantages: Base Intelligence: Appears to have stronger fundamental capability in long-context understanding (massive context window), theoretical reasoning, and creative tasks without needing excessive "thinking" time. Cost Efficiency: For many tasks, it achieves similar results with less compute (and thus lower cost/latency). GPT-5.2 Advantages: Agentic Workflow: With the "Thinking" mode enabled (high compute), it excels at complex, multi-step agents and coding tasks (SWE-Bench). It is "tuned" effectively to use extra compute to solve harder problems. Gemini 3 Pro Advantages: Base Intelligence: Appears to have stronger fundamental capability in long-context understanding (massive context window), theoretical reasoning, and creative tasks without needing excessive "thinking" time. Cost Efficiency: For many tasks, it achieves similar results with less compute (and thus lower cost/latency). Base Intelligence: Appears to have stronger fundamental capability in long-context understanding (massive context window), theoretical reasoning, and creative tasks without needing excessive "thinking" time. Cost Efficiency: For many tasks, it achieves similar results with less compute (and thus lower cost/latency). Base Intelligence: Appears to have stronger fundamental capability in long-context understanding (massive context window), theoretical reasoning, and creative tasks without needing excessive "thinking" time. Cost Efficiency: For many tasks, it achieves similar results with less compute (and thus lower cost/latency). GPT-5.2 Advantages: Agentic Workflow: With the "Thinking" mode enabled (high compute), it excels at complex, multi-step agents and coding tasks (SWE-Bench). It is "tuned" effectively to use extra compute to solve harder problems. Agentic Workflow: With the "Thinking" mode enabled (high compute), it excels at complex, multi-step agents and coding tasks (SWE-Bench). It is "tuned" effectively to use extra compute to solve harder problems. Agentic Workflow: With the "Thinking" mode enabled (high compute), it excels at complex, multi-step agents and coding tasks (SWE-Bench). It is "tuned" effectively to use extra compute to solve harder problems. Conclusions The claim that they are "on par" is accurate. If you strip away OpenAI's "xhigh" compute advantage used in benchmarks, Gemini 3 Pro is likely equal or slightly ahead in raw model intelligence. GPT-5.2's "superiority" in benchmarks largely comes from its ability to spend significantly more time and compute processing a single prompt. Based on the verification performed, here is the compiled list of sources regarding the GPT-5.2 release, the Gemini 3 Pro comparison, and the associated benchmarking controversy. References 1. Official Release Announcements OpenAI – System Card Update openai.com/index/gpt-5-system-card-update-gpt-5-2/ openai.com/index/gpt-5-system-card-update-gpt-5-2/ openai.com/index/gpt-5-system-card-update-gpt-5-2/ openai.com/index/gpt-5-system-card-update-gpt-5-2/ Google – The Gemini 3 Era blog.google/products/gemini/gemini-3/ blog.google/products/gemini/gemini-3/ blog.google/products/gemini/gemini-3/ blog.google/products/gemini/gemini-3/ 2. Benchmark Performance & Technical Analysis R&D World – Comparative Analysis R&D World – Comparative Analysis Title: "How GPT-5.2 stacks up against Gemini 3.0 and Claude Opus 4.5" Verified Details: Validates the 52.9% score on ARC-AGI-2 (Thinking mode) vs. Gemini 3 Pro's ~31.1%. Confirms GPT-5.2's lead in abstract reasoning is heavily tied to the "Thinking" process. Source: rdworldonline.com/how-gpt-5-2-stacks-up Title: "How GPT-5.2 stacks up against Gemini 3.0 and Claude Opus 4.5" Title: Verified Details: Validates the 52.9% score on ARC-AGI-2 (Thinking mode) vs. Gemini 3 Pro's ~31.1%. Confirms GPT-5.2's lead in abstract reasoning is heavily tied to the "Thinking" process. Verified Details: Source: rdworldonline.com/how-gpt-5-2-stacks-up Source: rdworldonline.com/how-gpt-5-2-stacks-up rdworldonline.com/how-gpt-5-2-stacks-up Vellum AI – Deep Dive Vellum AI – Deep Dive Title: "GPT-5.2 Benchmarks" Verified Details: Verifies the 92.4% score on GPQA Diamond, noting it is effectively tied with Gemini 3 Pro (91.9%) when within the margin of error, but marketed as a "win" by OpenAI. Source: vellum.ai/blog/gpt-5-2-benchmarks Title: "GPT-5.2 Benchmarks" Title: Verified Details: Verifies the 92.4% score on GPQA Diamond, noting it is effectively tied with Gemini 3 Pro (91.9%) when within the margin of error, but marketed as a "win" by OpenAI. Verified Details: Source: vellum.ai/blog/gpt-5-2-benchmarks Source: vellum.ai/blog/gpt-5-2-benchmarks vellum.ai/blog/gpt-5-2-benchmarks Simon Willison’s Weblog Simon Willison’s Weblog Title: "GPT-5.2" Verified Details: Technical breakdown of the API pricing ($1.75/1M input) and the distinction between the "Instant" and "Thinking" API endpoints. Source: simonwillison.net/2025/Dec/11/gpt-52/ Title: "GPT-5.2" Title: Verified Details: Technical breakdown of the API pricing ($1.75/1M input) and the distinction between the "Instant" and "Thinking" API endpoints. Verified Details: Source: simonwillison.net/2025/Dec/11/gpt-52/ Source: simonwillison.net/2025/Dec/11/gpt-52/ simonwillison.net/2025/Dec/11/gpt-52/ 3. The "Cheating" & Compute Controversy Reddit (r/LocalLLaMA & r/Singularity) Reddit (r/LocalLLaMA & r/Singularity) Threads: "GPT-5.2 Thinking evals" & "OpenAI drops GPT-5.2 'Code Red' vibes" Verified Details: These community discussions are the primary source of the "cheating" allegations. Users identified that OpenAI's benchmarks used "xhigh" (extra high) reasoning effort—a setting that uses significantly more tokens and time than the "Medium" or "High" settings available to standard users or used in Gemini's standard benchmarks. Source: reddit.com/r/singularity/comments/1pk4t5z/gpt52_thinking_evals/ Source: reddit.com/r/ChatGPTCoding/comments/1pkq4mc/ Threads: "GPT-5.2 Thinking evals" & "OpenAI drops GPT-5.2 'Code Red' vibes" Threads: Verified Details: These community discussions are the primary source of the "cheating" allegations. Users identified that OpenAI's benchmarks used "xhigh" (extra high) reasoning effort—a setting that uses significantly more tokens and time than the "Medium" or "High" settings available to standard users or used in Gemini's standard benchmarks. Verified Details: Source: reddit.com/r/singularity/comments/1pk4t5z/gpt52_thinking_evals/ Source: reddit.com/r/singularity/comments/1pk4t5z/gpt52_thinking_evals/ reddit.com/r/singularity/comments/1pk4t5z/gpt52_thinking_evals/ Source: reddit.com/r/ChatGPTCoding/comments/1pkq4mc/ Source: reddit.com/r/ChatGPTCoding/comments/1pkq4mc/ reddit.com/r/ChatGPTCoding/comments/1pkq4mc/ InfoQ News InfoQ News Title: "OpenAI's New GPT-5.1 Models are Faster and More Conversational" (Contextual coverage including 5.2) Verified Details: Discusses the introduction of the "xhigh" reasoning effort level and the trade-offs between benchmark scores and actual user latency/cost. Source: infoq.com/news/2025/12/openai-gpt-51/ Title: "OpenAI's New GPT-5.1 Models are Faster and More Conversational" (Contextual coverage including 5.2) Title: Verified Details: Discusses the introduction of the "xhigh" reasoning effort level and the trade-offs between benchmark scores and actual user latency/cost. Verified Details: Source: infoq.com/news/2025/12/openai-gpt-51/ Source: infoq.com/news/2025/12/openai-gpt-51/ infoq.com/news/2025/12/openai-gpt-51/