The ongoing rivalry between two leading AI systems, ChatGPT and Gemini, has been illuminated by recent benchmark evaluations. While both platforms continue to evolve, findings indicate that ChatGPT currently excels in specific areas, notably in reasoning, problem-solving, and abstract thinking.
Benchmark tests are vital for comparing AI systems, as they provide measurable insights into capabilities. One noteworthy benchmark is the GPQA Diamond, which assesses PhD-level reasoning in subjects such as physics, chemistry, and biology. This test features challenging questions requiring complex reasoning, not just straightforward answers. According to the latest results, ChatGPT-5.2 scored 92.4%, slightly ahead of Gemini 3 Pro, which achieved 91.9%. For context, a typical PhD graduate is expected to score around 65%, while non-expert humans average 34%.
Moving to software engineering, the SWE-Bench Pro (Private Dataset) benchmark evaluates an AI’s ability to resolve actual coding issues sourced from the GitHub platform. ChatGPT-5.2 resolved approximately 24% of these challenges, while Gemini only managed around 18%. This benchmark is particularly rigorous, as it uses a non-public dataset, making the tasks more complex compared to simpler coding assessments where AIs typically resolve around 75% of issues.
The third significant benchmark is the ARC-AGI-2, introduced in March 2025. Designed to assess an AI’s abstract reasoning abilities, this test requires identifying patterns from limited examples. ChatGPT-5.2 Pro achieved 54.2%, outpacing Gemini on this front as well. For instance, the Gemini 3 Pro scored 31.1%, while a refined version of Gemini scored 54%.
These benchmarks reflect critical aspects of AI performance, highlighting ChatGPT’s strengths in reasoning and problem-solving. Although AI outputs can vary due to their stochastic nature, the consistency shown in these tests offers a clearer picture of capabilities compared to subjective comparisons based solely on user preference.
Despite ChatGPT’s impressive results in these benchmarks, it is essential to note that Gemini excels in other areas, such as user preference evaluations conducted on platforms like LLMArena. Here, Gemini ranks higher than ChatGPT, showcasing the diverse strengths of each system.
As the AI landscape evolves rapidly, these benchmark results are subject to change with new releases from both OpenAI and Google. The ongoing competition will likely yield further advancements, but for now, the data indicates that ChatGPT holds a slight edge in critical reasoning and problem-solving tasks. This analysis reinforces the importance of benchmark testing in understanding the capabilities of AI systems and guiding users in their selections.
