GPT 5.1 Crowned Best AI Model in Andrej Karpathy’s LLM Council Experiment

In a groundbreaking experiment that has captured the attention of the artificial intelligence community, Andrej Karpathy, a prominent AI researcher and founder of Eureka Labs, recently unveiled a novel evaluation method called the “LLM Council.” This innovative approach allows multiple language models to anonymously assess each other’s responses to user queries, creating a unique environment for peer review among AI systems. The results of this experiment have sparked discussions about the capabilities of leading AI models, particularly OpenAI’s GPT-5.1, which emerged as the top performer in this comparative analysis.

The LLM Council experiment is structured around a three-step process designed to facilitate unbiased evaluations among competing language models. Initially, a user query is dispatched to several models, including GPT-5.1, Google’s Gemini 3.0, Claude, and Grok. Each model generates its response independently, ensuring that the answers are not influenced by the others. In the second step, these responses are presented side-by-side to each model without revealing their authorship. This anonymity is crucial, as it encourages models to judge the quality of the responses based solely on their content rather than preconceived notions about the models themselves.

Once the models have ranked the responses based on criteria such as accuracy and insight, a “chairman model” takes on the role of synthesizing the feedback. This model combines the rankings and critiques from the participating models to produce a final consensus answer. The result is a collaborative output that reflects the collective judgment of the models involved, rather than a single entity’s perspective.

Karpathy’s findings from the LLM Council experiment were both surprising and enlightening. Despite recent benchmarks suggesting that Google’s Gemini 3.0 had overtaken OpenAI in overall capability and reasoning tests, GPT-5.1 consistently ranked highest across various queries. This outcome raises intriguing questions about the reliability of benchmark tests versus real-world performance evaluations conducted through peer review mechanisms.

One of the most fascinating aspects of the experiment was the willingness of the models to recognize and praise each other’s strengths. Karpathy noted that the models often selected responses from GPT-5.1 as superior, highlighting its perceived depth and insightfulness. Conversely, Claude was frequently rated as the least effective model, with participants noting its tendency to provide overly terse responses that lacked the nuance found in the outputs of its competitors.

However, Karpathy also provided a nuanced perspective on the results. While he acknowledged that GPT-5.1 was often lauded for its comprehensive answers, he personally found it to be somewhat verbose. He expressed a preference for the more concise and processed style of Gemini 3.0, indicating that subjective assessments of AI performance can vary significantly among human evaluators. This observation underscores the complexity of evaluating AI models, as different users may prioritize different qualities in responses.

The implications of the LLM Council experiment extend beyond mere rankings. It introduces a new paradigm for evaluating language models, one that emphasizes collaborative judgment and peer assessment over traditional benchmarking methods. This approach could pave the way for more sophisticated evaluations of AI systems, allowing researchers and developers to gain deeper insights into the strengths and weaknesses of various models.

Moreover, the experiment revealed an interesting phenomenon: when models were informed that a particular answer originated from GPT-5.1, they often adjusted their own responses accordingly. This behavior suggests a level of deference among AI models, where the recognition of a superior output leads to self-correction and refinement. Such dynamics raise important questions about the nature of collaboration and competition among AI systems, as well as the potential for collective intelligence in machine learning.

As the AI landscape continues to evolve, the findings from the LLM Council experiment highlight the importance of fostering environments where models can learn from one another. By encouraging models to engage in constructive criticism and collaborative evaluation, researchers can unlock new avenues for improving AI performance and understanding the intricacies of language processing.

In addition to the technical insights gained from the experiment, Karpathy’s work also serves as a reminder of the broader implications of AI development. As organizations like OpenAI and Google race to advance their models, the question of how we evaluate and compare these systems becomes increasingly critical. The LLM Council experiment offers a glimpse into a future where AI models are not only assessed based on their individual capabilities but also through their interactions with one another.

Furthermore, the experiment raises ethical considerations regarding the deployment of AI systems in real-world applications. If models can influence each other’s outputs, what does this mean for the integrity of information generated by AI? Ensuring that AI systems maintain high standards of accuracy and reliability will be paramount as they become more integrated into society.

As the discourse surrounding AI continues to grow, the LLM Council experiment stands out as a significant contribution to our understanding of language models. It challenges conventional wisdom about model performance and opens up new avenues for research and exploration. By embracing collaborative evaluation methods, the AI community can foster innovation and drive progress in ways that benefit both developers and users alike.

In conclusion, Andrej Karpathy’s LLM Council experiment represents a pivotal moment in the evaluation of language models. By allowing AI systems to anonymously judge each other’s responses, the experiment not only highlighted the strengths of OpenAI’s GPT-5.1 but also underscored the importance of collaborative assessment in the field of artificial intelligence. As we move forward, it will be essential to continue exploring innovative evaluation methods that reflect the complexities of AI interactions and the diverse needs of users. The future of AI is not just about building better models; it’s about creating systems that can learn from one another and adapt to the ever-changing landscape of human knowledge and communication.