Scientists reveal the alien logic of AI: hyper-rational but stumped by simple concepts
A new study suggests that artificial intelligence systems approach strategic decision-making with a higher degree of mathematical optimization than human players, often outperforming humans in games requiring iterative reasoning. While these large language models demonstrate an ability to adapt to complex rules and specific competitive scenarios, they differ fundamentally from human cognition by failing to identify certain logical shortcuts known as dominant strategies. The findings appear in the Journal of Economic Behavior and Organization.
Large language models are advanced artificial intelligence systems designed to process and generate text based on vast datasets. These models are increasingly integrated into economic workflows, ranging from market analysis to automated negotiation agents. As these tools become more prevalent in settings that involve social interaction and competition, it becomes necessary to understand how their decision-making processes compare to human behavior.
Previous psychological and economic research indicates that humans often rely on bounded rationality, meaning their strategic thinking is limited by cognitive capacity and time. Iuliia Alekseenko, Dmitry Dagaev, Sofiia Paklina, and Petr Parshakov conducted this study to determine if artificial intelligence mirrors these human limitations or operates with a distinct form of logic. The authors are affiliated with HSE University, the University of Lausanne, and the New Economic School.
“This study was motivated by a growing debate about whether large language models can meaningfully serve as substitutes for human decision-makers in economic and behavioral research. While recent work has shown that LLMs can replicate outcomes in some classic experiments, it remains unclear how they reason strategically and whether their behavior truly resembles human bounded rationality,” the researchers told PsyPost.
“We focused on the beauty contest game because it is one of the most extensively studied tools for measuring strategic thinking and iterative reasoning in humans, with decades of experimental evidence across different populations and settings. This made it an ideal benchmark for a direct comparison between human behavior and AI-generated decisions.”
“More broadly, we were motivated by a real-world concern: AI systems are increasingly used in strategic environments such as markets, forecasting, and negotiation. Understanding whether AI models reason like humans, better than humans, or simply differently is crucial for predicting how they may influence outcomes when interacting with people.”
The researchers utilized a classic game theory experiment known as the “beauty contest” or “Guess the Number” game. In this game, participants simultaneously choose an integer between 0 and 100. The winner is the player whose chosen number is closest to a specific fraction of the average of all chosen numbers.
A common version sets the target at two-thirds of the average. If all players chose numbers randomly, the average would be 50, and the target would be 33. A sophisticated player anticipates this and chooses 33. If all players are equally sophisticated, they will all choose 33, making the new target 22. This reasoning process repeats iteratively until it reaches 0, which is the theoretical Nash equilibrium.
To test the capabilities of artificial intelligence, the authors employed five prominent large language models: GPT-4o, GPT-4o-Mini, Gemini-2.5-flash, Claude-Sonnet-4, and Llama-4-Maverick. The researchers replicated sixteen distinct scenarios from classic behavioral economics papers. These scenarios varied the number of players, the target fraction, and the aggregation method used to determine the winner.
The study gathered 50 responses from each model for every scenario to ensure statistical reliability. The temperature parameter for the models was fixed at 1.0 to allow for variability similar to a diverse group of human participants.
The study first replicated an experiment originally conducted by Rosemarie Nagel in 1995. The artificial agents played a version of the game where the target was either one-half or two-thirds of the average. In the scenario where the target was one-half, human participants typically chose numbers averaging around 27.
The artificial intelligence models consistently chose lower numbers. For example, the Llama model averaged a guess of 2.00, while Claude Sonnet averaged 12.72. This pattern persisted in the two-thirds variation. While humans averaged 36.73, the models provided mean guesses ranging from 2.80 to 22.24. This suggests that the models engaged in more steps of iterative reasoning than the average human participant.
The researchers also replicated a study by Duffy and Nagel from 1997 to see how the models handled different winning criteria. In this set of experiments, the winner was determined by being closest to one-half of the median, mean, or maximum of the chosen numbers. Human players tend to choose higher numbers when the target is based on the maximum.
The large language models successfully replicated this comparative static. When the target function changed to the maximum, models like Claude Sonnet and GPT-4o shifted their guesses upward significantly. This indicates that the models are capable of recognizing how changes in the rules should theoretically impact the optimal strategy.
A separate set of experiments focused on two-player games, initially studied by Grosskopf and Nagel in 2008. In a two-player game where the target is two-thirds of the average, choosing 0 is a weakly dominant strategy. This means that choosing 0 is never worse than any other option and is often better.
Despite this mathematical certainty, the models failed to identify the dominant strategy explicitly. The researchers analyzed the reasoning text generated by the models and found no instances where a model correctly explained the concept of a dominant strategy in this context. While the models played low numbers, they arrived at their decisions through probabilistic reasoning rather than by solving the game logically.
“Two things stood out,” the researchers said. “First, we were surprised by how consistently AI models behaved more strategically than humans across very different experimental settings. Second, and more unexpectedly, even the most advanced models failed to explicitly identify a simple dominant strategy in a two-player game, revealing an important gap between sophisticated-looking reasoning and basic game-theoretic logic.”
“Across many settings, AI models behaved much more strategically than humans, often choosing values far closer to the theoretical benchmark, which would meaningfully alter outcomes in real strategic interactions. At the same time, these effects highlight differences rather than superiority, since AI also shows clear limitations in recognizing simple dominant strategies.”
The researchers further investigated whether models could simulate specific human traits, replicating work by Brañas-Garza and colleagues. The prompts were adjusted to describe the artificial agent as having either high or low cognitive reflection scores. When instructed to act as an agent with high cognitive reflection, the models chose lower numbers. When instructed to act as an agent with low cognitive reflection, they chose higher numbers.
This alignment matches the behavioral patterns observed in actual human subjects. The models demonstrated a similar ability to simulate emotional states. When prompted to experience anger, the models chose higher numbers, mirroring findings from Castagnetti and colleagues that showed anger inhibits deep strategic reasoning in humans.
The researchers also examined the effect of model size on performance using the Llama family of models. They tested versions of the model ranging from 1 billion to 405 billion parameters. A clear correlation emerged between model size and strategic behavior.
The smaller models produced guesses that deviated substantially from the Nash equilibrium, often matching or exceeding human averages. The largest models produced results much closer to zero. This implies that as artificial intelligence systems scale in complexity, their behavior in strategic settings tends to converge toward the theoretical mathematical optimum rather than typical human behavior.
“A key takeaway is that modern AI systems can reason strategically and adapt to different situations, but they do not think in the same way humans do,” the researchers told PsyPost. “In our experiments, AI models consistently behaved in a more strategic and calculation-driven manner than people, even compared to well-educated or expert human participants.”
“At the same time, the study shows that AI reasoning is not simply a more advanced version of human reasoning. Despite their sophistication, the models failed to identify a basic dominant strategy in a simple two-player game, highlighting important limitations and blind spots.”
“For the average reader, this means that AI decisions should not be interpreted as direct predictions of human behavior. When AI systems are used in settings that involve judgment, competition, or social interaction, they may push outcomes in directions that differ from what we would expect if only humans were involved.”
There are some limitations to the study’s findings. The artificial agents were not playing for real financial incentives, which is a standard component of behavioral economics experiments with humans. The absence of a tangible reward could influence the depth of reasoning the models employ. Additionally, the study relied on specific phrasing in the prompts to simulate the experimental conditions. While robustness checks with paraphrased prompts showed consistent results, the models exhibited some sensitivity to how the task was framed.
“A common misinterpretation would be to conclude that AI thinks like humans or can be used as a direct proxy for human decision-making,” the researchers noted. “Our results show that while AI can perform well in strategic tasks, its reasoning patterns differ in important ways, and these differences can meaningfully affect outcomes. The key caveat is that strong performance in a task does not necessarily imply human-like cognition.”
“Our next step is to extend this approach to a wider set of strategic games that capture different cognitive demands, such as coordination, cooperation, and dominance reasoning. Ultimately, our goal is to build a systematic benchmark that compares human and AI behavior across multiple economic and psychological games, allowing researchers to better understand where AI aligns with human reasoning and where it diverges.”
The study, “Strategizing with AI: Insights from a beauty contest experiment,” was authored by Iuliia Alekseenko, Dmitry Dagaev, Sofiia Paklina, and Petr Parshakov.
