Richard L. McCormick Interim President | Stony Brook University
Richard L. McCormick Interim President | Stony Brook University
A recent study led by Tuhin Chakrabarty, an assistant professor in the Department of Computer Science at Stony Brook University, alongside researchers from Columbia University, has revealed that the New York Times word game 'Connections' poses a significant challenge for Large Language Models (LLMs) in abstract reasoning tasks. Despite AI's proficiency in games like chess, this research highlights its limitations when faced with 'Connections.'
The study examined how LLMs responded to over 400 games of 'Connections,' finding that even the most advanced model tested, Claude 3.5 Sonnet, could only fully solve 18% of them. The game's format involves a 4x4 grid with 16 words that players must group into four clusters based on shared characteristics. For instance, words such as ‘Followers,’ ‘Sheep,’ ‘Puppets,’ and ‘Lemmings’ are grouped as ‘Conformists.’
Chakrabarty explained that while some may find the task straightforward, many words can be misleadingly grouped into various categories. He stated: “While the task might seem easy to some, many of these words can be easily grouped into several other categories.” This design makes the game particularly intriguing.
Further findings indicate that LLMs perform better with semantic relations but struggle with multiword expressions and complex knowledge involving word form and meaning. The research evaluated five different LLMs — Google’s Gemini 1.5 Pro, Anthropic’s Claude 3.5 Sonnet, OpenAI’s GPT4 Omni, Meta’s Llama 3.1 405B, and Mistral Large 2 — across numerous 'Connections' games compared to human performance on a subset.
The conclusion drawn from the results is clear: while LLMs show partial success in solving these puzzles, "their performance was far from ideal."
For more details on this study's findings, visit the AI Innovation Institute website.