The headline writes itself and, predictably, wrote itself badly in most places it appeared. A study published in Scientific Reports in January 2026 compared the performance of several large language models, including GPT-4, Claude, and Gemini, against more than 100,000 human participants on a standard test of divergent thinking. Some AI models scored above the human average. This is one study, not settled consensus, and the task being measured is considerably more specific than “creativity.”
What the study actually found is more interesting than the summary.
And the part that the researchers themselves identify as most significant is not the part that got attention.
What the Divergent Association Task measures
The primary tool used in the study is the Divergent Association Task, developed by Jay Olson of the University of Toronto Mississauga, who is a co-author of the paper. The test is simple to explain. Participants are asked to produce ten words that are as semantically different from each other as possible. The more semantically distant the words from each other, the higher the score. Someone who produces: “galaxy, fork, freedom, algae, harmonica, quantum, nostalgia, velvet, hurricane, photosynthesis” scores higher than someone who lists words that cluster in meaning.
The task takes two to four minutes to complete. It is designed to be accessible enough to administer to large populations online, which is how the 100,000-participant dataset was assembled. The researchers also validated it against longer, more established creativity assessments, and found that performance on the DAT correlates with performance on those other tests. In other words, it is measuring something real about divergent thinking, not merely word knowledge or vocabulary breadth.
GPT-4, at its maximum temperature setting, a parameter that controls how unpredictable and daring the model’s outputs are, exceeded the creativity scores of 72 percent of human participants. This is the finding that generated most of the coverage. It is genuine. It is also one result on one task, in one dataset, under one configuration.
What the study actually emphasises
The lead researcher, Professor Karim Jerbi from the Department of Psychology at the Université de Montréal, and his co-first authors — postdoctoral researcher Antoine Bellemare-Pépin (Université de Montréal) and PhD candidate François Lespinasse (Concordia University) — gave considerable space in their paper and in the institutional press release to a finding that received considerably less coverage: the most creative humans still clearly outperform the best AI systems tested.
The average performance of the most creative half of participants exceeds that of all AI models tested. The top 10 percent of human participants open a wider gap still. When the study extended beyond the DAT to richer creative tasks, including haiku composition, short story writing, and movie plot summaries, the most skilled human participants retained a clear advantage. “Even the best AI systems still fall short of the levels reached by the most creative humans,” Professor Jerbi said in the study’s press release.
The headline, then, is accurate at one level and misleading at another. AI did beat the average human on this test. It did not beat the best humans. And the distribution matters: the average human performance being exceeded does not mean the ceiling of human creativity has been reached or surpassed. It means the floor has been lifted past.
Temperature and prompting as levers
One of the more technically interesting findings in the paper concerns how AI creativity can be modified. The temperature setting, which controls the randomness of the model’s outputs, has a measurable effect on DAT scores. At low temperature, models produce predictable, conventional outputs. At higher temperature, they generate more varied and semantically distant associations.
Prompting strategy also affects the result. The researchers found that instructions framed around etymology, encouraging the model to draw on word origins and structure, produced less obvious associations and higher creativity scores. This is a finding about the interaction between human instruction and AI output, not simply about what AI can produce on its own. The creativity demonstrated in those high-scoring responses was not independent of the human choices that elicited it.
This complicates the competition framing that the headline implies. If the score can be raised or lowered substantially depending on how a human instructs the system, then the output is not a clean measure of the AI’s creative capacity in isolation. It is a measure of the human-AI system. That is not a minor distinction.
The misread the researchers anticipate
Professor Jerbi addressed directly in the paper’s accompanying statement what he described as a “misleading sense of competition.” “Generative AI has above all become an extremely powerful tool in the service of human creativity: it will not replace creators, but profoundly transform how they imagine, explore, and create, for those who choose to use it.”
This is a measured position, neither alarmed nor dismissive, and it is consistent with what the data actually show. The study does not demonstrate that AI is generally more creative than humans. It demonstrates that AI can now exceed the average human performance on a specific and well-validated measure of divergent linguistic creativity, while remaining below the performance of the most creative humans on both that measure and on more complex creative tasks.
The commissioned headline characterises the researchers’ central concern as “the gap between average and exceptional is now the only question that matters.” That framing is sharper than what Jerbi’s team stated, but it captures the direction of their argument. What the study closes off is the claim that AI is simply incapable of creative output. What it opens up is the question of where, exactly, the gap between AI and the best human creative work lies, and whether it is narrowing.
What the study cannot resolve
The Divergent Association Task is a test of one component of creativity: the ability to generate semantically diverse associations quickly. It does not measure sustained creative work, the kind that involves revision, judgment, emotional investment, aesthetic risk, and the accumulation of a voice over years. It does not measure the conditions under which creative work becomes meaningful to other people.
These distinctions matter when the conversation turns, as it inevitably does, to whether AI will displace creative workers. The study’s data are about DAT performance and short creative writing samples assessed by raters. They are not data about whether AI can produce a body of work that holds up over time, or sustains the attention of a reader across something longer than a haiku. The researchers are careful about this. Their conclusions are scoped to what their data show. What the study adds to the broader conversation is a well-constructed benchmark and a large, validated dataset. It shows that the question “can AI be creative?” has stopped being useful. The question now is: creative in what way, measured how, compared to which humans, and for what purpose.