studyfinds.org /ai-average-human-creativity/

AI Beats Average Humans At Creativity Test, But Creative Geniuses Still Reign Supreme

StudyFinds Analysis 9-11 minutes 1/21/2026
A robotic hand painting

(Credit: paulista/Shutterstock)

Top-tier creativity remains elusive to AI. Models can’t help but repeat ‘safe’ ideas over and over.

In A Nutshell

AI creativity can be tuned, but it has limits. Adjusting settings helps, yet no current model matches the originality of highly creative people.

AI can outperform the average person on a standard creativity test, which asked participants to list words that are as different from each other as possible.

The most creative humans still outperform every AI tested, creating a clear gap between top human creativity and today’s machines.

AI tends to repeat the same “safe” ideas, while people naturally vary their responses. Raising an AI’s randomness setting reduces this repetition and boosts scores.

ChatGPT can now best the average person when it comes to creative tasks, according to recent research. That being said, if you’re among the most creative humans, your job is probably safe.

MY LATEST VIDEOS

Researchers from the University of Montreal ran the largest direct comparison between human and machine creativity to date, pitting 100,000 people against nine of the world’s most advanced AI systems. The results? GPT-4 scored higher than typical humans on a standard creativity test. Google’s GeminiPro matched average human performance.

While all of that may be a bit distressing for biological beings reading this, it isn’t time to throw in the creativity towel on humanity just yet. When the AI systems were stacked against the top 10% of creative people, every AI model failed to measure up.

The test itself was deceptively simple: name 10 words as different from each other as possible. Someone who writes “car, dog, tree” shows less creative range than someone who comes up with “microscope, volcano, whisper.” The further apart the words are in meaning, the higher the creativity score.

“The persistent gap between the best-performing humans and even the most advanced LLMs indicates that the most demanding creative roles in industry are unlikely to be supplanted by current artificial intelligence systems,” the researchers wrote in their paper, published in Scientific Reports.

The Repetition Problem Nobody Expected

Despite beating average humans overall, GPT-4 kept using the same words over and over. The word “microscope” appeared in 70% of its responses. “Elephant” showed up 60% of the time. GPT-4-turbo was even worse, dropping “ocean” into more than 90% of its answers.

Humans? The most common word was “car” at just 1.4%. Then “dog” at 1.2% and “tree” at 1.0%. Real people naturally avoid repeating themselves. AI tends to fall back on the same high-probability words unless you adjust the settings.

The research team, led by Antoine Bellemare-Pepin and François Lespinasse, tested whether they could fix this. They adjusted something called “temperature,” which is essentially a dial that controls how random or predictable the AI’s word choices are. After the temperature was increased GPT-4 stopped repeating itself so much. Its creativity scores jumped, reaching a level higher than 72% of all human participants.

That’s useful for anyone trying to get better creative output from ChatGPT. But it also reveals something fundamental: AI creativity is a setting you can turn up or down, not an inherent capability.

doubts, goals
This study is good news for artists worried about AI taking their livelihood. Truly unique creativity remains a purely human trait. Credit: Nicoleta Ionescu on Shutterstock

When Newer Doesn’t Mean Better

OpenAI released GPT-4-turbo after the original GPT-4, presumably as an improvement. On this creativity test, though, it performed worse. Much worse.

The researchers found that newer versions don’t automatically get more creative. Sometimes they get less creative. The researchers suggest this might happen because newer versions are optimized for speed and cost, potentially trading creativity for efficiency.

Another noteworthy finding: Vicuna, a smaller open-source model, beat several larger, more expensive commercial alternatives. Bigger doesn’t mean more creative either.

The 100,000-Person Experiment

The study pulled participants from the United States, United Kingdom, Canada, Australia, and New Zealand: all English speakers balanced for age and gender. Everyone took the same test: list 10 unrelated words.

Researchers then fed identical instructions to nine different AI models, collecting 500 responses from each. They tested everything from household names like GPT-4 and Claude to lesser-known open-source models like Pythia and StableLM.

The team also pushed beyond simple word lists. They had the AI write haikus, movie synopses, and short fiction stories, then measured how diverse the ideas were. GPT-4 consistently beat GPT-3.5 on creative writing. However, human writers still produced work with greater variety and originality, especially in poetry and plot summaries.

What This Actually Means

If you’re a professional writer, designer, or artist, this research suggests you’re not about to be replaced. AI can match, and sometimes exceed, what an average person produces. But the best human creators operate on a different level entirely.

That gap matters. Most companies don’t hire average creators for their most demanding work. They hire the top performers, the people who can generate truly original ideas. Current AI can’t touch that tier.

For everyone else using ChatGPT to brainstorm or draft content, there’s a practical takeaway: if you want more creative results, tell the AI to increase its temperature setting (usually between 1.0 and 1.5 works well). You’ll get less repetition and more diverse outputs.

The researchers made their testing framework public so other scientists can benchmark new AI models as they’re released. For now, though, the ceiling is clear. Artificial intelligence has learned to mimic average human creativity, but exceptional human creativity remains in a category of its own.


Paper Summary

Study Limitations

Architecture details weren’t available for some commercial models, limiting conclusions about which specific features boost creative performance. The test measured how different words are from each other in meaning, which captures one aspect of creativity but not everything—someone could express novel ideas using similar words. The exact training data for models like GPT-4 and Claude remains undisclosed, so researchers can’t determine whether these systems had previous exposure to the creativity test.

Funding and Disclosures

A.B. received support from a Fonds de Recherche du Québec-Société et Culture doctoral grant (274043). K.J. received funding from Canada Research Chairs program (950-232368), a Discovery Grant from the Natural Sciences and Engineering Research Council of Canada (2021-03426), and Strategic Research Clusters Program from the Fonds de recherche du Québec–Nature et technologies (2023-RS6-309472). J.O. received support from a Canadian Institutes of Health Research postdoctoral fellowship. F.L. and Y.H. received Courtois-Neuromod scholarships. F.L. is currently supported by the Social Science and Humanities Research Council of Canada doctoral fellowship and the Applied AI Institute of Concordia University. K.W.M. is a Senior Research Scientist at Google DeepMind but conducted this work independently. All other authors declared no competing interests.

Publication Details

Bellemare-Pepin, A., Lespinasse, F., Thölke, P., Harel, Y., Mathewson, K., Olson, J.A., Bengio, Y., and Jerbi, K. Department of Psychology, Université de Montréal; Music Department, Concordia University; Department of Sociology and Anthropology, Concordia University; Mila (Quebec AI Research Institute); Department of Psychology, University of Toronto Mississauga; Department of Computer Science and Operations Research, Université de Montréal; UNIQUE Center (Quebec Neuro-AI Research Center). “Divergent Creativity in Humans and Large Language Models,” published on January 21, 2026 in Scientific Reports. Corresponding author: Karim Jerbi (karim.jerbi@umontreal.ca). The protocol for human data collection received approval from the University of Toronto Research Ethics Board (#45872) and exemption from the Harvard University Institutional Review Board (IRB21-0991).

Called "brilliant," "fantastic," and "spot on" by scientists and researchers, our acclaimed StudyFinds Analysis articles are created using an exclusive AI-based model with complete human oversight by the StudyFinds Editorial Team. For these articles, we use an unparalleled LLM process across multiple systems to analyze entire journal papers, extract data, and create accurate, accessible content. Our writing and editing team proofreads and polishes each and every article before publishing. With recent studies showing that artificial intelligence can interpret scientific research as well as (or even better) than field experts and specialists, StudyFinds was among the earliest to adopt and test this technology before approving its widespread use on our site. We stand by our practice and continuously update our processes to ensure the very highest level of accuracy. Read our AI Policy (link below) for more information.

Our Editorial Team

Steve Fink

Editor-in-Chief

Sophia Naughton

Associate Editor