ΑΙhub.org
 

If AI image generators are so smart, why do they struggle to write and count?


by
17 July 2023



share this:

AI image produced using the prompt ‘hyper-realistic ten hands on a picture with text saying hello’. Midjourney, author provided.

By Seyedali Mirjalili, Torrens University Australia

Generative AI tools such as Midjourney, Stable Diffusion and DALL-E 2 have astounded us with their ability to produce remarkable images in a matter of seconds.

Despite their achievements, however, there remains a puzzling disparity between what AI image generators can produce and what we can. For instance, these tools often won’t deliver satisfactory results for seemingly simple tasks such as counting objects and producing accurate text.

If generative AI has reached such unprecedented heights in creative expression, why does it struggle with tasks even a primary school student could complete?

Exploring the underlying reasons helps sheds light on the complex numerical nature of AI, and the nuance of its capabilities.

AI’s limitations with writing

Humans can easily recognise text symbols (such as letters, numbers and characters) written in various different fonts and handwriting. We can also produce text in different contexts, and understand how context can change meaning.

Current AI image generators lack this inherent understanding. They have no true comprehension of what any text symbols mean. These generators are built on artificial neural networks trained on massive amounts of image data, from which they “learn” associations and make predictions.

Combinations of shapes in the training images are associated with various entities. For example, two inward-facing lines that meet might represent the tip of a pencil, or the roof of a house.

But when it comes to text and quantities, the associations must be incredibly accurate, since even minor imperfections are noticeable. Our brains can overlook slight deviations in a pencil’s tip, or a roof – but not as much when it comes to how a word is written, or the number of fingers on a hand.

As far as text-to-image models are concerned, text symbols are just combinations of lines and shapes. Since text comes in so many different styles – and since letters and numbers are used in seemingly endless arrangements – the model often won’t learn how to effectively reproduce text.

AI-generated image produced in response to the prompt ‘KFC logo’. Imagine AI.

The main reason for this is insufficient training data. AI image generators require much more training data to accurately represent text and quantities than they do for other tasks.

The tragedy of AI hands

Issues also arise when dealing with smaller objects that require intricate details, such as hands.

Two AI-generated images produced in response to the prompt ‘young girl holding up ten fingers, realistic’. Shutterstock AI.

In training images, hands are often small, holding objects, or partially obscured by other elements. It becomes challenging for AI to associate the term “hand” with the exact representation of a human hand with five fingers.

Consequently, AI-generated hands often look misshapen, have additional or fewer fingers, or have hands partially covered by objects such as sleeves or purses.

We see a similar issue when it comes to quantities. AI models lack a clear understanding of quantities, such as the abstract concept of “four”.

As such, an image generator may respond to a prompt for “four apples” by drawing on learning from myriad images featuring many quantities of apples – and return an output with the incorrect amount.

In other words, the huge diversity of associations within the training data impacts the accuracy of quantities in outputs.

Three AI-generated images produced in response to the prompt ‘5 soda cans on a table’. Shutterstock AI.

Will AI ever be able to write and count?

It’s important to remember text-to-image and text-to-video conversion is a relatively new concept in AI. Current generative platforms are “low-resolution” versions of what we can expect in the future.

With advancements being made in training processes and AI technology, future AI image generators will likely be much more capable of producing accurate visualisations.

It’s also worth noting most publicly accessible AI platforms don’t offer the highest level of capability. Generating accurate text and quantities demands highly optimised and tailored networks, so paid subscriptions to more advanced platforms will likely deliver better results.The Conversation

Seyedali Mirjalili, Professor, Director of Centre for Artificial Intelligence Research and Optimisation, Torrens University Australia

This article is republished from The Conversation under a Creative Commons license. Read the original article.




The Conversation is an independent source of news and views, sourced from the academic and research community and delivered direct to the public.
The Conversation is an independent source of news and views, sourced from the academic and research community and delivered direct to the public.

            AIhub is supported by:



Subscribe to AIhub newsletter on substack



Related posts :

Reinforcement learning applied to autonomous vehicles: an interview with Oliver Chang

  25 Feb 2026
In the third of our interviews with the 2026 AAAI Doctoral Consortium cohort, we hear from Oliver Chang.

The Machine Ethics podcast: moral agents with Jen Semler

In this episode, Ben and Jen Semler talk about what makes a moral agent, the point of moral agents, philosopher and engineer collaborations, and more.

Extending the reward structure in reinforcement learning: an interview with Tanmay Ambadkar

  23 Feb 2026
Find out more about Tanmay's research on RL frameworks, the latest in our series meeting the AAAI Doctoral Consortium participants.

The Good Robot podcast: what makes a drone “good”? with Beryl Pong

  20 Feb 2026
In this episode, Eleanor and Kerry talk to Beryl Pong about what it means to think about drones as “good” or “ethical” technologies.

Relational neurosymbolic Markov models

and   19 Feb 2026
Relational neurosymbolic Markov models make deep sequential models logically consistent, intervenable and generalisable

AI enables a Who’s Who of brown bears in Alaska

  18 Feb 2026
A team of scientists from EPFL and Alaska Pacific University has developed an AI program that can recognize individual bears in the wild, despite the substantial changes that occur in their appearance over the summer season.

Learning to see the physical world: an interview with Jiajun Wu

and   17 Feb 2026
Winner of the 2019 AAAI / ACM SIGAI dissertation award tells us about his current research.

3 Questions: Using AI to help Olympic skaters land a quint

  16 Feb 2026
Researchers are applying AI technologies to help figure skaters improve. They also have thoughts on whether five-rotation jumps are humanly possible.



AIhub is supported by:







Subscribe to AIhub newsletter on substack




 















©2026.02 - Association for the Understanding of Artificial Intelligence