about

resources

events

contribute

republishing

☰

ΑΙhub.org

Why ChatGPT struggles with math

by University of Waterloo

07 November 2024

What flaw did you discover in ChatGPT’s ability to do math?

As I explained in a recent post on X, the latest reasoning variant of ChatGPT o1, struggles with large-digit multiplication, especially when multiplying numbers beyond nine digits. This is a notable improvement over the previous ChatGPT-4o model, which struggled even with four-digit multiplication, but it’s still a major flaw.

Is OpenAI's o1 a good calculator? We tested it on up to 20×20 multiplication—o1 solves up to 9×9 multiplication with decent accuracy, while gpt-4o struggles beyond 4×4. For context, this task is solvable by a small LM using implicit CoT with stepwise internalization. 1/4 pic.twitter.com/et5DB9bhNL

— Yuntian Deng (@yuntiandeng) September 17, 2024

What implications does this have regarding the tool’s ability to reason?

Large-digit multiplication is a useful test of reasoning because it requires a model to apply principles learned during training to new test cases. Humans can do this naturally. For instance, if you teach a high school student how to multiply nine-digit numbers, they can easily extend that understanding to handle ten-digit multiplication, demonstrating a grasp of the underlying principles rather than mere memorization.

In contrast, LLMs often struggle to generalize beyond the data they have been trained on. For example, if an LLM is trained on data involving multiplication of up to nine-digit numbers, it typically cannot generalize to ten-digit multiplication.

As LLMs become more powerful, their impressive performance on challenging benchmarks can create the perception that they can “think” at advanced levels. It’s tempting to rely on them to solve novel problems or even make decisions. However, the fact that even o1 struggles with reliably solving large-digit multiplication problems indicates that LLMs still face challenges when asked to generalize to new tasks or unfamiliar domains.

Why is it important to study how these LLMs “think”?

Companies like OpenAI haven’t fully disclosed the details of how their models are trained or the data they use. Understanding how these AI models operate allows researchers to identify their strengths and limitations, which is essential for improving them. Moreover, knowing these limitations helps us understand which tasks are best suited for LLMs and where human expertise is still crucial.

University of Waterloo

AIhub is supported by:

Introducing the NASA Onboard Artificial Intelligence Research (OnAIR) platform: an interview with Evana Gizzi

Lucy Smith 03 Jul 2025

Find out about the OnAIR platform, some of the particular challenges of deploying AI-based solutions in space, and how the tool has been used so far.

An interview with Nicolai Ommer: the RoboCupSoccer Small Size League

Lucy Smith 01 Jul 2025

We caught up with Nicolai to find out more about the Small Size League, how the auto referees work, and how teams use AI.

Forthcoming machine learning and AI seminars: July 2025 edition

Lucy Smith 30 Jun 2025

A list of free-to-attend AI-related seminars that are scheduled to take place between 1 July and 31 August 2025.

monthly digest

What is vibe coding? A computer scientist explains what it means to have AI write computer code − and what risks that can entail

The Conversation 19 Jun 2025

Until recently, most computer code was written, at least originally, by human beings. But with the advent of GenAI, that has begun to change.

Why ChatGPT struggles with math

What flaw did you discover in ChatGPT’s ability to do math?

What implications does this have regarding the tool’s ability to reason?

Why is it important to study how these LLMs “think”?

Related posts :

Introducing the NASA Onboard Artificial Intelligence Research (OnAIR) platform: an interview with Evana Gizzi

An interview with Nicolai Ommer: the RoboCupSoccer Small Size League

Forthcoming machine learning and AI seminars: July 2025 edition

AIhub monthly digest: June 2025 – gearing up for RoboCup 2025, privacy-preserving models, and mitigating biases in LLMs

RoboCupRescue: an interview with Adam Jacoff

Making optimal decisions without having all the cards in hand

Exploring counterfactuals in continuous-action reinforcement learning

What is vibe coding? A computer scientist explains what it means to have AI write computer code − and what risks that can entail

↑