Skip to content
SCIENCE

When it comes to advanced math, ChatGPT is no star student

AI's ability to handle math depends on what exactly you ask it to do.

Story text
While learning high-level mathematics is no easy feat, teaching math concepts can often be just as tricky. That may be why many teachers are turning to ChatGPT for help. According to a recent Forbes article, 51 percent of teachers surveyed stated that they had used ChatGPT to help teach, with 10 percent using it daily. ChatGPT can help relay technical information in more basic terms, but it may not always provide the correct solution, especially for upper-level math. An international team of researchers tested what the software could manage by providing the generative AI program with challenging graduate-level mathematics questions. While ChatGPT failed on a significant number of them, its correct answers suggested that it could be useful for math researchers and teachers as a type of specialized search engine.

Portraying ChatGPT’s math muscles

The media tends to portray ChatGPT’s mathematical intelligence as either brilliant or incompetent. “Only the extremes have been emphasized,” explained Frieder Simon, a University of Oxford PhD candidate and the study’s lead author. For example, ChatGPT aced Psychology Today’s Verbal-Linguistic Intelligence IQ Test, scoring 147 points, but failed miserably on Accounting Today’s CPA exam. “There’s a middle [road] for some use cases; ChatGPT is performing pretty well [for some students and educators], but for others, not so much,” Simon elaborated. At the testing level of high school and undergraduate math classes, ChatGPT performs well, ranking in the 89th percentile for the SAT math test. It even received a B on technology expert Scott Aaronson’s quantum computing final exam. But different tests may be needed to reveal the limits of ChatGPT’s capabilities. “One thing media have focused on is ChatGPT’s ability to pass various popular standardized tests,” stated Leah Henrickson, a professor of digital media at the University of Queensland. “These are tests that students spend literally years preparing for. We’re often led to believe that these tests evaluate our intelligence, but more often than not, they evaluate our ability to recall facts. ChatGPT can pass these tests because it can recall facts that it has picked up in its training.” Simon and his research team proposed a unique set of upper-level math questions to assess whether ChatGPT also had test-taking and problem-solving skills. “[Previous studies looked at] if the output has been correct or incorrect,” Simon added. “And we wanted to go beyond this and have implemented a much more fine-grained methodology where we can really assess how ChatGPT fails, if it does fail, and in what way it fails.” To create a more complex testing system, the researchers compiled prompts from several fields into a larger problem set they called GHOSTS.

Creating GHOSTS

The GHOSTS data set stands for the six types of math problems the researchers tested on ChatGPT: grad text, holes-in-proofs, Olympiad problem-solving, symbolic integration, math, and search-engine aspects. Researchers, graduate-level educators, and students commonly use these different mathematical skills. Simon explained: “We wanted to make a holistic comparison of different mathematical reasoning. Previous data sets were always somewhat similar. They were mostly composed of these word problems, where you have a small problem formulated at the high school level, or maybe undergraduate, but nothing at the graduate level.” The GHOSTS data set included questions from a graduate-level math textbook, offered fill-in-the-blank proof questions, gave incredibly hard advanced problems, and asked ChatGPT to integrate constants into equations, run more standard graduate-level analyses, and define certain math concepts. The researchers ran over 700 prompts through the generative AI program and analyzed ChatGPT’s answers to find where things went wrong. When prompted to explain how it reached its answers, ChatGPT often presented unusual or unexpected reasoning—even when it got the correct answer, it did so by traveling outside the bounds of standard practice. Students learn a standard form of math reasoning (such as the mnemonic SOHCAHTOA for remembering the equations for trigonometry functions), so ChatGPT’s convoluted method for arriving at answers may confuse students, especially in more basic math classes. “ChatGPT is fantastic for learning, and I use it all the time,” Simon added. “But there’s a big part to this [where] you have to know enough domain knowledge to verify it.” Simon and his team suggest that ChatGPT’s educational abilities should only be used for more advanced math learners. As Simon explained, mature learners “know enough to check the output. If you ask [ChatGPT] for a proof, you have to be confident enough in your own abilities to follow the mathematical proof and to spot any gaps in it.” For the less mature learners, Simon warned that ChatGPT could be “dangerous” to use independently, as the learner may not have enough experience to validate the math. Other experts, like Dr. Gerardo Adesso from the University of Nottingham, agree. “[ChatGPT] can also make some silly numerical or logic mistakes that any human would spot right away,” he said. “That’s why one should always double-check its outputs before trusting them blindly. ChatGPT is not a magic tool that can solve any math problem, but it can be a helpful companion to give you some hints and suggestions.”

Finding the best and worst math use cases

In their paper (which is in the process of being published), the researchers list the top three best and worst use cases for ChatGPT from their GHOSTS data set. Because ChatGPT is an LLM (large language model), it has more proficiency in analyzing languages than equations. So it shouldn’t be a surprise that the generative AI program failed when it came to pure math questions, such as integration, but was best at defining math concepts. The researchers also found that there were particular math questions that ChatGPT couldn’t perform, such as finding the area of geometric figures. In those cases, Simon recommended using other software like Wolfram Alpha, which has a ChatGPT plugin, to perform more equation-focused problems. The researchers suggest that, while ChatGPT wasn’t proficient in upper-level math, it would be incredibly useful as a math-based search engine for researchers, educators, and even coders. “It’s fairly inaccurate in many points, but these points are still useful to read because they give you some bits of information that you can hold onto that will point you to the next website or article,” Simon added. “This speeds up the learning process. In the classic coding loop, you have a question, ask it online, wait a few hours, and then get an answer. With this iteration, it is almost instant.” As Simon highlighted, with coding, students can immediately test ChatGPT’s recommended solution, verifying whether it works. When used as a math-based search engine for academic researchers, ChatGPT can save significant time and energy. The research team emphasized that this use case for ChatGPT may be especially helpful for physicists, computer scientists, and even engineers who use different mathematical concepts in their studies. For the general audience, Thomas Lukasiewicz, a professor of computer science at Oxford and the last author of the study, believes that comparing the best and worst use cases for ChatGPT might help the media to remove the misconceptions about its math capabilities. As he explained, the media could “[Show] where it’s good and where it’s bad. This could be how to illustrate capabilities, which we’ve also done in our paper at the end, looking at that [as a model] to see it in action on some concrete examples.”

How to improve ChatGPT’s math skills

Large language models are in constant development; this work was done prior to the release of ChatGPT based on GPT-4, so it's possible that the current version would already perform better at math problems. But Simon and other researchers are also suggesting ways that the software’s math skills could be improved, possibly to where it may be proficient in advanced math. “ChatGPT could improve its math abilities by learning from more and better data, especially from higher-level math, and at the same time, one could get better answers by suitable prompt engineering,” Adesso stated. “ChatGPT could also benefit from integrating with other systems that can better handle formal and symbolic math natively.” Simon and Lukasiewicz hope to start a community-based initiative to make more sophisticated math data available to ChatGPT. “Once there are better methodologies out there, we’d like to make a leaderboard to allow people to submit their own prompts and do their own ratings based on this better methodology that assists them in doing that,” Simon added. “This will be the biggest impact because it can also help with the data-gathering process, which is the bottleneck, I would say, in mathematics, because you cannot outsource that.” Other experts, like Henrickson, suspect that ChatGPT could be improved for the user, especially by asking more big-picture questions. “We can decide for ourselves when AI meets our needs and doesn’t. To make those decisions, though, we need to have at least a basic understanding of how these systems work rather than just focusing on the output they generate,” she said. “By thinking through questions like these, we can make more informed choices about what technologies we use to complement our own insight and why.” Kenna Hughes-Castleberry is the science communicator at JILA (a joint physics research institute between the National Institute of Standards and Technology and the University of Colorado Boulder) and a freelance science journalist. Her main writing focuses are quantum physics, quantum technology, deep technology, social media, and the diversity of people in these fields, particularly women and people from minority ethnic and racial groups. Follow her on LinkedIn or visit her website.