GSM-Symbolic: Analyzing LLM Limitations in Mathematical Reasoning and Potential Options | by Alexander Watson | Oct, 2024

What The Paper on LLM Reasoning Bought Proper — And What It Missed.

Co-authors: Alex Watson, Yev Meyer, Dane Corneil, Maarten Van Segbroeck (Gretel.ai)

Supply: Gretel.ai

Massive language fashions (LLMs) have not too long ago made important strides in AI reasoning, together with mathematical problem-solving. Nevertheless, a current paper titled “GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Massive Language Fashions” by Mirzadeh et al. raises questions concerning the true capabilities of those fashions in relation to mathematical reasoning. Now we have reviewed the paper and located it to be a beneficial contribution to the continuing dialogue about AI capabilities and limitations, nonetheless, our evaluation means that its conclusions could not totally seize the complexity of the problem.

The authors introduce GSM-Symbolic, an enhanced benchmark derived from the favored GSM8K dataset. This new benchmark permits for the era of numerous query variants, enabling a extra nuanced analysis of LLMs’ efficiency throughout varied setups. The examine’s large-scale evaluation of 25 state-of-the-art open and closed fashions gives important insights into how these fashions behave when confronted with mathematical reasoning duties.

Determine 1: GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Massive Language Fashions (Supply: Mirzadeh et al., GSM-Symbolic Paper)

One of the crucial stunning findings is the excessive variability in mannequin efficiency throughout totally different instantiations of the identical query. All fashions exhibit “important variability in accuracy” when examined on GSM-Symbolic. This variability raises considerations concerning the reliability of presently reported metrics on the GSM8K benchmark, which depends on single point-accuracy responses.

Determine 3: GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Massive Language Fashions (Supply: Mirzadeh et al., GSM-Symbolic Paper)

Not all fashions are created equal. Llama-3–8b and GPT-4o are clear outliers in that they don’t exhibit as important of a drop on the brand new benchmark as different fashions like gemma-2–9b, phi-3, phi-3.5 and mathstral-7b. This observations suggests two necessary factors:

  1. Llama-3–8b and GPT-4o usually show a extra strong understanding of mathematical ideas, though they’re nonetheless not resistant to efficiency variations.
  2. The coaching information for Llama-3–8b and GPT-4o doubtless has not been contaminated (or not less than to not the identical extent) with GSM8K information. On this context, information contamination refers back to the unintentional inclusion of check or benchmark information in a mannequin’s coaching set, resulting in artificially inflated mannequin efficiency throughout analysis. If contamination had occurred, because the authors hypothesize for some fashions, we’d count on to see very excessive efficiency on GSM8K however considerably decrease efficiency on even slight variations of those issues.

These findings spotlight a chance for enchancment via using artificial information, the place correctly designed artificial datasets can deal with each of those factors for anybody coaching fashions:

  1. To mitigate potential information contamination points, there’s no want to make use of the unique GSM8K information in coaching when high-quality artificial variations might be generated (weblog hyperlink). These artificial datasets retain the mathematical reasoning challenges of GSM8K with out reusing the precise issues or options, thus preserving the integrity of the mannequin’s analysis.
  2. Much more importantly, it’s potential to generate artificial information that surpass the standard of each the OpenAI GSM8K and Apple GSM-Symbolic datasets. This strategy can result in a extra strong understanding of mathematical ideas, addressing the efficiency variability noticed in present fashions.

The authors present that LLMs are extra delicate to adjustments in numerical values than to adjustments in correct names inside issues, suggesting that the fashions’ understanding of the underlying mathematical ideas is probably not as strong as beforehand thought. Because the complexity of questions will increase (measured by the variety of clauses), the efficiency of all fashions degrades, and the variance of their efficiency will increase. This highlights the significance of utilizing numerous information in coaching, and that is one thing that synthetics might help with. Because the authors show, there’s logically no purpose why a AI mannequin ought to carry out worse on a given set of issues, with only a easy change in numbers or a slight variation within the variety of clauses.

Determine 4: GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Massive Language Fashions (Supply: Mirzadeh et al., GSM-Symbolic Paper)

Maybe probably the most regarding discovering is the introduction of GSM-NoOp, a dataset designed to problem the reasoning capabilities of LLMs. By including seemingly related however finally inconsequential info to issues, the authors noticed substantial efficiency drops throughout all fashions — as much as 65% for some. The authors suggest that this factors to present LLMs relying extra on a sort of sample matching than true logical reasoning

Determine 6: GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Massive Language Fashions (Supply: Mirzadeh et al., GSM-Symbolic Paper)

Whereas the GSM-Symbolic examine gives beneficial insights into the efficiency of LLMs on mathematical reasoning duties, it’s necessary to critically study the paper’s conclusions. The authors argue that the noticed limitations counsel LLMs are usually not able to true logical reasoning. Nevertheless, this interpretation could also be oversimplifying a posh problem.

The paper’s argument for LLMs counting on sample matching reasonably than reasoning appears much less definitive when examined carefully. It’s clear that these fashions are usually not excellent reasoners — in the event that they had been, they’d obtain 100% accuracy on GSM8K. However the leap from imperfect efficiency to a scarcity of reasoning functionality shouldn’t be essentially justified.

There are not less than two potential explanations for why LLMs, like people, generally get questions improper:

  1. The mannequin tries to strictly sample match an issue to one thing it has seen earlier than, and fails if it may well’t.
  2. The mannequin tries to comply with a logical program however has a sure (compounding) chance of creating an error at every step, as anticipated primarily based on the truth that it actually samples tokens.

The paper appears to lean in the direction of rationalization (1), however doesn’t make a convincing case for why this must be most well-liked over rationalization (2). The truth is, (2) is extra akin to human-like reasoning and probably extra fascinating from a analysis perspective.

Let’s study every essential discovering of the paper via this crucial lens:

GSM-Symbolic Efficiency

The GSM-Symbolic strategy is a beneficial methodology for dataset enlargement, validating the potential of artificial information era strategies like these utilized by Gretel. Nevertheless, it’s price noting that mannequin efficiency doesn’t utterly crumble on these new variants — it simply will get considerably worse. If the fashions had been strictly sample matching, we would count on efficiency to drop to close zero on these new variants. The noticed habits appears extra in keeping with a mannequin that may generalize to some extent however makes extra errors on unfamiliar downside constructions.

Even human specialists are usually not infallible. On the MATH benchmark, as an illustration, former math olympians sometimes scored 18/20 or 19/20, making small arithmetic errors. This implies that error-prone reasoning, reasonably than a scarcity of reasoning functionality, could be a extra correct description of each human and LLM efficiency.

Various Problem

The paper’s findings on efficiency degradation with growing query complexity are in keeping with the thought of compounding errors in a multi-step reasoning course of. Because the variety of steps will increase, so does the chance of creating an error sooner or later within the chain. This habits is noticed in human problem-solving as nicely and doesn’t essentially point out a scarcity of reasoning capability.

GSM-NoOp Problem

The GSM-NoOp outcomes, is probably not as immediately associated to reasoning functionality because the paper suggests. In real-world situations, we sometimes assume that every one info offered in an issue assertion is related. As an illustration, within the instance query in Determine 7, an affordable human would possibly infer (just like the LLMs did) that the scale of the kiwis was solely talked about as a result of they had been discarded.

Determine 7: GSM-Symbolic: Instance GSM No-Op query. (Supply: Mirzadeh et al., GSM-Symbolic Paper)

The power to discern related info from irrelevant info, particularly when the irrelevant info is inserted with the intent to be deceptive (i.e. seemingly related), is a separate talent from pure mathematical reasoning.

The authors embrace a follow-up experiment (NoOp-NoOp) during which the fashions are implicitly “warned” of the deceptive intent: they use few-shot examples that additionally include irrelevant info. The subset of fashions illustrated with this experiment nonetheless present a drop in efficiency. A number of follow-up experiments might serve to raised perceive the phenomenon:

  1. Develop the NoOp-NoOp experiment to extra fashions;
  2. Measure how nicely fashions carry out when explicitly warned that some info could also be irrelevant within the immediate;
  3. High quality-tune fashions on artificial coaching examples that embrace irrelevant info along with examples that include fully related info.

Whereas the paper by Mirzadeh et al. highlights necessary limitations in present LLMs, at Gretel now we have developed datasets that deal with lots of the challenges recognized within the paper:

  1. Artificial GSM8K Dataset: Out there on HuggingFace at gretelai/synthetic-gsm8k-reflection-405b, this dataset focuses on producing extra advanced, multi-step reasoning variations of issues than what existed within the authentic human generated dataset from OpenAI. It incorporates superior prompting strategies, together with Reflection and different cognitive fashions, to seize detailed reasoning processes. This strategy has proven important enhancements, significantly for very exhausting issues, demonstrating its potential to reinforce AI’s capability to deal with advanced, multi-step reasoning duties. As lined in our weblog, Gretel’s artificial information created utilizing these strategies achieved a 92.3% win-rate on downside complexity and an 82.7% win-rate for academic worth over the usual Llama 3.1 405B parameter mannequin outputs, utilizing these superior strategies as judged by GPT-4o– demonstrating that LLM reasoning can additional be unlocked with extra subtle coaching information examples and prompting strategies than the fundamental Chain-of-Thought used within the paper.
Supply: https://gretel.ai/weblog/teaching-ai-to-think-a-new-approach-with-synthetic-data-and-reflection

2. Artificial Textual content-to-SQL Dataset: Generated by Gretel to assist enhance LLMs capability to work together with SQL-based databases/warehouses & lakes, obtainable at gretelai/synthetic_text_to_sql, has confirmed extremely efficient in bettering mannequin efficiency on Textual content-to-SQL duties. When used to fine-tune CodeLlama fashions, it led to 36%+ enhancements on the BIRD benchmark, a difficult cross-domain Textual content-to-SQL analysis platform. Additional supporting the idea about right now’s LLMs being skilled on information that’s too easy and resulting in memorization, a single epoch of fine-tuning the Phi-3 and Llama 3.1 fashions on this dataset yielded a 300%+ enchancment on BIRD benchmark issues labeled as “very exhausting”.

These outcomes demonstratethat high-quality artificial information is usually a highly effective software in addressing the constraints of present LLMs in advanced reasoning duties.

In conclusion, the GSM-Symbolic paper gives beneficial insights into the present limitations of LLMs in mathematical reasoning duties. Nevertheless, its conclusions must be approached critically. The noticed habits of LLMs might be interpreted in a number of methods, and it’s potential that the paper’s emphasis on sample matching over reasoning could also be oversimplifying a extra advanced problem.

The constraints recognized by the examine are actual and important. The variability in efficiency, sensitivity to numerical adjustments, and struggles with irrelevant info all level to areas the place present LLMs might be improved.

Nevertheless, as demonstrated by extra superior fashions similar to GPT-4o and Llama 3.1 above- by synthesizing numerous, difficult downside units that push the boundaries of what AI fashions can deal with, we are able to develop LLMs that exhibit extra strong, human-like reasoning capabilities.

  1. I. Mirzadeh, Okay. Alizadeh, H. Shahrokhi, O. Tuzel, S. Bengio, and M. Farajtabar. GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Massive Language Fashions. 2024.