This ChatGPT immediate and its corresponding (incorrect) response have been lately shared and re-posted on LinkedIn numerous occasions. They got as a stable proof that the AGI is simply not there but. Additional re-posts additionally identified that re-arranging the immediate to: “Which one is larger: 9.11 or 9.9?”, ensures an accurate reply, and additional emphasizes the brittleness of LLMs.
After evaluating each prompts towards a random group of ChatGPT customers, we discovered that in each circumstances the reply is incorrect about 50% of the time. As some customers have appropriately identified, there’s a refined ambiguity with the query, i.e. are we referring to mathematical inequality of two actual numbers, or are we referring to 2 dates (e.g. September 11 vs September 9), or two sub-sections in a doc (e.g. chapter 9.11 or 9.9)?
We determined to carry out a extra managed experiment through the use of OpenAI APIs. This manner now we have full management over each the system immediate and the person immediate; we will additionally take out the sampling uncertainty out of the equation so far as attainable by e.g. setting the temperature low.
The ultimate outcomes are very fascinating!
Our hypotheses might be said as follows:
- Given the identical immediate, with none extra context, and with temperature stored near zero, we should always practically at all times acquire the identical output, with secure log possibilities. Whereas folks check with LLMs as “stochastic”, for a given enter, LLM ought to at all times generate the identical output; the “hallucinations” or variance comes from the sampling mechanism outdoors of the LLM, and this we will dampen considerably by setting a really low temperature worth.
- Based mostly on our random person exams with ChatGPT, we’d count on each the unique immediate, and the re-worded model to provide incorrect reply 50% of the time — in different phrases, with out additional disambiguation or context, we wouldn’t count on one immediate to carry out higher than the opposite.
For our experiment design, we carry out the next:
- We conduct a variety of experiments, beginning with the unique immediate, adopted by a collection of “interventions”
- For every experiment/intervention, we execute 1 000 trials
- We use OpenAI’s most superior GPT-4o mannequin
- We set the temperature to 0.1 to primarily eradicate the randomness attributable to sampling; we experiment with each random seed in addition to fastened seed
- To gauge the “confidence” of the reply, we accumulate the log chance and calculate the linear chance of the reply in every trial; we plot the Kernel Density Estimate (KDE) of the linear possibilities throughout the 1 000 trials for every of the experiments
The total code for our experimental design is obtainable right here.
The person immediate is ready to “9.11 or 9.9 — which one is larger?”.
In keeping with what social media customers have reported, GPT-4o provides the right reply 55% of the time ☹️. The mannequin can be not very sure — on massive variety of trials, its “confidence” within the reply is ~80%.
Within the re-worded person immediate, no extra context/disambiguation is supplied, however the wording is barely modified to: “Which one is larger, 9.11 or 9.9?”
Amazingly, and opposite to our ChatGPT person exams, the appropriate reply is reached 100% of the time throughout 1 000 trials. Moreover, the mannequin displays very excessive confidence in its reply 🤔.
There was important work lately in attempting to induce improved “reasoning” capabilities in LLMs with chain-of-thought (CoT) prompting being the most well-liked. Huang et al have revealed a really complete survey on LLM reasoning capabilities.
As such, we modify the unique person immediate by additionally telling the LLM to clarify its reasoning. Apparently sufficient, the chance of appropriate reply improves to 62%, nonetheless the solutions include even larger uncertainty.
The ultimate experiment is identical as experiment “C”, nonetheless we as a substitute bootstrap the system immediate by telling the LLM to “clarify its reasoning”. Extremely, we now see the appropriate reply 100% of the time, with very excessive confidence. We see similar outcomes if we use the re-worded person immediate as nicely.
What began off as a easy experiment to validate a few of the statements seen on social media, ended up with some very fascinating findings. Let’s summarize the important thing takeaways:
- For an similar immediate, with each temperature set very low (primarily eliminating sampling uncertainty), and a hard and fast seed worth, we see very massive variance in log possibilities. Slight variance might be defined by {hardware} precision, however variance this huge may be very tough to clarify. It signifies that both (1) sampling mechanism is a LOT extra sophisticated, or (2) there are extra layers/fashions upstream past our management.
- In keeping with earlier literature, merely instructing the LLM to “clarify its reasoning” improves its efficiency.
- There’s clearly a definite dealing with between the system immediate and the person immediate. Bootstrapping a job within the system immediate versus the person immediate, appears to end in considerably higher efficiency.
- We will clearly see how brittle the prompts might be. The important thing takeaway right here is that we should always at all times goal to supply disambiguation and clear context in our prompts.