Whereas constructing my very own LLM-based software, I discovered many immediate engineering guides, however few equal guides for figuring out the temperature setting.
In fact, temperature is an easy numerical worth whereas prompts can get mindblowingly advanced, so it could really feel trivial as a product choice. Nonetheless, selecting the best temperature can dramatically change the character of your outputs, and anybody constructing a production-quality LLM software ought to select temperature values with intention.
On this publish, we’ll discover what temperature is and the mathematics behind it, potential product implications, and the way to decide on the best temperature in your LLM software and consider it. On the finish, I hope that you simply’ll have a transparent plan of action to search out the best temperature for each LLM use case.
What’s temperature?
Temperature is a quantity that controls the randomness of an LLM’s outputs. Most APIs restrict the worth to be from 0 to 1 or some comparable vary to maintain the outputs in semantically coherent bounds.
From OpenAI’s documentation:
“Larger values like 0.8 will make the output extra random, whereas decrease values like 0.2 will make it extra centered and deterministic.”
Intuitively, it’s like a dial that may modify how “explorative” or “conservative” the mannequin is when it spits out a solution.
What do these temperature values imply?
Personally, I discover the mathematics behind the temperature subject very attention-grabbing, so I’ll dive into it. However in the event you’re already acquainted with the innards of LLMs otherwise you’re not concerned with them, be at liberty to skip this part.
You most likely know that an LLM generates textual content by predicting the following token after a given sequence of tokens. In its prediction course of, it assigns possibilities to all doable tokens that would come subsequent. For instance, if the sequence handed to the LLM is “The giraffe ran over to the…”, it would assign excessive possibilities to phrases like “tree” or “fence” and decrease possibilities to phrases like “house” or “e book”.
However let’s again up a bit. How do these possibilities come to be?
These possibilities normally come from uncooked scores, often known as logits, which might be the outcomes of many, many neural community calculations and different Machine Studying methods. These logits are gold; they include all the precious details about what tokens may very well be chosen subsequent. However the issue with these logits is that they don’t match the definition of a chance: they are often any quantity, optimistic or destructive, like 2, or -3.65, or 20. They’re not essentially between 0 and 1, they usually don’t essentially all add as much as 1 like a pleasant chance distribution.
So, to make these logits usable, we have to use a perform to remodel them right into a clear chance distribution. The perform usually used right here is known as the softmax, and it’s basically a chic equation that does two vital issues:
- It turns all of the logits into optimistic numbers.
- It scales the logits so that they add as much as 1.
![](https://towardsdatascience.com/wp-content/uploads/2025/02/1_0exHZ6IptxPE3OdpKzIRSQ.webp)
The softmax perform works by taking every logit, elevating e (round 2.718) to the facility of that logit, after which dividing by the sum of all these exponentials. So the very best logit will nonetheless get the very best numerator, which implies it will get the very best chance. However different tokens, even with destructive logit values, will nonetheless get an opportunity.
Now right here’s the place Temperature is available in: temperature modifies the logits earlier than making use of softmax. The components for softmax with temperature is:
![](https://towardsdatascience.com/wp-content/uploads/2025/02/1_DFbjtZFdwmuMjSgHIgTooA.webp)
When the temperature is low, dividing the logits by T makes the values bigger/extra unfold out. Then the exponentiation would make the very best worth a lot bigger than the others, making the chance distribution extra uneven. The mannequin would have the next probability of selecting essentially the most possible token, leading to a extra deterministic output.
When the temperature is excessive, dividing the logits by T makes all of the values smaller/nearer collectively, spreading out the chance distribution extra evenly. This implies the mannequin is extra more likely to choose much less possible tokens, rising randomness.
How to decide on temperature
In fact, the easiest way to decide on a temperature is to mess around with it. I imagine any temperature, like several immediate, must be substantiated with instance runs and evaluated in opposition to different prospects. We’ll talk about that within the subsequent part.
However earlier than we dive into that, I wish to spotlight that temperature is an important product choice, one that may considerably affect person conduct. It could appear reasonably simple to decide on: decrease for extra accuracy-based functions, larger for extra inventive functions. However there are tradeoffs in each instructions with downstream penalties for person belief and utilization patterns. Listed here are some subtleties that come to thoughts:
- Low temperatures could make the product really feel authoritative. Extra deterministic outputs can create the phantasm of experience and foster person belief. Nonetheless, this may additionally result in gullible customers. If responses are all the time assured, customers would possibly cease critically evaluating the AI’s outputs and simply blindly belief them, even when they’re improper.
- Low temperatures can scale back choice fatigue. When you see one robust reply as a substitute of many choices, you’re extra more likely to take motion with out overthinking. This would possibly result in simpler onboarding or decrease cognitive load whereas utilizing the product. Inversely, excessive temperatures may create extra choice fatigue and result in churn.
- Excessive temperatures can encourage person engagement. The unpredictability of excessive temperatures can preserve customers curious (like variable rewards), resulting in longer periods or elevated interactions. Inversely, low temperatures would possibly create stagnant person experiences that bore customers.
- Temperature can have an effect on the way in which customers refine their prompts. When solutions are surprising with excessive temperatures, customers could be pushed to make clear their prompts. However with low temperatures, customers could also be pressured to add extra element or develop on their prompts so as to get new solutions.
These are broad generalizations, and naturally there are lots of extra nuances with each particular software. However in most functions, the temperature is usually a highly effective variable to regulate in A/B testing, one thing to think about alongside your prompts.
Evaluating totally different temperatures
As builders, we’re used to unit testing: defining a set of inputs, working these inputs by means of a perform, and getting a set of anticipated outputs. We sleep soundly at evening after we be sure that our code is doing what we anticipate it to do and that our logic is satisfying some clear-cut constraints.
The promptfoo bundle permits you to carry out the LLM-prompt equal of unit testing, however there’s some extra nuance. As a result of LLM outputs are non-deterministic and infrequently designed to do extra inventive duties than strictly logical ones, it may be laborious to outline what an “anticipated output” seems to be like.
Defining your “anticipated output”
The only analysis tactic is to have a human charge how good they suppose some output is, based on some rubric. For outputs the place you’re on the lookout for a sure “vibe” you could’t categorical in phrases, it will most likely be the best technique.
One other easy analysis tactic is to make use of deterministic metrics — these are issues like “does the output include a sure string?” or “is the output legitimate json?” or “does the output fulfill this javascript expression?”. In case your anticipated output will be expressed in these methods, promptfoo has your again.
A extra attention-grabbing, AI-age analysis tactic is to make use of LLM-graded checks. These basically use LLMs to guage your LLM-generated outputs, and will be fairly efficient if used correctly. Promptfoo gives these model-graded metrics in a number of kinds. The entire record is right here, and it accommodates assertions from “is the output related to the unique question?” to “evaluate the totally different take a look at instances and inform me which one is greatest!” to “the place does this output rank on this rubric I outlined?”.
Instance
Let’s say I’m making a consumer-facing software that comes up with inventive present concepts and I wish to empirically decide what temperature I ought to use with my predominant immediate.
I’d wish to consider metrics like relevance, originality, and feasibility inside a sure finances and guarantee that I’m choosing the right temperature to optimize these elements. If I’m evaluating GPT 4o-mini’s efficiency with temperatures of 0 vs. 1, my take a look at file would possibly begin like this:
suppliers:
 - id: openai:gpt-4o-mini
   label: openai-gpt-4o-mini-lowtemp
   config:
     temperature: 0
 - id: openai:gpt-4o-mini
   label: openai-gpt-4o-mini-hightemp
   config:
     temperature: 1
prompts:
 - "Give you a one-sentence inventive present thought for an individual who's {{persona}}. It ought to price beneath {{finances}}."checks:
 - description: "Mary - attainable, beneath finances, unique"
   vars:
     persona: "a 40 12 months previous girl who loves pure wine and performs pickleball"
     finances: "$100"
   assert:
     - sort: g-eval
       worth:
         - "Verify if the present is definitely attainable and affordable"
         - "Verify if the present is probably going beneath $100"
         - "Verify if the present could be thought-about unique by the typical American grownup"
 - description: "Sean - reply relevance"
   vars:
     persona: "a 25 12 months previous man who rock climbs, goes to raves, and lives in Hayes Valley"
     finances: "$50"
   assert:
     - sort: answer-relevance
       threshold: 0.7
I’ll most likely wish to run the take a look at instances repeatedly to check the results of temperature modifications throughout a number of same-input runs. In that case, I’d use the repeat param like:
promptfoo eval --repeat 3
![](https://towardsdatascience.com/wp-content/uploads/2025/02/1_N7gmNKtkilshi76E3WRmIg.webp)
Conclusion
Temperature is an easy numerical parameter, however don’t be deceived by its simplicity: it could have far-reaching implications for any LLM software.
Tuning it good is vital to getting the conduct you need — too low, and your mannequin performs it too protected; too excessive, and it begins spouting unpredictable responses. With instruments like promptfoo, you may systematically take a look at totally different settings and discover your Goldilocks zone — not too chilly, not too sizzling, however good. ️