Temperature concept fully explained in detail
To understand temperature, you have to look at the final layer of a Transformer architecture during next-token prediction. Step-by-Step Workflow Logit Generation: For a given context, the final linear layer of the model outputs a vector of unnormalized raw scores—called logits (z)—for every single token in the vocabulary. If a vocabulary has 50,000 tokens, z is a 50,000-dimensional vector of real numbers. Temperature Application: Before passing these logits to the activation function, every single logit in the vector is divided by the temperature parameter (T). Softmax Normalization: The scaled logits are then passed through the Softmax function to convert them into a valid probability distribution (p), where all probabilities sum to 1. Mathematically, the probability p_i of selecting the i-th token from the vocabulary is defined as: Where: z_i is the raw logit score for token i. T is the temperature parameter (T 0). How Different T Values Alter the Math When T \to 0 (Greedy Decoding / Deterministic): As T approaches zero, the gap between the highest logit and all other logits approaches infinity after division. The exponential nature of Softmax causes the token with the highest raw logit to receive a probability of 1.0, while all others drop to 0.0. The model becomes completely deterministic. When T = 1.0 (Default Behavior): The logits are untouched. The probability distribution perfectly reflects the model's native training and calibrated confidence. When T 1.0 (High Variance / High Risk): Dividing raw logits by a number greater than 1 compresses their differences. For instance, if two logits are 8.0 and 4.0, dividing by T=2.0 turns them into 4.0 and 2.0. The mathematical contrast is severely muted. When exponentiated in the Softmax step, the resulting probability distribution becomes flatter (higher entropy). Low-probability tokens suddenly get a non-trivial chance of being selected. 3. LIMITATIONS, TRADE-OFFS, & COMPOSITION The post correctly highlights that high temperature leads to compounding errors. Because LLMs generate text auto-regressively (token-by-token), selecting a "risky," low-probability token at step N fundamentally alters the prompt context for step N+1. Once the model steps off the optimal path, the subsequent logits shift entirely. A single bizarre token choice can derail the entire context window, forcing the model to hallucinate fluent-sounding nonsense for the rest of the generation.
Download
0 formatsNo download links available.