ΑΙhub.org
 

Improving calibration by relating focal loss, temperature scaling, and properness


by
28 November 2024



share this:

In machine learning classification tasks, achieving high accuracy is only part of the goal; it’s equally important for models to express how confident they are in their predictions – a concept known as model calibration. Well-calibrated models provide probability estimates that closely reflect the true likelihood of outcomes, which is critical in domains like healthcare, finance, and autonomous systems, where decision-making relies on trustworthy predictions. A key factor influencing both the accuracy and calibration of a model is the choice of the loss function during training. The loss function guides the model on how to learn from data by penalizing errors in prediction in a certain way. In this blog post, we will explore how to choose a loss function to achieve good calibration, focusing on the recently proposed focal loss and trying to understand why it leads to quite well-calibrated performance.

For multiclass classification problems, commonly used losses include cross-entropy (log-loss) and mean squared error (Brier score). These loss functions are strictly proper, meaning they are uniquely minimized when the predicted probabilities exactly match the ground truth probabilities. To illustrate properness, consider a training set where 10 instances have identical feature values, with 8 being actual positives and 2 negatives. When predicting the probability of a positive outcome for a new instance with the same features, it is logical to predict 0.8, and that’s precisely what proper losses incentivize the model to predict.

In practice, models trained with strictly proper losses such as cross-entropy and Brier score are indeed well-calibrated, but on the training set. When evaluating their performance on the test set, these models are often overconfident. This means that the predicted probabilities are systematically higher than the true likelihoods; the model assigns excessive confidence to its predictions, potentially resulting in misguided decisions based on these inflated probabilities.

A common approach to reducing overconfidence in a model’s predictions on a test set is to apply post-hoc calibration. This involves learning a transformation on a validation set that adjusts the model’s predicted probabilities to better align with true probabilities without altering the model itself.

Temperature scaling is one of the simplest post-hoc calibration methods. It introduces a single scalar parameter called the temperature that scales the logits (pre-softmax outputs) before applying the softmax function to obtain probabilities. The temperature is learned by minimizing some loss (typically a proper loss such as cross-entropy) of the scaled predictions on the validation set.

Some loss functions can inherently produce quite well-calibrated performance on the test set without applying additional calibration methods. One such example is focal loss, which, despite not being a strictly proper loss function, has been observed to yield relatively well-calibrated probability estimates in practice. Understanding the reasons behind this phenomenon was a primary motivation for our research.

The focal loss, which incorporates a single non-negative hyperparameter \gamma, has the following expression:

(1)   \begin{equation*}     FL_{\gamma} (p) = -log(p) \cdot(1-p)^\gamma \end{equation*}

Here, the term -log(p) is identical to cross-entropy loss expression, while the multiplicative term (1-p)^\gamma reduces the loss contribution from easy examples (those the model predicts correctly with high confidence) while putting more focus on difficult, misclassified instances. The argument p denotes the model’s predicted probability scalar for the true class label.

In our paper, we showed that focal loss could be deconstructed into the composition of a proper loss L(p) and a confidence-raising transformation \phi(p) such that:

(2)   \begin{equation*}    FL_{\gamma} (p) = L(\phi(p)) \end{equation*}

Here, p represents the whole predicted probability vector. We will refer to \phi(p) as the focal calibration map. We showed that \phi(p) is fixed (depends only on \gamma parameter) and confidence-raising transformation. By confidence-raising, we mean that the highest probability in the predicted vector is mapped to an even higher value.

We visualized the proper part of the focal loss and focal calibration map in the following visualization:

Figure 1: Proper part of the focal loss and focal calibration map

We could note that the proper part of the focal loss is much steeper than both focal and cross-entropy losses. Also, technically, negative \gamma values result in valid but not-so-meaningful expressions.

In the binary case, the focal calibration map has the following expression:

(3)   \begin{equation*}     \hat{p}(q) = \frac{1}{1 + \bigl(\frac{1-q}{q}\bigr)^{\gamma} \frac{(1-q)-\gamma q \cdot \log(q)}{q-\gamma (1-q) \cdot \log(1-q)}} \end{equation*}

Surprisingly, this expression, when applied to the logit before the sigmoid step \hat{p}(\frac{1}{1+exp(-s)}), is very close to temperature scaling transformation:

Figure 2: Focal calibration and temperature scaling comparison

Moreover, we showed that these two transformations are not identical, but the focal calibration map with parameter \gamma could be lower and upper bounded with two temperature scaling with parameters T=\frac{1}{\gamma+1} and T=\frac{1}{\gamma+1 - \frac{\log(\gamma+1)}{2}}.

Figure 3: Focal calibration (with respect to logit before sigmoid step) bounded between two temperature scaling transformations

We believe that the focal loss decomposition into proper loss and confidence-raising calibration map could explain its well-calibrated performance on the test set.
Models trained with proper losses typically achieve good calibration on the training set but tend to be overconfident on the test set, and we could expect similar performance for the proper part of the focal loss (with respect to probabilities after focal calibration). However, by examining the predicted probabilities before applying the confidence-raising transformation, we might expect slightly underconfident probabilities on the training set and more well-calibrated outcomes on the test set.

In the multiclass setting, focal calibration and temperature scaling are related but not as closely as in the binary case. To leverage the calibration properties of both transformations, we propose combining them into a new calibration technique called focal temperature scaling. This approach involves tuning two parameters – the temperature T and the focal parameter \gamma – to balance overconfidence and underconfidence in the model’s predictions. By optimizing these parameters on the validation set, we aim to achieve improved calibration performance.

The experiments performed on CIFAR-10, 100, and TinyImageNet suggest that focal temperature scaling clearly outperforms standard temperature scaling and could be combined on top of other calibration methods, such as AdaFocal, to potentially enhance calibration even further.

Summary

  1. We showed how the decomposition of a focal loss into proper loss and confidence-raising map could explain its well-calibration on the test set.
  2. We showed the surprising connection of the focal calibration map with temperature scaling, especially in the binary cases.
  3. We highlighted how composing this confidence-raising map with temperature scaling could improve model calibration.

This work was accepted at ECAI 2024. Read the paper in full:



tags: , ,


Viacheslav Komisarenko is a PhD student at the University of Tartu
Viacheslav Komisarenko is a PhD student at the University of Tartu

            AIhub is supported by:



Subscribe to AIhub newsletter on substack



Related posts :

RWDS Big Questions: how do we balance innovation and regulation in the world of AI?

  06 Mar 2026
The panel explores the tensions, trade-offs and practical realities facing policymakers and data scientists alike.

Studying multiplicity: an interview with Prakhar Ganesh

  05 Mar 2026
What is multiplicity, and what implications does it have for fairness, privacy and interpretability in real-world systems?

Top AI ethics and policy issues of 2025 and what to expect in 2026

, and   04 Mar 2026
In the latest issue of AI Matters, a publication of ACM SIGAI, Larry Medsker summarised the year in AI ethics and policy, and looked ahead to 2026.

The greatest risk of AI in higher education isn’t cheating – it’s the erosion of learning itself

  03 Mar 2026
Will AI hollow out the pipeline of students, researchers and faculty that is the basis of today’s universities?

Forthcoming machine learning and AI seminars: March 2026 edition

  02 Mar 2026
A list of free-to-attend AI-related seminars that are scheduled to take place between 2 March and 30 April 2026.
monthly digest

AIhub monthly digest: February 2026 – collective decision making, multi-modal learning, and governing the rise of interactive AI

  27 Feb 2026
Welcome to our monthly digest, where you can catch up with AI research, events and news from the month past.

The Good Robot podcast: the role of designers in AI ethics with Tomasz Hollanek

  26 Feb 2026
In this episode, Tomasz argues that design is central to AI ethics and explores the role designers should play in shaping ethical AI systems.

Reinforcement learning applied to autonomous vehicles: an interview with Oliver Chang

  25 Feb 2026
In the third of our interviews with the 2026 AAAI Doctoral Consortium cohort, we hear from Oliver Chang.



AIhub is supported by:







Subscribe to AIhub newsletter on substack




 















©2026.02 - Association for the Understanding of Artificial Intelligence