Methodology
This page explains how MCGrad works, from the mathematical foundations of multicalibration to the specific algorithm design that makes it scalable and safe for production. For additional details, see the full paper [1].
1. Introduction: Beyond Global Calibration
A machine learning model is calibrated if its predicted probabilities match the true frequency of outcomes. When a model predicts 0.8 for a set of events, 80% of those events should actually occur.
However, global calibration is often insufficient. A model can be calibrated on average while being systematically miscalibrated for specific segments, also known as protected groups (e.g., overestimating risk for one demographic while underestimating it for another). These local miscalibrations cancel each other out in the global average, masking the problem.
Multicalibration addresses this by requiring the model to be calibrated not just globally, but simultaneously across all meaningful segments (e.g., defined by age, region, device, etc.). Multicalibration has several application areas, including matchings [2].
2. Formal Definitions
Let be the input features and be the binary target. A predictor estimates . We use capital to denote the logit of , i.e., .
Calibration
A predictor is perfectly calibrated if, for all :
In practice, miscalibration is measured using metrics like ECCE (Estimated Cumulative Calibration Error) [3], which quantifies the deviation between predictions and outcomes without relying on arbitrary binning.
Multicalibration
We define a collection of segments (e.g., "users in US", "users on Android"). A predictor is multicalibrated with respect to if it is calibrated on every segment .
Formally, for all segments and all prediction values , the conditional probability should match the prediction:
This is equivalent to requiring the expected residual to be zero:
This residual formulation connects directly to the gradient-based approach in MCGrad (see Insight 3 below).
Existing algorithms often require manual specification of , i.e., the segments to calibrate. MCGrad relaxes this requirement: you need only provide the features, and MCGrad implicitly calibrates across all segments definable by decision trees on those features.
3. The MCGrad Algorithm
MCGrad achieves multicalibration by iteratively training a Gradient Boosted Decision Tree (GBDT) model. The core innovation follows three key insights.
Insight 1: Segments as Decision Trees
A segment in data is typically defined by intersections of attributes (e.g., "People aged 35–50 in the UK").
- Categorical attributes are simple equivalences:
country == 'UK'. - Numerical attributes are intervals, i.e., intersections of inequalities:
35 <= age <= 50, equivalent toage >= 35 AND age <= 50.
This structure corresponds exactly to a leaf in a Decision Tree. Any meaningful segment of the data can be represented (or approximated) by a leaf in a sufficiently deep tree.
Implication: The abstract space of all possible segments can be replaced with the space of Decision Trees. Instead of iterating over infinite segments, we search for decision trees.
Insight 2: Augmenting the Feature Space
Multicalibration requires checking calibration for every segment and every prediction (e.g., "Prediction equals 0.8"). Without loss of generality, this definition can be relaxed to intervals: "Prediction is in the range [0.7, 0.9]". Instead of treating "prediction intervals" as a separate constraint, the prediction itself becomes a feature.
Implication: By training trees on the augmented features , the tree learning algorithm naturally finds splits like:
if country='UK' AND prediction >= 0.7 AND prediction <= 0.9:
This split automatically isolates a specific segment in a specific prediction range. No special logic for "intervals" is needed—the decision tree handles it natively.
Insight 3: Transformation to Regression
With the first two insights, we have a target (the residual ) and a feature space . The goal is to train a model that finds regions in this space where the residual is non-zero.
Gradient Boosted Decision Trees (GBDTs) work by iteratively training trees to predict the negative gradient of a loss function. For binary classification using Log-Loss , the negative gradient is mathematically identical to the residual:
Implication: No specialized "multicalibration algorithm" is needed. A standard GBDT trained to minimize Log-Loss on augmented features suffices:
- The GBDT automatically searches for trees that predict the residual .
- If it finds a tree with a non-zero prediction, it has found a miscalibrated segment.
- The boosting update then corrects the model in that exact direction.
- When the boosting algorithm can no longer find trees that improve the loss, no efficiently discoverable segments remain miscalibrated.
Iterative Process
Correcting calibration in one region can create miscalibration in other regions. This occurs because the post-processed model has different predictions, and conditioning on intervals over its predictions differs from conditioning on intervals over . Therefore, MCGrad proceeds in rounds:
Algorithm:
- Initialize: Start with the base model's logits .
- Iterate: For round :
- Train a GBDT to predict the residual using features .
- Update the model: .
- Update predictions: .
- Converge: Stop when performance on a validation set no longer improves.
This recursive structure ensures progressive elimination of calibration errors until the model is multicalibrated.
4. Scalability and Safety Design
MCGrad is designed for production systems serving millions of predictions. It includes several optimizations:
Efficient Implementation
Rather than custom boosting logic, MCGrad delegates to LightGBM, a highly optimized GBDT library. This ensures:
- Speed: Training and inference are extremely fast.
- Scalability: Large datasets and high-dimensional feature spaces are handled efficiently.
Logit Rescaling
GBDTs use a small learning rate to prevent overfitting, but this can result in under-confident updates that require many trees to correct. MCGrad adds a rescaling step after each round: it learns a single scalar multiplier to scale the update optimally.
This approach drastically reduces the number of rounds needed for convergence.
Safety Guardrails
Post-processing methods risk overfitting, potentially harming the base model. MCGrad prevents this through:
- Early Stopping: Validation loss is tracked after every round. Training stops immediately when loss ceases to improve. If the first round harms performance, MCGrad returns the original model ().
- Min-Hessian Regularization: Augmenting the data with the previous round’s model necessarily gives rise to regions of the augmented feature space that are particularly prone to overfitting. As the predicted probabilities become close to 0 or 1, the Hessian in these regions becomes smaller. Enforcing a minimum Hessian ensures that these regions are no longer considered.
References
[1] Tax, N., Perini, L., Linder, F., Haimovich, D., Karamshuk, D., Okati, N., Vojnovic, M., & Apostolopoulos, P. A. MCGrad: Multicalibration at Web Scale. Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.1 (KDD 2026). DOI: 10.1145/3770854.3783954
[2] Baldeschi, R. C., Di Gregorio, S., Fioravanti, S., Fusco, F., Guy, I., Haimovich, D., Leonardi, S., Linder, F., Perini, L., Russo, M., & Tax, N. Multicalibration yields better matchings. ArXiv preprint, 2025.
[3] Arrieta-Ibarra, I., Gujral, P., Tannen, J., Tygert, M., & Xu, C. Metrics of calibration for probabilistic predictions. Journal of Machine Learning Research, 23(351), 1-54, 2022.