Measuring Learning Efficiency via a Practical Information-Theoretic Approximation

Subhadeep Roy

The objective of this note is to develop a practical method for measuring the learning efficiency of predictive models relative to the intrinsic structure present in a dataset. In many machine learning settings, model performance is evaluated solely by accuracy or loss metrics, which makes it difficult to distinguish whether poor performance arises from model limitations or from the inherent unpredictability of the target variable given the available inputs. Information theory provides a principled way to characterize this distinction through the mutual information \(I(X ; Y)\) which quantifies the amount of information the input variables \(X\) contain about the target \(Y\). However, direct estimation of mutual information in high-dimensional settings is typically intractable. This work proposes a practical approximation that uses negative log-likelihood (NLL) under a strong and well-calibrated reference model as a proxy for the conditional entropy \(H(Y | X)\). This enables us to estimate a reasonable lower bound on the intrinsic learnability of a dataset, and a corresponding notion of model learning efficiency as the fraction of available information that a given model successfully extracts from that data.

1.1 Scope and ongoing work

The present formulation focuses on static datasets where samples are assumed to be independently drawn from a fixed distribution. This setting allows a clean estimation of intrinsic learnability via conditional entropy. Extending this framework to time-varying data where observations arrive sequentially and may exhibit temporal dependence is an active direction of ongoing work. In such settings, learnability becomes a property of the underlying stochastic process rather than a fixed dataset, requiring a causal formulation based on entropy-rates and history-dependent predictions.

2 Prior Art

Our formulation builds on classical information theory, where mutual information \(I(X;Y)=H(Y)-H(Y|X)\) characterizes the maximum information about labels that can, in principle, be extracted from inputs (Cover and Thomas 2006). Log-loss provides a strictly proper scoring rule for probabilistic prediction (Gneiting and Raftery 2007), and modern work on calibration evaluates the alignment between predicted probabilities and empirical outcomes (Guo et al. 2017).

Recent work on scaling laws studies how predictive performance varies with model size, data volume, and compute (Kaplan et al. 2020; Hoffmann et al. 2022). These approaches identify regimes of diminishing returns and characterize compute-optimal model sizes for a given dataset. Related work also introduces data quality and redundancy as factors that influence effective scaling behavior (Subramanyam et al. 2025).

Complementary perspectives examine tradeoffs between compression and prediction at the representation level (Tishby et al. 2000; Grünwald 2007), as well as systems-level efficiency, where architectural and algorithmic optimizations reduce memory and compute requirements during inference (Alizadeh et al. 2024).

Existing work does not explicitly separate the intrinsic information available in a dataset from the information actually extracted by a model. This distinction is central to our framework, which focuses on quantifying extraction efficiency relative to the dataset’s inherent predictability, rather than optimizing performance under compute or data constraints.

In practice, when intrinsic limits are unknown, strong models or ensembles are commonly used as empirical reference points. For example, teacher–student distillation frameworks use high-capacity models to define target behavior (Hinton et al. 2015), and scaling-law analyses evaluate performance across model families (Kaplan et al. 2020) rather than against a theoretical bound, which is often computationally intractable.

3 Limitations of Standard Model Evaluation

Modern machine learning evaluation typically measures model performance using metrics such as accuracy, cross-entropy loss, or F1 score on a held-out dataset. While these metrics quantify predictive performance, they do not distinguish between two fundamentally different causes of error: model inefficiency and intrinsic unpredictability of the data. A model may perform poorly either because it fails to capture structure that exists in the data, or because the input variables simply do not contain sufficient information to reliably predict the target variable. In practice, these two scenarios are indistinguishable when evaluation relies solely on predictive accuracy.

This ambiguity becomes particularly problematic when evaluating new architectures, algorithms, or benchmarks. A low-performing model might suggest architectural deficiencies, but it could equally indicate that the dataset itself contains high levels of noise or irreducible uncertainty. Conversely, high performance may simply reflect a dataset with low intrinsic entropy rather than a particularly capable model.

Information theory provides a natural way to formalize this distinction. The mutual information \(I(X ; Y)\) measures how much information the input variables \(X\) contain about the target \(Y\) and therefore represents the maximum predictive information that any model could extract from the data. While theoretical limits of this kind are well understood in principle, they are rarely estimated in practice for real-world datasets.

As a result, models are typically evaluated only by their observed performance rather than by how close that performance is to the information limit imposed by the dataset.

To make this ambiguity precise, consider the commonly used F1 score for binary classification. Let \(\hat{Y} = f(X) \in \{0,1\}\) denote the predicted label. The F1 score is defined as the harmonic mean of precision and recall and depends only on the joint distribution \(p(\hat{Y}, Y)\). As such, it evaluates predictive performance purely in terms of agreement between predicted and true labels, without reference to the input variables \(X\). This leads to a fundamental limitation: the F1 score does not capture how much information about \(Y\) is actually available in \(X\). Consequently, two qualitatively different scenarios may yield identical F1 scores. In the first, the dataset may be intrinsically unpredictable, with \(H(Y \mid X) \approx H(Y)\) and \(I(X; Y) \approx 0\), so that even the Bayes-optimal classifier performs poorly. In the second, the dataset may contain substantial predictive structure, with \(H(Y \mid X) \ll H(Y)\), but the model fails to approximate the true conditional distribution. In both cases, the observed F1 score may be low, yet the underlying causes are fundamentally different. Moreover, the F1 score lacks a notion of a dataset-dependent performance ceiling: the maximum achievable F1 depends on the unknown distribution \(p(Y \mid X)\) and is therefore not accessible in practice. As a result, standard evaluation metrics measure what a model achieves, but not how close it operates to the information-theoretic limits imposed by the dataset.

4 Dataset Learnability

A central question in predictive modeling is how much information about the target variable is actually present in the input data. Intuitively, if the inputs contain little information about the target, then even an ideal model cannot achieve high predictive accuracy. Conversely, if the inputs contain substantial information about the target, then poor model performance indicates that the model has failed to extract structure that exists in the data. We refer to the intrinsic predictability of a dataset as its learnability.

Let the dataset consist of input variables \(X\) and target variable \(Y\) jointly distributed according to an unknown distribution \(p(X,Y)\). This formulation assumes a static dataset setting where samples are independently drawn from this distribution. The marginal distributions are denoted by \(p(X)\) and \(p(Y)\), and the conditional distribution of the target given the input is \(p(Y|X)\). The entropy of the target variable is defined as

\[H(Y) = - \sum_{y} p(y)\log p(y)\] which measures the total uncertainty in the labels before observing the input. The conditional entropy of the target given the input is

\[H(Y|X) = - \mathbb{E}_{p(X,Y)}[\log p(Y|X)]\] which measures the residual uncertainty in the target after observing the input. The reduction in uncertainty obtained by observing the input is quantified by the mutual information

\[I(X;Y) = H(Y) - H(Y|X)\] which measures how much information the input variables contain about the target. We define the learnability of the dataset as the fraction of label uncertainty that can, in principle, be resolved using the input:

This normalization ensures that learnability lies in the interval \([0,1]\). When \(L(X,Y)=0\), the inputs contain no information about the target, and prediction is fundamentally impossible. When \(L(X,Y)=1\), the target is fully determined by the inputs.

In practice, computing \(L(X,Y)\) requires estimating the conditional entropy \(H(Y|X)\), which depends on the true conditional distribution \(p(Y|X)\). For realistic datasets this distribution is unknown and generally intractable to estimate directly.

To obtain a practical approximation, we instead train a high-capacity golden model \(G\) whose purpose is to approximate the true conditional distribution. Let the predictive distribution of this model be denoted by \(q_G(Y|X)\), where

Under this approximation, the conditional entropy can be estimated using the negative log-likelihood of the golden model:

4.1 Estimating the Posterior Density of the Data Labels

Let the dataset be \[\mathcal{D}=\{(x_i,y_i)\}_{i=1}^{n}, \qquad y_i \in \{1,\dots,K\}.\] The probabilistic predictor \(G\) returns a categorical distribution over labels: \[G(x) \equiv q_G(\cdot \mid x)\in\Delta^{K-1}, \qquad q_G(k\mid x)\ge 0,\quad \sum_{k=1}^K q_G(k\mid x)=1.\] (Throughout, logarithms are base-2 (units: bits). Switching to natural logs yields units of nats.)

4.1.1 Empirical Label Entropy

Let the empirical class probabilities be \[\hat p(k) \;=\; \frac{1}{n}\sum_{i=1}^{n}\mathbf{1}\{y_i=k\}.\] The empirical label entropy is \[\hat H(Y) \;=\; -\sum_{k=1}^{K}\hat p(k)\log_2 \hat p(k),\] with the convention \(0\log 0 \equiv 0\).

4.1.2 Empirical Negative Log-Likelihood - NLL

For a single example \((x_i,y_i)\), the negative log-likelihood under model \(G\) is \[\ell_i(G) \;=\; -\log_2 q_G(y_i\mid x_i).\] Here \(q_G(y_i \mid x_i)\) denotes the probability assigned by the model \(G\) to the true label \(y_i\) given input \(x_i\). In general this probability is typically strictly less than 1 in stochastic or noisy settings (the probability mass function is smeared around the true label). The per-sample negative log-likelihood \(\ell_i(G)\) therefore measures the information (in bits) associated with the event that the model assigns probability \(q_G(y_i \mid x_i)\) to the observed label. The empirical NLL over a dataset \(\mathcal{D}\) is the sample mean \[\widehat{\mathrm{NLL}}_G(\mathcal{D}) \;=\; \frac{1}{n}\sum_{i=1}^{n} \ell_i(G) \;=\; -\frac{1}{n}\sum_{i=1}^{n}\log_2 q_G(y_i\mid x_i).\] \(\widehat{\mathrm{NLL}}_G\) as estimated above can serve as an upper bound to the true conditional entropy as: \[\widehat{\mathrm{NLL}}_G \geq H(Y | X)\] Consequently, learnability, expressed as the normalized information contained in the dataset can be approximated by the golden model \(G\) as :

\[\boxed{ L_G(X,Y) \approx 1 - \frac{\widehat{\mathrm{NLL}}_G }{\hat H(Y) } }\] Because the golden model only approximates the true posterior distribution, the resulting estimate constitutes a lower bound on the true learnability ceiling. The gap between the estimated learnability and the true value depends on how well the golden model approximates \(p(Y|X)\), and therefore on the representational capacity and training quality of \(G\).

Evaluation protocol. To approximate generalization, \(\widehat{\mathrm{NLL}}_G\) should be computed on held-out data (validation/test or out-of-fold predictions), not on the training set.

5 Model Efficiency

Noting, that the above framework solely captures the learnability of the dataset (with \(G\) being an oracle estimator - used offline for evaluation), we now discuss the canonical efficiency of a given model under test: \(F\). We pass the model \(F\) through the above framework to calculate the normalized mutual information between the true labels \(Y\) and the labels \(F(X)\) predicted by \(F\) as: \[\begin{equation} I_n(F(X);Y) = 1 - \frac{H(Y\mid F(X))}{H(Y)} \approx 1 - \frac{\widehat{\mathrm{NLL}}_F}{\hat H(Y)} \end{equation}\] and finally normalize it to calculate the efficiency factor as:

\[\begin{equation} \boxed{ \eta(f)\;\coloneqq\;\frac{I_n(F(X);Y)}{ L_G(X,Y)}. } \end{equation}\]

6 Interpretation and Edge Cases

This section discusses how the proposed quantities behave under several important regimes and clarifies how to interpret the resulting metrics.

6.1 Interpretation of Learnability

The dataset learnability \(L_G(X,Y)\) measures the fraction of label entropy that can be explained by the input variables.

6.2 Interpretation of Model Efficiency

measures the fraction of available information that the model \(F\) extracts from the dataset.

6.3 Noiseless Targets

If both the golden model \(G\) and the evaluated model \(F\) perfectly capture the mapping, their NLL approaches zero and the estimated mutual information approaches the label entropy:

6.4 Irreducible Label Noise

even for the Bayes-optimal predictor. In this setting the negative log-likelihood cannot approach zero, and the achievable predictive performance is fundamentally limited by the noise level of the dataset.

6.5 Random Labels

In this regime any model, regardless of complexity, performs no better than random guessing. This case serves as a useful sanity check for the framework.

6.6 Overconfident Models

The NLL metric penalizes models that assign high confidence to incorrect predictions. An overconfident but inaccurate model may therefore achieve poor NLL even if its classification accuracy appears reasonable. This behavior is desirable because the proposed framework measures the information conveyed by the full predictive distribution rather than only the top predicted label.

6.7 Probabilistic Predictions

The framework requires models to produce probabilistic outputs \(q_F(y|x)\). Hard-label predictions discard uncertainty information and prevent accurate estimation of conditional entropy. For deterministic classifiers, probability calibration techniques may be required before computing NLL-based quantities.

6.8 Extension to Time-Varying Data (Work in Progress)

In many real-world systems, data is generated sequentially and exhibits temporal dependence. Examples include sensor streams, financial time series, and interactive systems. In such settings, the assumption of independently sampled \((X,Y)\) pairs no longer holds, and the notion of learnability must be generalized.

A natural extension is to define learnability over sequences \((X_{1:T}, Y_{1:T})\) , where prediction at time \(t\) depends on the history \(X_{1:t}\). The corresponding intrinsic uncertainty is governed by conditional entropy rates rather than single-step conditional entropy. This leads to a time-indexed notion of learnability that varies along the trajectory and reflects the evolving predictability of the process. Developing a practical approximation of this sequential learnability, analogous to the NLL-based estimator proposed here, remains an open direction for future work.

7 Validation on Synthetic Dataset

This section describes the validation plan for the learnability framework and demonstrate that negative log-likelihood (NLL) can be used to:

We generate an ensemble of synthetic datasets with varying degrees of learnability and evaluate the framework on the ensemble. Since the dataset is synthetic, we have full knowledge of the true posterior density of labels which gives us the true underlying learnability of the ensemble. We will then run a high-powered oracle model (\(G\)) to estimate the learnability of the ensemble, which will serve as a tight lower bound on the true learnability. This lower bound from the oracle model will then be used as a baseline for computing the efficiency of a library of models under test.

7.1 Dataset Construction

Let \(X \in \mathbb{R}^d\) denote the feature vector. We partition the feature space into subsets with different distributions to induce heterogeneous learnability:

\[\begin{align*} X = [X^{(1)}, X^{(2)}], \quad X^{(1)} \in \mathbb{R}^{d_1}, \; X^{(2)} \in \mathbb{R}^{d_2}, \; d_1 + d_2 = d \end{align*}\]

The feature subsets are drawn as: \[\begin{align*} X^{(1)} &\sim \mathcal{N}(0, I_{d_1}) \\ X^{(2)} &\sim \mathcal{D} \quad \text{(non-Gaussian or structured distribution)} \end{align*}\]

where \(\mathcal{D}\) can be chosen to control feature informativeness (e.g., heavy-tailed, bounded, or low-variance distributions).

The latent score is defined as: \[\begin{align*} t(X) = \beta\, w^\top X \end{align*}\] where \(w\) is a binary vector used to mask out a randomly selected feature subset. The conditional label distribution is: \[\begin{align*} p(y=1 \mid X) = \sigma(t(X)) = \frac{1}{1 + e^{-t(X)}} \end{align*}\]

The binary label is then sampled as: \[y \sim \mathrm{Bernoulli}(p(y=1\mid X)).\]

7.1.0.1 Controlling Learnability

This construction allows independent control over signal strength, sparsity, and feature distribution, enabling a richer spectrum of learnability regimes beyond isotropic Gaussian settings.

8 Experimental Results

8.1 Controlled Learnability using Synthetic Dataset

The dataset is generated as described in section 7.1 with the following parameters to control the learnability:

Models. We train fixed-depth MLPs with width ranging from 32 to 1024, thereby scaling capacity while keeping architecture constant.

Efficiency measures the fraction of extractable information captured by the model.

8.1.1 Result 1: Same Accuracy, Different Information Extraction

Figure 1 plots efficiency as a function of dataset learnability for models of different sizes.

This indicates that two models with identical accuracy can still differ in their probabilistic alignment with the underlying data, particularly in low-signal regimes. We interpret this as metric decoupling where accuracy reflects correctness of discrete decisions while efficiency reflects how much of the data’s available signal is captured in the predictive distribution.

8.1.2 Result 2: Capacity Scaling Beyond Data-Limited Regimes

Figure 2 shows model scaling behavior as capacity increases for datasets with varying levels of intrinsic learnability.

Observation: Once model capacity exceeds the dataset’s effective learnability, further increases in capacity do not improve the amount of information extracted from the data, even though predictive accuracy remains largely unchanged.

This indicates that there exists a data-dependent capacity threshold beyond which, additional model capacity yields diminishing or negative returns in probabilistic fidelity, does not improve predictive accuracy, and reaches an efficiency-floor after optimal model calibration. We refer to this regime as capacity-information saturation:

Importantly, this behavior is not reflected in accuracy-based metrics, which remain stable across model sizes.

8.2 Avazu Click Through Rate (CTR) Prediction - Work In Progress

In this section we evaluate the framework on Avazu CTR prediction, a real-world tabular classification task with unknown intrinsic learnability. Since the true learnability is unavailable, we approximate a practical reference ceiling using the best calibrated held-out NLL among a family of strong tabular models, including gradient-boosted trees and TabTransformer. We then compare a small family of MLPs of increasing width against this empirical ceiling.

9 Interpretation: Capacity–Information Saturation

When model capacity increases beyond the point at which the available signal is already well captured, further scaling does not increase the amount of extractable information. Instead, it produces diminishing returns in probabilistic fidelity and, in low-learnability regimes, may reduce efficiency even after calibration.

Formally, \[NLL(f) = H(Y|X) + \mathbb{E}_X D_{KL}\!\left(P^*(Y|X)\,\|\,f(Y|X)\right)\] This decomposition makes clear that held-out NLL differs from the intrinsic uncertainty \(H(Y|X)\) by a mismatch term measuring how far the model’s predictive distribution is from the true conditional distribution. In our experiments, calibration reduces a substantial part of this mismatch, but a residual gap remains in low-signal regimes as model size increases.

The resulting picture is not one of unbounded collapse, but of capacity–information saturation: beyond a dataset-dependent threshold, additional capacity does not yield additional extracted signal and may lead to lower efficiency.

10 Implications

Modern model selection is often dominated by accuracy-based metrics. These results show that:

In this sense, the framework is not primarily a claim about model failure, but a tool for identifying when additional capacity is no longer justified by the information content of the dataset.

References

Alizadeh, Keivan, Iman Mirzadeh, Dmitry Belenko, et al. 2024. “LLM in a Flash: Efficient Large Language Model Inference with Limited Memory.” arXiv Preprint arXiv:2312.11514.

Cover, Thomas M., and Joy A. Thomas. 2006. Elements of Information Theory. 2nd ed. Wiley.

Gneiting, Tilmann, and Adrian E. Raftery. 2007. “Strictly Proper Scoring Rules, Prediction, and Estimation.” Journal of the American Statistical Association.

Grünwald, Peter D. 2007. The Minimum Description Length Principle. MIT Press.

Guo, Chuan, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. 2017. “On Calibration of Modern Neural Networks.” Proceedings of the International Conference on Machine Learning.

Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. 2015. “Distilling the Knowledge in a Neural Network.” arXiv Preprint arXiv:1503.02531.

Hoffmann, Jordan, Sebastian Borgeaud, Arthur Mensch, et al. 2022. “Training Compute-Optimal Large Language Models.” arXiv Preprint arXiv:2203.15556.

Kaplan, Jared, Sam McCandlish, Tom Henighan, et al. 2020. “Scaling Laws for Neural Language Models.” arXiv Preprint arXiv:2001.08361.

Subramanyam, A., Y. Chen, and R. L. Grossman. 2025. “Scaling Laws Revisited: Modeling the Role of Data Quality in Language Model Pretraining.” arXiv Preprint arXiv:2510.03313.

Tishby, Naftali, Fernando C. Pereira, and William Bialek. 2000. “The Information Bottleneck Method.” arXiv Preprint Physics/0004057.

1 Objective