PAC Learning
[Summary & Contributions] | [Relevant Publications]
Summary and Contributions
A central question in the theory of machine learning concerns the identification of classes of data distributions – or data models – for which one can provide computationally efficient learning algorithms with provable statistical learning guarantees. In the context of the probably approximately correct (PAC) learning model – arguably a cornerstone in the theory of machine learning – the two most widely studied settings, namely the realizable and (fully) agnostic settings, both represent somewhat extreme tradeoffs between computational efficiency and statistical modeling power: The realizable setting, as originally proposed by Valiant (1984), often admits computationally efficient learning algorithms, but makes the restrictive statistical assumption that examples are labeled by a deterministic target function (from some known function class); on the other hand, the (fully) agnostic setting allows for fully general joint probability distributions on the labeled examples, but often fails to admit computationally efficient learning algorithms. Consequently, there has been much interest in exploring intermediate PAC learning models that both allow for some stochasticity in the labels and admit computationally efficient learning algorithms with finite sample complexity bounds. Some examples of such models include random classification noise (RCN), probabilistic concepts, Massart noise, and (univariate) generalized linear models (GLMs) and single index models (SIMs); in general, most of this work has focused on binary classification problems. In the NeurIPS 2025 paper below, we study a broad range of learning problems under what we call realizable-statistic models (RSMs), wherein we allow stochastic labels but assume that some vector-valued statistic of the conditional label distribution comes from some known function class. RSMs are a flexible class of models that interpolate between the realizable and fully agnostic settings, and that also recover several previously studied models as special cases. We show that for a broad range of RSM learning problems, where the statistic of interest can be accurately estimated via a convex ‘strongly proper composite’ surrogate loss, minimizing this convex surrogate loss yields a computationally efficient learning algorithm with finite sample complexity bounds. We then apply this result to show that various commonly used (and in some cases, not so commonly used) convex surrogate risk minimization algorithms yield computationally efficient learning algorithms with finite sample complexity bounds for a variety of RSM learning problems including binary classification, multiclass classification, multi-label prediction, and subset ranking. In terms of the distribution over the domain/instance space, our results are all distribution-independent.
Relevant Publications
- Shivani Agarwal.
Efficient PAC learning for realizable-statistic models via convex surrogates.
In Advances in Neural Information Processing Systems (NeurIPS), 2025.
[pdf] - Mingyuan Zhang and Shivani Agarwal.
Bayes consistency vs. H-consistency: The interplay between surrogate loss functions and the scoring function class.
In Advances in Neural Information Processing Systems (NeurIPS), 2020.
Spotlight paper.
[pdf]