Information-Theoretic Stability and Generalization

doi:10.1017/9781108616799.011

10 - Information-Theoretic Stability and Generalization

Published online by Cambridge University Press: 22 March 2021

Maxim Raginsky ,

Alexander Rakhlin and

Aolin Xu

Edited by

Miguel R. D. Rodrigues and

Yonina C. Eldar

Show author details

Miguel R. D. Rodrigues: Affiliation:
University College London
Yonina C. Eldar: Affiliation:
Weizmann Institute of Science, Israel

Book contents

Get access

Summary

Machine-learning algorithms can be viewed as stochastic transformations that map training data to hypotheses. Following Bousquet and Elisseeff, we say such an algorithm is stable if its output does not depend too much on any individual training example. Since stability is closely connected to generalization capabilities of learning algorithms, it is of interest to obtain sharp quantitative estimates on the generalization bias of machine-learning algorithms in terms of their stability properties. We describe several information-theoretic measures of algorithmic stability and illustrate their use for upper-bounding the generalization bias of learning algorithms. Specifically, we relate the expected generalization error of a learning algorithm to several information-theoretic quantities that capture the statistical dependence between the training data and the hypothesis. These include mutual information and erasure mutual information, and their counterparts induced by the total variation distance. We illustrate the general theory through examples, including the Gibbs algorithm and differentially private algorithms, and discuss strategies for controlling the generalization error.

Keywords

learning supervised learning unsupervised learning generalization stability sample complexity Vapnik–Chervonenkis theory

Type: Chapter
Information: Information-Theoretic Methods in Data Science , pp. 302 - 329

DOI: https://doi.org/10.1017/9781108616799.011 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2021

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Book purchase

Temporarily unavailable

References

Vapnik, V. N., Statistical learning theory. Wiley, 1998.Google Scholar

Shalev-Shwartz, S., Shamir, O., Srebro, N., and Sridharan, K., “Learnability, stability, and uniform convergence,” J. Machine Learning Res., vol. 11, pp. 2635–2670, 2010.Google Scholar

Taylor, J. and Tibshirani, R. J., “Statistical learning and selective inference,” Proc. Natl. Acad. Sci. USA, vol. 112, no. 25, pp. 7629–7634, 2015.Google Scholar

Devroye, L. and Wagner, T., “Distribution-free performance bounds for potential function rules,” IEEE Trans. Information Theory, vol. 25, no. 5, pp. 601–604, 1979.CrossRef Google Scholar

Rogers, W. H. and Wagner, T. J., “A finite sample distribution-free performance bound for local discrimination rules,” Annals Statist., vol. 6, no. 3, pp. 506–514, 1978.Google Scholar

Ron, D. and Kearns, M., “Algorithmic stability and sanity-check bounds for leave-one-out crossvalidation,” Neural Computation, vol. 11, no. 6, pp. 1427–1453, 1999.Google Scholar

Bousquet, O. and Elisseeff, A., “Stability and generalization,” J. Machine Learning Res., vol. 2, pp. 499–526, 2002.Google Scholar

Poggio, T., Rifkin, R., Mukherjee, S., and Niyogi, P., “General conditions for predictivity in learning theory,” Nature, vol. 428, no. 6981, pp. 419–422, 2004.CrossRef Google Scholar PubMed

Kutin, S. and Niyogi, P., “Almost -everywhere algorithmic stability and generalization error,” in Proc. 18th Conference on Uncertainty in Artificial Intelligence (UAI 2002), 2002, pp. 275–282.Google Scholar

Rakhlin, A., Mukherjee, S., and Poggio, T., “Stability results in learning theory,” Analysis Applications, vol. 3, no. 4, pp. 397–417, 2005.CrossRef Google Scholar

Dwork, C., McSherry, F., Nissim, K., and Smith, A., “Calibrating noise to sensitivity in private data analysis,” in Proc. Theory of Cryptography Conference, 2006, pp. 265–284.Google Scholar

Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O., and Roth, A., “Generalization in adaptive data analysis and holdout reuse,” arXiv:1506.02629, 2015.Google Scholar

Dwork, C. and Roth, A., “The algorithmic foundatons of differential privacy,” Foundations and Trends in Theoretical Computer Sci., vol. 9, nos. 3–4, pp. 211–407, 2014.Google Scholar

Kairouz, P., Oh, S., and Viswanath, P., “The composition theorem for differential privacy,” in Proc. 32nd International Conference on Machine Learning (ICML), 2015, pp. 1376–1385.Google Scholar

Raginsky, M., Rakhlin, A., Tsao, M., Wu, Y., and Xu, A., “Information -theoretic analysis of stability and bias of learning algorithms,” in Proc. IEEE Information Theory Workshop, 2016.Google Scholar

Xu, A. and Raginsky, M., “Information –theoretic analysis of generalization capability of learning algorithms,” in Proc. Conference on Neural Information Processing Systems, 2017.Google Scholar

Strasser, H., Mathematical theory of statistics: Statistical experiments and asymptotic decision Theory. Walter de Gruyter, 1985.Google Scholar

Polyanskiy, Y. and Wu, Y., “Dissipation of information in channels with input constraints,” IEEE Trans. Information Theory, vol. 62, no. 1, pp. 35–55, 2016.Google Scholar

Verdú, S. and Weissman, T., “The information lost in erasures,” IEEE Trans. Information Theory, vol. 54, no. 11, pp. 5030–5058, 2008.Google Scholar

Hoeffding, W., “Probability inequalities for sums of bounded random variables,” J. Amer. Statist. Soc., vol. 58, no. 301, pp. 13–30, 1963.Google Scholar

Boucheron, S., Lugosi, G., and Massart, P., Concentration inequalities: A nonasymptotic theory of independence. Oxford University Press, 2013.Google Scholar

Cover, T. M. and Thomas, J. A., Elements of information theory, 2nd edn. Wiley, 2006.Google Scholar

Bassily, R., Nissim, K., Smith, A., Steinke, T., Stemmer, U., and Ullman, J., “Algorithmic stability for adaptive data analysis,” in Proc. 48th ACM Symposium on Theory of Computing (STOC), 2016, pp. 1046–1059.Google Scholar

McAllester, D., “PAC -Bayesian model averaging,” in Proc. 1999 Conference on Learning Theory, 1999.Google Scholar

McAllester, D., “A PAC-Bayesian tutorial with a dropout bound,” arXiv:1307.2118, 2013, http://arxiv.org/abs/1307.2118.Google Scholar

Russo, D. and Zou, J., “Controlling bias in adaptive data analysis using information theory,” in Proc. 19th International Conference on Artificial Intelligence and Statistics (AISTATS), 2016, pp. 1232–1240.Google Scholar

Jiao, J., Han, Y., and Weissman, T., “Dependence measures bounding the exploration bias for general measurements,” in Proc. IEEE International Symposium on Information Theory, 2017, pp. 1475–1479.CrossRef Google Scholar

Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O., and Roth, A., “Preserving statistical validity in adaptive data analysis,” in Proc. 47th ACM Symposium on Theory of Computing (STOC), 2015.CrossRef Google Scholar

Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O., and Roth, A., “Generalization in adaptive data analysis and holdout reuse,” in 28th Annual Conference on Neural Information Processing Systems (NIPS), 2015.Google Scholar

Bassily, R., Moran, S., Nachum, I., Shafer, J., and Yehudayoff, A., “Learners that use little information,” in Proc. Conference on Algorithmic Learning Theory (ALT), 2018.Google Scholar

Vershynin, R., High-dimensional probability: An introduction with applications in data science. Cambridge University Press, 2018.Google Scholar

Devroye, L., Györfi, L., and Lugosi, G., A probabilistic theory of pattern recognition. Springer, 1996.Google Scholar

Buescher, K. L. and Kumar, P. R., “Learning by canonical smooth estimation. I. Simultaneous estimation,” IEEE Tran. Automatic Control, vol. 41, no. 4, pp. 545–556, 1996.Google Scholar

Cohen, J. E., Kemperman, J. H. B., and Zbǎganu, G., Comparisons of stochastic matrices, with applications in information theory, statistics, economics, and population sciences. Birkhäuser, 1998.Google Scholar

Polyanskiy, Y., “Channel coding: Non-asymptotic fundamental limits,” Ph.D. dissertation, Princeton University, 2010.Google Scholar

Sason, I. and Verdú, S., “f-divergence inequalities,” IEEE Trans. Information Theory, vol. 62, no. 11, pp. 5973–6006, 2016.CrossRef Google Scholar

McSherry, F. and Talwar, K., “Mechanism design via differential privacy,” in Proc. 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2007.Google Scholar

Polyanskiy, Y. and Wu, Y., “Lecture notes on information theory,” Lecture Notes for ECE563 (UIUC) and 6.441 (MIT), 2012–2016, http://people.lids.mit.edu/yp/homepage/ data/itlectures_v4. pdf.Google Scholar

Verdú, S., “The exponential distribution in information theory,” Problems of Information Transmission, vol. 32, no. 1, pp. 86–95, 1996.Google Scholar

Raginsky, M., “Strong data processing inequalities and Φ-Sobolev inequalities for discrete channels,” IEEE Trans. Information Theory, vol. 62, no. 6, pp. 3355–3389, 2016.Google Scholar

Goodfellow, I., Bengio, Y., and Courville, A., Deep learning. MIT Press, 2016.Google Scholar

Feldman, V. and Steinke, T., “Calibrating noise to variance in adaptive data analysis,” in Proc. 2018 Conference on Learning Theory, 2018.Google Scholar

Cuff, P. and Yu, L., “Differential privacy as a mutual information constraint,” in Proc. 2016 ACM SIGSAC Conference on Computer and Communication Security (CCS), 2016, pp. 43–54.Google Scholar