Information Bottleneck and Representation Learning

doi:10.1017/9781108616799.012

11 - Information Bottleneck and Representation Learning

Published online by Cambridge University Press: 22 March 2021

Pablo Piantanida and

Leonardo Rey Vega

Edited by

Miguel R. D. Rodrigues and

Yonina C. Eldar

Show author details

Miguel R. D. Rodrigues: Affiliation:
University College London
Yonina C. Eldar: Affiliation:
Weizmann Institute of Science, Israel

Book contents

Get access

Summary

A grand challenge in representation learning is the development of computational algorithms that learn the explanatory factors of variation behind high-dimensional data. Representation models (encoders) are often determined for optimizing performance on training data when the real objective is to generalize well to other (unseen) data. This chapter provides an overview of fundamental concepts in statistical learning theory and the information-bottleneck principle. This serves as a mathematical basis for the technical results, in which an upper bound to the generalization gap corresponding to the cross-entropy risk is given. When this penalty term times a suitable multiplier and the cross-entropy empirical risk are minimized jointly, the problem is equivalent to optimizing the information-bottleneck objective with respect to the empirical data distribution. This result provides an interesting connection between mutual information and generalization, and helps to explain why noise injection during the training phase can improve the generalization ability of encoder models and enforce invariances in the resulting representations.

Keywords

learning supervised learning unsupervised learning deep learning generalization information bottleneck

Type: Chapter
Information: Information-Theoretic Methods in Data Science , pp. 330 - 358

DOI: https://doi.org/10.1017/9781108616799.012 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2021

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Book purchase

Temporarily unavailable

References

National Research Council, Frontiers in massive data analysis. National Academies Press, 2013.Google Scholar

Shannon, C., “A mathematical theory of communication,” Bell System Technical J., vols. 3, 4, 27, pp. 379–423, 623–656, 1948.Google Scholar

Vapnik, V., The nature of statistical learning theory, 2nd edn. Springer, 2000.Google Scholar

Hinton, G. I., “Connectionist learning procedures,” in Machine learning, Kodratoff, Y. and Michalski, R. S., eds. Elsevier, 1990, pp. 555–610.Google Scholar

Barlow, H. B., “Unsupervised learning,” Neural Computation, vol. 1, no. 3, pp. 295–311, 1989.Google Scholar

Pouget, A., Beck, J. M., Ma, W. J., and Latham, P. E., “Probabilistic brains: Knowns and unknowns,” Nature Neurosci., vol. 16, no. 9, pp. 1170–1178, 2013.CrossRef Google Scholar PubMed

Barlow, H., “The exploitation of regularities in the environment by the brain,” Behav. Brain Sci., vol. 24, no. 8, pp. 602–607, 2001.Google Scholar

LeCun, Y., Bengio, Y., and Hinton, G., “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, May 2015.Google Scholar

Bengio, Y., Courville, A., and Vincent, P., “Representation learning: A review and new perspectives,” IEEE Trans. Pattern Analysis Machine Intelligence, vol. 35, no. 8, pp. 1798–1828, 2013.Google Scholar

Barron, A. R., “Approximation and estimation bounds for artificial neural networks,” Machine Learning, vol. 14, no. 1, pp. 115–133, 1994.CrossRef Google Scholar

Rissanen, J., “Modeling by shortest data description,” Automatica, vol. 14, no. 5, pp. 465–471, 1978.Google Scholar

Barron, A. R. and Cover, T. M., “Minimum complexity density estimation,” IEEE Trans. Information Theory, vol. 37, no. 4, pp. 1034–1054, 1991.Google Scholar

Boucheron, S., Bousquet, O., and Lugosi, G., “Theory of classification: A survey of some recent advances,” ESAIM: Probability Statist., vol. 9, no. 11, pp. 323–375, 2005.Google Scholar

Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R., “Dropout: A simple way to prevent neural networks from overfitting,” J. Machine Learning Res., vol. 15, no. 1, pp. 1929–1958, 2014.Google Scholar

Achille, A. and Soatto, S., “Information dropout: Learning optimal representations through noisy computation,” arXiv:1611.01353 [stat.ML], 2016.Google Scholar

Kingma, D. P. and Welling, M., “Auto -encoding variational Bayes,” in Proc. 2nd International Conference on Learning Representations (ICLR), 2013.Google Scholar

Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O., “Understanding deep learning requires rethinking generalization,” CoRR, vol. abs/1611.03530, 2016.Google Scholar

Shamir, O., Sabato, S., and Tishby, N., “Learning and generalization with the information bottleneck,” Theor. Comput. Sci., vol. 411, nos. 29–30, pp. 2696–2711, 2010.Google Scholar

Shwartz-Ziv, R. and Tishby, N., “Opening the black box of deep neural networks via information,” CoRR, vol. abs/1703.00810, 2017.Google Scholar

Tishby, N., Pereira, F. C., and Bialek, W., “The information bottleneck method,” in Proc. 37th Annual Allerton Conference on Communication, Control and Computing, 1999, pp. 368–377.Google Scholar

Russo, D. and Zou, J., “How much does your data exploration overfit? Controlling bias via information usage,” arXiv:1511.05219 [CS, stat], 2015.Google Scholar

Xu, A. and Raginsky, M., “Information -theoretic analysis of generalization capability of learning algorithms,” in Proc. Advances in Neural Information Processing Systems 30, 2017, pp. 2524–2533.Google Scholar

Achille, A. and Soatto, S., “Emergence of invariance and disentangling in deep representations,” arXiv:1706.01350 [CS, stat], 2017.Google Scholar

Cover, T. M. and Thomas, J. A., Elements of information theory. Wiley-Interscience, 2006.Google Scholar

Vapnik, V. N., Statistical learning theory. Wiley, 1998.Google Scholar

Gamal, A. E. and Kim, Y.-H., Network information theory. Cambridge University Press, 2012.Google Scholar

Shannon, C. E., “Coding theorems for a discrete source with a fidelity criterion,” IRE National Convention Record, vol. 4, no. 1, pp. 142–163, 1959.Google Scholar

Dobrushin, R. and Tsybakov, B., “Information transmission with additional noise,” IEEE Trans. Information Theory, vol. 8, no. 5, pp. 293–304, 1962.Google Scholar

Courtade, T. and Weissman, T., “Multiterminal source coding under logarithmic loss,” IEEE Trans. Information Theory, vol. 60, no. 1, pp. 740–761, 2014.CrossRef Google Scholar

Vera, M., Vega, L. R., and Piantanida, P., “Collaborative representation learning,” arXiv:1604.01433 [cs.IT], 2016.Google Scholar

Slonim, N. and Tishby, N., “Document clustering using word clusters via the information bottleneck method,” in Proc. 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2000, pp. 208–215.Google Scholar

Wang, L., Chen, M., Rodrigues, M., Wilcox, D., Calderbank, R., and Carin, L., “Informationtheoretic compressive measurement design,” IEEE Trans. Pattern Analysis Machine Intelligence, vol. 39, no. 6, pp. 1150–1164, 2017.Google Scholar

Boyd, S. and Vandenberghe, L., Convex optimization. Cambridge University Press, 2004.CrossRef Google Scholar

Vera, M., Vega, L. R., and Piantanida, P., “Compression-based regularization with an application to multi-task learning,” IEEE J. Selected Topics Signal Processing, vol. 5, no. 12, pp. 1063–1076, 2018.Google Scholar

Arimoto, S., “An algorithm for computing the capacity of arbitrary discrete memoryless channels,” IEEE Trans. Information Theory, vol. 18, no. 1, pp. 14–20, 1972.Google Scholar

Blahut, R., “Computation of channel capacity and rate-distortion functions,” IEEE Trans. Information Theory, vol. 18, no. 4, pp. 460–473, 1972.Google Scholar

Alemi, A. A., Fischer, I., Dillon, J. V., and Murphy, K., “Deep variational information bottleneck,” CoRR, vol. abs/1612.00410, 2016.Google Scholar

Rissanen, J., “Paper: Modeling by shortest data description,” Automatica, vol. 14, no. 5, pp. 465–471, 1978.Google Scholar

Grünwald, P. D., Myung, I. J., and Pitt, M. A., Advances in minimum description length: Theory and applications. MIT Press, 2005.Google Scholar

Arimoto, S., “On the converse to the coding theorem for discrete memoryless channels (corresp.),” IEEE Trans. Information Theory, vol. 19, no. 3, pp. 357–359, 1973.CrossRef Google Scholar

Shtarkov, Y. M., “Universal sequential coding of single messages,” Problems Information Transmission, vol. 23, no. 3, pp. 175–186, 1987.Google Scholar

Tsybakov, A. B., Introduction to nonparametric estimation, 1st edn. Springer, 2008.Google Scholar

Tebbe, D. and Dwyer, S., “Uncertainty and the probability of error (corresp.),” IEEE Trans. Information Theory, vol. 14, no. 3, pp. 516–518, 1968.Google Scholar