Universal Clustering

doi:10.1017/9781108616799.010

9 - Universal Clustering

Published online by Cambridge University Press: 22 March 2021

Ravi Kiran Raman and

Lav R. Varshney

Edited by

Miguel R. D. Rodrigues and

Yonina C. Eldar

Show author details

Miguel R. D. Rodrigues: Affiliation:
University College London
Yonina C. Eldar: Affiliation:
Weizmann Institute of Science, Israel

Book contents

Get access

Summary

Clustering is a general term for techniques that, given a set of objects, aim to select those that are closer to one another than to the rest, according to a chosen notion of closeness. It is an unsupervised-learning problem since objects are not externally labeled by category. Much effort has been expended on finding natural mathematical definitions of closeness and then developing/evaluating algorithms in these terms. Many have argued that there is no domain-independent mathematical notion of similarity but that it is context-dependent; categories are perhaps natural in that people can evaluate them when they see them. Some have dismissed the problem of unsupervised learning in favor of supervised learning, saying it is not a powerful natural phenomenon. Yet, most learning is unsupervised. We largely learn how to think through categories by observing the world in its unlabeled state. Drawing on universal information theory, we ask whether there are universal approaches to unsupervised clustering. In particular, we consider instances wherein the ground-truth clusters are defined by the unknown statistics governing the data to be clustered.

Keywords

supervised learning unsupervised learning clustering universal clustering

Type: Chapter
Information: Information-Theoretic Methods in Data Science , pp. 263 - 301

DOI: https://doi.org/10.1017/9781108616799.010 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2021

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Book purchase

Temporarily unavailable

References

Rand, W. M., “Objective criteria for the evaluation of clustering methods,” J. Am. Statist. Assoc., vol. 66, no. 336, pp. 846–850, 1971.CrossRef Google Scholar

von, U. Luxburg, Williamson, R. C., and Guyon, I., “Clustering : Science or art?” in Proc. 29th International Conference on Machine Learning (ICML 2012), 2012, pp. 65–79.Google Scholar

Bowker, G. C. and Star, S. L., Sorting things out: Classification and its consequences. MIT Press, 1999.Google Scholar

Niu, D., Dy, J. G., and Jordan, M. I., “Iterative discovery of multiple alternative clustering views,” IEEE Trans. Pattern Analysis Machine Intelligence, vol. 36, no. 7, pp. 1340–1353, 2014.Google Scholar

Valiant, L., Probably approximately correct: Nature’s algorithms for learning and prospering in a complex world. Basic Books, 2013.Google Scholar

Vogelstein, J. T., Park, Y., Ohyama, T., Kerr, R. A., Truman, J. W., Priebe, C. E., and Zlatic, M., “Discovery of brainwide neural-behavioral maps via multiscale unsupervised structure learning,” Science, vol. 344, no. 6182, pp. 386–392, 2014.Google Scholar

Sanderson, R. E., Helmi, A., and Hogg, D. W., “Action-space clustering of tidal streams to infer the Galactic potential,” Astrophys. J., vol. 801, no. 2, 18 pages, 2015.Google Scholar

Gibson, E., Futrell, R., Jara-Ettinger, J., Mahowald, K., Bergen, L., Ratnasingam, S., Gibson, M., Piantadosi, S. T., and Conway, B. R., “Color naming across languages reflects color use,” Proc. Natl. Acad. Sci. USA, vol. 114, no. 40, pp. 10785–10790, 2017.Google Scholar

Shannon, C. E., “Bits storage capacity,” Manuscript Division, Library of Congress, handwritten note, 1949.Google Scholar

Weldon, M., The Future X Network: A Bell Labs perspective. CRC Press, 2015.Google Scholar

Lintott, C., Schawinski, K., Bamford, S., Slosar, A., Land, K., Thomas, D., Edmondson, E., Masters, K., Nichol, R. C., Raddick, M. J., Szalay, A., Andreescu, D., Murray, P., and Vandenberg, J., “Galaxy Zoo 1: Data release of morphological classifications for nearly 900000 galaxies,” Monthly Notices Roy. Astron. Soc., vol. 410, no. 1, pp. 166–178, 2010.Google Scholar

Kittur, A., Chi, E. H., and Suh, B., “Crowdsourcing user studies with Mechanical Turk,” in Proc. SIGCHI Conference on Human Factors in Computational Systems (CHI 2008), 2008, pp. 453–456.Google Scholar

Ipeirotis, P. G., Provost, F., and Wang, J., “Quality management on Amazon Mechanical Turk,” in Proc. ACM SIGKDD Workshop Human Computation (HCOMP’10), 2010, pp. 64–67.Google Scholar

Krizhevsky, A., Sutskever, I., and Hinton, G. E., “ImageNet classification with deep convolutional neural networks,” in Proc. Advances in Neural Information Processing Systems 25, 2012, pp. 1097–1105.Google Scholar

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L., “ImageNet large scale visual recognition challenge,” arXiv:1409.0575 [cs.CV], 2014.Google Scholar

Simonite, T., “Google ’s new service translates languages almost as well as humans can.” MIT Technol. Rev., Sep. 27, 2016.Google Scholar

Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X., Kaiser, L., Gouws, S., Kato, Y., Kudo, T., Kazawa, H., Stevens, K., Kurian, G., Patil, N., Wang, W., Young, C., Smith, J., Riesa, J., Rudnick, A., Vinyals, O., Corrado, G., Hughes, M., and Dean, J., “Google –s neural machine translation system: Bridging the gap between human and machine translation,” arXiv:1609.08144 [cs.CL], 2017.Google Scholar

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis, D., “Mastering the game of Go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016.Google Scholar

LeCun, Y., Bengio, Y., and Hinton, G., “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.Google Scholar

Simonite, T., “The missing link of artificial intelligence,” MIT Technol. Rev., Feb. 18, 2016.Google Scholar

Shannon, C. E., “A mathematical theory of communication,” Bell Systems Technical J., vol. 27, nos. 3–4, pp. 379–423, 623–656, 1948.Google Scholar

Varshney, L. R., “Block diagrams in information theory: Drawing things closed,” in SHOT Special Interest Group on Computers, Information, and Society Workshop 2014, 2014.Google Scholar

Gray, R. M., “Source coding and simulation,” IEEE Information Theory Soc. Newsletter, vol. 58, no. 4, pp. 1/5–11, 2008 (2008 Shannon Lecture).Google Scholar

Gray, R. M., “Time-invariant trellis encoding of ergodic discrete-time sources with a fidelity criterion,” IEEE Trans. Information Theory, vol. 23, no. 1, pp. 71–83, 1977.Google Scholar

Steinberg, Y. and Verdú, S., “Simulation of random processes and rate-distortion theory,” IEEE Trans. Information Theory, vol. 42, no. 1, pp. 63–86, 1996.Google Scholar

Cover, T. M. and Thomas, J. A., Elements of information theory. John Wiley & Sons, 1991.Google Scholar

Rissanen, J., “Optimal estimation,” IEEE Information Theory Soc. Newsletter, vol. 59, no. 3, pp. 1/6–7, 2009 (2009 Shannon Lecture).Google Scholar

Misra, V., “Universal communication and clustering,” Ph.D. dissertation, Stanford University, 2014.Google Scholar

Rissanen, J. J., “Generalized Kraft inequality and arithmetic coding,” IBM J. Res. Development, vol. 20, no. 3, pp. 198–203, 1976.CrossRef Google Scholar

Huffman, D. A., “A method for the construction of minimum-redundancy codes,” Proc. IRE, vol. 40, no. 9, pp. 1098–1101, 1952.Google Scholar

Ziv, J. and Lempel, A., “Compression of individual sequences via variable-rate coding,” IEEE Trans. Information Theory, vol. 24, no. 5, pp. 530–536, 1978.Google Scholar

Ziv, J., “Coding theorems for individual sequences,” IEEE Trans. Information Theory, vol. 24, no. 4, pp. 405–412, 1978.Google Scholar

Wyner, A. D. and Ziv, J., “The rate-distortion function for source coding with side information at the decoder,” IEEE Trans. Information Theory, vol. 22, no. 1, pp. 1–10, 1976.Google Scholar

Rissanen, J. J., “A universal data compression system,” IEEE Trans. Information Theory, vol. 29, no. 5, pp. 656–664, 1983.Google Scholar

Gallager, R., “Variations on a theme by Huffman,” IEEE Trans. Information Theory, vol. 24, no. 6, pp. 668–674, 1978.Google Scholar

Lawrence, J. C., “A new universal coding scheme for the binary memoryless source,” IEEE Trans. Information Theory, vol. 23, no. 4, pp. 466–472, 1977.CrossRef Google Scholar

Ziv, J., “Coding of sources with unknown statistics–Part I: Probability of encoding error,” IEEE Trans. Information Theory, vol. 18, no. 3, pp. 384–389, 1972.Google Scholar

Davisson, L. D., “Universal noiseless coding,” IEEE Trans. Information Theory, vol. 19, no. 6, pp. 783–795, 1973.Google Scholar

Slepian, D. and Wolf, J. K., “Noiseless coding of correlated information sources,” IEEE Trans. Information Theory, vol. 19, no. 4, pp. 471–480, 1973.Google Scholar

Cover, T. M., “A proof of the data compression theorem of Slepian and Wolf for ergodic sources,” IEEE Trans. Information Theory, vol. 21, no. 2, pp. 226–228, 1975.Google Scholar

Csiszár, I., “Linear codes for sources and source networks: Error exponents, universal coding,” IEEE Trans. Information Theory, vol. 28, no. 4, pp. 585–592, 1982.Google Scholar

Shannon, C. E., “Coding theorems for a discrete source with a fidelity criterion,” IRE National Convention Record, vol. 4, no. 1, pp. 142–163, 1959.Google Scholar

Berger, T., Rate distortion theory: A mathematical basis for data compression. PrenticeHall, 1971.Google Scholar

Ziv, J., “Coding of sources with unknown statistics–Part II: Distortion relative to a fidelity critenon,” IEEE Trans. Information Theory, vol. 18, no. 3, pp. 389–394, May 1972.Google Scholar

Ziv, J., “On universal quantization,” IEEE Trans. Information Theory, vol. 31, no. 3, pp. 344–347, 1985.Google Scholar

Hui, E. Yang and Kieffer, J. C., “Simple universal lossy data compression schemes derived from the Lempel–Ziv algorithm,” IEEE Trans. Information Theory, vol. 42, no. 1, pp. 239–245, 1996.Google Scholar

Wyner, A. D., Ziv, J., and Wyner, A. J., “On the role of pattern matching in information theory,” IEEE Trans. Information Theory, vol. 44, no. 6, pp. 2045–2056, 1998.Google Scholar

Shannon, C. E., “Communication in the presence of noise,” Proc. IRE, vol. 37, no. 1, pp. 10–21, 1949.Google Scholar

Goppa, V. D., “Nonprobabilistic mutual information without memory,” Problems Control lnformation Theory, vol. 4, no. 2, pp. 97–102, 1975.Google Scholar

Csiszár, I., “The method of types,” IEEE Trans. Information Theory, vol. 44, no. 6, pp. 2505–2523, 1998.Google Scholar

Moulin, P., “A Neyman-Pearson approach to universal erasure and list decoding,” IEEE Trans. lnformation Theory, vol. 55, no. 10, pp. 4462–4478, 2009.Google Scholar

Merhav, N., “Universal decoding for arbitrary channels relative to a given class of decoding metrics,” IEEE Trans. Information Theory, vol. 59, no. 9, pp. 5566–5576, 2013.Google Scholar

Shannon, C. E., “Certain results in coding theory for noisy channels,” Information Control, vol. 1, no. 1, pp. 6–25, 1957.Google Scholar

Feinstein, A., “On the coding theorem and its converse for finite-memory channels,” Information Control, vol. 2, no. 1, pp. 25–44, 1959.CrossRef Google Scholar

Csiszár, I. and Narayan, P., “Capacity of the Gaussian arbitrarily varying channel,” IEEE Trans. Information Theory, vol. 37, no. 1, pp. 18–26, 1991.Google Scholar

Ziv, J., “Universal decoding for finite-state channels,” IEEE Trans. Information Theory, vol. 31, no. 4, pp. 453–460, 1985.Google Scholar

Feder, M. and Lapidoth, A., “Universal decoding for channels with memory,” IEEE Trans. Information Theory, vol. 44, no. 5, pp. 1726–1745, 1998.Google Scholar

Misra, V. and Weissman, T., “Unsupervised learning and universal communication,” in Proc. 2013 IEEE International Symposium on lnformation Theory, 2013, pp. 261–265.Google Scholar

Merhav, N., “Universal decoding using a noisy codebook,” arXiv:1609:00549 [cs.IT], IEEE Trans. Information Theory, vol. 64. no. 4, pp. 2231–2239, 2018.Google Scholar

Lapidoth, A. and Narayan, P., “Reliable communication under channel uncertainty,” IEEE Trans. Information Theory, vol. 44, no. 6, pp. 2148–2177, 1998.Google Scholar

Zeitouni, O., Ziv, J., and Merhav, N., “When is the generalized likelihood ratio test optimal?,” IEEE Trans. Information Theory, vol. 38, no. 5, pp. 1597–1602, 1992.Google Scholar

Feder, M. and Merhav, N., “Universal composite hypothesis testing: A competitive minimax approach,” IEEE Trans. Information Theory, vol. 48, no. 6, pp. 1504–1517, 2002.Google Scholar

Levitan, E. and Merhav, N., “A competitive Neyman-Pearson approach to universal hypothesis testing with applications,” IEEE Trans. Information Theory, vol. 48, no. 8, pp. 2215–2229, 2002.Google Scholar

Feder, M., Merhav, N., and Gutman, M., “Universal prediction of individual sequences,” IEEE Trans. Information Theory, vol. 38, no. 4, pp. 1258–1270, 1992.Google Scholar

Feder, M., “Gambling using a finite state machine,” IEEE Trans. Information Theory, vol. 37, no. 5, pp. 1459–1465, 1991.Google Scholar

Weissman, T., Ordentlich, E., Seroussi, G., Verdú, S., and Weinberger, M. J., “Universal discrete denoising: Known channel,” IEEE Trans. Information Theory, vol. 51, no. 1, pp. 5–28, 2005.Google Scholar

Ordentlich, E., Viswanathan, K., and Weinberger, M. J., “Twice-universal denoising,” IEEE Trans. Information Theory, vol. 59, no. 1, pp. 526–545, 2013.CrossRef Google Scholar

Bendory, T., Boumal, N., Ma, C., Zhao, Z., and Singer, A., “Bispectrum inversion with application to multireference alignment,” vol. 66, no. 4, pp. 1037–1050, 2018.Google Scholar

Abbe, E., Pereira, J. M., and Singer, A., “Sample complexity of the Boolean multireference alignment problem,” in Proc. 2017 IEEE International Symposium on Information Theory, 2017, pp. 1316–1320.Google Scholar

Pananjady, A., Wainwright, M. J., and Courtade, T. A., “Denoising linear models with permuted data,” in Proc. 2017 IEEE International Symposium on Information Theory, 2017, pp. 446–450.Google Scholar

Viola, P. and Wells III, W. M., “Alignment by maximization of mutual information,” Int. J. Computer Vision, vol. 24, no. 2, pp. 137–154, 1997.Google Scholar

Raman, R. K. and Varshney, L. R., “Universal joint image clustering and registration using partition information,” in Proc. 2017 IEEE International Symposium on lnformation Theory, 2017, pp. 2168–2172.Google Scholar

Stein, J., Ziv, J., and Merhav, N., “Universal delay estimation for discrete channels,” IEEE Trans. Information Theory, vol. 42, no. 6, pp. 2085–2093, 1996. universal Clustering 297Google Scholar

Ziv, J., “On classification with empirically observed statistics and universal data compression,” IEEE Trans. Information Theory, vol. 34, no. 2, pp. 278–286, 1988.Google Scholar

Merhav, N., “Universal classification for hidden Markov models,” IEEE Trans. Information Theory, vol. 37, no. 6, pp. 1586–1594, Nov. 1991.Google Scholar

Raman, R. K. and Varshney, L. R., “Budget -optimal clustering via crowdsourcing,” in Proc. 2017 IEEE International Symposium on Information Theory, 2017, pp. 2163–2167.Google Scholar

Li, Y., Nitinawarat, S., and Veeravalli, V. V., “Universal outlier hypothesis testing,” IEEE Trans. Information Theory, vol. 60, no. 7, pp. 4066–4082, 2014.Google Scholar

Cormode, G., Paterson, M., Sahinalp, S. C., and Vishkin, U., “Communication complexity of document exchange,” in Proc. 11th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’00), 2000, pp. 197–206.Google Scholar

Muthukrishnan, S. and Sahinalp, S. C., “Approximate nearest neighbors and sequence comparison with block operations,” in Proc. 32nd Annual ACM Symposium on Theory Computation (STOC’00), 2000, pp. 416–424.Google Scholar

Banerjee, A., Merugu, S., Dhillon, I. S., and Ghosh, J., “Clustering with Bregman divergences,” J. Machine Learning Res., vol. 6, pp. 1705–1749, 2005.Google Scholar

Li, M. and Vitányi, P., An introduction to Kolmogorov complexity and its applications, 3rd edn. Springer, 2008.Google Scholar

Bennett, C. H., Gács, P., Li, M., Vitányi, P. M. B., and Zurek, W. H., “Information distance,” IEEE Trans. Information Theory, vol. 44, no. 4, pp. 1407–1423, 1998.Google Scholar

Li, M., Chen, X., Li, X., Ma, B., and Vitányi, P. M. B., “The similarity metric,” IEEE Trans. Information Theory, vol. 50, no. 12, pp. 3250–3264, 2004.Google Scholar

Vitanyi, P., “Universal similarity,” in Proc. IEEE Information Theory Workshop (ITW’05), 2005, pp. 238–243.Google Scholar

Cilibrasi, R. L. and Vitányi, P. M. B., “The Google similarity distance,” IEEE Trans. Knowledge Data Engineering, vol. 19, no. 3, pp. 370–383, 2007.Google Scholar

Nguyen, H. V., Müller, E., Vreeken, J., Efros, P., and Böhm, K., “Multivariate maximal correlation analysis,” in Proc. 31st Internatinal Conference on Machine Learning (ICML 2014), 2014, pp. 775–783.Google Scholar

Estévez, P. A., Tesmer, M., Perez, C. A., and Zurada, J. M., “Normalized mutual information feature selection,” IEEE Trans. Neural Networks, vol. 20, no. 2, pp. 189–201, 2009.Google Scholar

Danon, L., Diaz-Guilera, A., Duch, J., and Arenas, A., “Comparing community structure identification,” J. Statist. Mech., vol. 2005, p. P09008, 2005.Google Scholar

Gates, A. J. and Ahn, Y.-Y., “The impact of random models on clustering similarity,” J. Machine Learning Res., vol. 18, no. 87, pp. 1–28, 2017.Google Scholar

Lewis, J., Ackerman, M., and de Sa, V., “Human cluster evaluation and formal quality measures: A comparative study,” in Proc. 34th Annual Conference on Cognitive Science in Society, 2012.Google Scholar

MacQueen, J., “Some methods for classification and analysis of multivariate observations,” in Proc. 5th Berkeley Symposium on Mathematics Statistics and Probability, 1967, pp. 281–297.Google Scholar

Duda, R. O., Hart, P. E., and Stork, D. G., Pattern classification, 2nd edn. Wiley, 2001.Google Scholar

Bishop, C. M., Pattern recognition and machine learning. Springer, 2006.Google Scholar

Linde, Y., Buzo, A., and Gray, R. M., “An algorithm for vector quantizer design,” IEEE Trans. Communication, vol. 28, no. 1, pp. 84–95, 1980.Google Scholar

Dhillon, I. S., Mallela, S., and Kumar, R., “A divisive information-theoretic feature clustering algorithm for text classification,” J. Machine Learning Res., vol. 3, pp. 1265–1287, 2003.Google Scholar

Banerjee, A., Guo, X., and Wang, H., “On the optimality of conditional expectation as a Bregman predictor,” IEEE Trans. Information Theory, vol. 51, no. 7, pp. 2664–2669, 2005.Google Scholar

Dhillon, I. S., Mallela, S., and Modha, D. S., “Information -theoretic co-clustering,” in Proc. 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’03), 2003, pp. 89–98.Google Scholar

Raman, R. K. and Varshney, L. R., “Universal clustering via crowdsourcing,” arXiv:1610.02276 [cs.IT], 2016.Google Scholar

Li, Y., Nitinawarat, S., and Veeravalli, V. V., “Universal outlier detection,” in Proc. 2013 Information Theory Applications Workshop, 2013.Google Scholar

Li, Y., Nitinawarat, S., and Veeravalli, V. V., “Universal outlier hypothesis testing,” in Proc. 2014 IEEE Internatinal Symposium on Information Theory, 2014, pp. 4066–4082.Google Scholar

Li, Y., Nitinawarat, S., Su, Y., and Veeravalli, V. V., “Universal outlier hypothesis testing: pplication to anomaly detection,” in Proc. IEEE International Conference on Acoustics, Speech, Signal Process. (ICASSP 2015), 2015, pp. 5595–5599.Google Scholar

Bu, Y., Zou, S., and Veeravalli, V. V., “Linear -complexity exponentially-consistent tests for universal outlying sequence detection,” in Proc. 2017 IEEE Ilnternational Symposium on lnformation Theory, 2017, pp. 988–992.Google Scholar

Li, Y., Nitinawarat, S., and Veeravalli, V. V., “Universal sequential outlier hypothesis testing,” Sequence Analysis, vol. 36, no. 3, pp. 309–344, 2017.Google Scholar

Wright, J., Tao, Y., Lin, Z., Ma, Y., and H.-Y. Shum, “Classification via minimum incremental coding length (MICL),,” in Proc. Advances in Neural Information Processing Systems 20. MIT Press, 2008, pp. 1633–1640.Google Scholar

Ma, Y., Derksen, H., Hong, W., and Wright, J., “Segmentation of multivariate mixed data via lossy data coding and compression,” IEEE Trans. Pattern Analysis Machine Intelligence, vol. 29, no. 9, pp. 1546–1562, 2007.Google Scholar

Yang, A. Y., Wright, J., Ma, Y., and Sastry, S. S., “Unsupervised segmentation of natural images via lossy data compression,” Comput. Vision Image Understanding, vol. 110, no. 2, pp. 212–225, 2008.Google Scholar

Rao, S. R., Mobahi, H., Yang, A. Y., Sastry, S. S., and Ma, Y., “Natural image segmentation with adaptive texture and boundary encoding,” in Computer Vision–ACCV 2009. Springer.Google Scholar

Cilibrasi, R. and Vitányi, P. M. B., “Clustering by compression,” IEEE Trans. Information Theory, vol. 51, no. 4, pp. 1523–1545, 2005.Google Scholar

Ryabko, D., “Clustering processes,” in 27th International Conference on Machine Learning, 2010, pp. 919–926.Google Scholar

Khaleghi, A., Ryabko, D., Mary, J., and Preux, P., “Consistent algorithms for clustering time series,” J. Machine Learning Res., vol. 17, no. 3, pp. 1–32, 2016.Google Scholar

Ryabko, D., “Independence clustering (without a matrix),” in Proc. Advances in Neural lnformation Processing Systems 30, 2017, pp. 4016–4026.Google Scholar

Bell, A. J. and Sejnowski, T. J., “An information-maximization approach to blind separation and blind deconvolution,” Neural Computation, vol. 7, no. 6, pp. 1129–1159, 1995.Google Scholar

Bach, F. R. and Jordan, M. I., “Beyond independent components: Trees and clusters,” J. Machine Learning Res., vol. 4, no. 12, pp. 1205–1233, 2003.Google Scholar

Chow, C. K. and Liu, C. N., “Approximating discrete probability distributions with dependence trees,” IEEE Trans. Information Theory, vol. 14, no. 3, pp. 462–467, 1968.Google Scholar

Chickering, D. M., “Learning Bayesian networks is NP-complete,” in Learning from data, Fisher, D. and Lenz, H.-J., eds. Springer, 1996, pp. 121–130.Google Scholar

Montanari, A. and Pereira, J. A., “Which graphical models are difficult to learn?,” in Proc. Advances in Neural Information Processing Systems 22, 2009, pp. 1303–1311.Google Scholar

Abbeel, P., Koller, D., and Ng, A. Y., “Learning factor graphs in polynomial time and sample complexity,” J. Machine Learning Res., vol. 7, pp. 1743–1788, 2006.Google Scholar

Ren, Z., Sun, T., Zhang, C.-H., and Zhou, H. H., “Asymptotic normality and optimalities in estimation of large Gaussian graphical models,” Annals Statist., vol. 43, no. 3, pp. 991–1026, 2015.Google Scholar

Loh, P.-L. and Wainwright, M. J., “Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses,” in Proc. Advances in Neural Information Processing Systems 25, 2012, pp. 2087–2095.Google Scholar

Santhanam, N. P. and Wainwright, M. J., “Information-theoretic limits of selecting binary graphical models in high dimensions,” IEEE Trans. Information Theory, vol. 58, no. 7, pp. 4117–4134, 2012.Google Scholar

Bachschmid-Romano, L. and Opper, M., “Inferring hidden states in a random kinetic Ising model: Replica analysis,” J. Statist. Mech., vol. 2014, no. 6, p. P06013, 2014.Google Scholar

Bresler, G., “Efficiently learning Ising models on arbitrary graphs,” in Proc. 47th Annual ACM Symposium Theory of Computation (STOC’15), 2015, pp. 771–782.Google Scholar

Netrapalli, P., Banerjee, S., Sanghavi, S., and Shakkottai, S., “Greedy learning of Markov network structure,” in Proc. 48th Annual Allerton Conference on Communication Control Computation, 2010, pp. 1295–1302.Google Scholar

Tan, V. Y. F., Anandkumar, A., Tong, L., and Willsky, A. S., “A large-deviation analysis of the maximum-likelihood learning of Markov tree structures,” IEEE Trans. Information Theory, vol. 57, no. 3, pp. 1714–1735, 2011.Google Scholar

Beal, M. J. and Ghahramani, Z., “Variational Bayesian learning of directed graphical models with hidden variables,” Bayesian Analysis, vol. 1, no. 4, pp. 793–831, 2006.Google Scholar

Anandkumar, A. and Valluvan, R., “Learning loopy graphical models with latent variables: Efficient methods and guarantees,” Annals Statist., vol. 41, no. 2, pp. 401–435, 2013.Google Scholar

Bresler, G., Gamarnik, D., and Shah, D., “Learning graphical models from the Glauber dynamics,” arXiv:1410.7659 [cs.LG], 2014, to be published in IEEE Trans. Information Theory.Google Scholar

Dawid, A. P., “Conditional independence in statistical theory,” J. Roy. Statist. Soc. Ser. B. Methodol., vol. 41, no. 1, pp. 1–31, 1979.Google Scholar

Batu, T., Fischer, E., Fortnow, L., Kumar, R., Rubinfeld, R., and White, P., “Testing random variables for independence and identity,” in Proc. 42nd Annual Symposium on the Foundations Computer Science, 2001, pp. 442–451.Google Scholar

Gretton, A. and Györfi, L., “Consistent non-parametric tests of independence,” J. Machine Learning Res., vol. 11, no. 4, pp. 1391–1423, 2010.Google Scholar

Sen, R., Suresh, A. T., Shanmugam, K., Dimakis, A. G., and Shakkottai, S., “Model -powered conditional independence test,” in Proc. Advances in Neural Information Processing Systems 30, 2017, pp. 2955–2965.Google Scholar

Tishby, N., Pereira, F. C., and Bialek, W., “The information bottleneck method,” in Proc. 37th Annual Allerton Conference on Communation Control Computication, 1999, pp. 368–377.Google Scholar

Gilad-Bachrach, R., Navot, A., and Tishby, N., “An information theoretic trade-off between complexity and accuracy,” in Learning Theory and Kernel Machines, Schölkopf, B. and Warmuth, M. K., eds. Springer, 2003, pp. 595–609.Google Scholar

Slonim, N., “The information bottleneck: Theory and applications,” Ph.D. dissertation, The Hebrew University of Jerusalem, 2002.Google Scholar

Kim, H., Gao, W., Kannan, S., Oh, S., and Viswanath, P., “Discovering potential correlations via hypercontractivity,” in Proc. 30th Annual Conference on Neural lnformation Processing Systems (NIPS), 2017, pp. 4577–4587.Google Scholar

Slonim, N., Friedman, N., and Tishby, N., “Multivariate information bottleneck,” Neural Comput., vol. 18, no. 8, pp. 1739–1789, 2006.Google Scholar

Chechik, G., Globerson, A., Tishby, N., and Weiss, Y., “Information bottleneck for Gaussian variables,” J. Machine Learning Res., vol. 6, no. 1, pp. 165–188, 2005.Google Scholar

Slonim, N. and Tishby, N., “Agglomerative information bottleneck,” in Proc. Advances in Neural Information Processing Systems 12, 1999, pp. 617–625.Google Scholar

Slonim, N., Friedman, N., and Tishby, N., “Agglomerative multivariate information bottleneck,” in Proc. Advances in Neural Information Processing Systems 14, 2002, pp. 929–936.Google Scholar

Bridle, J. S., Heading, A. J. R., and MacKay, D. J. C., “Unsupervised classifiers, mutual information and ‘phantom targets,” in Proc. Advances in Neural Information Processing Systems 4, 1992, pp. 1096–1101.Google Scholar

Butte, A. J. and Kohane, I. S., “Mutual information relevance networks: Functional genomic clustering using pairwise entropy measurements,” in Biocomputing 2000, 2000, pp. 418–429.Google Scholar

Priness, I., Maimon, O., and Ben-Gal, I., “Evaluation of gene-expression clustering via mutual information distance measure,” BMC Bioinformatics, vol. 8, no. 1, p. 111, 2007.Google Scholar

Kraskov, A., Stögbauer, H., Andrzejak, R. G., and Grassberger, P., “Hierarchical clustering using mutual information,” Europhys. Lett., vol. 70, no. 2, p. 278, 2005.Google Scholar

Aghagolzadeh, M., Soltanian-Zadeh, H., Araabi, B., and Aghagolzadeh, A., “A hierarchical clustering based on mutual information maximization,” in Proc. IEEE International Conference on Image Processing (ICIP 2007), vol. 1, 2007, pp. I-277–I-280.Google Scholar

Chan, C., Al-Bashabsheh, A., Ebrahimi, J. B., Kaced, T., and Liu, T., “Multivariate mutual information inspired by secret-key agreement,” Proc. IEEE, vol. 103, no. 10, pp. 1883–1913, 2015.Google Scholar

Csiszár, I. and Narayan, P., “Secrecy capacities for multiple terminals,” IEEE Trans. Information Theory, vol. 50, no. 12, pp. 3047–3061, 2004.Google Scholar

Chan, C., Al-Bashabsheh, A., and Zhou, Q., “Change of multivariate mutual information: From local to global,” IEEE Trans. Information Theory, vol. 64, no. 1, pp. 57–76, 2018.Google Scholar

Chan, C. and Liu, T., “Clustering by multivariate mutual information under Chow–Liu tree approximation,” in Proc. 53rd Annual Allerton Conference on Communication Control Computation, 2015, pp. 993–999.Google Scholar

Chan, C., Al-Bashabsheh, A., Zhou, Q., Kaced, T., and Liu, T., “Info-clustering: A mathematical theory for data clustering,” IEEE Trans. Mol. Biol. Multi-Scale Commun., vol. 2, no. 1, pp. 64–91, 2016.Google Scholar

Chan, C., Al-Bashabsheh, A., and Zhou, Q., “Agglomerative info-clustering,” arXiv: 1701.04926 [cs. IT], 2017.Google Scholar

Studený, M. and Vejnarová, J., “The multiinformation function as a tool for measuring stochastic dependence,” in Learning in Graphical Models, Jordan, M. I., ed. Kluwer Academic Publishers, 1998, pp. 261–297.Google Scholar

Steeg, G. V. and Galstyan, A., “Discovering structure in high-dimensional data through correlation explanation,” in Proc. 28th Annual Conference on Neural Information Processing Systems (NIPS), 2014, pp. 577–585.Google Scholar

Raman, R. K. and Varshney, L. R., “Universal joint image clustering and registration using multivariate information measures,” lEEE J. Selected Topics Signal Processing, vol. 12, no. 5, pp. 928–943, 2018.Google Scholar

Studený, M., “Asymptotic behaviour of empirical multiinformation,” Kybernetika, vol. 23, no. 2, pp. 124–135, 1987.Google Scholar

Raman, R. K., Yu, H., and Varshney, L. R., “Illum information,” in Proc. 2017 Information Theory Applications Workshop, 2017.Google Scholar

Palomar, D. P. and Verdú, S., “Lautum information,” IEEE Trans. Information Theory, vol. 54, no. 3, pp. 964–975, 2008.Google Scholar

Slonim, N. and Tishby, N., “Document clustering using word clusters via the information bottleneck method,” in Proc. 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’00), 2000, pp. 208–215.Google Scholar

Goldberger, J., Greenspan, H., and Gordon, S., “Unsupervised image clustering using the information bottleneck method,” in Pattern Recognition, Van Gool, L., ed. Springer, 2002, pp. 158–165.Google Scholar

Dubnov, S., Assayag, G., Lartillot, O., and Bejerano, G., “Using machine-learning methods for musical style modeling,” IEEE Computer, vol. 36, no. 10, pp. 73–80, 2003.Google Scholar

Cilibrasi, R., Vitányi, P., and de Wolf, R., “Algorithmic clustenng of music based on string compression,” Czech. Math. J., vol. 28, no. 4, pp. 49–67, 2004.Google Scholar

Steeg, G. V. and Galstyan, A., “The information sieve,” in Proc. 33rd International Conference on Machine Learning (ICML 2016), 2016, pp. 164–172.Google Scholar

Hodas, N. O., Steeg, G. V., Harrison, J., Chikkagoudar, S., Bell, E., and Corley, C. D., “Disentangling the lexicons of disaster response in twitter,” in Proc. 24th International Conference on the World Wide Web (WWW’15), 2015, pp. 1201–1204.Google Scholar

Anselmi, F., Leibo, J. Z., Rosasco, L., Mutch, J., Tacchetti, A., and Poggio, T., “Unsupervised learning of invariant representations,” Theoretical Computer Sci., vol. 633, pp. 112–121, 2016.Google Scholar

Jordan, M. I., “On statistics, computation and scalability,” Bernoulli, vol. 19, no. 4, pp. 1378–1390, 2013.Google Scholar

Tan, V. Y., “Asymptotic estimates in information theory with non-vanishing error probabilities,” Foundations Trends Communication Information Theory, vol. 11, nos. 1–2, pp. 1–184, 2014.Google Scholar

Gao, W., Oh, S., and Viswanath, P., “Demystifying fixed k-nearest neighbor information estimators,” arXiv:1604.03006 [cs.LG], to be published in IEEE Trans. Information Theory.Google Scholar