An Introductory Guide to Fano’s Inequality with Applications in Statistical Estimation

doi:10.1017/9781108616799.017

16 - An Introductory Guide to Fano’s Inequality with Applications in Statistical Estimation

Published online by Cambridge University Press: 22 March 2021

Jonathan Scarlett and

Volkan Cevher

Edited by

Miguel R. D. Rodrigues and

Yonina C. Eldar

Show author details

Miguel R. D. Rodrigues: Affiliation:
University College London
Yonina C. Eldar: Affiliation:
Weizmann Institute of Science, Israel

Book contents

Get access

Summary

Information theory plays an indispensable role in the development of algorithm-independent impossibility results, both for communication problems and for seemingly distinct areas such as statistics and machine learning. While numerous information-theoretic tools have been proposed for this purpose, the oldest one remains arguably the most versatile and widespread: Fano’s inequality. In this chapter, we provide a survey of Fano’s inequality and its variants in the context of statistical estimation, adopting a versatile framework that covers a wide range of specific problems. We present a variety of key tools and techniques used for establishing impossibility results via this approach, and provide representative examples covering group testing, graphical model selection, sparse linear regression, density estimation, and convex optimization.

Keywords

Fano’s inequality graphical model selection group testing estimation optimization sampling

Type: Chapter
Information: Information-Theoretic Methods in Data Science , pp. 487 - 528

DOI: https://doi.org/10.1017/9781108616799.017 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2021

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Book purchase

Temporarily unavailable

References

Fano, R. M., “Class notes for MIT course 6.574: Transmission of information,,” 1952.Google Scholar

Malyutov, M. B., “The separating property of random matrices,” Math. Notes Academy Sci. USSR, vol. 23, no. 1, pp. 84–91, 1978.Google Scholar

Atia, G. and Saligrama, V., “Boolean compressed sensing and noisy group testing,” IEEE Trans. Information Theory, vol. 58, no. 3, pp. 1880–1901, 2012.Google Scholar

Wainwright, M. J., “Information-theoretic limits on sparsity recovery in the highdimensional and noisy setting,” IEEE Trans. Information Theory, vol. 55, no. 12, pp. 5728–5741, 2009.Google Scholar

Candès, E. J. and Davenport, M. A., “How well can we estimate a sparse vector?,” Appl. Comput. Harmonic Analysis, vol. 34, no. 2, pp. 317–323, 2013.Google Scholar

Hassanieh, H., Indyk, P., Katabi, D., and Price, E., “Nearly optimal sparse Fourier transform.,” in Proc. 44th Annual ACM Symposium on Theory of Computation, 2012, pp. 563–578.Google Scholar

Cevher, V., Kapralov, M., Scarlett, J., and Zandieh, A., “An adaptive sublinear-time block sparse Fourier transform,,” in Proc. 49th Annual ACM Symposium on Theory of Computation, 2017, pp. 702–715.Google Scholar

Amini, A. A. and Wainwright, M. J., “High-dimensional analysis of semidefinite relaxations for sparse principal components,” Annals Statist., vol. 37, no. 5B, pp. 2877–2921, 2009.Google Scholar

Vu, V. Q. and Lei, J., “Minimax rates of estimation for sparse PCA in high dimensions,,” in Proc. 15th International Conference on Artificial Intelligence and Statistics, 2012, pp. 1278–1286.Google Scholar

Negahban, S. and Wainwright, M. J., “Restricted strong convexity and weighted matrix completion: Optimal bounds with noise,” J. Machine Learning Res., vol. 13, no. 5, pp. 1665–1697, 2012.Google Scholar

Davenport, M. A., Plan, Y., Van Den Berg, E., and Wootters, M., “1-bit matrix completion,” Information and Inference, vol. 3, no. 3, pp. 189–223, 2014.Google Scholar

Ibragimov, I. and Khasminskii, R., “Estimation of infinite-dimensional parameter in Gaussian white noise,” Soviet Math. Doklady, vol. 236, no. 5, pp. 1053–1055, 1977.Google Scholar

Yang, Y. and Barron, A., “Information-theoretic determination of minimax rates of convergence,” Annals Statist., vol. 27, no. 5, pp. 1564–1599, 1999.Google Scholar

Birgé, L., “Approximation dans les espaces métriques et théorie de l'estimation,” Probability Theory and Related Fields, vol. 65, no. 2, pp. 181–237, 1983.Google Scholar

Raskutti, G., Wainwright, M. J., and Yu, B., “Minimax-optimal rates for sparse additive models over kernel classes via convex programming,” J. Machine Learning Res., vol. 13, no. 2, pp. 389–427, 2012.Google Scholar

Yang, Y., Pilanci, M., and Wainwright, M. J., “Randomized sketches for kernels: Fast and optimal nonparametric regression,” Annals Statist., vol. 45, no. 3, pp. 991–1023, 2017.Google Scholar

Zhang, Y., Duchi, J., Jordan, M. I., and Wainwright, M. J., “Information-theoretic lower bounds for distributed statistical estimation with communication constraints,” in Advances in Neural Information Processing Systems, 2013, pp. 2328–2336.Google Scholar

Xu, A. and Raginsky, M., “Information-theoretic lower bounds on Bayes risk in decentralized estimation,” IEEE Trans. Information Theory, vol. 63, no. 3, pp. 1580–1600, 2017.Google Scholar

Duchi, J. C., Jordan, M. I., and Wainwright, M. J., “Local privacy and statistical minimax rates,,” in Proc. 54th Annual IEEE Symposium on Foundations of Computer Science, 2013, pp. 429–438.Google Scholar

Raginsky, M. and Rakhlin, A., “Information-based complexity, feedback and dynamics in convex programimng,” IEEE Trans. Information Theory, vol. 57, no. 10, pp. 7036–7056, 2011.Google Scholar

Agarwal, A., Bartlett, P. L., Ravikumar, P., and Wainwright, M. J., “Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization,” IEEE Trans. Information Theory, vol. 58, no. 5, pp. 3235–3249, 2012.Google Scholar

Raginsky, M. and Rakhlin, A., “Lower bounds for passive and active learning,,” in Advances in Neural Information Processing Systems, 2011, pp. 1026–1034.Google Scholar

Agarwal, A., Agarwal, S., Assadi, S., and Khanna, S., “Learning with limited rounds of adaptivity: Coin tossing, multi-armed bandits, and ranking from pairwise comparisons,,” in Proc. Conference on Learning Theory, 2017, pp. 39–75.Google Scholar

Scarlett, J., “Tight regret bounds for Bayesian optimization in one dimension,,” in Proc. International Conference on Machine Learning, 2018, pp. 4507–4515.Google Scholar

Bar-Yossef, Z., Jayram, T. S., Kumar, R., and Sivakumar, D., “Information theory methods in communication complexity,,” in Proc. 17th IEEE Annual Conference on Computational Complexity, 2002, pp. 93–102.Google Scholar

Santhanam, N. and Wainwright, M., “Information-theoretic limits of selecting binary graphical models in high dimensions,” IEEE Trans. Information Theory, vol. 58, no. 7, pp. 4117–4134, 2012.Google Scholar

Shanmugam, K., Tandon, R., Dimakis, A., and Ravikumar, P., “On the information theoretic limits of learning Ising models,,” in Advances in Neural Information Processing Systems, 2014, pp. 2303–2311.Google Scholar

Shah, N. B. and Wainwright, M. J., “Simple, robust and optimal ranking from pairwise comparisons,” J. Machine Learning Res., vol. 18, no. 199, pp. 1–38, 2018.Google Scholar

Pananjady, A., Mao, C., Muthukumar, V., Wainwright, M. J., and Courtade, T. A., “Worst-case vs average-case design for estimation from fixed pairwise comparisons,” http://arxiv.org/abs/1707.06217.Google Scholar

Yang, Y., “Minimax nonparametric classification. i. rates of convergence,” IEEE Trans. Information Theory, vol. 45, no. 7, pp. 2271–2284, 1999.Google Scholar

Nokleby, M., Rodrigues, M., and Calderbank, R., “Discrimination on the Grassmann manifold: Fundamental limits of subspace classifiers,” IEEE Trans. Information Theory, vol. 61, no. 4, pp. 2133–2147, 2015.Google Scholar

Mazumdar, A. and Saha, B., “Query complexity of clustering with side information,,” in Advances in Neural Information Processing Systems, 2017, pp. 4682–4693.Google Scholar

Mossel, E., “Phase transitions in phylogeny,” Trans. Amer. Math. Soc., vol. 356, no. 6, pp. 2379–2404, 2004.Google Scholar

Cover, T. M. and Thomas, J. A., Elements of information theory. John Wiley & Sons, 2006.Google Scholar

Duchi, J. C. and Wainwright, M. J., “Distance-based and continuum Fano inequalities with applications to statistical estimation,” http://arxiv. org/abs/1311.2669.Google Scholar

Sason, I. and Verdú, S., “f-divergence inequalities,” IEEE Trans. Information Theory, vol. 62, no. 11, pp. 5973–6006, 2016.Google Scholar

Dorfman, R., “The detection of defective members of large populations,” Annals Math. Statist., vol. 14, no. 4, pp. 436–440, 1943.Google Scholar

Scarlett, J. and Cevher, V., “Phase transitions in group testing,,” in Proc. ACM-SIAM Symposium on Discrete Algorithms, 2016, pp. 40–53.Google Scholar

Scarlett, J. and Cevher, V., “How little does non-exact recovery help in group testing?,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, 2017, pp. 6090–6094.Google Scholar

Baldassini, L., Johnson, O., and Aldridge, M., “The capacity of adaptive group testing,,” in Proc. IEEE Int. Symp. Inform. Theory, 2013, pp. 2676–2680.Google Scholar

Scarlett, J. and Cevher, V., “Converse bounds for noisy group testing with arbitrary measurement matrices,,” in Proc. IEEE International Symposium on Information Theory, 2016, pp. 2868–2872.Google Scholar

Scarlett, J. and Cevher, V., “On the difficulty of selecting Ising models with approximate recovery,” IEEE Trans. Signal Information Processing over Networks, vol. 2, no. 4, pp. 625–638, 2016.Google Scholar

Scarlett, J. and Cevher, V., “Lower bounds on active learning for graphical model selection,,” in Proc. 20th International Conference on Artificial Intelligence and Statistics, 2017.Google Scholar

Tan, V. Y. F., Anandkumar, A., and Willsky, A. S., “Learning high-dimensional Markov forest distributions: Analysis of error rates,” J. Machine Learning Res., vol. 12, no. 5, pp. 1617–1653, 2011.Google Scholar

Anandkumar, A., Tan, V. Y. F., Huang, F., and Willsky, A. S., “High-dimensional structure estimation in Ising models: local separation criterion,” Annals Statist., vol. 40, no. 3, pp. 1346–1375, 2012.Google Scholar

Dasarathy, G., Singh, A., Balcan, M-F., and Park, J. H., “Active learning algorithms for graphical model selection,,” in Proc. 19th International Conference on Artificial Intelligence and Statistics, 2016, pp. 1356–1364.Google Scholar

Yu, B., “Assouad, Fano, and Le Cam,” in Festschrift for Lucien Le Cam. Springer, 1997, pp. 423–435.Google Scholar

Duchi, J., “Lecture notes for statistics 311/electrical engineering 377 (MIT),,” http://stanford.edu/class/stats311/.Google Scholar

Wu, Y., “Lecture notes for ECE598YW: Information-theoretic methods for highdimensional statistics,,” www.stat.yale.edu/~yw562/ln.html.Google Scholar

Polyanskiy, Y., Poor, V., and Verdú, S., “Channel coding rate in the finite blocklength regime,” IEEE Trans. Information Theory, vol. 56, no. 5, pp. 2307–2359, 2010.Google Scholar

Johnson, O., “Strong converses for group testing from finite blocklength results,” IEEE Trans. Information Theory, vol. 63, no. 9, pp. 5923–5933, 2017.Google Scholar

Venkataramanan, R. and Johnson, O., “A strong converse bound for multiple hypothesis testing, with applications to high-dimensional estimation,” Electron. J. Statistics, vol. 12, no. 1, pp. 1126–1149, 2018.Google Scholar

Loh, P.-L., “On lower bounds for statistical learning theory,” Entropy, vol. 19, no. 11, p. 617, 2017.Google Scholar

Lai, T. L. and Robbins, H., “Asymptotically efficient adaptive allocation rules,” Advances Appl. Math., vol. 6, no. 1, pp. 4–22, 1985.Google Scholar

Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E., “Gambling in a rigged casino: The adversarial multi-armed bandit problem,,” in Proc. 26th Annual IEEE Conference on Foundations of Computer Science, 1995, pp. 322–331.Google Scholar

Arias-Castro, E., Candès, E. J., and Davenport, M. A., “On the fundamental limits of adaptive sensing,” IEEE Trans. Information Theory, vol. 59, no. 1, pp. 472–481, 2013.Google Scholar

Han, T. S. and Verdú, S., “Generalizing the Fano inequality,” IEEE Trans. Information Theory, vol. 40, no. 4, pp. 1247–1251, 1994.Google Scholar

Birgé, L., “A new lower bound for multiple hypothesis testing,” IEEE Trans. Information Theory, vol. 51, no. 4, pp. 1611–1615, 2005.Google Scholar

Gushchin, A. A., “On Fano’s lemma and similar inequalities for the minimax risk,” Probability Theory and Math. Statistics, vol. 2003, no. 67, pp. 26–37, 2004.Google Scholar

Guntuboyina, A., “Lower bounds for the minimax risk using f-divergences, and applications,” IEEE Trans. Information Theory, vol. 57, no. 4, pp. 2386–2399, 2011.Google Scholar

Polyanskiy, Y. and Verdú, S., “Arimoto channel coding converse and Rényi divergence,,” in Proc. 48th Annual Allerton Conference on Communication, Control, and Compution, 2010, pp. 1327–1333.Google Scholar

Braun, G. and Pokutta, S., “An information diffusion Fano inequality,,” http://arxiv.org/abs/1504.05492.Google Scholar

Chen, X., Guntuboyina, A., and Zhang, Y., “On Bayes risk lower bounds,” J. Machine Learning Res., vol. 17, no. 219, pp. 1–58, 2016.Google Scholar