Information-Theoretic Bounds on Sketching

doi:10.1017/9781108616799.005

4 - Information-Theoretic Bounds on Sketching

Published online by Cambridge University Press: 22 March 2021

Mert Pilanci

Edited by

Miguel R. D. Rodrigues and

Yonina C. Eldar

Show author details

Miguel R. D. Rodrigues: Affiliation:
University College London
Yonina C. Eldar: Affiliation:
Weizmann Institute of Science, Israel

Book contents

Get access

Summary

Approximate computation methods with provable performance guarantees are becoming important and relevant tools in practice. In this chapter we focus on sketching methods designed to reduce data dimensionality in computationally intensive tasks. Sketching can often provide better space, time, and communication complexity trade-offs by sacrificing minimal accuracy. This chapter discusses the role of information theory in sketching methods for solving large-scale statistical estimation and optimization problems. We investigate fundamental lower bounds on the performance of sketching. By exploring these lower bounds, we obtain interesting trade-offs in computation and accuracy. We employ Fano’s inequality and metric entropy to understand fundamental lower bounds on the accuracy of sketching, which is parallel to the information-theoretic techniques used in statistical minimax theory.

Keywords

sketching sketches constrained least squares unconstrained least squares estimation convex analysis convex optimization privacy

Type: Chapter
Information: Information-Theoretic Methods in Data Science , pp. 104 - 133

DOI: https://doi.org/10.1017/9781108616799.005 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2021

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Book purchase

Temporarily unavailable

References

Vempala, S., The random projection method. American Mathematical Society, 2004.Google Scholar

Candès, E. J. and Tao, T., “Near-optimal signal recovery from random projections: Universal encoding strategies?” IEEE Trans. Information Theory, vol. 52, no. 12, pp. 5406–5425, 2006.Google Scholar

Halko, N., Martinsson, P., and Tropp, J. A., “Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions,” SIAM Rev., vol. 53, no. 2, pp. 217–288, 2011.Google Scholar

Mahoney, M. W., Randomized algorithms for matrices and data. Now Publishers, 2011.Google Scholar

Woodruff, D. P., “Sketching as a tool for numerical linear algebra,” Foundations and Trends Theoretical Computer Sci., vol. 10, nos. 1–2, pp. 1–157, 2014.Google Scholar

Muthukrishnan, S., “Data streams: Algorithms and applications,” Foundations and Trends Theoretical Computer Sci., vol. 1, no. 2, pp. 117–236, 2005.CrossRef Google Scholar

Yu, B., “Assouad , Fano, and Le Cam,” in Festschrift in Honor of Lucien Le Cam. Springer, 1997, pp. 423–435.Google Scholar

Yang, Y. and Barron, A., “Information-theoretic determination of minimax rates of convergence,” Annals Statist., vol. 27, no. 5, pp. 1564–1599, 1999.CrossRef Google Scholar

Dwork, C., McSherry, F., Nissim, K., and Smith, A., “Calibrating noise to sensitivity in private data analysis,” in Proc. Theory of Cryptography Conference, 2006, pp. 265–284.CrossRef Google Scholar

Blocki, J., Blum, A., Datta, A., and Sheffet, O., “The Johnson–Lindenstrauss transform itself preserves differential privacy,” in Proc. 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science, 2012, pp. 410–419.Google Scholar

Ailon, N. and Chazelle, B., “Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform,” in Proc. 38th Annual ACM Symposium on Theory of Computing, 2006, pp. 557–563.CrossRef Google Scholar

Drineas, P. and Mahoney, M. W., “Effective resistances, statistical leverage, and applications to linear equation solving,” arXiv:1005.3097, 2010.Google Scholar

Spielman, D. A. and Srivastava, N., “Graph sparsification by effective resistances,” SIAM J. Computing, vol. 40, no. 6, pp. 1913–1926, 2011.Google Scholar

Charikar, M., Chen, K., and Farach-Colton, M., “Finding frequent items in data streams,” in International Colloquium on Automata, Languages, and Programming, 2002, pp. 693–703.CrossRef Google Scholar

Kane, D. M. and Nelson, J., “Sparser Johnson–Lindenstrauss transforms,” J. ACM, vol. 61, no. 1, article no. 4, 2014.Google Scholar

Nelson, J. and Nguyên, H. L., “Osnap : Faster numerical linear algebra algorithms via sparser subspace embeddings,” in Proc. 2013 IEEE 54th Annual Symposium on Foundations of Computer Science (FOCS), 2013, pp. 117–126.Google Scholar

Hiriart-Urruty, J. and Lemaréchal, C., Convex analysis and minimization algorithms. Springer, 1993, vol. 1.Google Scholar

Boyd, S. and Vandenberghe, L., Convex optimization. Cambridge University Press, 2004.CrossRef Google Scholar

Ledoux, M. and Talagrand, M., Probability in Banach spaces: Isoperimetry and processes. Springer, 1991.Google Scholar

Bartlett, P. L., Bousquet, O., and Mendelson, S., “Local Rademacher complexities,” Annals Statist., vol. 33, no. 4, pp. 1497–1537, 2005.CrossRef Google Scholar

Chandrasekaran, V., Recht, B., Parrilo, P. A., and Willsky, A. S., “The convex geometry of linear inverse problems,” Foundations Computational Math., vol. 12, no. 6, pp. 805–849, 2012.Google Scholar

Pilanci, M. and Wainwright, M. J., “Randomized sketches of convex programs with sharp guarantees,” UC Berkeley, Technical Report, 2014, full-length version at arXiv:1404.7203; Presented in part at ISIT 2014.Google Scholar

Pilanci, M. and Wainwright, M. J., “Iterative Hessian sketch: Fast and accurate solution approximation for constrained least-squares,” J. Machine Learning Res., vol. 17, no. 1, pp. 1842–1879, 2016.Google Scholar

Chen, S., Donoho, D. L., and Saunders, M. A., “Atomic decomposition by basis pursuit,” SIAM J. Sci. Computing, vol. 20, no. 1, pp. 33–61, 1998.Google Scholar

Candès, E. J. and Tao, T., “Decoding by linear programming,” IEEE Trans. Information Theory, vol. 51, no. 12, pp. 4203–4215, 2005.Google Scholar

Fano, R. M. and Wintringham, W., “Transmission of information,” Phys. Today, vol. 14, p. 56, 1961.Google Scholar

Cover, T. and Thomas, J., Elements of information theory. John Wiley & Sons, 1991.Google Scholar

Assouad, P., “Deux remarques sur l’estimation,” Comptes Rendus Acad. Sci. Paris, vol. 296, pp. 1021–1024, 1983.Google Scholar

Ibragimov, I. A. and Has’minskii, R. Z., Statistical estimation: Asymptotic theory. Springer, 1981.Google Scholar

Birgé, L., “Estimating a density under order restrictions: Non-asymptotic minimax risk,” Annals Statist., vol. 15, no. 3, pp. 995–1012, 1987.CrossRef Google Scholar

Kolmogorov, A. and Tikhomirov, B., “∊-entropy and ∊-capacity of sets in functional spaces,” Uspekhi Mat. Nauk, vol. 86, pp. 3–86, 1959, English transl. Amer. Math. Soc. Translations, vol. 17, pp. 277–364, 1961.Google Scholar

Tibshirani, R., “Regression shrinkage and selection via the Lasso,” J. Roy. Statist. Soc. Ser. B, vol. 58, no. 1, pp. 267–288, 1996.Google Scholar

Raskutti, G., Wainwright, M. J., and Yu, B., “Minimax rates of estimation for highdimensional linear regression over ℓ_q-balls,” IEEE Trans. Information Theory, vol. 57, no. 10, pp. 6976–6994, 2011.Google Scholar

Srebro, N., Alon, N., and Jaakkola, T. S., “Generalization error bounds for collaborative prediction with low-rank matrices,” in Proc. Advances in Neural Information Processing Systems, 2005, pp. 1321–1328.Google Scholar

Yuan, M. and Lin, Y., “Model selection and estimation in regression with grouped variables,” J. Roy. Statist. Soc. B, vol. 1, no. 68, p. 49, 2006.Google Scholar

Negahban, S. and Wainwright, M. J., “Estimation of (near) low-rank matrices with noise and high-dimensional scaling,” Annals Statist., vol. 39, no. 2, pp. 1069–1097, 2011.Google Scholar

Bunea, F., She, Y., and Wegkamp, M., “Optimal selection of reduced rank estimators of high-dimensional matrices,” Annals Statist., vol. 39, no. 2, pp. 1282–1309, 2011.CrossRef Google Scholar

Pilanci, M. and Wainwright, M. J., “Newton sketch: A near linear-time optimization algorithm with linear-quadratic convergence,” SIAM J. Optimization, vol. 27, no. 1, pp. 205–245, 2017.CrossRef Google Scholar

Weinert, H. L., (ed.), Reproducing kernel hilbert spaces: Applications in statistical signal processing. Hutchinson Ross Publishing Co., 1982.Google Scholar

Schölkopf, B. and Smola, A., Learning with kernels. MIT Press, 2002.Google Scholar

Aronszajn, N., “Theory of reproducing kernels,” Trans. Amer. Math. Soc., vol. 68, pp. 337–404, 1950.Google Scholar

Yang, Y., Pilanci, M., and Wainwright, M. J., “Randomized sketches for kernels: Fast and optimal nonparametric regression,” Annals Statist., vol. 45, no. 3, pp. 991–1023, 2017.Google Scholar

Rahimi, A. and Recht, B., “Random features for large-scale kernel machines,” in Proc. Advances in Neural Information Processing Systems, 2008, pp. 1177–1184.Google Scholar

Rahimi, A. and Recht, B., “Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning,” in Proc. Advances in Neural lnformation Processing Systems, 2009, pp. 1313–1320.Google Scholar

Drineas, P. and Mahoney, M. W., “On the Nyström method for approximating a Gram matrix for improved kernel-based learning,” J. Machine Learning Res., vol. 6, no. 12, pp. 2153–2175, 2005.Google Scholar

Le, Q., Sarlós, T., and Smola, A., “Fastfood –approximating kernel expansions in loglinear time,” in Proc. 30th International Conference on Machine Learning, 2013, 9 unnumbered pages.Google Scholar

Zhou, S., Lafferty, J., and Wasserman, L., “Compressed regression,” IEEE Trans. Information Theory, vol. 55, no. 2, pp. 846–866, 2009.Google Scholar

Candès, E. J., Romberg, J., and Tao, T., “Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information,” IEEE Trans. Information Theory, vol. 52, no. 2, pp. 489–509, 2004.Google Scholar

Clarkson, K. L. and Woodruff, D. P., “Numerical linear algebra in the streaming model,” in Proceedings of the Forty-First Annual ACM Symposium on Theory of Computing, 2009, pp. 205–214.Google Scholar

Tibshirani, R., “Regression shrinkage and selection via the lasso,” J. Roy. Statist. Soc. Ser. B (Methodological), vol. 58, no. 1, pp. 267–288, 1996.Google Scholar

Donoho, D. and Tanner, J., “Observed universality of phase transitions in high-dimensional geometry, with implications for modern data analysis and signal processing,” Phil. Trans. Roy. Soc. London A: Math., Phys. Engineering Sci., vol. 367, no. 1906, pp. 4273–4293, 2009.Google Scholar