Neural network approximation

Ronald DeVore; Boris Hanin; Guergana Petrova

doi:10.1017/S0962492921000052

Neural network approximation

Published online by Cambridge University Press: 04 August 2021

Ronald DeVore ,

Boris Hanin and

Guergana Petrova

Show author details

Ronald DeVore: Affiliation:
Department of Mathematics, Texas A&M University, College Station, TX77843, USA E-mail: [email protected]
Boris Hanin: Affiliation:
Department of Operations Research and Financial Engineering, Princeton University, NJ08544, USA E-mail: [email protected]
Guergana Petrova: Affiliation:
Department of Mathematics, Texas A&M University, College Station, TX77843, USA E-mail: [email protected]

Article contents

Abstract
References

Rights & Permissions

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.

Neural networks (NNs) are the method of choice for building learning algorithms. They are now being investigated for other numerical tasks such as solving high-dimensional partial differential equations. Their popularity stems from their empirical success on several challenging learning problems (computer chess/Go, autonomous navigation, face recognition). However, most scholars agree that a convincing theoretical explanation for this success is still lacking. Since these applications revolve around approximating an unknown function from data observations, part of the answer must involve the ability of NNs to produce accurate approximations.

This article surveys the known approximation properties of the outputs of NNs with the aim of uncovering the properties that are not present in the more traditional methods of approximation used in numerical analysis, such as approximations using polynomials, wavelets, rational functions and splines. Comparisons are made with traditional approximation methods from the viewpoint of rate distortion, i.e. error versus the number of parameters used to create the approximant. Another major component in the analysis of numerical approximation is the computational time needed to construct the approximation, and this in turn is intimately connected with the stability of the approximation algorithm. So the stability of numerical approximation using NNs is a large part of the analysis put forward.

The survey, for the most part, is concerned with NNs using the popular ReLU activation function. In this case the outputs of the NNs are piecewise linear functions on rather complicated partitions of the domain of f into cells that are convex polytopes. When the architecture of the NN is fixed and the parameters are allowed to vary, the set of output functions of the NN is a parametrized nonlinear manifold. It is shown that this manifold has certain space-filling properties leading to an increased ability to approximate (better rate distortion) but at the expense of numerical stability. The space filling creates the challenge to the numerical method of finding best or good parameter choices when trying to approximate.

Type: Research Article
Information: Acta Numerica , Volume 30 , May 2021 , pp. 327 - 444

DOI: https://doi.org/10.1017/S0962492921000052 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright: © The Author(s), 2021. Published by Cambridge University Press

References

Adams, R. A. and Fournier, J. J. F. (2003), Sobolev Spaces, Elsevier.Google Scholar

Ali, M. and Nouy, A. (2020), Approximation of smoothness classes by deep ReLU networks. Available at arXiv:2007.15645v1.Google Scholar

Allen-Zhu, Z., Li, Y. and Song, Z. (2019), A convergence theory for deep learning via over-parameterization, in Proceedings of the 36th International Conference on Machine Learning (ICML 2019) (Chaudhuri, K. and Salakhutdinov, R., eds), Vol. 97 of Proceedings of Machine Learning Research, PMLR, pp. 242–252.Google Scholar

Arora, R., Basu, A., Mianjy, P. and Mukherjee, A. (2018a), Understanding deep neural networks with rectified linear units, in 6th International Conference on Learning Representations (ICLR 2018). Available at https://openreview.net/forum?id=B1J_rgWRW.Google Scholar

Arora, S., Ge, R., Neyshabur, B. and Zhang, Y. (2018b), Stronger generalization bounds for deep nets via a compression approach, in Proceedings of the 35th International Conference on Machine Learning (ICML 2018) (Dy, J. and Krause, A., eds), Vol. 80 of Proceedings of Machine Learning Research, PMLR, pp. 254–263.Google Scholar

Bach, F. (2017), Breaking the curse of dimensionality with convex neural networks, J. Mach. Learn. Res. 18, 1–53.Google Scholar

Balestriero, R. and Baraniuk, R. (2021), Mad Max: Affine spline insights into deep learning, Proc. IEEE 109, 704–727.CrossRef Google Scholar

Barron, A. R. (1993), Universal approximation bounds for superpositions of a sigmoidal function, IEEE Trans. Inform. Theory 39, 930–945.CrossRef Google Scholar

Barron, A. R. (1994), Approximation and estimation bounds for artificial neural networks, Mach. Learn. 14, 115–133.CrossRef Google Scholar

Bartlett, P. L., Foster, D. J. and Telgarsky, M. (2017), Spectrally-normalized margin bounds for neural networks, in Advances in Neural Information Processing Systems 30 (NIPS 2017) (Guyon, I. et al., eds), Curran Associates, pp. 6240–6249.Google Scholar

Bartlett, P. L., Harvey, N., Liaw, C. and Mehrabian, A. (2019), Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks, J. Mach. Learn. Res. 20, 1–17.Google Scholar

Bartlett, P. L., Long, P. M., Lugosi, G. and Tsigler, A. (2020), Benign overfitting in linear regression, Proc. Natl Acad. Sci. 117, 30063–30070.CrossRef Google Scholar PubMed

Bennett, C. and Sharpley, R. (1990), Interpolation of Operators, Academic Press.Google Scholar

Benyamini, Y. and Lindenstrauss, J. (2000), Geometric Nonlinear Functional Analysis 1, Vol. 48 of Colloquium Publications, American Mathematical Society.CrossRef Google Scholar

Bergh, J. and Lofstrom, (1976), Interpolation Spaces: An Introduction, Springer.CrossRef Google Scholar

Binev, P., Cohen, A., Dahmen, W., DeVore, R., Petrova, G. and Wojtaszczyk, P. (2011), Convergence rates for greedy algorithms in reduced basis methods, SIAM J. Math. Anal. 43, 1457–1472.CrossRef Google Scholar

Binev, P., Cohen, A., Dahmen, W., DeVore, R., Petrova, G. and Wojtaszczyk, P. (2017), Data assimilation in reduced modeling, SIAM/ASA J. Uncertain. Quantif. 5, 1–29.CrossRef Google Scholar

Bölcskei, H., Grohs, P., Kutyniok, G. and Petersen, P. (2019), Optimal approximation with sparsely connected deep neural networks, SIAM J. Math. Data Sci. 1, 8–45.CrossRef Google Scholar

Bousquet, O., Boucheron, S. and Lugosi, G. (2005), Theory of classification: A survey of some recent advances, ESAIM Probab. Statist. 9, 323–375.Google Scholar

Bronstein, M. M., Bruna, J., LeCun, Y., Szlam, A. and Vandergheynst, P. (2017), Geometric deep learning: Going beyond Euclidean data, IEEE Signal Process. Mag. 34, 18–42.CrossRef Google Scholar

Buffa, A., Maday, Y., Patera, A. T., Prud’homme, C. and Turinici, G. (2012), A priori convergence of the greedy algorithm for the parameterized reduced basis, Math. Model. Numer. Anal. 46, 595–603.CrossRef Google Scholar

Carl, B. (1981), Entropy numbers, s-numbers, and eigenvalue problems, J. Funct. Anal. 41, 290–306.CrossRef Google Scholar

Chizat, L., Oyallon, E. and Bach, F. (2019), On lazy training in differentiable programming, in Advances in Neural Information Processing Systems 32 (NeurIPS 2019) (Wallach, H. et al., eds), Curran Associates, pp. 2937–2947.Google Scholar

Cohen, A., DeVore, R., Petrova, G. and Wojtaszczyk, P. (2020), Optimal stable nonlinear approximation. Available at arXiv:2009.09907.Google Scholar

Csiskos, M., Kupavskii, A. and Mustafa, N. (2019), Tight lower bounds on the VC-dimension of geometric set systems, J. Mach. Learn. Res. 20, 1–8.Google Scholar

Cybenko, G. (1989), Approximation by superpositions of a sigmoidal function, Math. Control Signals Systems 2, 303–314.CrossRef Google Scholar

Daubechies, I., DeVore, R., Foucart, S., Hanin, B. and Petrova, G. (2019), Nonlinear approximation and (deep) ReLU networks. Available at arXiv:1905.02199 (to appear in Constr. Approx.).Google Scholar

de Boor, C. (1978), A Practical Guide to Splines, Vol. 27 of Applied Mathematical Sciences, Springer.CrossRef Google Scholar

DeVore, R. A. (1998), Nonlinear approximation, in Acta Numerica, Vol. 7, Cambridge University Press, pp. 51–150.Google Scholar

DeVore, R. A. and Popov, V. A. (1988), Interpolation of Besov spaces, Trans. Amer. Math. Soc. 305, 397–414.CrossRef Google Scholar

DeVore, R. A. and Scherer, K. (1979), Interpolation of linear operators on Sobolev spaces, Ann. of Math. 189, 583–599.CrossRef Google Scholar

DeVore, R. A. and Sharpley, R. C. (1993), Besov spaces on domains in RD, Trans. Amer. Math. Soc. 335, 843–864.Google Scholar

DeVore, R. A. and Temlyakov, V. N. (1996), Some remarks on greedy algorithms, Adv. Comput. Math. 5, 173–187.CrossRef Google Scholar

DeVore, R. A., Howard, R. and Micchelli, C. (1989), Optimal non-linear approximation, Manuscripta Math. 4, 469–478.CrossRef Google Scholar

DeVore, R. A., Kyriazis, G., Leviatan, D. and Tikhomirov, V. (1993), Wavelet compression and nonlinear n-widths, Adv. Comput. Math. 1, 197–214.CrossRef Google Scholar

DeVore, R. A., Oskolkov, K. I. and Petrushev, P. P. (1997), Approximation by feed-forward neural networks, Ann. Numer. Math. 4, 261–287.Google Scholar

DeVore, R. A., Petrova, G. and Wojtaszczyk, P. (2011), Approximation of functions of few variables in high dimensions, Constr. Approx. 33, 125–143.CrossRef Google Scholar

DeVore, R. A., Petrova, G. and Wojtaszczyk, P. (2013), Greedy algorithms for reduced basis in Banach spaces, Constr. Approx. 37, 455–466.CrossRef Google Scholar

Du, S. S., Lee, J., Li, H., Wang, L. and Zhai, X. (2019a), Gradient descent finds global minima of deep neural networks, in Proceedings of the 36th International Conference on Machine Learning (ICML 2019) (Chaudhuri, K. and Salakhutdinov, R., eds), Vol. 97 of Proceedings of Machine Learning Research, PMLR, pp. 1675–1685.Google Scholar

Du, S. S., Zhai, X., Poczos, B. and Singh, A. (2019b), Gradient descent provably optimizes over-parameterized neural networks, in 7th International Conference on Learning Representations (ICLR 2019). Available at https://openreview.net/forum?id=S1eK3i09YQ.Google Scholar

Dym, N., Sober, B. and Daubechies, I. (2020), Expression of fractals through neural network functions, IEEE J. Selected Areas Inform. Theory 1, 57–66.CrossRef Google Scholar

Dziugaite, G. K. and Roy, D. M. (2017), Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data, in Workshop on Principled Approaches to Deep Learning (ICML 2017).Google Scholar

, W. E and Wang, Q. (2018), Exponential convergence of the deep neural network approximation for analytic functions, Sci. China Math. 61, 1733–1740.CrossRef Google Scholar

, W. E, Ma, C. and Wu, L. (2019), The Barron space and the flow-induced function spaces for neural network models. Available at arXiv:1906.08039.Google Scholar

Elbrächter, D., Perekrestenko, D., Grohs, P. and Bölcskei, H. (2019), Deep neural network approximation theory. Available at arXiv:1901.02220.Google Scholar

Frazier, M. W., Jawerth, B. and Weiss, G. (1991), Littlewood–Paley Theory and the Study of Function Spaces, Vol. 79 of CBMS Regional Conference Series in Mathematics, American Mathematical Society.CrossRef Google Scholar

Ghorbani, B., Mei, S., Misiakiewicz, T. and Montanari, A. (2021), Linearized two-layers neural networks in high dimension, Ann. Statist. 49, 1029–1054.CrossRef Google Scholar

Gribonval, R., Kutyniok, G., Nielsen, M. and Voigtlaender, F. (2019), Approximation spaces of deep neural networks. Available at arXiv:1905.01208.Google Scholar

Gühring, I., Raslan, M. and Kutyniok, G. (2020), Expressivity of deep neural networks. Available at arXiv:2007.04759.Google Scholar

Hanin, B. (2019), Universal function approximation by deep neural nets with bounded width and ReLU activations, Mathematics 7, 992.CrossRef Google Scholar

Hanin, B. and Rolnick, D. (2019), Deep ReLU networks have surprisingly few activation patterns, in Advances in Neural Information Processing Systems 32 (NeurIPS 2019) (Wallach, H. et al., eds), Curran Associates, pp. 361–370.Google Scholar

Hata, M. (1986), Fractals in mathematics, in Patterns and Waves: Qualitative Analysis of Nonlinear Differential Equations (Nishida, T., Mimura, M. and Fujii, H., eds), Vol. 18 of Studies in Mathematics and Its Applications, Elsevier, pp. 259–278.CrossRef Google Scholar

He, J., Li, L., Xu, J. and Zheng, C. (2020), ReLU deep neural networks and linear finite elements, Comput. Math 38, 502–527.Google Scholar

Hebb, D. O. (1949), The Organization of Behavior: A Neuropsychological Theory, Wiley, Chapman & Hall.Google Scholar

Hornik, K., Stinchcombe, M., White, H. et al. (1989), Multilayer feedforward networks are universal approximators, Neural Networks 2, 359–366.CrossRef Google Scholar

Jacot, A., Gabriel, F. and Hongler, C. (2018), Neural tangent kernel: Convergence and generalization in neural networks, in Advances in Neural Information Processing Systems 31 (NeurIPS 2018) (Bengio, S. et al., eds), Curran Associates, pp. 8571–8580.Google Scholar

Klusowski, J. and Barron, (2018), Approximation by combinations of ReLU and squared ReLU ridge functions with l1 and l0 controls, IEEE Trans. Inform. Theory 64, 7649– 7656.CrossRef Google Scholar

Krizhevsky, A., Sutskever, I. and Hinton, G. E. (2012), ImageNet classification with deep convolutional neural networks, in Advances in Neural Information Processing Systems 25 (NIPS 2012) (Pereira, F. et al., eds), Curran Associates, pp. 1097–1105.Google Scholar

LeCun, Y., Bengio, Y. and Hinton, G. (2015), Deep learning, Nature 521 (7553), 436–444.CrossRef Google Scholar PubMed

Liu, C., Zhu, L. and Belkin, M. (2020), Toward a theory of optimization for over-parameterized systems of non-linear equations: The lessons of deep learning. Available at arXiv:2003.00307.Google Scholar

Lorenz, G. G., Makovoz, Y. and Golitschek, M. von (1996), Constructive Approximation: Advanced Problems, first edition, Springer.CrossRef Google Scholar

Lu, J., Shen, Z., Yang, H. and Zhang, S. (2020), Deep network approximation for smooth functions. Available at arXiv:2001.03040.Google Scholar

Maiorov, V. (1999), On best approximation by ridge functions, J. Approx. Theory 99, 68–94.CrossRef Google Scholar

Makovoz, Y. (1996), Random approximants and neural networks, J. Approx. Theory 85, 98–109.CrossRef Google Scholar

Mhaskar, H. N. and Poggio, T. (2020), Function approximation by deep networks, Commun. Pure Appl. Anal. 19, 4085–4095.CrossRef Google Scholar

Montufar, G. F., Pascanu, R., Cho, K. and Bengio, Y. (2014), On the number of linear regions of deep neural networks, in Advances in Neural Information Processing Systems 27 (NIPS 2014) (Ghahramani, Z. et al., eds), Curran Associates, pp. 2924–2932.Google Scholar

Ongie, G., Willett, R., Soudry, D. and Srebro, N. (2020), A function space view of bounded norm infinite width ReLU nets: The multivariate case, in 8th International Conference on Learning Representations (ICLR 2020). Available at https://openreview.net/forum?id=H1lNPxHKDH.Google Scholar

Opschoor, J., Petersen, P. and Schwab, C. (2019a), Deep ReLU networks and high-order finite element methods, SAM, ETH Zürich.CrossRef Google Scholar

Opschoor, J., Schwab, C. and Zech, J. (2019b), Exponential ReLU DNN expression of holomorphic maps in high dimension. SAM Research Report, ETH Zürich.Google Scholar

Parhi, R. and Nowak, R. D. (2020), Banach space representer theorems for neural networks and ridge splines. Available at arXiv:2006.05626v2.Google Scholar

Peetre, J. (1976), New Thoughts on Besov Spaces, Vol. 1 of Duke University Mathematics Series, Mathematics Department, Duke University.Google Scholar

Petersen, P. (2020), Neural network theory. Available at http://pc-petersen.eu/Neural_Network_Theory.pdf.Google Scholar

Petersen, P. and Voigtlaender, F. (2018), Optimal approximation of piecewise smooth functions using deep ReLU neural networks, Neural Networks 108, 296–330.CrossRef Google Scholar PubMed

Petrushev, P. (1988), Direct and converse theorems for spline and rational approximation and Besov spaces, in Function Spaces and Applications, Vol. 1302 of Lecture Notes in Mathematics, Springer, pp. 363–377.CrossRef Google Scholar

Petrushev, P. (1998), Approximation by ridge functions and neural networks, SIAM J. Math. Anal. 30, 155–189.CrossRef Google Scholar

Pinkus, A. (1999), Approximation theory of the MLP model in neural networks, in Acta Numerica, Vol. 8, Cambridge University Press, pp. 143–195.Google Scholar

Pinkus, A. (2012), N-widths in Approximation Theory, Vol. 7 of A Series of Modern Surveys in Mathematics, Springer Science & Business Media.Google Scholar

Rosenblatt, F. (1958), The perceptron: A probabilistic model for information storage and organization in the brain, Psychological Review 65, 386.CrossRef Google Scholar

Savarese, P., Evron, I., Soudry, D. and Srebro, N. (2019), How do infinite width bounded norm networks look in function space?, in Proceedings of the 32nd Conference on Learning Theory (COLT 2019) (Beygelzimer, A. and Hsu, D., eds), Vol. 99 of Proceedings of Machine Learning Research, PMLR, pp. 2667–2690.Google Scholar

Schmidt-Hieber, J. (2020), Nonparametric regression using deep neural networks with ReLU activation, Ann. Statist. 48, 1875–1897.Google Scholar

Shen, Z., Yang, H. and Zhang, S. (2019), Nonlinear approximation via compositions, Neural Networks 119, 74–84.CrossRef Google Scholar PubMed

Siegel, J. W. and Xu, J. (2020), High order approximation rates for neural networks with ReLU ^k activation functions. Available at arXiv:2012.07205.Google Scholar

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M. et al. (2016), Mastering the game of Go with deep neural networks and tree search, Nature 529 (7587), 484–489.CrossRef Google Scholar PubMed

Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A. et al. (2017), Mastering the game of Go without human knowledge, Nature 550 (7676), 354–359.CrossRef Google Scholar PubMed

Stanley, R. P. et al. (2004), An introduction to hyperplane arrangements, Geometric Com-binatorics 13, 389–496.CrossRef Google Scholar

Stein, E. M. (1970), Singular Integrals and Differentiability Properties of Functions, Princeton University Press.Google Scholar

Telgarsky, M. (2016), Representation benefits of deep feedforward networks, JMLR: Workshop and Conference Proceedings 49, 1–23.Google Scholar

Traub, J. and Wozniakowski, H. (1980), A General Theory of Optimal Algorithms, Academic Press.Google Scholar

Unser, M. (2020), A unifying representer theorem for inverse problems and machine learning, Found. Comput. Math. doi:10.1007/s10208-020-09472-x.CrossRef Google Scholar

Vapnik, V. (1989), Statistical Learning Theory, Wiley-Interscience.Google Scholar

Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K. et al. (2016), Google’s neural machine translation system: Bridging the gap between human and machine translation. Available at arXiv:1609.08144.Google Scholar

Yarotsky, D. (2017), Error bounds for approximations with deep ReLU networks, Neural Networks 94, 103–114.CrossRef Google Scholar PubMed

Yarotsky, D. (2018), Optimal approximation of continuous functions by very deep ReLU networks, in 31st Conference on Learning Theory (COLT 2018) (Bubeck, S. et al., eds), Vol. 75 of Proceedings of Machine Learning Research, PMLR, pp. 639–649.Google Scholar

Zaslavsky, T. (1975), Facing Up To Arrangements: Face-Count Formulas for Partitions of Space by Hyper-Planes, American Mathematical Society.Google Scholar

Zhang, C., Bengio, S., Hardt, M., Recht, B. and Vinyals, O. (2017), Understanding deep learning requires rethinking generalization, in 5th International Conference on Learning Representations (ICLR 2017). Available at https://openreview.net/forum?id=Sy8gdB9xx.Google Scholar

Article contents

Neural network approximation

Abstract

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests