Hostname: page-component-745bb68f8f-g4j75 Total loading time: 0 Render date: 2025-01-11T06:40:32.019Z Has data issue: false hasContentIssue false

Neural network approximation

Published online by Cambridge University Press:  04 August 2021

Ronald DeVore
Affiliation:
Department of Mathematics, Texas A&M University, College Station, TX77843, USA E-mail: [email protected]
Boris Hanin
Affiliation:
Department of Operations Research and Financial Engineering, Princeton University, NJ08544, USA E-mail: [email protected]
Guergana Petrova
Affiliation:
Department of Mathematics, Texas A&M University, College Station, TX77843, USA E-mail: [email protected]
Rights & Permissions [Opens in a new window]

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.

Neural networks (NNs) are the method of choice for building learning algorithms. They are now being investigated for other numerical tasks such as solving high-dimensional partial differential equations. Their popularity stems from their empirical success on several challenging learning problems (computer chess/Go, autonomous navigation, face recognition). However, most scholars agree that a convincing theoretical explanation for this success is still lacking. Since these applications revolve around approximating an unknown function from data observations, part of the answer must involve the ability of NNs to produce accurate approximations.

This article surveys the known approximation properties of the outputs of NNs with the aim of uncovering the properties that are not present in the more traditional methods of approximation used in numerical analysis, such as approximations using polynomials, wavelets, rational functions and splines. Comparisons are made with traditional approximation methods from the viewpoint of rate distortion, i.e. error versus the number of parameters used to create the approximant. Another major component in the analysis of numerical approximation is the computational time needed to construct the approximation, and this in turn is intimately connected with the stability of the approximation algorithm. So the stability of numerical approximation using NNs is a large part of the analysis put forward.

The survey, for the most part, is concerned with NNs using the popular ReLU activation function. In this case the outputs of the NNs are piecewise linear functions on rather complicated partitions of the domain of f into cells that are convex polytopes. When the architecture of the NN is fixed and the parameters are allowed to vary, the set of output functions of the NN is a parametrized nonlinear manifold. It is shown that this manifold has certain space-filling properties leading to an increased ability to approximate (better rate distortion) but at the expense of numerical stability. The space filling creates the challenge to the numerical method of finding best or good parameter choices when trying to approximate.

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
© The Author(s), 2021. Published by Cambridge University Press

References

Adams, R. A. and Fournier, J. J. F. (2003), Sobolev Spaces, Elsevier.Google Scholar
Ali, M. and Nouy, A. (2020), Approximation of smoothness classes by deep ReLU networks. Available at arXiv:2007.15645v1.Google Scholar
Allen-Zhu, Z., Li, Y. and Song, Z. (2019), A convergence theory for deep learning via over-parameterization, in Proceedings of the 36th International Conference on Machine Learning (ICML 2019) (Chaudhuri, K. and Salakhutdinov, R., eds), Vol. 97 of Proceedings of Machine Learning Research, PMLR, pp. 242252.Google Scholar
Arora, R., Basu, A., Mianjy, P. and Mukherjee, A. (2018a), Understanding deep neural networks with rectified linear units, in 6th International Conference on Learning Representations (ICLR 2018). Available at https://openreview.net/forum?id=B1J_rgWRW.Google Scholar
Arora, S., Ge, R., Neyshabur, B. and Zhang, Y. (2018b), Stronger generalization bounds for deep nets via a compression approach, in Proceedings of the 35th International Conference on Machine Learning (ICML 2018) (Dy, J. and Krause, A., eds), Vol. 80 of Proceedings of Machine Learning Research, PMLR, pp. 254263.Google Scholar
Bach, F. (2017), Breaking the curse of dimensionality with convex neural networks, J. Mach. Learn. Res. 18, 153.Google Scholar
Balestriero, R. and Baraniuk, R. (2021), Mad Max: Affine spline insights into deep learning, Proc. IEEE 109, 704727.CrossRefGoogle Scholar
Barron, A. R. (1993), Universal approximation bounds for superpositions of a sigmoidal function, IEEE Trans. Inform. Theory 39, 930945.CrossRefGoogle Scholar
Barron, A. R. (1994), Approximation and estimation bounds for artificial neural networks, Mach. Learn. 14, 115133.CrossRefGoogle Scholar
Bartlett, P. L., Foster, D. J. and Telgarsky, M. (2017), Spectrally-normalized margin bounds for neural networks, in Advances in Neural Information Processing Systems 30 (NIPS 2017) (Guyon, I. et al., eds), Curran Associates, pp. 62406249.Google Scholar
Bartlett, P. L., Harvey, N., Liaw, C. and Mehrabian, A. (2019), Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks, J. Mach. Learn. Res. 20, 117.Google Scholar
Bartlett, P. L., Long, P. M., Lugosi, G. and Tsigler, A. (2020), Benign overfitting in linear regression, Proc. Natl Acad. Sci. 117, 3006330070.CrossRefGoogle ScholarPubMed
Bennett, C. and Sharpley, R. (1990), Interpolation of Operators, Academic Press.Google Scholar
Benyamini, Y. and Lindenstrauss, J. (2000), Geometric Nonlinear Functional Analysis 1, Vol. 48 of Colloquium Publications, American Mathematical Society.CrossRefGoogle Scholar
Bergh, J. and Lofstrom, (1976), Interpolation Spaces: An Introduction, Springer.CrossRefGoogle Scholar
Binev, P., Cohen, A., Dahmen, W., DeVore, R., Petrova, G. and Wojtaszczyk, P. (2011), Convergence rates for greedy algorithms in reduced basis methods, SIAM J. Math. Anal. 43, 14571472.CrossRefGoogle Scholar
Binev, P., Cohen, A., Dahmen, W., DeVore, R., Petrova, G. and Wojtaszczyk, P. (2017), Data assimilation in reduced modeling, SIAM/ASA J. Uncertain. Quantif. 5, 129.CrossRefGoogle Scholar
Bölcskei, H., Grohs, P., Kutyniok, G. and Petersen, P. (2019), Optimal approximation with sparsely connected deep neural networks, SIAM J. Math. Data Sci. 1, 845.CrossRefGoogle Scholar
Bousquet, O., Boucheron, S. and Lugosi, G. (2005), Theory of classification: A survey of some recent advances, ESAIM Probab. Statist. 9, 323375.Google Scholar
Bronstein, M. M., Bruna, J., LeCun, Y., Szlam, A. and Vandergheynst, P. (2017), Geometric deep learning: Going beyond Euclidean data, IEEE Signal Process. Mag. 34, 1842.CrossRefGoogle Scholar
Buffa, A., Maday, Y., Patera, A. T., Prud’homme, C. and Turinici, G. (2012), A priori convergence of the greedy algorithm for the parameterized reduced basis, Math. Model. Numer. Anal. 46, 595603.CrossRefGoogle Scholar
Carl, B. (1981), Entropy numbers, s-numbers, and eigenvalue problems, J. Funct. Anal. 41, 290306.CrossRefGoogle Scholar
Chizat, L., Oyallon, E. and Bach, F. (2019), On lazy training in differentiable programming, in Advances in Neural Information Processing Systems 32 (NeurIPS 2019) (Wallach, H. et al., eds), Curran Associates, pp. 29372947.Google Scholar
Cohen, A., DeVore, R., Petrova, G. and Wojtaszczyk, P. (2020), Optimal stable nonlinear approximation. Available at arXiv:2009.09907.Google Scholar
Csiskos, M., Kupavskii, A. and Mustafa, N. (2019), Tight lower bounds on the VC-dimension of geometric set systems, J. Mach. Learn. Res. 20, 18.Google Scholar
Cybenko, G. (1989), Approximation by superpositions of a sigmoidal function, Math. Control Signals Systems 2, 303314.CrossRefGoogle Scholar
Daubechies, I., DeVore, R., Foucart, S., Hanin, B. and Petrova, G. (2019), Nonlinear approximation and (deep) ReLU networks. Available at arXiv:1905.02199 (to appear in Constr. Approx.).Google Scholar
de Boor, C. (1978), A Practical Guide to Splines, Vol. 27 of Applied Mathematical Sciences, Springer.CrossRefGoogle Scholar
DeVore, R. A. (1998), Nonlinear approximation, in Acta Numerica, Vol. 7, Cambridge University Press, pp. 51150.Google Scholar
DeVore, R. A. and Popov, V. A. (1988), Interpolation of Besov spaces, Trans. Amer. Math. Soc. 305, 397414.CrossRefGoogle Scholar
DeVore, R. A. and Scherer, K. (1979), Interpolation of linear operators on Sobolev spaces, Ann. of Math. 189, 583599.CrossRefGoogle Scholar
DeVore, R. A. and Sharpley, R. C. (1993), Besov spaces on domains in RD, Trans. Amer. Math. Soc. 335, 843864.Google Scholar
DeVore, R. A. and Temlyakov, V. N. (1996), Some remarks on greedy algorithms, Adv. Comput. Math. 5, 173187.CrossRefGoogle Scholar
DeVore, R. A., Howard, R. and Micchelli, C. (1989), Optimal non-linear approximation, Manuscripta Math. 4, 469478.CrossRefGoogle Scholar
DeVore, R. A., Kyriazis, G., Leviatan, D. and Tikhomirov, V. (1993), Wavelet compression and nonlinear n-widths, Adv. Comput. Math. 1, 197214.CrossRefGoogle Scholar
DeVore, R. A., Oskolkov, K. I. and Petrushev, P. P. (1997), Approximation by feed-forward neural networks, Ann. Numer. Math. 4, 261287.Google Scholar
DeVore, R. A., Petrova, G. and Wojtaszczyk, P. (2011), Approximation of functions of few variables in high dimensions, Constr. Approx. 33, 125143.CrossRefGoogle Scholar
DeVore, R. A., Petrova, G. and Wojtaszczyk, P. (2013), Greedy algorithms for reduced basis in Banach spaces, Constr. Approx. 37, 455466.CrossRefGoogle Scholar
Du, S. S., Lee, J., Li, H., Wang, L. and Zhai, X. (2019a), Gradient descent finds global minima of deep neural networks, in Proceedings of the 36th International Conference on Machine Learning (ICML 2019) (Chaudhuri, K. and Salakhutdinov, R., eds), Vol. 97 of Proceedings of Machine Learning Research, PMLR, pp. 16751685.Google Scholar
Du, S. S., Zhai, X., Poczos, B. and Singh, A. (2019b), Gradient descent provably optimizes over-parameterized neural networks, in 7th International Conference on Learning Representations (ICLR 2019). Available at https://openreview.net/forum?id=S1eK3i09YQ.Google Scholar
Dym, N., Sober, B. and Daubechies, I. (2020), Expression of fractals through neural network functions, IEEE J. Selected Areas Inform. Theory 1, 5766.CrossRefGoogle Scholar
Dziugaite, G. K. and Roy, D. M. (2017), Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data, in Workshop on Principled Approaches to Deep Learning (ICML 2017).Google Scholar
, W. E and Wang, Q. (2018), Exponential convergence of the deep neural network approximation for analytic functions, Sci. China Math. 61, 17331740.CrossRefGoogle Scholar
, W. E, Ma, C. and Wu, L. (2019), The Barron space and the flow-induced function spaces for neural network models. Available at arXiv:1906.08039.Google Scholar
Elbrächter, D., Perekrestenko, D., Grohs, P. and Bölcskei, H. (2019), Deep neural network approximation theory. Available at arXiv:1901.02220.Google Scholar
Frazier, M. W., Jawerth, B. and Weiss, G. (1991), Littlewood–Paley Theory and the Study of Function Spaces, Vol. 79 of CBMS Regional Conference Series in Mathematics, American Mathematical Society.CrossRefGoogle Scholar
Ghorbani, B., Mei, S., Misiakiewicz, T. and Montanari, A. (2021), Linearized two-layers neural networks in high dimension, Ann. Statist. 49, 10291054.CrossRefGoogle Scholar
Gribonval, R., Kutyniok, G., Nielsen, M. and Voigtlaender, F. (2019), Approximation spaces of deep neural networks. Available at arXiv:1905.01208.Google Scholar
Gühring, I., Raslan, M. and Kutyniok, G. (2020), Expressivity of deep neural networks. Available at arXiv:2007.04759.Google Scholar
Hanin, B. (2019), Universal function approximation by deep neural nets with bounded width and ReLU activations, Mathematics 7, 992.CrossRefGoogle Scholar
Hanin, B. and Rolnick, D. (2019), Deep ReLU networks have surprisingly few activation patterns, in Advances in Neural Information Processing Systems 32 (NeurIPS 2019) (Wallach, H. et al., eds), Curran Associates, pp. 361370.Google Scholar
Hata, M. (1986), Fractals in mathematics, in Patterns and Waves: Qualitative Analysis of Nonlinear Differential Equations (Nishida, T., Mimura, M. and Fujii, H., eds), Vol. 18 of Studies in Mathematics and Its Applications, Elsevier, pp. 259278.CrossRefGoogle Scholar
He, J., Li, L., Xu, J. and Zheng, C. (2020), ReLU deep neural networks and linear finite elements, Comput. Math 38, 502527.Google Scholar
Hebb, D. O. (1949), The Organization of Behavior: A Neuropsychological Theory, Wiley, Chapman & Hall.Google Scholar
Hornik, K., Stinchcombe, M., White, H. et al. (1989), Multilayer feedforward networks are universal approximators, Neural Networks 2, 359366.CrossRefGoogle Scholar
Jacot, A., Gabriel, F. and Hongler, C. (2018), Neural tangent kernel: Convergence and generalization in neural networks, in Advances in Neural Information Processing Systems 31 (NeurIPS 2018) (Bengio, S. et al., eds), Curran Associates, pp. 85718580.Google Scholar
Klusowski, J. and Barron, (2018), Approximation by combinations of ReLU and squared ReLU ridge functions with l1 and l0 controls, IEEE Trans. Inform. Theory 64, 76497656.CrossRefGoogle Scholar
Krizhevsky, A., Sutskever, I. and Hinton, G. E. (2012), ImageNet classification with deep convolutional neural networks, in Advances in Neural Information Processing Systems 25 (NIPS 2012) (Pereira, F. et al., eds), Curran Associates, pp. 10971105.Google Scholar
LeCun, Y., Bengio, Y. and Hinton, G. (2015), Deep learning, Nature 521 (7553), 436444.CrossRefGoogle ScholarPubMed
Liu, C., Zhu, L. and Belkin, M. (2020), Toward a theory of optimization for over-parameterized systems of non-linear equations: The lessons of deep learning. Available at arXiv:2003.00307.Google Scholar
Lorenz, G. G., Makovoz, Y. and Golitschek, M. von (1996), Constructive Approximation: Advanced Problems, first edition, Springer.CrossRefGoogle Scholar
Lu, J., Shen, Z., Yang, H. and Zhang, S. (2020), Deep network approximation for smooth functions. Available at arXiv:2001.03040.Google Scholar
Maiorov, V. (1999), On best approximation by ridge functions, J. Approx. Theory 99, 6894.CrossRefGoogle Scholar
Makovoz, Y. (1996), Random approximants and neural networks, J. Approx. Theory 85, 98109.CrossRefGoogle Scholar
Mhaskar, H. N. and Poggio, T. (2020), Function approximation by deep networks, Commun. Pure Appl. Anal. 19, 40854095.CrossRefGoogle Scholar
Montufar, G. F., Pascanu, R., Cho, K. and Bengio, Y. (2014), On the number of linear regions of deep neural networks, in Advances in Neural Information Processing Systems 27 (NIPS 2014) (Ghahramani, Z. et al., eds), Curran Associates, pp. 29242932.Google Scholar
Ongie, G., Willett, R., Soudry, D. and Srebro, N. (2020), A function space view of bounded norm infinite width ReLU nets: The multivariate case, in 8th International Conference on Learning Representations (ICLR 2020). Available at https://openreview.net/forum?id=H1lNPxHKDH.Google Scholar
Opschoor, J., Petersen, P. and Schwab, C. (2019a), Deep ReLU networks and high-order finite element methods, SAM, ETH Zürich.CrossRefGoogle Scholar
Opschoor, J., Schwab, C. and Zech, J. (2019b), Exponential ReLU DNN expression of holomorphic maps in high dimension. SAM Research Report, ETH Zürich.Google Scholar
Parhi, R. and Nowak, R. D. (2020), Banach space representer theorems for neural networks and ridge splines. Available at arXiv:2006.05626v2.Google Scholar
Peetre, J. (1976), New Thoughts on Besov Spaces, Vol. 1 of Duke University Mathematics Series, Mathematics Department, Duke University.Google Scholar
Petersen, P. (2020), Neural network theory. Available at http://pc-petersen.eu/Neural_Network_Theory.pdf.Google Scholar
Petersen, P. and Voigtlaender, F. (2018), Optimal approximation of piecewise smooth functions using deep ReLU neural networks, Neural Networks 108, 296330.CrossRefGoogle ScholarPubMed
Petrushev, P. (1988), Direct and converse theorems for spline and rational approximation and Besov spaces, in Function Spaces and Applications, Vol. 1302 of Lecture Notes in Mathematics, Springer, pp. 363377.CrossRefGoogle Scholar
Petrushev, P. (1998), Approximation by ridge functions and neural networks, SIAM J. Math. Anal. 30, 155189.CrossRefGoogle Scholar
Pinkus, A. (1999), Approximation theory of the MLP model in neural networks, in Acta Numerica, Vol. 8, Cambridge University Press, pp. 143195.Google Scholar
Pinkus, A. (2012), N-widths in Approximation Theory, Vol. 7 of A Series of Modern Surveys in Mathematics, Springer Science & Business Media.Google Scholar
Rosenblatt, F. (1958), The perceptron: A probabilistic model for information storage and organization in the brain, Psychological Review 65, 386.CrossRefGoogle Scholar
Savarese, P., Evron, I., Soudry, D. and Srebro, N. (2019), How do infinite width bounded norm networks look in function space?, in Proceedings of the 32nd Conference on Learning Theory (COLT 2019) (Beygelzimer, A. and Hsu, D., eds), Vol. 99 of Proceedings of Machine Learning Research, PMLR, pp. 26672690.Google Scholar
Schmidt-Hieber, J. (2020), Nonparametric regression using deep neural networks with ReLU activation, Ann. Statist. 48, 18751897.Google Scholar
Shen, Z., Yang, H. and Zhang, S. (2019), Nonlinear approximation via compositions, Neural Networks 119, 7484.CrossRefGoogle ScholarPubMed
Siegel, J. W. and Xu, J. (2020), High order approximation rates for neural networks with ReLU k activation functions. Available at arXiv:2012.07205.Google Scholar
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M. et al. (2016), Mastering the game of Go with deep neural networks and tree search, Nature 529 (7587), 484489.CrossRefGoogle ScholarPubMed
Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A. et al. (2017), Mastering the game of Go without human knowledge, Nature 550 (7676), 354359.CrossRefGoogle ScholarPubMed
Stanley, R. P. et al. (2004), An introduction to hyperplane arrangements, Geometric Com-binatorics 13, 389496.CrossRefGoogle Scholar
Stein, E. M. (1970), Singular Integrals and Differentiability Properties of Functions, Princeton University Press.Google Scholar
Telgarsky, M. (2016), Representation benefits of deep feedforward networks, JMLR: Workshop and Conference Proceedings 49, 123.Google Scholar
Traub, J. and Wozniakowski, H. (1980), A General Theory of Optimal Algorithms, Academic Press.Google Scholar
Unser, M. (2020), A unifying representer theorem for inverse problems and machine learning, Found. Comput. Math. doi:10.1007/s10208-020-09472-x.CrossRefGoogle Scholar
Vapnik, V. (1989), Statistical Learning Theory, Wiley-Interscience.Google Scholar
Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K. et al. (2016), Google’s neural machine translation system: Bridging the gap between human and machine translation. Available at arXiv:1609.08144.Google Scholar
Yarotsky, D. (2017), Error bounds for approximations with deep ReLU networks, Neural Networks 94, 103114.CrossRefGoogle ScholarPubMed
Yarotsky, D. (2018), Optimal approximation of continuous functions by very deep ReLU networks, in 31st Conference on Learning Theory (COLT 2018) (Bubeck, S. et al., eds), Vol. 75 of Proceedings of Machine Learning Research, PMLR, pp. 639649.Google Scholar
Zaslavsky, T. (1975), Facing Up To Arrangements: Face-Count Formulas for Partitions of Space by Hyper-Planes, American Mathematical Society.Google Scholar
Zhang, C., Bengio, S., Hardt, M., Recht, B. and Vinyals, O. (2017), Understanding deep learning requires rethinking generalization, in 5th International Conference on Learning Representations (ICLR 2017). Available at https://openreview.net/forum?id=Sy8gdB9xx.Google Scholar