J. Mach. Learn. , 1 (2022), pp. 141-246.

Published online: 2022-07

Category: Theory

[*An open-access
article; the PDF
is free to any online user.*]

Cited by

- BibTex
- RIS
- TXT

Although gradient descent (GD) optimization methods in combination with rectified linear unit
(ReLU) artificial neural networks (ANNs) often supply an impressive performance in real world learning
problems, till this day it remains – in all practically relevant scenarios – an open problem of research to rigorously prove (or disprove) the conjecture that such GD optimization methods do converge in the training of
ANNs with ReLU activation.

In this article we study fully-connected feedforward deep ReLU ANNs with an arbitrarily large number of
hidden layers and we prove convergence of the risk of the GD optimization method with random initializations in the training of such ANNs under the assumption that the unnormalized probability density function $p: [a, b]^d → [0, ∞)$ of the probability distribution of the input data of the considered supervised learning problem is piecewise polynomial, under the assumption that the target function $f: [a, b]^d → \mathbb{R}^{\delta}$ (describing the
relationship between input data and the output data) is piecewise polynomial, and under the assumption that
the risk function of the considered supervised learning problem admits at least one regular global minimum.
In addition, in the special situation of shallow ANNs with just one hidden layer and one-dimensional input we
also verify this assumption by proving in the training of such shallow ANNs that for every Lipschitz continuous target function there exists a global minimum in the risk landscape. Finally, in the training of deep ANNs
with ReLU activation we also study solutions of gradient flow (GF) differential equations and we prove by
proving that every non-divergent GF trajectory converges with a polynomial rate of convergence to a critical
point (in the sense of limiting Fréchet subdifferentiability).

Our mathematical convergence analysis builds up on ideas from our previous article [S. Eberle, A. Jentzen,
A. Riekert, & G. Weiss, Existence, uniqueness, and convergence rates for gradient flows in the training of
artificial neural networks with ReLU activation. arXiv:2108.08106 (2021)], on tools from real algebraic geometry such as the concept of semi-algebraic functions and generalized Kurdyka-Łojasiewicz inequalities, on tools
from functional analysis such as the Arzelà–Ascoli theorem on the relative compactness of uniformly bounded
and equicontinuous sequences of continuous functions, on tools from nonsmooth analysis such as the concept
of limiting Fréchet subgradients, as well as on the fact that the set of realization functions of shallow ReLU
ANNs with fixed architecture forms a closed subset of the set of continuous functions revealed in [P. Petersen,
M. Raslan, & F. Voigtlaender, Topological properties of the set of functions generated by neural networks of
fixed size. Found. Comput. Math. 21 (2021), no. 2, 375–444].

Although gradient descent (GD) optimization methods in combination with rectified linear unit
(ReLU) artificial neural networks (ANNs) often supply an impressive performance in real world learning
problems, till this day it remains – in all practically relevant scenarios – an open problem of research to rigorously prove (or disprove) the conjecture that such GD optimization methods do converge in the training of
ANNs with ReLU activation.

In this article we study fully-connected feedforward deep ReLU ANNs with an arbitrarily large number of
hidden layers and we prove convergence of the risk of the GD optimization method with random initializations in the training of such ANNs under the assumption that the unnormalized probability density function $p: [a, b]^d → [0, ∞)$ of the probability distribution of the input data of the considered supervised learning problem is piecewise polynomial, under the assumption that the target function $f: [a, b]^d → \mathbb{R}^{\delta}$ (describing the
relationship between input data and the output data) is piecewise polynomial, and under the assumption that
the risk function of the considered supervised learning problem admits at least one regular global minimum.
In addition, in the special situation of shallow ANNs with just one hidden layer and one-dimensional input we
also verify this assumption by proving in the training of such shallow ANNs that for every Lipschitz continuous target function there exists a global minimum in the risk landscape. Finally, in the training of deep ANNs
with ReLU activation we also study solutions of gradient flow (GF) differential equations and we prove by
proving that every non-divergent GF trajectory converges with a polynomial rate of convergence to a critical
point (in the sense of limiting Fréchet subdifferentiability).

Our mathematical convergence analysis builds up on ideas from our previous article [S. Eberle, A. Jentzen,
A. Riekert, & G. Weiss, Existence, uniqueness, and convergence rates for gradient flows in the training of
artificial neural networks with ReLU activation. arXiv:2108.08106 (2021)], on tools from real algebraic geometry such as the concept of semi-algebraic functions and generalized Kurdyka-Łojasiewicz inequalities, on tools
from functional analysis such as the Arzelà–Ascoli theorem on the relative compactness of uniformly bounded
and equicontinuous sequences of continuous functions, on tools from nonsmooth analysis such as the concept
of limiting Fréchet subgradients, as well as on the fact that the set of realization functions of shallow ReLU
ANNs with fixed architecture forms a closed subset of the set of continuous functions revealed in [P. Petersen,
M. Raslan, & F. Voigtlaender, Topological properties of the set of functions generated by neural networks of
fixed size. Found. Comput. Math. 21 (2021), no. 2, 375–444].

*Journal of Machine Learning*.

*1*(2). 141-246. doi:10.4208/jml.220114a