Key points are not available for this paper at this time.
We consider the problem of training a shallow neural network with quadratic activation functions and the generalization power of such trained networks. Assuming that the samples are generated by a full rank matrix Formula: see text of the hidden network node weights, we obtain the following results. We establish that all full-rank approximately stationary solutions of the risk minimization problem are also approximate global optimums of the risk (in-sample and population). As a consequence, we establish that, when trained on polynomially many samples, the gradient descent algorithm converges to the global optimum of the risk minimization problem regardless of the width of the network when it is initialized at some value Formula: see text, which we compute. Furthermore, the network produced by the gradient descent has a near zero generalization error. Next, we establish that initializing the gradient descent algorithm below Formula: see text is easily achieved when the weights of the ground truth matrix Formula: see text are randomly generated and the matrix is sufficiently overparameterized. Finally, we identify a simple necessary and sufficient geometric condition on the size of the training set under which any global minimizer of the empirical risk has necessarily zero generalization error. Funding: The research of E. C. Kizildag is supported by Columbia University, with the Distinguished Postdoctoral Fellowship in Statistics. Support from the National Science Foundation Grant DMS-2015517 is gratefully acknowledged.
Building similarity graph...
Analyzing shared references across papers
Loading...
Mathematics of Operations Research
Massachusetts Institute of Technology
Columbia University
Yale University
Add This Paper to Your Research Feed
Any time a new paper drops it will be there.
Gamarnik et al. (Tue,) studied this question.