ar X iv :1 81 0. 02 28 1v 1 [ cs .L G ] 4 O ct 2 01 8 A CONVERGENCE ANALYSIS OF GRADIENT DESCENT FOR DEEP LINEAR NEURAL NETWORKS Sanjeev Arora Princeton University and Institute for Advanced Study arora@cs.princeton.edu Nadav Cohen Institute for Advanced Study cohennadav@ias.edu Noah Golowich Harvard University ngolowich@college.harvard.edu Wei Hu Princeton University huwei@cs.princeton.edu ABSTRACT We analyze speed of convergence to global optimum for gradient descent training a deep linear neural network (parameterized as x 7→WN · · ·W1x) by minimizing the ℓ2 loss over whitened data. Convergence at a linear rate is guaranteed when the following hold: (i) dimensions of hidden layers are at least the minimum of the in- put and output dimensions; (ii) weight matrices at initialization are approximately balanced; and (iii) the initial loss is smaller than the loss of any rank-deficient solution. The assumptions on initialization (conditions