1 code implementation • 21 Sep 2023 • Philip M. Long, Peter L. Bartlett
Recent experiments have shown that, often, when training a neural network with gradient descent (GD) with a step size $\eta$, the operator norm of the Hessian of the loss grows until it approximately reaches $2/\eta$, after which it fluctuates around this value.
no code implementations • 21 Apr 2023 • Peter L. Bartlett, Philip M. Long
We apply this result, together with techniques due to Haussler and to Benedek and Itai, to obtain new upper bounds on packing numbers in terms of this scale-sensitive notion of dimension.
no code implementations • 4 Oct 2022 • Peter L. Bartlett, Philip M. Long, Olivier Bousquet
We consider Sharpness-Aware Minimization (SAM), a gradient-based optimization method for deep networks that has exhibited performance improvements on image and language prediction problems.
1 code implementation • 19 Sep 2022 • Niladri S. Chatterji, Philip M. Long
We bound the excess risk of interpolating deep linear networks trained using gradient flow.
no code implementations • 8 Dec 2021 • Philip M. Long, Rocco A. Servedio
Van Rooyen et al. introduced a notion of convex loss functions being robust to random classification noise, and established that the "unhinged" loss function is robust in this sense.
no code implementations • 6 Oct 2021 • Niladri S. Chatterji, Philip M. Long
We prove a lower bound on the excess risk of sparse interpolating procedures for linear regression with Gaussian data in the overparameterized regime.
no code implementations • 25 Aug 2021 • Niladri S. Chatterji, Philip M. Long, Peter L. Bartlett
The recent success of neural network models has shone light on a rather surprising statistical phenomenon: statistical models that perfectly fit noisy data can generalize well to unseen test data.
no code implementations • 21 May 2021 • Philip M. Long
The Neural Tangent Kernel (NTK) is the wide-network limit of a kernel defined using neural networks at initialization, whose embedding is the gradient of the output of the network with respect to its parameters.
no code implementations • 9 Feb 2021 • Niladri S. Chatterji, Philip M. Long, Peter L. Bartlett
We establish conditions under which gradient descent applied to fixed-width deep networks drives the logistic loss to zero, and prove bounds on the rate of convergence.
no code implementations • 4 Dec 2020 • Niladri S. Chatterji, Philip M. Long, Peter L. Bartlett
We study the training of finite-width two-layer smoothed ReLU networks for binary classification using the logistic loss.
no code implementations • 16 Oct 2020 • Peter L. Bartlett, Philip M. Long
We consider bounds on the generalization performance of the least-norm linear regressor, in the over-parameterized regime where it can interpolate the data.
no code implementations • 25 Apr 2020 • Niladri S. Chatterji, Philip M. Long
We prove bounds on the population risk of the maximum margin algorithm for two-class linear classification.
no code implementations • ICLR 2020 • Difan Zou, Philip M. Long, Quanquan Gu
We further propose a modified identity input and output transformations, and show that a $(d+k)$-wide neural network is sufficient to guarantee the global convergence of GD/SGD, where $d, k$ are the input and output dimensions respectively.
no code implementations • 1 Feb 2020 • Niladri S. Chatterji, Peter L. Bartlett, Philip M. Long
We consider the problem of sampling from a strongly log-concave density in $\mathbb{R}^d$, and prove an information theoretic lower bound on the number of stochastic gradient queries of the log density needed.
no code implementations • 26 Jun 2019 • Peter L. Bartlett, Philip M. Long, Gábor Lugosi, Alexander Tsigler
Motivated by this phenomenon, we consider when a perfect fit to training data in linear regression is compatible with accurate prediction.
no code implementations • ICLR 2020 • Philip M. Long, Hanie Sedghi
We prove bounds on the generalization error of convolutional networks.
no code implementations • 7 Jan 2019 • Philip M. Long, Hanie Sedghi
We analyze the joint probability distribution on the lengths of the vectors of hidden variables in different layers of a fully connected deep network, when the weights and biases are chosen randomly according to Gaussian distributions, and the input is in $\{ -1, 1\}^N$.
no code implementations • 9 Nov 2018 • Anindya De, Philip M. Long, Rocco A. Servedio
This implies that, for constant $d$, multivariate log-concave distributions can be learned in $\tilde{O}_d(1/\epsilon^{2d+2})$ time using $\tilde{O}_d(1/\epsilon^{d+2})$ samples, answering a question of [Diakonikolas, Kane and Stewart, 2016] All of our results extend to a model of noise-tolerant density estimation using Huber's contamination model, in which the target distribution to be learned is a $(1-\epsilon,\epsilon)$ mixture of some unknown distribution in the class with some other arbitrary and unknown distribution, and the learning algorithm must output a hypothesis distribution with total variation distance error $O(\epsilon)$ from the target distribution.
no code implementations • 18 Jul 2018 • Anindya De, Philip M. Long, Rocco A. Servedio
For the case $| \mathcal{A} | = 3$, we give an algorithm for learning $\mathcal{A}$-sums to accuracy $\epsilon$ that uses $\mathsf{poly}(1/\epsilon)$ samples and runs in time $\mathsf{poly}(1/\epsilon)$, independent of $N$ and of the elements of $\mathcal{A}$.
1 code implementation • ICLR 2019 • Hanie Sedghi, Vineet Gupta, Philip M. Long
We characterize the singular values of the linear transformation associated with a standard 2D multi-channel convolutional layer, enabling their efficient computation.
no code implementations • 13 Apr 2018 • Peter L. Bartlett, Steven N. Evans, Philip M. Long
This implies that $h$ can be represented to any accuracy by a deep residual network whose nonlinear layers compute functions with a small Lipschitz constant.
no code implementations • ICML 2018 • Peter L. Bartlett, David P. Helmbold, Philip M. Long
We provide polynomial bounds on the number of iterations for gradient descent to approximate the least squares matrix $\Phi$, in the case where the initial hypothesis $\Theta_1 = ... = \Theta_L = I$ has excess loss bounded by a small enough constant.
no code implementations • 14 Feb 2016 • David P. Helmbold, Philip M. Long
We analyze dropout in deep networks with rectified linear units and the quadratic loss.
no code implementations • 15 Dec 2014 • David P. Helmbold, Philip M. Long
Dropout is a simple but effective technique for learning in neural networks and other settings.
no code implementations • 31 Jul 2013 • Pranjal Awasthi, Maria Florina Balcan, Philip M. Long
For malicious noise, where the adversary can corrupt both the label and the features, we provide a polynomial-time algorithm for learning linear separators in $\Re^d$ under isotropic log-concave distributions that can tolerate a nearly information-theoretically optimal noise rate of $\eta = \Omega(\epsilon)$.
no code implementations • 6 Nov 2012 • Maria Florina Balcan, Philip M. Long
We provide new results concerning label efficient, polynomial time, passive and active learning of linear separators.