Search Results for author: Philip M. Long

Found 26 papers, 3 papers with code

Sharpness-Aware Minimization and the Edge of Stability

1 code implementation • 21 Sep 2023 • Philip M. Long, Peter L. Bartlett

Recent experiments have shown that, often, when training a neural network with gradient descent (GD) with a step size $\eta$, the operator norm of the Hessian of the loss grows until it approximately reaches $2/\eta$, after which it fluctuates around this value.

Paper
Code

Prediction, Learning, Uniform Convergence, and Scale-sensitive Dimensions

no code implementations • 21 Apr 2023 • Peter L. Bartlett, Philip M. Long

We apply this result, together with techniques due to Haussler and to Benedek and Itai, to obtain new upper bounds on packing numbers in terms of this scale-sensitive notion of dimension.

Paper
Add Code

The Dynamics of Sharpness-Aware Minimization: Bouncing Across Ravines and Drifting Towards Wide Minima

no code implementations • 4 Oct 2022 • Peter L. Bartlett, Philip M. Long, Olivier Bousquet

We consider Sharpness-Aware Minimization (SAM), a gradient-based optimization method for deep networks that has exhibited performance improvements on image and language prediction problems.

Paper
Add Code

Deep Linear Networks can Benignly Overfit when Shallow Ones Do

1 code implementation • 19 Sep 2022 • Niladri S. Chatterji, Philip M. Long

We bound the excess risk of interpolating deep linear networks trained using gradient flow.

Paper
Code

The perils of being unhinged: On the accuracy of classifiers minimizing a noise-robust convex loss

no code implementations • 8 Dec 2021 • Philip M. Long, Rocco A. Servedio

Van Rooyen et al. introduced a notion of convex loss functions being robust to random classification noise, and established that the "unhinged" loss function is robust in this sense.

Paper
Add Code

Foolish Crowds Support Benign Overfitting

no code implementations • 6 Oct 2021 • Niladri S. Chatterji, Philip M. Long

We prove a lower bound on the excess risk of sparse interpolating procedures for linear regression with Gaussian data in the overparameterized regime.

regression

Paper
Add Code

The Interplay Between Implicit Bias and Benign Overfitting in Two-Layer Linear Networks

no code implementations • 25 Aug 2021 • Niladri S. Chatterji, Philip M. Long, Peter L. Bartlett

The recent success of neural network models has shone light on a rather surprising statistical phenomenon: statistical models that perfectly fit noisy data can generalize well to unseen test data.

Paper
Add Code

Properties of the After Kernel

no code implementations • 21 May 2021 • Philip M. Long

The Neural Tangent Kernel (NTK) is the wide-network limit of a kernel defined using neural networks at initialization, whose embedding is the gradient of the output of the network with respect to its parameters.

Binary Classification Data Augmentation

Paper
Add Code

When does gradient descent with logistic loss interpolate using deep networks with smoothed ReLU activations?

no code implementations • 9 Feb 2021 • Niladri S. Chatterji, Philip M. Long, Peter L. Bartlett

We establish conditions under which gradient descent applied to fixed-width deep networks drives the logistic loss to zero, and prove bounds on the rate of convergence.

Paper
Add Code

When does gradient descent with logistic loss find interpolating two-layer networks?

no code implementations • 4 Dec 2020 • Niladri S. Chatterji, Philip M. Long, Peter L. Bartlett

We study the training of finite-width two-layer smoothed ReLU networks for binary classification using the logistic loss.

Binary Classification

Paper
Add Code

Failures of model-dependent generalization bounds for least-norm interpolation

no code implementations • 16 Oct 2020 • Peter L. Bartlett, Philip M. Long

We consider bounds on the generalization performance of the least-norm linear regressor, in the over-parameterized regime where it can interpolate the data.

Generalization Bounds Learning Theory +1

Paper
Add Code

Finite-sample Analysis of Interpolating Linear Classifiers in the Overparameterized Regime

no code implementations • 25 Apr 2020 • Niladri S. Chatterji, Philip M. Long

We prove bounds on the population risk of the maximum margin algorithm for two-class linear classification.

Paper
Add Code

On the Global Convergence of Training Deep Linear ResNets

no code implementations • ICLR 2020 • Difan Zou, Philip M. Long, Quanquan Gu

We further propose a modified identity input and output transformations, and show that a $(d+k)$-wide neural network is sufficient to guarantee the global convergence of GD/SGD, where $d, k$ are the input and output dimensions respectively.

Paper
Add Code

Oracle Lower Bounds for Stochastic Gradient Sampling Algorithms

no code implementations • 1 Feb 2020 • Niladri S. Chatterji, Peter L. Bartlett, Philip M. Long

We consider the problem of sampling from a strongly log-concave density in $\mathbb{R}^d$, and prove an information theoretic lower bound on the number of stochastic gradient queries of the log density needed.

Paper
Add Code

Benign Overfitting in Linear Regression

no code implementations • 26 Jun 2019 • Peter L. Bartlett, Philip M. Long, Gábor Lugosi, Alexander Tsigler

Motivated by this phenomenon, we consider when a perfect fit to training data in linear regression is compatible with accurate prediction.

regression

Paper
Add Code

Generalization bounds for deep convolutional neural networks

no code implementations • ICLR 2020 • Philip M. Long, Hanie Sedghi

We prove bounds on the generalization error of convolutional networks.

Generalization Bounds

Paper
Add Code

On the effect of the activation function on the distribution of hidden nodes in a deep network

no code implementations • 7 Jan 2019 • Philip M. Long, Hanie Sedghi

We analyze the joint probability distribution on the lengths of the vectors of hidden variables in different layers of a fully connected deep network, when the weights and biases are chosen randomly according to Gaussian distributions, and the input is in $\{ -1, 1\}^N$.

Paper
Add Code

Density estimation for shift-invariant multidimensional distributions

no code implementations • 9 Nov 2018 • Anindya De, Philip M. Long, Rocco A. Servedio

This implies that, for constant $d$, multivariate log-concave distributions can be learned in $\tilde{O}_d(1/\epsilon^{2d+2})$ time using $\tilde{O}_d(1/\epsilon^{d+2})$ samples, answering a question of [Diakonikolas, Kane and Stewart, 2016] All of our results extend to a model of noise-tolerant density estimation using Huber's contamination model, in which the target distribution to be learned is a $(1-\epsilon,\epsilon)$ mixture of some unknown distribution in the class with some other arbitrary and unknown distribution, and the learning algorithm must output a hypothesis distribution with total variation distance error $O(\epsilon)$ from the target distribution.

Density Estimation

Paper
Add Code

Learning Sums of Independent Random Variables with Sparse Collective Support

no code implementations • 18 Jul 2018 • Anindya De, Philip M. Long, Rocco A. Servedio

For the case $| \mathcal{A} | = 3$, we give an algorithm for learning $\mathcal{A}$-sums to accuracy $\epsilon$ that uses $\mathsf{poly}(1/\epsilon)$ samples and runs in time $\mathsf{poly}(1/\epsilon)$, independent of $N$ and of the elements of $\mathcal{A}$.

Paper
Add Code

The Singular Values of Convolutional Layers

1 code implementation • ICLR 2019 • Hanie Sedghi, Vineet Gupta, Philip M. Long

We characterize the singular values of the linear transformation associated with a standard 2D multi-channel convolutional layer, enabling their efficient computation.

Paper
Code

Representing smooth functions as compositions of near-identity functions with implications for deep network optimization

no code implementations • 13 Apr 2018 • Peter L. Bartlett, Steven N. Evans, Philip M. Long

This implies that $h$ can be represented to any accuracy by a deep residual network whose nonlinear layers compute functions with a small Lipschitz constant.

Paper
Add Code

Gradient descent with identity initialization efficiently learns positive definite linear transformations by deep residual networks

no code implementations • ICML 2018 • Peter L. Bartlett, David P. Helmbold, Philip M. Long

We provide polynomial bounds on the number of iterations for gradient descent to approximate the least squares matrix $\Phi$, in the case where the initial hypothesis $\Theta_1 = ... = \Theta_L = I$ has excess loss bounded by a small enough constant.

Paper
Add Code

Surprising properties of dropout in deep networks

no code implementations • 14 Feb 2016 • David P. Helmbold, Philip M. Long

We analyze dropout in deep networks with rectified linear units and the quadratic loss.

Paper
Add Code

On the Inductive Bias of Dropout

no code implementations • 15 Dec 2014 • David P. Helmbold, Philip M. Long

Dropout is a simple but effective technique for learning in neural networks and other settings.

Inductive Bias

Paper
Add Code

The Power of Localization for Efficiently Learning Linear Separators with Noise

no code implementations • 31 Jul 2013 • Pranjal Awasthi, Maria Florina Balcan, Philip M. Long

For malicious noise, where the adversary can corrupt both the label and the features, we provide a polynomial-time algorithm for learning linear separators in $\Re^d$ under isotropic log-concave distributions that can tolerate a nearly information-theoretically optimal noise rate of $\eta = \Omega(\epsilon)$.

Active Learning

Paper
Add Code

Active and passive learning of linear separators under log-concave distributions

no code implementations • 6 Nov 2012 • Maria Florina Balcan, Philip M. Long

We provide new results concerning label efficient, polynomial time, passive and active learning of linear separators.

Active Learning Open-Ended Question Answering

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.