Search Results for author: Philip M. Long

Found 26 papers, 3 papers with code

Sharpness-Aware Minimization and the Edge of Stability

1 code implementation21 Sep 2023 Philip M. Long, Peter L. Bartlett

Recent experiments have shown that, often, when training a neural network with gradient descent (GD) with a step size $\eta$, the operator norm of the Hessian of the loss grows until it approximately reaches $2/\eta$, after which it fluctuates around this value.

Prediction, Learning, Uniform Convergence, and Scale-sensitive Dimensions

no code implementations21 Apr 2023 Peter L. Bartlett, Philip M. Long

We apply this result, together with techniques due to Haussler and to Benedek and Itai, to obtain new upper bounds on packing numbers in terms of this scale-sensitive notion of dimension.

The Dynamics of Sharpness-Aware Minimization: Bouncing Across Ravines and Drifting Towards Wide Minima

no code implementations4 Oct 2022 Peter L. Bartlett, Philip M. Long, Olivier Bousquet

We consider Sharpness-Aware Minimization (SAM), a gradient-based optimization method for deep networks that has exhibited performance improvements on image and language prediction problems.

Deep Linear Networks can Benignly Overfit when Shallow Ones Do

1 code implementation19 Sep 2022 Niladri S. Chatterji, Philip M. Long

We bound the excess risk of interpolating deep linear networks trained using gradient flow.

The perils of being unhinged: On the accuracy of classifiers minimizing a noise-robust convex loss

no code implementations8 Dec 2021 Philip M. Long, Rocco A. Servedio

Van Rooyen et al. introduced a notion of convex loss functions being robust to random classification noise, and established that the "unhinged" loss function is robust in this sense.

Foolish Crowds Support Benign Overfitting

no code implementations6 Oct 2021 Niladri S. Chatterji, Philip M. Long

We prove a lower bound on the excess risk of sparse interpolating procedures for linear regression with Gaussian data in the overparameterized regime.

regression

The Interplay Between Implicit Bias and Benign Overfitting in Two-Layer Linear Networks

no code implementations25 Aug 2021 Niladri S. Chatterji, Philip M. Long, Peter L. Bartlett

The recent success of neural network models has shone light on a rather surprising statistical phenomenon: statistical models that perfectly fit noisy data can generalize well to unseen test data.

Properties of the After Kernel

no code implementations21 May 2021 Philip M. Long

The Neural Tangent Kernel (NTK) is the wide-network limit of a kernel defined using neural networks at initialization, whose embedding is the gradient of the output of the network with respect to its parameters.

Binary Classification Data Augmentation

When does gradient descent with logistic loss interpolate using deep networks with smoothed ReLU activations?

no code implementations9 Feb 2021 Niladri S. Chatterji, Philip M. Long, Peter L. Bartlett

We establish conditions under which gradient descent applied to fixed-width deep networks drives the logistic loss to zero, and prove bounds on the rate of convergence.

When does gradient descent with logistic loss find interpolating two-layer networks?

no code implementations4 Dec 2020 Niladri S. Chatterji, Philip M. Long, Peter L. Bartlett

We study the training of finite-width two-layer smoothed ReLU networks for binary classification using the logistic loss.

Binary Classification

Failures of model-dependent generalization bounds for least-norm interpolation

no code implementations16 Oct 2020 Peter L. Bartlett, Philip M. Long

We consider bounds on the generalization performance of the least-norm linear regressor, in the over-parameterized regime where it can interpolate the data.

Generalization Bounds Learning Theory +1

Finite-sample Analysis of Interpolating Linear Classifiers in the Overparameterized Regime

no code implementations25 Apr 2020 Niladri S. Chatterji, Philip M. Long

We prove bounds on the population risk of the maximum margin algorithm for two-class linear classification.

On the Global Convergence of Training Deep Linear ResNets

no code implementations ICLR 2020 Difan Zou, Philip M. Long, Quanquan Gu

We further propose a modified identity input and output transformations, and show that a $(d+k)$-wide neural network is sufficient to guarantee the global convergence of GD/SGD, where $d, k$ are the input and output dimensions respectively.

Oracle Lower Bounds for Stochastic Gradient Sampling Algorithms

no code implementations1 Feb 2020 Niladri S. Chatterji, Peter L. Bartlett, Philip M. Long

We consider the problem of sampling from a strongly log-concave density in $\mathbb{R}^d$, and prove an information theoretic lower bound on the number of stochastic gradient queries of the log density needed.

Benign Overfitting in Linear Regression

no code implementations26 Jun 2019 Peter L. Bartlett, Philip M. Long, Gábor Lugosi, Alexander Tsigler

Motivated by this phenomenon, we consider when a perfect fit to training data in linear regression is compatible with accurate prediction.

regression

On the effect of the activation function on the distribution of hidden nodes in a deep network

no code implementations7 Jan 2019 Philip M. Long, Hanie Sedghi

We analyze the joint probability distribution on the lengths of the vectors of hidden variables in different layers of a fully connected deep network, when the weights and biases are chosen randomly according to Gaussian distributions, and the input is in $\{ -1, 1\}^N$.

Density estimation for shift-invariant multidimensional distributions

no code implementations9 Nov 2018 Anindya De, Philip M. Long, Rocco A. Servedio

This implies that, for constant $d$, multivariate log-concave distributions can be learned in $\tilde{O}_d(1/\epsilon^{2d+2})$ time using $\tilde{O}_d(1/\epsilon^{d+2})$ samples, answering a question of [Diakonikolas, Kane and Stewart, 2016] All of our results extend to a model of noise-tolerant density estimation using Huber's contamination model, in which the target distribution to be learned is a $(1-\epsilon,\epsilon)$ mixture of some unknown distribution in the class with some other arbitrary and unknown distribution, and the learning algorithm must output a hypothesis distribution with total variation distance error $O(\epsilon)$ from the target distribution.

Density Estimation

Learning Sums of Independent Random Variables with Sparse Collective Support

no code implementations18 Jul 2018 Anindya De, Philip M. Long, Rocco A. Servedio

For the case $| \mathcal{A} | = 3$, we give an algorithm for learning $\mathcal{A}$-sums to accuracy $\epsilon$ that uses $\mathsf{poly}(1/\epsilon)$ samples and runs in time $\mathsf{poly}(1/\epsilon)$, independent of $N$ and of the elements of $\mathcal{A}$.

The Singular Values of Convolutional Layers

1 code implementation ICLR 2019 Hanie Sedghi, Vineet Gupta, Philip M. Long

We characterize the singular values of the linear transformation associated with a standard 2D multi-channel convolutional layer, enabling their efficient computation.

Representing smooth functions as compositions of near-identity functions with implications for deep network optimization

no code implementations13 Apr 2018 Peter L. Bartlett, Steven N. Evans, Philip M. Long

This implies that $h$ can be represented to any accuracy by a deep residual network whose nonlinear layers compute functions with a small Lipschitz constant.

Gradient descent with identity initialization efficiently learns positive definite linear transformations by deep residual networks

no code implementations ICML 2018 Peter L. Bartlett, David P. Helmbold, Philip M. Long

We provide polynomial bounds on the number of iterations for gradient descent to approximate the least squares matrix $\Phi$, in the case where the initial hypothesis $\Theta_1 = ... = \Theta_L = I$ has excess loss bounded by a small enough constant.

Surprising properties of dropout in deep networks

no code implementations14 Feb 2016 David P. Helmbold, Philip M. Long

We analyze dropout in deep networks with rectified linear units and the quadratic loss.

On the Inductive Bias of Dropout

no code implementations15 Dec 2014 David P. Helmbold, Philip M. Long

Dropout is a simple but effective technique for learning in neural networks and other settings.

Inductive Bias

The Power of Localization for Efficiently Learning Linear Separators with Noise

no code implementations31 Jul 2013 Pranjal Awasthi, Maria Florina Balcan, Philip M. Long

For malicious noise, where the adversary can corrupt both the label and the features, we provide a polynomial-time algorithm for learning linear separators in $\Re^d$ under isotropic log-concave distributions that can tolerate a nearly information-theoretically optimal noise rate of $\eta = \Omega(\epsilon)$.

Active Learning

Active and passive learning of linear separators under log-concave distributions

no code implementations6 Nov 2012 Maria Florina Balcan, Philip M. Long

We provide new results concerning label efficient, polynomial time, passive and active learning of linear separators.

Active Learning Open-Ended Question Answering

Cannot find the paper you are looking for? You can Submit a new open access paper.