no code implementations • 13 Feb 2024 • Suzanna Parkinson, Greg Ongie, Rebecca Willett, Ohad Shamir, Nathan Srebro
We also show that a similar statement in the reverse direction is not possible: any function learnable with polynomial sample complexity by a norm-controlled depth-2 ReLU network with infinite width is also learnable with polynomial sample complexity by a norm-controlled depth-3 ReLU network.
no code implementations • 26 Dec 2023 • Daniel Barzilai, Ohad Shamir
It is by now well-established that modern over-parameterized models seem to elude the bias-variance tradeoff and generalize well despite overfitting noise.
no code implementations • 10 Jul 2023 • Guy Kornowski, Ohad Shamir
Recent works proposed several stochastic zero-order algorithms that solve this task, all of which suffer from a dimension-dependence of $\Omega(d^{3/2})$ where $d$ is the dimension of the problem, which was conjectured to be optimal.
no code implementations • NeurIPS 2023 • Guy Kornowski, Gilad Yehudai, Ohad Shamir
Thus, we show that the input dimension has a crucial role on the type of overfitting in this setting, which we also validate empirically for intermediate dimensions.
no code implementations • 16 Feb 2023 • Michael I. Jordan, Guy Kornowski, Tianyi Lin, Ohad Shamir, Manolis Zampetakis
In particular, we prove a lower bound of $\Omega(d)$ for any deterministic algorithm.
no code implementations • 21 Sep 2022 • Guy Kornowski, Ohad Shamir
We study the oracle complexity of producing $(\delta,\epsilon)$-stationary points of Lipschitz functions, in the sense proposed by Zhang et al. [2020].
1 code implementation • 15 Jun 2022 • Niv Haim, Gal Vardi, Gilad Yehudai, Ohad Shamir, Michal Irani
We propose a novel reconstruction scheme that stems from recent theoretical results about the implicit bias in training neural networks with gradient-based methods.
no code implementations • 13 Feb 2022 • Gal Vardi, Ohad Shamir, Nathan Srebro
We study norm-based uniform convergence bounds for neural networks, aiming at a tight understanding of how these are affected by the architecture and type of norm constraint, for the simple class of scalar-valued one-hidden-layer networks, and inputs bounded in Euclidean norm.
no code implementations • 9 Feb 2022 • Gal Vardi, Gilad Yehudai, Ohad Shamir
Despite a great deal of research, it is still unclear why neural networks are so susceptible to adversarial examples.
no code implementations • 8 Feb 2022 • Gal Vardi, Gilad Yehudai, Ohad Shamir
We solve an open question from Lu et al. (2017), by showing that any target network with inputs in $\mathbb{R}^d$ can be approximated by a width $O(d)$ network (independent of the target network's architecture), whose number of parameters is essentially larger only by a linear factor.
no code implementations • 30 Jan 2022 • Nadav Timor, Gal Vardi, Ohad Shamir
We study the conjectured relationship between the implicit regularization in neural networks, trained with gradient-based methods, and rank minimization of their weight matrices.
no code implementations • 27 Jan 2022 • Ohad Shamir
In this paper, we provide several new results on when one can or cannot expect benign overfitting to occur, for both regression and classification tasks.
no code implementations • 8 Dec 2021 • Liran Szlak, Ohad Shamir
A commonly used heuristic in RL is experience replay (e. g.~\citet{lin1993reinforcement, mnih2015human}), in which a learner stores and re-uses past trajectories as if they were sampled online.
no code implementations • 8 Dec 2021 • Liran Szlak, Ohad Shamir
Experience replay \citep{lin1993reinforcement, mnih2015human} is a widely used technique to achieve efficient use of data and improved performance in RL algorithms.
no code implementations • ICLR 2022 • Gal Vardi, Gilad Yehudai, Ohad Shamir
We prove that having such a large bit complexity is both necessary and sufficient for memorization with a sub-linear number of parameters.
no code implementations • NeurIPS 2021 • Brian Bullins, Kumar Kshitij Patel, Ohad Shamir, Nathan Srebro, Blake Woodworth
We propose and analyze a stochastic Newton algorithm for homogeneous distributed stochastic convex optimization, where each machine can calculate stochastic gradients of the same population objective, as well as stochastic Hessian-vector products (products of an independent unbiased estimator of the Hessian of the population objective with arbitrary vectors), with many such stochastic computations performed between rounds of communication.
no code implementations • 6 Oct 2021 • Gal Vardi, Ohad Shamir, Nathan Srebro
The implicit bias of neural networks has been extensively studied in recent years.
1 code implementation • NeurIPS 2021 • Itay Safran, Ohad Shamir
Perhaps surprisingly, we prove that when the condition number is taken into account, without-replacement SGD \emph{does not} significantly improve on with-replacement SGD in terms of worst-case bounds, unless the number of epochs (passes over the data) is larger than the condition number.
no code implementations • NeurIPS 2021 • Gal Vardi, Gilad Yehudai, Ohad Shamir
We theoretically study the fundamental problem of learning a single neuron with a bias term ($\mathbf{x} \mapsto \sigma(<\mathbf{w},\mathbf{x}> + b)$) in the realizable setting with the ReLU activation, using gradient descent.
no code implementations • NeurIPS 2021 • Guy Kornowski, Ohad Shamir
For this approach, we prove under a mild assumption an inherent trade-off between oracle complexity and smoothness: On the one hand, smoothing a nonsmooth nonconvex function can be done very efficiently (e. g., by randomized smoothing), but with dimension-dependent factors in the smoothness parameter, which can strongly affect iteration complexity when plugging into standard smooth optimization methods.
no code implementations • 2 Feb 2021 • Blake Woodworth, Brian Bullins, Ohad Shamir, Nathan Srebro
We resolve the min-max complexity of distributed stochastic convex optimization (up to a log factor) in the intermittent communication setting, where $M$ machines work in parallel over the course of $R$ rounds of communication to optimize the objective, and during each round of communication, each machine may sequentially compute $K$ stochastic gradient estimates.
no code implementations • 31 Jan 2021 • Eran Malach, Gilad Yehudai, Shai Shalev-Shwartz, Ohad Shamir
On the other hand, the fact that deep networks can efficiently express a target function does not mean that this target function can be learned efficiently by deep neural networks.
no code implementations • 30 Jan 2021 • Gal Vardi, Daniel Reichman, Toniann Pitassi, Ohad Shamir
We show a complexity-theoretic barrier to proving such results beyond size $O(d\log^2(d))$, but also show an explicit benign function, that can be approximated with networks of size $O(d)$ and not with networks of size $o(d/\log d)$.
1 code implementation • 9 Dec 2020 • Gal Vardi, Ohad Shamir
For one hidden-layer networks, we prove a similar result, where in general it is impossible to characterize implicit regularization properties in this manner, except for the "balancedness" property identified in Du et al. [2018].
no code implementations • 13 Oct 2020 • Guy Kornowski, Ohad Shamir
In this note, we consider the complexity of optimizing a highly smooth (Lipschitz $k$-th order derivative) and strongly convex function, via calls to a $k$-th order oracle which returns the value and first $k$ derivatives of the function at a given point, and where the dimension is unrestricted.
no code implementations • 30 Jun 2020 • Ohad Shamir
A line of recent works established that when training linear predictors over separable data, using gradient methods and exponentially-tailed losses, the predictors asymptotically converge in direction to the max-margin predictor.
1 code implementation • 1 Jun 2020 • Itay Safran, Gilad Yehudai, Ohad Shamir
We prove that while the objective is strongly convex around the global minima when the teacher and student networks possess the same number of neurons, it is not even \emph{locally convex} after any amount of over-parameterization.
no code implementations • NeurIPS 2020 • Gal Vardi, Ohad Shamir
To show this, we study a seemingly unrelated problem of independent interest: Namely, whether there are polynomially-bounded functions which require super-polynomial weights in order to approximate with constant-depth neural networks.
no code implementations • 27 Feb 2020 • Ohad Shamir
It is well-known that given a bounded, smooth nonconvex function, standard gradient-based methods can find $\epsilon$-stationary points (where the gradient norm is less than $\epsilon$) in $\mathcal{O}(1/\epsilon^2)$ iterations.
no code implementations • ICML 2020 • Blake Woodworth, Kumar Kshitij Patel, Sebastian U. Stich, Zhen Dai, Brian Bullins, H. Brendan McMahan, Ohad Shamir, Nathan Srebro
We study local SGD (also known as parallel SGD and federated averaging), a natural and frequently used stochastic distributed optimization method.
no code implementations • ICML 2020 • Eran Malach, Gilad Yehudai, Shai Shalev-Shwartz, Ohad Shamir
The lottery ticket hypothesis (Frankle and Carbin, 2018), states that a randomly-initialized network contains a small subnetwork such that, when trained in isolation, can compete with the performance of the original network.
no code implementations • 15 Jan 2020 • Gilad Yehudai, Ohad Shamir
We consider the fundamental problem of learning a single neuron $x \mapsto\sigma(w^\top x)$ using standard gradient methods.
no code implementations • ICML 2020 • Yoel Drori, Ohad Shamir
We study the iteration complexity of stochastic gradient descent (SGD) for minimizing the gradient norm of smooth, possibly nonconvex functions.
no code implementations • 31 Jul 2019 • Itay Safran, Ohad Shamir
In contrast to the majority of existing theoretical works, which assume that individual functions are sampled with replacement, we focus here on popular but poorly-understood heuristics, which involve going over random permutations of the individual functions.
no code implementations • 15 Apr 2019 • Itay Safran, Ronen Eldan, Ohad Shamir
Existing depth separation results for constant-depth networks essentially show that certain radial functions in $\mathbb{R}^d$, which can be easily approximated with depth $3$ networks, cannot be approximated by depth $2$ networks, even up to constant accuracy, unless their size is exponential in $d$.
no code implementations • NeurIPS 2019 • Gilad Yehudai, Ohad Shamir
Recently, a spate of papers have provided positive theoretical results for training over-parameterized neural networks (where the network size is larger than what is needed to achieve low error).
no code implementations • 13 Feb 2019 • Dylan J. Foster, Ayush Sekhari, Ohad Shamir, Nathan Srebro, Karthik Sridharan, Blake Woodworth
Notably, we show that in the global oracle/statistical learning model, only logarithmic dependence on smoothness is required to find a near-stationary point, whereas polynomial dependence on smoothness is necessary in the local stochastic oracle model.
no code implementations • 9 Feb 2019 • Yuval Dagan, Gil Kur, Ohad Shamir
We show that fundamental learning tasks, such as finding an approximate linear separator or linear regression, require memory at least \emph{quadratic} in the dimension, in a natural streaming setting.
no code implementations • NeurIPS 2018 • Murat A. Erdogdu, Lester Mackey, Ohad Shamir
An Euler discretization of the Langevin diffusion is known to converge to the global minimizers of certain convex and non-convex optimization problems.
no code implementations • 23 Sep 2018 • Ohad Shamir
We study the dynamics of gradient descent on objective functions of the form $f(\prod_{i=1}^{k} w_i)$ (with respect to scalar parameters $w_1,\ldots, w_k$), which arise in the context of training depth-$k$ linear neural networks.
no code implementations • 26 Jun 2018 • Yossi Arjevani, Ohad Shamir, Nathan Srebro
We provide tight finite-time convergence bounds for gradient descent and stochastic gradient descent on quadratic functions, when the gradients are delayed and reflect iterates from $\tau$ rounds ago.
no code implementations • NeurIPS 2018 • Ohad Shamir
In this paper, we rigorously prove that arbitrarily deep, nonlinear residual units indeed exhibit this behavior, in the sense that the optimization landscape contains no local minima with value above what can be obtained with a linear predictor (namely a 1-layer network).
no code implementations • 4 Mar 2018 • Yuval Dagan, Ohad Shamir
We study the problem of identifying correlations in multivariate data, under information constraints: Either on the amount of memory that can be used by the algorithm, or the amount of communication when the data is distributed across several machines.
1 code implementation • ICML 2018 • Itay Safran, Ohad Shamir
We consider the optimization problem associated with training simple ReLU neural networks of the form $\mathbf{x}\mapsto \sum_{i=1}^{k}\max\{0,\mathbf{w}_i^\top \mathbf{x}\}$ with respect to the squared loss.
no code implementations • 18 Dec 2017 • Noah Golowich, Alexander Rakhlin, Ohad Shamir
We study the sample complexity of learning neural networks, by providing new bounds on their Rademacher complexity assuming norm constraints on the parameter matrix of each layer.
no code implementations • 2 Jun 2017 • Shai Shalev-Shwartz, Ohad Shamir, Shaked Shammah
Exploiting the great expressive power of Deep Neural Network architectures, relies on the ability to train them.
no code implementations • 15 May 2017 • Nicolò Cesa-Bianchi, Ohad Shamir
We study how the regret guarantees of nonstochastic multi-armed bandits can be improved, if the effective range of the losses in each round is small (e. g. the maximal difference between two losses in a given round).
1 code implementation • ICML 2017 • Shai Shalev-Shwartz, Ohad Shamir, Shaked Shammah
In recent years, Deep Learning has become the go-to solution for a broad range of applications, often outperforming state-of-the-art.
no code implementations • ICML 2017 • Ohad Shamir, Liran Szlak
In this paper, we consider the applicability of this setting to convex online learning with delayed feedback, in which the feedback on the prediction made in round $t$ arrives with some delay $\tau$.
no code implementations • ICML 2017 • Dan Garber, Ohad Shamir, Nathan Srebro
We study algorithms for estimating the leading principal component of the population covariance matrix that are both communication-efficient and achieve estimation error of the order of the centralized ERM solution that uses all $mn$ samples.
no code implementations • NeurIPS 2016 • Ohad Shamir
Stochastic gradient methods for machine learning and optimization problems are usually analyzed assuming data points are sampled *with* replacement.
no code implementations • ICML 2017 • Yossi Arjevani, Ohad Shamir
Finite-sum optimization problems are ubiquitous in machine learning, and are commonly solved using first-order methods which rely on gradient computations.
no code implementations • ICML 2017 • Itay Safran, Ohad Shamir
We provide several new depth-based separation results for feed-forward neural networks, proving that various types of simple and natural functions can be better approximated using deeper networks than shallower ones, even if the shallower networks are much larger.
no code implementations • 5 Sep 2016 • Ohad Shamir
Although neural networks are routinely and successfully trained in practice using simple gradient-based methods, most existing theoretical results are negative, showing that learning such networks is difficult, in a worst-case sense over all data distributions.
no code implementations • NeurIPS 2016 • Yossi Arjevani, Ohad Shamir
Many canonical machine learning problems boil down to a convex optimization problem with a finite sum structure.
no code implementations • 11 May 2016 • Yossi Arjevani, Ohad Shamir
We consider a broad class of first-order optimization algorithms which are \emph{oblivious}, in the sense that their step sizes are scheduled regardless of the function under consideration, except for limited side-information such as smoothness or strong convexity parameters.
no code implementations • NeurIPS 2016 • Ohad Shamir
Stochastic gradient methods for machine learning and optimization problems are usually analyzed assuming data points are sampled \emph{with} replacement.
no code implementations • 12 Dec 2015 • Ronen Eldan, Ohad Shamir
We show that there is a simple (approximately radial) function on $\reals^d$, expressible by a small 3-layer feedforward neural networks, which cannot be approximated by any 2-layer network, to more than a certain constant accuracy, unless its width is exponential in the dimension.
no code implementations • 9 Dec 2015 • Jonathan Rosenski, Ohad Shamir, Liran Szlak
We consider a variant of the stochastic multi-armed bandit problem, where multiple players simultaneously choose from the same set of arms and may collide, receiving no reward.
no code implementations • 13 Nov 2015 • Itay Safran, Ohad Shamir
Deep learning, in the form of artificial neural networks, has achieved remarkable practical success in recent years, for a variety of difficult machine learning applications.
no code implementations • 30 Sep 2015 • Ohad Shamir
We consider the problem of principal component analysis (PCA) in a streaming stochastic setting, where our goal is to find a direction of approximate maximal variance, based on a stream of i. i. d.
no code implementations • 31 Jul 2015 • Ohad Shamir
We consider the closely related problems of bandit convex optimization with two-point feedback, and zero-order stochastic convex optimization with two function evaluations per round.
no code implementations • 31 Jul 2015 • Ohad Shamir
We study the convergence properties of the VR-PCA algorithm introduced by \cite{shamir2015stochastic} for fast computation of leading singular vectors.
no code implementations • NeurIPS 2015 • Yossi Arjevani, Ohad Shamir
We study the fundamental limits to communication-efficient distributed methods for convex learning and optimization, under different assumptions on the information available to individual machines, and the types of functions considered.
no code implementations • 23 Mar 2015 • Yossi Arjevani, Shai Shalev-Shwartz, Ohad Shamir
This, in turn, reveals a powerful connection between a class of optimization algorithms and the analytic theory of polynomials whereby new lower and upper bounds are derived.
no code implementations • 5 Nov 2014 • Nicolò Cesa-Bianchi, Yishay Mansour, Ohad Shamir
In this paper, we study lower bounds on the error attainable by such methods as a function of the number of entries observed in the kernel matrix or the rank of an approximate kernel matrix.
no code implementations • 23 Oct 2014 • Doron Kukliansky, Ohad Shamir
In this paper we analyze a budgeted learning setting, in which the learner can only choose and observe a small subset of the attributes of each training example.
1 code implementation • NeurIPS 2014 • Roi Livni, Shai Shalev-Shwartz, Ohad Shamir
It is well-known that neural networks are computationally hard to train.
no code implementations • 30 Sep 2014 • Noga Alon, Nicolò Cesa-Bianchi, Claudio Gentile, Shie Mannor, Yishay Mansour, Ohad Shamir
This naturally models several situations where the losses of different actions are related, and knowing the loss of one action provides information on the loss of other actions.
no code implementations • 9 Sep 2014 • Ohad Shamir
We describe and analyze a simple algorithm for principal component analysis and singular value decomposition, VR-PCA, which uses computationally cheap stochastic iterations, yet converges exponentially fast to the optimal solution.
no code implementations • 11 Aug 2014 • Ohad Shamir
We study the attainable regret for online linear optimization problems with bandit feedback, where unlike the full-information setting, the player can only observe its own loss rather than the full loss vector.
no code implementations • 19 Jun 2014 • Ohad Shamir
In this short note, we provide a sample complexity lower bound for learning linear predictors with respect to the squared loss.
no code implementations • 10 Jun 2014 • Ethan Fetaya, Ohad Shamir, Shimon Ullman
We consider the problem of learning from a similarity matrix (such as spectral clustering and lowd imensional embedding), when computing pairwise similarities are costly, and only a limited number of entries can be observed.
1 code implementation • 30 Dec 2013 • Ohad Shamir, Nathan Srebro, Tong Zhang
We present a novel Newton-type method for distributed optimization, which is particularly well suited for stochastic optimization and learning problems.
no code implementations • NeurIPS 2013 • Nicolò Cesa-Bianchi, Ofer Dekel, Ohad Shamir
In particular, we show that with switching costs, the attainable rate with bandit feedback is $T^{2/3}$.
no code implementations • NeurIPS 2014 • Ohad Shamir
Many machine learning approaches are characterized by information constraints on how they interact with the training data.
no code implementations • CVPR 2013 • Baoyuan Liu, Fereshteh Sadeghi, Marshall Tappen, Ohad Shamir, Ce Liu
Large-scale recognition problems with thousands of classes pose a particular challenge because applying the classifier requires more computation as the number of classes grows.
no code implementations • 26 Apr 2013 • Roi Livni, Shai Shalev-Shwartz, Ohad Shamir
The main goal of this paper is the derivation of an efficient layer-by-layer algorithm for training such networks, which we denote as the \emph{Basis Learner}.
no code implementations • NeurIPS 2013 • Nicolo Cesa-Bianchi, Ofer Dekel, Ohad Shamir
In particular, we show that with switching costs, the attainable rate with bandit feedback is $\widetilde{\Theta}(T^{2/3})$.
no code implementations • NeurIPS 2012 • Sasha Rakhlin, Ohad Shamir, Karthik Sridharan
We show a principled way of deriving online learning algorithms from a minimax analysis.
no code implementations • 11 Sep 2012 • Ohad Shamir
The problem of stochastic convex optimization with bandit feedback (in the learning community) or without knowledge of gradients (in the optimization community) has received much attention in recent years, in the form of algorithms and performance upper bounds.
no code implementations • NeurIPS 2011 • Rina Foygel, Ohad Shamir, Nati Srebro, Ruslan R. Salakhutdinov
We provide rigorous guarantees on learning with the weighted trace-norm under arbitrary sampling distributions.
no code implementations • NeurIPS 2011 • Shie Mannor, Ohad Shamir
We consider an adversarial online learning setting where a decision maker can choose an action in every stage of the game.
no code implementations • NeurIPS 2011 • Nicolò Cesa-Bianchi, Ohad Shamir
Most online algorithms used in machine learning today are based on variants of mirror descent or follow-the-leader.
no code implementations • NeurIPS 2011 • Andrew Cotter, Ohad Shamir, Nati Srebro, Karthik Sridharan
Mini-batch algorithms have recently received significant attention as a way to speed-up stochastic convex optimization problems.
no code implementations • NeurIPS 2011 • Sham M. Kakade, Varun Kanade, Ohad Shamir, Adam Kalai
In this paper, we provide algorithms for learning GLMs and SIMs, which are both computationally and statistically efficient.
no code implementations • 13 Jun 2011 • Nicolò Cesa-Bianchi, Ohad Shamir
Most traditional online learning algorithms are based on variants of mirror descent or follow-the-leader.
no code implementations • 31 Oct 2009 • Sham M. Kakade, Ohad Shamir, Karthik Sridharan, Ambuj Tewari
The versatility of exponential families, along with their attendant convexity properties, make them a popular and effective statistical model.
no code implementations • NeurIPS 2008 • Ohad Shamir, Naftali Tishby
In this paper, we provide a set of general sufficient conditions, which ensure the reliability of clustering stability estimators in the large sample regime.