no code implementations • 30 Jan 2023 • Gal Dalal, Assaf Hallak, Gugan Thoppe, Shie Mannor, Gal Chechik
We prove that the resulting variance decays exponentially with the planning horizon as a function of the expansion policy.
no code implementations • 28 Sep 2022 • Gal Dalal, Assaf Hallak, Shie Mannor, Gal Chechik
This allows us to reduce the variance of gradients by three orders of magnitude and to benefit from better sample complexity compared with standard policy gradient.
1 code implementation • 30 May 2022 • Guy Tennenholtz, Nadav Merlis, Lior Shani, Shie Mannor, Uri Shalit, Gal Chechik, Assaf Hallak, Gal Dalal
We learn the parameters of the TerMDP and leverage the structure of the estimation problem to provide state-wise confidence bounds.
no code implementations • 28 Jan 2022 • Aviv Rosenberg, Assaf Hallak, Shie Mannor, Gal Chechik, Gal Dalal
Some of the most powerful reinforcement learning frameworks use planning for action selection.
no code implementations • ICLR 2022 • Guy Tennenholtz, Assaf Hallak, Gal Dalal, Shie Mannor, Gal Chechik, Uri Shalit
We analyze the limitations of learning from such data with and without external reward, and propose an adjustment of standard imitation learning algorithms to fit this setup.
1 code implementation • NeurIPS 2021 • Assaf Hallak, Gal Dalal, Steven Dalton, Iuri Frosio, Shie Mannor, Gal Chechik
We first discover and analyze a counter-intuitive phenomenon: action selection through TS and a pre-trained value function often leads to lower performance compared to the original pre-trained agent, even when having access to the exact state and reward in future steps.
no code implementations • ICML 2017 • Assaf Hallak, Shie Mannor
The problem of on-line off-policy evaluation (OPE) has been actively studied in the last decade due to its importance both as a stand-alone problem and as a module in a policy improvement scheme.
no code implementations • 23 Feb 2017 • Assaf Hallak, Yishay Mansour, Elad Yom-Tov
The LTV approach considers the future implications of the item recommendation, and seeks to maximize the cumulative gain over time.
no code implementations • 17 Sep 2015 • Assaf Hallak, Aviv Tamar, Remi Munos, Shie Mannor
We consider the off-policy evaluation problem in Markov decision processes with function approximation.
no code implementations • 14 Aug 2015 • Assaf Hallak, Aviv Tamar, Shie Mannor
Recently, \citet{SuttonMW15} introduced the emphatic temporal differences (ETD) algorithm for off-policy evaluation in Markov decision processes.
no code implementations • 11 Feb 2015 • Assaf Hallak, François Schnitzler, Timothy Mann, Shie Mannor
Off-policy learning in dynamic decision problems is essential for providing strong evidence that a new policy is better than the one in use.
no code implementations • 8 Feb 2015 • Assaf Hallak, Dotan Di Castro, Shie Mannor
The objective is to learn a strategy that maximizes the accumulated reward across all contexts.