V-trace is an off-policy actor-critic reinforcement learning algorithm that helps tackle the lag between when actions are generated by the actors and when the learner estimates the gradient. Consider a trajectory $\left(x_{t}, a_{t}, r_{t}\right)^{t=s+n}_{t=s}$ generated by the actor following some policy $\mu$. We can define the $n$-steps V-trace target for $V\left(x_{s}\right)$, our value approximation at state $x_{s}$ as:
$$ v_{s} = V\left(x_{s}\right) + \sum^{s+n-1}_{t=s}\gamma^{t-s}\left(\prod^{t-1}_{i=s}c_{i}\right)\delta_{t}V $$
Where $\delta_{t}V = \rho_{t}\left(r_{t} + \gamma{V}\left(x_{t+1}\right) - V\left(x_{t}\right)\right)$ is a temporal difference algorithm for $V$, and $\rho_{t} = \text{min}\left(\bar{\rho}, \frac{\pi\left(a_{t}\mid{x_{t}}\right)}{\mu\left(a_{t}\mid{x_{t}}\right)}\right)$ and $c_{i} = \text{min}\left(\bar{c}, \frac{\pi\left(a_{t}\mid{x_{t}}\right)}{\mu\left(a_{t}\mid{x_{t}}\right)}\right)$ are truncated importance sampling weights. We assume that the truncation levels are such that $\bar{\rho} \geq \bar{c}$.
Source: IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner ArchitecturesPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Reinforcement Learning (RL) | 15 | 31.91% |
Starcraft | 7 | 14.89% |
Starcraft II | 7 | 14.89% |
Decision Making | 3 | 6.38% |
Atari Games | 3 | 6.38% |
Continuous Control | 2 | 4.26% |
OpenAI Gym | 2 | 4.26% |
Language Modelling | 1 | 2.13% |
Large Language Model | 1 | 2.13% |
Component | Type |
|
---|---|---|
🤖 No Components Found | You can add them if they exist; e.g. Mask R-CNN uses RoIAlign |