Adam is an adaptive learning rate optimization algorithm that utilises both momentum and scaling, combining the benefits of RMSProp and SGD w/th Momentum. The optimizer is designed to be appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients.
The weight updates are performed as:
$$ w_{t} = w_{t-1} - \eta\frac{\hat{m}_{t}}{\sqrt{\hat{v}_{t}} + \epsilon} $$
with
$$ \hat{m}_{t} = \frac{m_{t}}{1-\beta^{t}_{1}} $$
$$ \hat{v}_{t} = \frac{v_{t}}{1-\beta^{t}_{2}} $$
$$ m_{t} = \beta_{1}m_{t-1} + (1-\beta_{1})g_{t} $$
$$ v_{t} = \beta_{2}v_{t-1} + (1-\beta_{2})g_{t}^{2} $$
$ \eta $ is the step size/learning rate, around 1e-3 in the original paper. $ \epsilon $ is a small number, typically 1e-8 or 1e-10, to prevent dividing by zero. $ \beta_{1} $ and $ \beta_{2} $ are forgetting parameters, with typical values 0.9 and 0.999, respectively.
Source: Adam: A Method for Stochastic OptimizationPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Language Modelling | 56 | 7.39% |
Retrieval | 36 | 4.75% |
Question Answering | 34 | 4.49% |
Large Language Model | 29 | 3.83% |
Semantic Segmentation | 22 | 2.90% |
Prompt Engineering | 13 | 1.72% |
In-Context Learning | 13 | 1.72% |
Sentence | 12 | 1.58% |
Information Retrieval | 11 | 1.45% |
Component | Type |
|
---|---|---|
🤖 No Components Found | You can add them if they exist; e.g. Mask R-CNN uses RoIAlign |