no code implementations • 26 Jul 2023 • Tokio Kajitsuka, Issei Sato
Existing analyses of the expressive capacity of Transformer models have required excessively deep layers for data memorization, leading to a discrepancy with the Transformers actually used in practice.