Search Results for author: William Brandon

Found 3 papers, 2 papers with code

Reducing Transformer Key-Value Cache Size with Cross-Layer Attention

1 code implementation • 21 May 2024 • William Brandon, Mayank Mishra, Aniruddha Nrusimha, Rameswar Panda, Jonathan Ragan Kelly

MQA and GQA both modify the design of the attention block so that multiple query heads can share a single key/value head, reducing the number of distinct key/value heads by a large factor while only minimally degrading accuracy.

Paper
Code

Hydra: Sequentially-Dependent Draft Heads for Medusa Decoding

no code implementations • 7 Feb 2024 • Zachary Ankner, Rishab Parthasarathy, Aniruddha Nrusimha, Christopher Rinard, Jonathan Ragan-Kelley, William Brandon

In this work, we propose Hydra heads, a sequentially dependent, drop-in replacement for standard draft heads that significantly improves speculation accuracy.

Paper
Add Code

Striped Attention: Faster Ring Attention for Causal Transformers

1 code implementation • 15 Nov 2023 • William Brandon, Aniruddha Nrusimha, Kevin Qian, Zachary Ankner, Tian Jin, Zhiye Song, Jonathan Ragan-Kelley

In experiments running Striped Attention on A100 GPUs and TPUv4s, we are able to achieve up to 1. 45x end-to-end throughput improvements over the original Ring Attention algorithm on causal transformer training at a sequence length of 256k.

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.