Search Results for author: William Brandon

Found 3 papers, 2 papers with code

Reducing Transformer Key-Value Cache Size with Cross-Layer Attention

1 code implementation21 May 2024 William Brandon, Mayank Mishra, Aniruddha Nrusimha, Rameswar Panda, Jonathan Ragan Kelly

MQA and GQA both modify the design of the attention block so that multiple query heads can share a single key/value head, reducing the number of distinct key/value heads by a large factor while only minimally degrading accuracy.

Hydra: Sequentially-Dependent Draft Heads for Medusa Decoding

no code implementations7 Feb 2024 Zachary Ankner, Rishab Parthasarathy, Aniruddha Nrusimha, Christopher Rinard, Jonathan Ragan-Kelley, William Brandon

In this work, we propose Hydra heads, a sequentially dependent, drop-in replacement for standard draft heads that significantly improves speculation accuracy.

Striped Attention: Faster Ring Attention for Causal Transformers

1 code implementation15 Nov 2023 William Brandon, Aniruddha Nrusimha, Kevin Qian, Zachary Ankner, Tian Jin, Zhiye Song, Jonathan Ragan-Kelley

In experiments running Striped Attention on A100 GPUs and TPUv4s, we are able to achieve up to 1. 45x end-to-end throughput improvements over the original Ring Attention algorithm on causal transformer training at a sequence length of 256k.

Cannot find the paper you are looking for? You can Submit a new open access paper.