Search Results for author: Yakun Sophia Shao

Found 5 papers, 3 papers with code

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

1 code implementation • 31 Jan 2024 • Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, Amir Gholami

LLMs are seeing growing use for applications such as document analysis and summarization which require large context windows, and with these large context windows KV cache activations surface as the dominant contributor to memory consumption during inference.

Quantization

217

Paper
Code

MoCA: Memory-Centric, Adaptive Execution for Multi-Tenant Deep Neural Networks

1 code implementation • 10 May 2023 • Seah Kim, Hasan Genc, Vadim Vadimovich Nikiforov, Krste Asanović, Borivoje Nikolić, Yakun Sophia Shao

Driven by the wide adoption of deep neural networks (DNNs) across different application domains, multi-tenancy execution, where multiple DNNs are deployed simultaneously on the same hardware, has been proposed to satisfy the latency requirements of different applications while improving the overall system utilization.

Fairness

Paper
Code

Full Stack Optimization of Transformer Inference: a Survey

no code implementations • 27 Feb 2023 • Sehoon Kim, Coleman Hooper, Thanakul Wattanawong, Minwoo Kang, Ruohan Yan, Hasan Genc, Grace Dinh, Qijing Huang, Kurt Keutzer, Michael W. Mahoney, Yakun Sophia Shao, Amir Gholami

In this work, we survey different approaches for efficient Transformer inference, including: (i) analysis and profiling of the bottlenecks in existing Transformer architectures and their similarities and differences with previous convolutional models; (ii) implications of Transformer architecture on hardware, including the impact of non-linear operations such as Layer Normalization, Softmax, and GELU, as well as linear operations, on hardware design; (iii) approaches for optimizing a fixed Transformer architecture; (iv) challenges in finding the right mapping and scheduling of operations for Transformer models; and (v) approaches for optimizing Transformer models by adapting the architecture using neural architecture search.

Neural Architecture Search Scheduling

Paper
Add Code

CoSA: Scheduling by Constrained Optimization for Spatial Accelerators

no code implementations • 5 May 2021 • Qijing Huang, Minwoo Kang, Grace Dinh, Thomas Norell, Aravind Kalaiah, James Demmel, John Wawrzynek, Yakun Sophia Shao

Recent advances in Deep Neural Networks (DNNs) have led to active development of specialized DNN accelerators, many of which feature a large number of processing elements laid out spatially, together with a multi-level memory hierarchy and flexible interconnect.

Navigate Scheduling

Paper
Add Code

Gemmini: Enabling Systematic Deep-Learning Architecture Evaluation via Full-Stack Integration

5 code implementations • 22 Nov 2019 • Hasan Genc, Seah Kim, Alon Amid, Ameer Haj-Ali, Vighnesh Iyer, Pranav Prakash, Jerry Zhao, Daniel Grubb, Harrison Liew, Howard Mao, Albert Ou, Colin Schmidt, Samuel Steffl, John Wright, Ion Stoica, Jonathan Ragan-Kelley, Krste Asanovic, Borivoje Nikolic, Yakun Sophia Shao

DNN accelerators are often developed and evaluated in isolation without considering the cross-stack, system-level effects in real-world environments.

1,459

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.