Search Results for author: Ashish Panwar

Found 4 papers, 1 papers with code

Vidur: A Large-Scale Simulation Framework For LLM Inference

1 code implementation8 May 2024 Amey Agrawal, Nitin Kedia, Jayashree Mohan, Ashish Panwar, Nipun Kwatra, Bhargav Gulavani, Ramachandran Ramjee, Alexey Tumanov

Vidur models the performance of LLM operators using a combination of experimental profiling and predictive modeling, and evaluates the end-to-end inference performance for different workloads by estimating several metrics of interest such as latency and throughput.

Scheduling

vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention

no code implementations7 May 2024 Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee, Ashish Panwar

Thus, vAttention unburdens the attention kernel developer from having to explicitly support paging and avoids re-implementation of memory management in the serving framework.

Management

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

no code implementations4 Mar 2024 Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, Ramachandran Ramjee

However, batching multiple requests leads to an interleaving of prefill and decode iterations which makes it challenging to achieve both high throughput and low latency.

Scheduling

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

no code implementations31 Aug 2023 Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Ramachandran Ramjee

SARATHI employs chunked-prefills, which splits a prefill request into equal sized chunks, and decode-maximal batching, which constructs a batch using a single prefill chunk and populates the remaining slots with decodes.

Language Modelling Large Language Model

Cannot find the paper you are looking for? You can Submit a new open access paper.