no code implementations • 18 Jan 2021 • Arjun Balasubramanian, Adarsh Kumar, YuHan Liu, Han Cao, Shivaram Venkataraman, Aditya Akella
We present the design of GATI, an end-to-end prediction serving system that incorporates learned caches for low-latency DNN inference.
no code implementations • 7 Feb 2020 • Adarsh Kumar, Arjun Balasubramanian, Shivaram Venkataraman, Aditya Akella
In this work, we observe that caching intermediate layer outputs can help us avoid running all the layers of a DNN for a sizeable fraction of inference requests.