paint-brush

This story draft by @anchoring has not been reviewed by an editor, YET.

How AnLLMs Cut Cache Size Without Sacrificing Accuracy

featured image - How AnLLMs Cut Cache Size Without Sacrificing Accuracy
Anchoring HackerNoon profile picture
0-item

Authors:

(1) Jianhui Pang, from the University of Macau, and work was done when Jianhui Pang and Fanghua Ye were interning at Tencent AI Lab ([email protected]);

(2) Fanghua Ye, University College London, and work was done when Jianhui Pang and Fanghua Ye were interning at Tencent AI Lab ([email protected]);

(3) Derek F. Wong, University of Macau;

(4) Longyue Wang, Tencent AI Lab, and corresponding author.

Table of Links

Abstract and 1 Introduction

2 Related Work

3 Anchor-based Large Language Models

3.1 Background

3.2 Anchor-based Self-Attention Networks

3.3 Anchor-based Inference

4 Experiments and 4.1 Our Implementation

4.2 Data and Training Procedure

4.3 Evaluation

5 Results

6 Analysis

7 Conclusion, Limitations, Ethics Statement, and References


A More Experimental Results

B Data Settings

5 Results

As evident from the results presented in Table 1, both the AnLLM-AC and AnLLM-EP models demonstrate promising accuracy, comparable to that of the base model, while simultaneously improving memory and inference efficiency.


Accuracy (Acc). The proposed AnLLM-EP and AnLLM-AC models exhibit commendable accuracy across various benchmarks.


In the zero-shot setting, with full attention, AnLLM-EP and AnLLM-AC achieve average accuracies of 64.6% and 65.1%, respectively, comparable to Llama2-7B’s 65.8% accuracy. This suggests that training with integrated anchor tokens barely affects the model capacity, emphasizing the robustness of LLMs. Furthermore, our models excel in OBQA, PIQA, and SCIQ tasks.


In the five-shot setting, with five prior examples, AnLLM-EP and AnLLM-AC maintain dependable performance using full attention. When implementing the AnSAN technique, a slight accuracy decline across all models is observed. This is expected, as AnSAN, designed for memory efficiency, necessitates token removal, potentially leading to information loss. The degradation in BoolQ is most pronounced, which contains the longest demonstration tasks, indicating that the longer the text, the greater the information loss after compression. However, the average accuracy reduction is minimal, approximately 1.5%, suggesting that AnSAN effectively balances memory-saving and model performance.


Keys/Values Cache Reduction (C⇓). The size of the keys/values cache is a critical factor in the practical implementation of LLMs, particularly concerning memory efficiency and computational resources. In this respect, the AnLLM-EP and AnLLM-AC models offer significant advantages.



By adopting the AnSAN, these models are designed to dramatically reduce the keys/values cache size during inference. As shown in Table 1, these models achieve remarkable reductions in cache size. Specifically, the average reduction percentages are around 90% for AnLLM-EP and an impressive 99% for AnLLM-AC. This is a substantial improvement compared to conventional approaches, which typically necessitate large cache sizes to store keys/values. These reductions in cache size translate to considerable savings in memory and computational resources, rendering these models highly efficient for practical applications.


nference Acceleration Ratio (T⇑). The inference acceleration ratio serves as a crucial metric reflecting the model’s efficiency during the testing phase. By incorporating anchor tokens into natural language texts, we can repurpose the hidden states of anchor tokens as keys/values caches in the demonstrations, and then adopt an inference strategy as suggested by Wang et al. (2023). In this scenario, both the AnLLM-EP and AnLLM-AC models demonstrate significant improvements.



The AnLLM-EP and AnLLM-AC models exhibit remarkable performance in natural language understanding benchmarks, effectively balancing accuracy, memory efficiency, and time acceleration. The incorporation of anchor tokens into AnLLMs, along with the utilization of the AnSAN technique for reducing keys/values cache size, allows these models to maintain performance on par while significantly improving memory efficiency and inference speed. The equilibrium achieved between model performance and computational efficiency is noteworthy and opens up new possibilities for the advancement of LLMs.


This paper is available on arxiv under CC BY 4.0 DEED license.