Practical LLMs for Real-World Applications

(1) Jianhui Pang, from the University of Macau, and work was done when Jianhui Pang and Fanghua Ye were interning at Tencent AI Lab ([email protected]);

(2) Fanghua Ye, University College London, and work was done when Jianhui Pang and Fanghua Ye were interning at Tencent AI Lab ([email protected]);

(3) Derek F. Wong, University of Macau;

(4) Longyue Wang, Tencent AI Lab, and corresponding author.

Table of Links

Abstract and 1 Introduction

2 Related Work

3 Anchor-based Large Language Models

3.1 Background

3.2 Anchor-based Self-Attention Networks

3.3 Anchor-based Inference

4 Experiments and 4.1 Our Implementation

4.2 Data and Training Procedure

4.3 Evaluation

5 Results

6 Analysis

7 Conclusion, Limitations, Ethics Statement, and References

A More Experimental Results

B Data Settings

7 Conclusion

LLMs have emerged as a significant research area in the field of artificial intelligence. However, despite their exceptional performance across various natural language tasks, the practical application of these models is limited by their significant memory overhead and time efficiency. Implementing LLMs on resource-constrained devices, such as smartphones, poses a unique challenge. To address this issue, we propose anchor-based LLMs with the AnSAN technique. Our experiments demonstrate that by sacrificing a marginal 1.5% in precision, our approach saves 99% of keys/values cache memory while simultaneously improving inference speed by up to 3.5 times. Our methods’ application in machine translation showcases their compatibility and flexibility, effectively enhancing memory efficiency for practical use. Our novel approach is practical, straightforward, flexible, and compatible with existing methods, paving the way for further adoption of LLMs in real-world applications.


While our proposed AnLLMs demonstrate significant improvements in memory efficiency and inference acceleration, there are several limitations to consider:

  1. Accuracy trade-off: As observed in the experimental results, our method incurs a minor decrease in accuracy (within 1.5%) compared to the original model. This limitation stems from the information compression process, which may lead to information loss. Although the degradation in accuracy is relatively small, it is crucial to acknowledge this trade-off when deploying our method in practical applications.

  2. Applicability to various tasks: Our experiments primarily focus on question-answering benchmarks and machine translation tasks. The effectiveness of our method in other natural language processing tasks, such as summarization, sentiment analysis, and entity recognition, remains to be thoroughly investigated. Future work should explore the applicability and performance of our method across a broader range of tasks.

  3. Optimal anchor token selection: In our implementation, we chose the last token in a sequence as the anchor token. However, the optimal anchor token selection may vary across different tasks and domains. Further research is needed to develop more sophisticated strategies for identifying and leveraging the most suitable anchor tokens.

  4. Scalability to other LLMs: We have applied our method to the open-source Llama2 models. It remains to be seen how our approach would perform when applied to other opensource LLMs, such as Falcon and Qwen (Almazrouei et al., 2023; Bai et al., 2023). Evaluating the effectiveness and scalability of our method on more extensive language models is an essential direction for future research.

Despite these limitations, our work presents a novel approach to enhance memory efficiency and inference acceleration in LLMs. Future research efforts should address these limitations, refining our method and extending its applicability to a wider range of tasks and model architectures.

Ethics Statement

In conducting this research, we have adhered to the highest ethical standards and principles of academic integrity. The development and implementation of the AnLLMs and the AnSAN have been carried out with the primary aim of improving the memory efficiency and inference speed of large language models, without any intention to cause harm or promote malicious applications.

Our methodology and experimental design have been thoroughly reviewed to ensure that the datasets and models employed are used responsibly and appropriately. The RedPajama datasets and the open-source Llama2 models, which we utilized in our study, are publicly available and widely recognized as reliable resources in the research community. All data used in this study have been processed and analyzed in compliance with relevant guidelines and best practices.

We acknowledge that the advancements in large language models and their applications may have potential implications for privacy, security, and fairness. In light of these concerns, we emphasize the importance of responsible usage and deployment of our proposed AnLLMs and AnSAN techniques. Researchers and practitioners adopting our methods should be aware of the potential risks and take necessary precautions to mitigate any unintended consequences.

Throughout this study, we have strived for transparency and reproducibility. Our results and findings are reported honestly and accurately, without any manipulation or misrepresentation. We are committed to sharing our knowledge and insights with the broader research community, and we encourage open discussion and constructive feedback to further advance the understanding and development of efficient and ethical large language models.

In conclusion, this research has been conducted in accordance with the highest ethical standards, and we are dedicated to fostering a responsible and collaborative research environment in the field of large language models and artificial intelligence.


This paper is available on arxiv under CC BY 4.0 DEED license.