Authors:
(1) Jianhui Pang, from the University of Macau, and work was done when Jianhui Pang and Fanghua Ye were interning at Tencent AI Lab (nlp2ct.pangjh3@gmail.com);
(2) Fanghua Ye, University College London, and work was done when Jianhui Pang and Fanghua Ye were interning at Tencent AI Lab (fanghua.ye.19@ucl.ac.uk);
(3) Derek F. Wong, University of Macau;
(4) Longyue Wang, Tencent AI Lab, and corresponding author. Table of Links Abstract and 1 Introduction 2 Related Work 3 Anchor-based Large Language Models 3.1 Background 3.2 Anchor-based Self-Attention Networks 3.3 Anchor-based Inference 4 Experiments and 4.1 Our Implementation 4.2 Data and Training Procedure 4.3 Evaluation 5 Results 6 Analysis 7 Conclusion, Limitations, Ethics Statement, and References A More Experimental Results B Data Settings 4 Experiments In this section, we first detail AnLLM’s implementation, then outline the training procedure and model perplexity. Finally, we introduce the evaluation datasets and metrics. 4.1 Our Implementation Llama2-7b (Touvron et al., 2023b) is adopted as the base model in our experiments, which is an open-source and English-centric LLM. In accordance with the principles outlined in Section 3, we present our implementations here. The crux is to identify which tokens in a sequence can be considered anchor tokens. In light of this, we describe two implementation strategies: one employs the endpoints directly, and the other involves appending a new token at the end of a sequence to serve as the anchor token. The details are as follows: • AnLLM-EP. This approach uses punctuation marks within the sequence as anchor tokens. Punctuation marks, such as commas, periods, and question marks, are viewed as semantic boundaries within a sequence. As such, they can serve as anchor tokens in AnLLM. In our experiments of AnLLM-EP, we use the endpoint in English as the anchor tokens. • AnLLM-AC. This strategy entails the introduction of a new token to act as the sequence anchor. In our implementation, we designate as the new token and initialize its embedding using the mean value of the embedding matrix. For training data, we use the sentence tokenizer from the NLTK package to split texts into sentences, appending at the end of each sentence as the anchor token.[1] During inference, tokens can be flexibly added to the text based on user requirements, such as adding one anchor for each demonstration, allowing for flexible and controllable sequence compression. This paper is available on arxiv under CC BY 4.0 DEED license. [1] https://www.nltk.org/api/nltk.tokenize.punkt. html Authors: (1) Jianhui Pang, from the University of Macau, and work was done when Jianhui Pang and Fanghua Ye were interning at Tencent AI Lab (nlp2ct.pangjh3@gmail.com); (2) Fanghua Ye, University College London, and work was done when Jianhui Pang and Fanghua Ye were interning at Tencent AI Lab (fanghua.ye.19@ucl.ac.uk); (3) Derek F. Wong, University of Macau; (4) Longyue Wang, Tencent AI Lab, and corresponding author. Authors: Authors: (1) Jianhui Pang, from the University of Macau, and work was done when Jianhui Pang and Fanghua Ye were interning at Tencent AI Lab (nlp2ct.pangjh3@gmail.com); (2) Fanghua Ye, University College London, and work was done when Jianhui Pang and Fanghua Ye were interning at Tencent AI Lab (fanghua.ye.19@ucl.ac.uk); (3) Derek F. Wong, University of Macau; (4) Longyue Wang, Tencent AI Lab, and corresponding author. Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 2 Related Work 2 Related Work 3 Anchor-based Large Language Models 3.1 Background 3.1 Background 3.2 Anchor-based Self-Attention Networks 3.2 Anchor-based Self-Attention Networks 3.3 Anchor-based Inference 3.3 Anchor-based Inference 4 Experiments and 4.1 Our Implementation 4 Experiments and 4.1 Our Implementation 4.2 Data and Training Procedure 4.2 Data and Training Procedure 4.3 Evaluation 4.3 Evaluation 5 Results 5 Results 6 Analysis 6 Analysis 7 Conclusion, Limitations, Ethics Statement, and References 7 Conclusion, Limitations, Ethics Statement, and References A More Experimental Results A More Experimental Results B Data Settings B Data Settings 4 Experiments In this section, we first detail AnLLM’s implementation, then outline the training procedure and model perplexity. Finally, we introduce the evaluation datasets and metrics. 4.1 Our Implementation Llama2-7b (Touvron et al., 2023b) is adopted as the base model in our experiments, which is an open-source and English-centric LLM. In accordance with the principles outlined in Section 3, we present our implementations here. The crux is to identify which tokens in a sequence can be considered anchor tokens. In light of this, we describe two implementation strategies: one employs the endpoints directly, and the other involves appending a new token at the end of a sequence to serve as the anchor token. The details are as follows: • AnLLM-EP. This approach uses punctuation marks within the sequence as anchor tokens. Punctuation marks, such as commas, periods, and question marks, are viewed as semantic boundaries within a sequence. As such, they can serve as anchor tokens in AnLLM. In our experiments of AnLLM-EP, we use the endpoint in English as the anchor tokens. • AnLLM-EP. • AnLLM-AC. This strategy entails the introduction of a new token to act as the sequence anchor. In our implementation, we designate as the new token and initialize its embedding using the mean value of the embedding matrix. For training data, we use the sentence tokenizer from the NLTK package to split texts into sentences, appending at the end of each sentence as the anchor token.[1] During inference, tokens can be flexibly added to the text based on user requirements, such as adding one anchor for each demonstration, allowing for flexible and controllable sequence compression. • AnLLM-AC. This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv [1] https://www.nltk.org/api/nltk.tokenize.punkt. html

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Anchor-based Large Language Models: Experiments and Implementation

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A Comprehensive Overview of Data Augmentation Methods

Anchor-based Large Language Models

How Anchor Tokens Transform Sequence Information Compression in LLMs

Unlocking the Mechanics of Decoder-Only Transformers and Self-Attention

The Role of Anchor Tokens in Self-Attention Networks

Improving Real-Time Inference with Anchor Tokens

A Comprehensive Overview of Data Augmentation Methods

Anchor-based Large Language Models

How Anchor Tokens Transform Sequence Information Compression in LLMs

Unlocking the Mechanics of Decoder-Only Transformers and Self-Attention

The Role of Anchor Tokens in Self-Attention Networks

Improving Real-Time Inference with Anchor Tokens

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps