Authors:
(1) Jianhui Pang, from the University of Macau, and work was done when Jianhui Pang and Fanghua Ye were interning at Tencent AI Lab (nlp2ct.pangjh3@gmail.com);
(2) Fanghua Ye, University College London, and work was done when Jianhui Pang and Fanghua Ye were interning at Tencent AI Lab (fanghua.ye.19@ucl.ac.uk);
(3) Derek F. Wong, University of Macau;
(4) Longyue Wang, Tencent AI Lab, and corresponding author. Table of Links Abstract and 1 Introduction 2 Related Work 3 Anchor-based Large Language Models 3.1 Background 3.2 Anchor-based Self-Attention Networks 3.3 Anchor-based Inference 4 Experiments and 4.1 Our Implementation 4.2 Data and Training Procedure 4.3 Evaluation 5 Results 6 Analysis 7 Conclusion, Limitations, Ethics Statement, and References A More Experimental Results B Data Settings 3.2 Anchor-based Self-Attention Networks Anchor-based Attention Masks. To accomplish this, we devise anchor-based attention masks, as illustrated in Figure 2. Assuming that the current token in the sequence is a non-anchor token, we allow attention towards previous non-anchor tokens within the same sequence and anchor tokens from preceding sequences, while blocking attention towards non-anchor tokens from previous sequences. This approach ensures that non-anchor tokens can only access information from anchor tokens in previous sequences and the current sequence’s information. Conversely, when the current token is an anchor token, which is the last token in the sequence, we exclusively permit its attention towards previous non-anchor tokens within the same sequence, blocking all other attention. This constraint forces the anchor token to aggregate information solely from its current sequence. Consequently, we replace Eq. (3) with anchor-based attention masks in Eq. (4) to determine the mask of the i-th token in the input text concerning the j-th token (assuming that the i-th token belongs to the k-th sequence). Anchor Token Selection. By implementing the AnSAN mechanism for training LLMs, we can compel the model to compress sequence information into the anchor token and generate new tokens based on the anchor token information from previous sequences and non-anchor token information from the current sequence. The challenge now lies in selecting an appropriate anchor token. In our experiment, we propose two implementation methods: one using the endpoint as the anchor token, and the other appending a new token specifically as the anchor token. This paper is available on arxiv under CC BY 4.0 DEED license. Authors: (1) Jianhui Pang, from the University of Macau, and work was done when Jianhui Pang and Fanghua Ye were interning at Tencent AI Lab (nlp2ct.pangjh3@gmail.com); (2) Fanghua Ye, University College London, and work was done when Jianhui Pang and Fanghua Ye were interning at Tencent AI Lab (fanghua.ye.19@ucl.ac.uk); (3) Derek F. Wong, University of Macau; (4) Longyue Wang, Tencent AI Lab, and corresponding author. Authors: Authors: (1) Jianhui Pang, from the University of Macau, and work was done when Jianhui Pang and Fanghua Ye were interning at Tencent AI Lab (nlp2ct.pangjh3@gmail.com); (2) Fanghua Ye, University College London, and work was done when Jianhui Pang and Fanghua Ye were interning at Tencent AI Lab (fanghua.ye.19@ucl.ac.uk); (3) Derek F. Wong, University of Macau; (4) Longyue Wang, Tencent AI Lab, and corresponding author. Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 2 Related Work 2 Related Work 3 Anchor-based Large Language Models 3.1 Background 3.1 Background 3.2 Anchor-based Self-Attention Networks 3.2 Anchor-based Self-Attention Networks 3.3 Anchor-based Inference 3.3 Anchor-based Inference 4 Experiments and 4.1 Our Implementation 4 Experiments and 4.1 Our Implementation 4.2 Data and Training Procedure 4.2 Data and Training Procedure 4.3 Evaluation 4.3 Evaluation 5 Results 5 Results 6 Analysis 6 Analysis 7 Conclusion, Limitations, Ethics Statement, and References 7 Conclusion, Limitations, Ethics Statement, and References A More Experimental Results A More Experimental Results B Data Settings B Data Settings 3.2 Anchor-based Self-Attention Networks Anchor-based Attention Masks. To accomplish this, we devise anchor-based attention masks, as illustrated in Figure 2. Assuming that the current token in the sequence is a non-anchor token, we allow attention towards previous non-anchor tokens within the same sequence and anchor tokens from preceding sequences, while blocking attention towards non-anchor tokens from previous sequences. This approach ensures that non-anchor tokens can only access information from anchor tokens in previous sequences and the current sequence’s information. Conversely, when the current token is an anchor token, which is the last token in the sequence, we exclusively permit its attention towards previous non-anchor tokens within the same sequence, blocking all other attention. This constraint forces the anchor token to aggregate information solely from its current sequence. Consequently, we replace Eq. (3) with anchor-based attention masks in Eq. (4) to determine the mask of the i -th token in the input text concerning the j -th token (assuming that the i -th token belongs to the k -th sequence). Anchor-based Attention Masks. i j i k Anchor Token Selection. By implementing the AnSAN mechanism for training LLMs, we can compel the model to compress sequence information into the anchor token and generate new tokens based on the anchor token information from previous sequences and non-anchor token information from the current sequence. Anchor Token Selection. The challenge now lies in selecting an appropriate anchor token. In our experiment, we propose two implementation methods: one using the endpoint as the anchor token, and the other appending a new token specifically as the anchor token. This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

The Role of Anchor Tokens in Self-Attention Networks

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A Comprehensive Overview of Data Augmentation Methods

Anchor-based Large Language Models

How Anchor Tokens Transform Sequence Information Compression in LLMs

Unlocking the Mechanics of Decoder-Only Transformers and Self-Attention

Improving Real-Time Inference with Anchor Tokens

Anchor-based Large Language Models: Experiments and Implementation

A Comprehensive Overview of Data Augmentation Methods

Anchor-based Large Language Models

How Anchor Tokens Transform Sequence Information Compression in LLMs

Unlocking the Mechanics of Decoder-Only Transformers and Self-Attention

Improving Real-Time Inference with Anchor Tokens

Anchor-based Large Language Models: Experiments and Implementation

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps