Authors:
(1) Jianhui Pang, from the University of Macau, and work was done when Jianhui Pang and Fanghua Ye were interning at Tencent AI Lab (nlp2ct.pangjh3@gmail.com);
(2) Fanghua Ye, University College London, and work was done when Jianhui Pang and Fanghua Ye were interning at Tencent AI Lab (fanghua.ye.19@ucl.ac.uk);
(3) Derek F. Wong, University of Macau;
(4) Longyue Wang, Tencent AI Lab, and corresponding author. Table of Links Abstract and 1 Introduction 2 Related Work 3 Anchor-based Large Language Models 3.1 Background 3.2 Anchor-based Self-Attention Networks 3.3 Anchor-based Inference 4 Experiments and 4.1 Our Implementation 4.2 Data and Training Procedure 4.3 Evaluation 5 Results 6 Analysis 7 Conclusion, Limitations, Ethics Statement, and References A More Experimental Results B Data Settings 4.2 Data and Training Procedure Considering that AnLLMs are expected to predict subsequent tokens within the context of keys/values hidden states of anchor tokens, this presents a significant challenge for existing open-source LLMs. To this end, by substituting the self-attention networks with anchor-based self-attention networks as detailed in Section 3.2, we continually pre-train the Llama2 model using a publicly available corpus. Data. We employ the RedPajama-Data-1TSample dataset (Computer, 2023) for the continuous pre-training purpose.[2] This dataset comprises 850, 000 samples with approximately 1 billion tokens, which have been subjected to right truncation to fit the model context of 4, 096. Training Loss and Perplexity. The left-hand side of Figure 3 depicts the training loss associated with our models. The loss curves for AnLLM-EP and AnLLM-AC consistently decline to approximately 1.9, with AnLLM-AC achieving a lower loss. This observation suggests that continually pre-training an LLM using anchor-based attention masks is indeed viable, enabling the LLM to effectively learn the process of compressing sequence information into anchor tokens. The right-hand side of Figure 3 displays the perplexity evaluation of the models with varying context lengths. Full attention is utilized to assess the language modeling capabilities of all models. Following the settings of Chen et al. (2023), the perplexity is evaluated on the test samples of the Proof-Pile datasets (Rae et al., 2020). The results demonstrate that both AnLLM-EP and AnLLMAC models maintain a promising performance, exhibiting language modeling capacity comparable to the base model, Llama2-7B. Moreover, this finding suggests that AnLLMs are compatible with full attention, as indicated by minimal perplexity decline. This paper is available on arxiv under CC BY 4.0 DEED license. [2] https://huggingface.co/datasets/ togethercomputer/RedPajama-Data-1T-Sample Authors: (1) Jianhui Pang, from the University of Macau, and work was done when Jianhui Pang and Fanghua Ye were interning at Tencent AI Lab (nlp2ct.pangjh3@gmail.com); (2) Fanghua Ye, University College London, and work was done when Jianhui Pang and Fanghua Ye were interning at Tencent AI Lab (fanghua.ye.19@ucl.ac.uk); (3) Derek F. Wong, University of Macau; (4) Longyue Wang, Tencent AI Lab, and corresponding author. Authors: Authors: (1) Jianhui Pang, from the University of Macau, and work was done when Jianhui Pang and Fanghua Ye were interning at Tencent AI Lab (nlp2ct.pangjh3@gmail.com); (2) Fanghua Ye, University College London, and work was done when Jianhui Pang and Fanghua Ye were interning at Tencent AI Lab (fanghua.ye.19@ucl.ac.uk); (3) Derek F. Wong, University of Macau; (4) Longyue Wang, Tencent AI Lab, and corresponding author. Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 2 Related Work 2 Related Work 3 Anchor-based Large Language Models 3.1 Background 3.1 Background 3.2 Anchor-based Self-Attention Networks 3.2 Anchor-based Self-Attention Networks 3.3 Anchor-based Inference 3.3 Anchor-based Inference 4 Experiments and 4.1 Our Implementation 4 Experiments and 4.1 Our Implementation 4.2 Data and Training Procedure 4.2 Data and Training Procedure 4.3 Evaluation 4.3 Evaluation 5 Results 5 Results 6 Analysis 6 Analysis 7 Conclusion, Limitations, Ethics Statement, and References 7 Conclusion, Limitations, Ethics Statement, and References A More Experimental Results A More Experimental Results B Data Settings B Data Settings 4.2 Data and Training Procedure Considering that AnLLMs are expected to predict subsequent tokens within the context of keys/values hidden states of anchor tokens, this presents a significant challenge for existing open-source LLMs. To this end, by substituting the self-attention networks with anchor-based self-attention networks as detailed in Section 3.2, we continually pre-train the Llama2 model using a publicly available corpus. Data . We employ the RedPajama-Data-1TSample dataset (Computer, 2023) for the continuous pre-training purpose.[2] This dataset comprises 850, 000 samples with approximately 1 billion tokens, which have been subjected to right truncation to fit the model context of 4, 096. Data Training Loss and Perplexity. The left-hand side of Figure 3 depicts the training loss associated with our models. The loss curves for AnLLM-EP and AnLLM-AC consistently decline to approximately 1.9, with AnLLM-AC achieving a lower loss. This observation suggests that continually pre-training an LLM using anchor-based attention masks is indeed viable, enabling the LLM to effectively learn the process of compressing sequence information into anchor tokens. Training Loss and Perplexity. The right-hand side of Figure 3 displays the perplexity evaluation of the models with varying context lengths. Full attention is utilized to assess the language modeling capabilities of all models. Following the settings of Chen et al. (2023), the perplexity is evaluated on the test samples of the Proof-Pile datasets (Rae et al., 2020). The results demonstrate that both AnLLM-EP and AnLLMAC models maintain a promising performance, exhibiting language modeling capacity comparable to the base model, Llama2-7B. Moreover, this finding suggests that AnLLMs are compatible with full attention, as indicated by minimal perplexity decline. This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv [2] https://huggingface.co/datasets/ togethercomputer/RedPajama-Data-1T-Sample

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Pre-Training AnLLMs: Leveraging RedPajama Data for Enhanced Performance

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A Comprehensive Overview of Data Augmentation Methods

Anchor-based Large Language Models

How Anchor Tokens Transform Sequence Information Compression in LLMs

Unlocking the Mechanics of Decoder-Only Transformers and Self-Attention

The Role of Anchor Tokens in Self-Attention Networks

Improving Real-Time Inference with Anchor Tokens

A Comprehensive Overview of Data Augmentation Methods

Anchor-based Large Language Models

How Anchor Tokens Transform Sequence Information Compression in LLMs

Unlocking the Mechanics of Decoder-Only Transformers and Self-Attention

The Role of Anchor Tokens in Self-Attention Networks

Improving Real-Time Inference with Anchor Tokens

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps