Large Language Models on Memory-Constrained Devices Using Flash Memory: Results for Falcon 7B Model

Authors: (1) Keivan Alizadeh; (2) Iman Mirzadeh, Major Contribution; (3) Dmitry Belenko, Major Contribution; (4) S. Karen Khatamifard; (5) Minsik Cho; (6) Carlo C Del Mundo; (7) Mohammad Rastegari; (8) Mehrdad Farajtabar. Table of Links Abstract and 1. Introduction 2. Flash Memory & LLM Inference and 2.1 Bandwidth and Energy Constraints 2.2 Read Throughput 3 Load From Flash 3.1 Reducing Data Transfer 3.2 Improving Transfer Throughput with Increased Chunk Sizes 3.3 Optimized Data Management in DRAM 4 Results 4.1 Results for OPT 6.7B Model 4.2 Results for Falcon 7B Model 5 Related Works 6 Conclusion and Discussion, Acknowledgements and References 4.2 Results for Falcon 7B Model To verify that our findings generalize beyond OPT models we also apply the idea of LLM in flash to Falcon model. Since, the base line Falcon model is not sparse, we used a sparsified (relufied) version with almost the same performance as that of the base version (Mirzadeh et al., 2023). Similar to previous section, we present the results obtained under the condition that approximately half of the model size is available for use in DRAM. Predictors. In the Falcon 7B model, predictors of rank r = 256 are used for the initial 28 layers, and r = 1152 for the last four layers. Window Configuration. Our model reserves memory for a window containing the last 4 tokens. This setup utilizes 33% of the Feed Forward Network (FFN). In terms of memory allocation, embeddings take 4.2% of the model size, attention weights account for 19.4%, and predictors require 4%. The active portion of the FFN, given our window size, is 25.3% (calculated as 0.33 × 76.8). Overall, this amounts to 52.93% of the model’s total size. This paper is available on arxiv under CC BY-SA 4.0 DEED license. Authors: (1) Keivan Alizadeh; (2) Iman Mirzadeh, Major Contribution; (3) Dmitry Belenko, Major Contribution; (4) S. Karen Khatamifard; (5) Minsik Cho; (6) Carlo C Del Mundo; (7) Mohammad Rastegari; (8) Mehrdad Farajtabar. Authors: Authors: (1) Keivan Alizadeh; (2) Iman Mirzadeh, Major Contribution; (3) Dmitry Belenko, Major Contribution; (4) S. Karen Khatamifard; (5) Minsik Cho; (6) Carlo C Del Mundo; (7) Mohammad Rastegari; (8) Mehrdad Farajtabar. Table of Links Abstract and 1. Introduction Abstract and 1. Introduction 2. Flash Memory & LLM Inference and 2.1 Bandwidth and Energy Constraints 2. Flash Memory & LLM Inference and 2.1 Bandwidth and Energy Constraints 2.2 Read Throughput 2.2 Read Throughput 3 Load From Flash 3 Load From Flash 3.1 Reducing Data Transfer 3.1 Reducing Data Transfer 3.2 Improving Transfer Throughput with Increased Chunk Sizes 3.2 Improving Transfer Throughput with Increased Chunk Sizes 3.3 Optimized Data Management in DRAM 3.3 Optimized Data Management in DRAM 4 Results 4 Results 4.1 Results for OPT 6.7B Model 4.1 Results for OPT 6.7B Model 4.2 Results for Falcon 7B Model 4.2 Results for Falcon 7B Model 5 Related Works 5 Related Works 6 Conclusion and Discussion, Acknowledgements and References 6 Conclusion and Discussion, Acknowledgements and References 4.2 Results for Falcon 7B Model To verify that our findings generalize beyond OPT models we also apply the idea of LLM in flash to Falcon model. Since, the base line Falcon model is not sparse, we used a sparsified (relufied) version with almost the same performance as that of the base version (Mirzadeh et al., 2023). Similar to previous section, we present the results obtained under the condition that approximately half of the model size is available for use in DRAM. Predictors. In the Falcon 7B model, predictors of rank r = 256 are used for the initial 28 layers, and r = 1152 for the last four layers. Predictors. Window Configuration. Our model reserves memory for a window containing the last 4 tokens. This setup utilizes 33% of the Feed Forward Network (FFN). In terms of memory allocation, embeddings take 4.2% of the model size, attention weights account for 19.4%, and predictors require 4%. The active portion of the FFN, given our window size, is 25.3% (calculated as 0.33 × 76.8). Overall, this amounts to 52.93% of the model’s total size. Window Configuration. This paper is available on arxiv under CC BY-SA 4.0 DEED license. This paper is available on arxiv under CC BY-SA 4.0 DEED license. available on arxiv

Large Language Models on Memory-Constrained Devices Using Flash Memory: Results for Falcon 7B Model

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Large Language Models on Memory-Constrained Devices Using Flash Memory: Abstract and Intro

Large Language Models on Memory-Constrained Devices Using Flash Memory: Abstract and Intro

Large Language Models on Memory-Constrained Devices Using Flash Memory: Flash Memory & LLM Inference

Large Language Models on Memory-Constrained Devices Using Flash Memory: Read Throughput

Large Language Models on Memory-Constrained Devices Using Flash Memory: Load From Flash

Large Language Models on Memory-Constrained Devices Using Flash Memory: Reducing Data Transfer

Large Language Models on Memory-Constrained Devices Using Flash Memory: Abstract and Intro

Large Language Models on Memory-Constrained Devices Using Flash Memory: Abstract and Intro

Large Language Models on Memory-Constrained Devices Using Flash Memory: Flash Memory & LLM Inference

Large Language Models on Memory-Constrained Devices Using Flash Memory: Read Throughput

Large Language Models on Memory-Constrained Devices Using Flash Memory: Load From Flash

Large Language Models on Memory-Constrained Devices Using Flash Memory: Reducing Data Transfer

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps