Authors:
(1) Keivan Alizadeh;
(2) Iman Mirzadeh, Major Contribution;
(3) Dmitry Belenko, Major Contribution;
(4) S. Karen Khatamifard;
(5) Minsik Cho;
(6) Carlo C Del Mundo;
(7) Mohammad Rastegari;
(8) Mehrdad Farajtabar. Table of Links Abstract and 1. Introduction 2. Flash Memory & LLM Inference and 2.1 Bandwidth and Energy Constraints 2.2 Read Throughput 3 Load From Flash 3.1 Reducing Data Transfer 3.2 Improving Transfer Throughput with Increased Chunk Sizes 3.3 Optimized Data Management in DRAM 4 Results 4.1 Results for OPT 6.7B Model 4.2 Results for Falcon 7B Model 5 Related Works 6 Conclusion and Discussion, Acknowledgements and References 3.3 Optimized Data Management in DRAM Although data transfer within DRAM is more efficient compared to accessing flash memory, it still incurs a non-negligible cost. When introducing data for new neurons, reallocating the matrix and appending new matrices can lead to significant overhead due to the need for rewriting existing neurons data in DRAM. This is particularly costly when a substantial portion (approximately 25%) of the Feed-Forward Networks (FFNs) in DRAM needs to be rewritten. To address this issue, we adopt an alternative memory management strategy. This involves the preallocation of all necessary memory and the establishment of a corresponding data structure for efficient management. The data structure comprises elements such as pointers, matrix, bias, num_used, and last_k_active shown in Figure 7. Each row in the matrix represents the concatenated row of the ’up project’ and the column of the ’down project’ of a neuron. The pointer vector indicates the original neuron index corresponding to each row in the matrix. The bias for the ’up project’ in the original model is represented in the corresponding bias element. The num_used parameter tracks the number of rows currently utilized in the matrix, initially set to zero. The matrix for the ith layer is pre-allocated with a size of Reqi × 2dmodel, where Reqi denotes the maximum number of neurons required for the specified window size in a subset of C4 validation set. By allocating a sufficient amount of memory for each layer in advance, we minimize the need for frequent reallocation. Finally, the last_k_active component identifies the neurons from the original model that were most recently activated using the last k tokens. The following operations are done during inference as depicted in Figure 7. 2. Bringing in New Neurons: Necessary neuron data is retrieved from flash memory. The corresponding pointers and scalars are read from DRAM, and these rows are then inserted into the matrix, extending from num_row to num_row + num_new. This approach eliminates the need for reallocating memory in DRAM and copying existing data, reducing inference latency. 3. Inference Process: For the inference operation, the first half of the matrix[:num_rows,:d_model] is used as the ’up project’, and the transposed second half, matrix[:num_rows,d_model:].transpose(), serves as the ’down project’. This configuration is possible because the order of neurons in the intermediate output of the feed-forward layer does not alter the final output, allowing for a streamlined inference process. These steps collectively ensure efficient memory management during inference, optimizing the neural network’s performance and resource utilization. This paper is available on arxiv under CC BY-SA 4.0 DEED license. Authors: (1) Keivan Alizadeh; (2) Iman Mirzadeh, Major Contribution; (3) Dmitry Belenko, Major Contribution; (4) S. Karen Khatamifard; (5) Minsik Cho; (6) Carlo C Del Mundo; (7) Mohammad Rastegari; (8) Mehrdad Farajtabar. Authors: Authors: (1) Keivan Alizadeh; (2) Iman Mirzadeh, Major Contribution; (3) Dmitry Belenko, Major Contribution; (4) S. Karen Khatamifard; (5) Minsik Cho; (6) Carlo C Del Mundo; (7) Mohammad Rastegari; (8) Mehrdad Farajtabar. Table of Links Abstract and 1. Introduction Abstract and 1. Introduction 2. Flash Memory & LLM Inference and 2.1 Bandwidth and Energy Constraints 2. Flash Memory & LLM Inference and 2.1 Bandwidth and Energy Constraints 2.2 Read Throughput 2.2 Read Throughput 3 Load From Flash 3 Load From Flash 3.1 Reducing Data Transfer 3.1 Reducing Data Transfer 3.2 Improving Transfer Throughput with Increased Chunk Sizes 3.2 Improving Transfer Throughput with Increased Chunk Sizes 3.3 Optimized Data Management in DRAM 3.3 Optimized Data Management in DRAM 4 Results 4 Results 4.1 Results for OPT 6.7B Model 4.1 Results for OPT 6.7B Model 4.2 Results for Falcon 7B Model 4.2 Results for Falcon 7B Model 5 Related Works 5 Related Works 6 Conclusion and Discussion, Acknowledgements and References 6 Conclusion and Discussion, Acknowledgements and References 3.3 Optimized Data Management in DRAM Although data transfer within DRAM is more efficient compared to accessing flash memory, it still incurs a non-negligible cost. When introducing data for new neurons, reallocating the matrix and appending new matrices can lead to significant overhead due to the need for rewriting existing neurons data in DRAM. This is particularly costly when a substantial portion (approximately 25%) of the Feed-Forward Networks (FFNs) in DRAM needs to be rewritten. To address this issue, we adopt an alternative memory management strategy. This involves the preallocation of all necessary memory and the establishment of a corresponding data structure for efficient management. The data structure comprises elements such as pointers, matrix, bias, num_used, and last_k_active shown in Figure 7. Each row in the matrix represents the concatenated row of the ’up project’ and the column of the ’down project’ of a neuron. The pointer vector indicates the original neuron index corresponding to each row in the matrix. The bias for the ’up project’ in the original model is represented in the corresponding bias element. The num_used parameter tracks the number of rows currently utilized in the matrix, initially set to zero. The matrix for the ith layer is pre-allocated with a size of Reqi × 2dmodel, where Reqi denotes the maximum number of neurons required for the specified window size in a subset of C4 validation set. By allocating a sufficient amount of memory for each layer in advance, we minimize the need for frequent reallocation. Finally, the last_k_active component identifies the neurons from the original model that were most recently activated using the last k tokens. The following operations are done during inference as depicted in Figure 7. 2. Bringing in New Neurons: Necessary neuron data is retrieved from flash memory. The corresponding pointers and scalars are read from DRAM, and these rows are then inserted into the matrix, extending from num_row to num_row + num_new. This approach eliminates the need for reallocating memory in DRAM and copying existing data, reducing inference latency. 2. Bringing in New Neurons: 3. Inference Process: For the inference operation, the first half of the matrix[:num_rows,:d_model] is used as the ’up project’, and the transposed second half, matrix[:num_rows,d_model:].transpose(), serves as the ’down project’. This configuration is possible because the order of neurons in the intermediate output of the feed-forward layer does not alter the final output, allowing for a streamlined inference process. These steps collectively ensure efficient memory management during inference, optimizing the neural network’s performance and resource utilization. This paper is available on arxiv under CC BY-SA 4.0 DEED license. This paper is available on arxiv under CC BY-SA 4.0 DEED license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Large Language Models on Memory-Constrained Devices Using Flash Memory: Optimized Data in DRAM

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Large Language Models on Memory-Constrained Devices Using Flash Memory: Abstract and Intro

Large Language Models on Memory-Constrained Devices Using Flash Memory: Abstract and Intro

Large Language Models on Memory-Constrained Devices Using Flash Memory: Flash Memory & LLM Inference

Large Language Models on Memory-Constrained Devices Using Flash Memory: Read Throughput

Large Language Models on Memory-Constrained Devices Using Flash Memory: Load From Flash

Large Language Models on Memory-Constrained Devices Using Flash Memory: Reducing Data Transfer

Large Language Models on Memory-Constrained Devices Using Flash Memory: Abstract and Intro

Large Language Models on Memory-Constrained Devices Using Flash Memory: Abstract and Intro

Large Language Models on Memory-Constrained Devices Using Flash Memory: Flash Memory & LLM Inference

Large Language Models on Memory-Constrained Devices Using Flash Memory: Read Throughput

Large Language Models on Memory-Constrained Devices Using Flash Memory: Load From Flash

Large Language Models on Memory-Constrained Devices Using Flash Memory: Reducing Data Transfer

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps