paint-brush
Large Language Models on Memory-Constrained Devices Using Flash Memory: Load From Flashby@knapsack
New Story

Large Language Models on Memory-Constrained Devices Using Flash Memory: Load From Flash

by KnapsackJuly 31st, 2024
Read on Terminal Reader
tldt arrow

Too Long; Didn't Read

Efficiently run large language models on devices with limited DRAM by optimizing flash memory use, reducing data transfer, and enhancing throughput.
featured image - Large Language Models on Memory-Constrained Devices Using Flash Memory: Load From Flash
Knapsack HackerNoon profile picture

Authors:

(1) Keivan Alizadeh;

(2) Iman Mirzadeh, Major Contribution;

(3) Dmitry Belenko, Major Contribution;

(4) S. Karen Khatamifard;

(5) Minsik Cho;

(6) Carlo C Del Mundo;

(7) Mohammad Rastegari;

(8) Mehrdad Farajtabar.

Abstract and 1. Introduction

2. Flash Memory & LLM Inference and 2.1 Bandwidth and Energy Constraints

2.2 Read Throughput

3 Load From Flash

3.1 Reducing Data Transfer

3.2 Improving Transfer Throughput with Increased Chunk Sizes

3.3 Optimized Data Management in DRAM

4 Results

4.1 Results for OPT 6.7B Model

4.2 Results for Falcon 7B Model

5 Related Works

6 Conclusion and Discussion, Acknowledgements and References

3 Load From Flash

This section addresses the challenge of conducting inference on devices where the available DRAM is substantially smaller than the size of the model. This necessitates storing the full model weights in flash memory. Our primary metric for evaluating various flash loading strategies is latency, dissected into three distinct components: the I/O cost of loading from flash, the overhead of managing memory with newly loaded data, and the compute cost for inference operations.


Our proposed solutions for reducing latency under memory constraints are categorized into three strategic areas, each targeting a specific aspect of the latency:


Reducing Data Load: Aiming to decrease latency associated with flash I/O operations by loading less data[1].


Optimizing Data Chunk Size: Enhancing flash throughput by increasing the size of data chunks loaded, thereby mitigating latency.


Efficient Management of Loaded Data: Streamlining the management of data once it is loaded into memory to minimize overhead.


It is important to note that our focus is not on the compute aspect of the process, as it is orthogonal to the core concerns of our work. This delineation allows us to concentrate on optimizing flash memory interactions and memory management to achieve efficient inference on memory-constrained devices.


Finally, we will elaborate on the implementation of these strategies in subsequent sections.


This paper is available on arxiv under CC BY-SA 4.0 DEED license.


[1] It is notable that, by data we mean weights of the neural network. However, our developed techniques can be easily generalized to other data types transferred and used for LLM inference, such as activations or KV cache, as suggested by Sheng et al. (2023).