Authors:
(1) Keivan Alizadeh;
(2) Iman Mirzadeh, Major Contribution;
(3) Dmitry Belenko, Major Contribution;
(4) S. Karen Khatamifard;
(5) Minsik Cho;
(6) Carlo C Del Mundo;
(7) Mohammad Rastegari;
(8) Mehrdad Farajtabar.
2. Flash Memory & LLM Inference and 2.1 Bandwidth and Energy Constraints
3.2 Improving Transfer Throughput with Increased Chunk Sizes
3.3 Optimized Data Management in DRAM
4.1 Results for OPT 6.7B Model
4.2 Results for Falcon 7B Model
6 Conclusion and Discussion, Acknowledgements and References
Experimental Setup: Our experiment is designed to optimize inference efficiency on personal devices. To this end, we process sequences individually, running only one sequence at a time. This approach allows us to allocate a specific portion of DRAM for the Key-Value (KV) cache while primarily focusing on the model size. This strategy is particularly effective when dealing with only one sequence/query at a time.[2]
For the implementation of our inference process, we utilize the HuggingFace’s transformers and KV caching. This setup is tested under the condition where approximately half of the model size is available in DRAM. We select this amount as a showcase of the idea of hosting the LLM in flash. With a different level of sparsity or employing quantization, one can work with smaller available DRAM capacity as well. Such a configuration demonstrates the practicality of executing inference with lower memory footprints.
Hardware Configuration. Our models are evaluated using two distinct hardware setups. The first setup includes an Apple M1 Max with a 1TB solid-state drive (SSD) for flash memory. In this configuration, computations are performed on the CPU, and the models are maintained in a 32-bit format. The second setup involves a Linux machine equipped with a 24 GB NVIDIA GeForce RTX 4090 graphics card. For this machine, computations are GPU-based, and models are run in the bfloat16 format. For both setups, we operate under the assumption that almost half of the total available memory (DRAM plus GPU memory) is allocated for model computations.
Models. We use OPT 6.7B (Zhang et al., 2022b) and a sparsified Falcon 7B (Mirzadeh et al., 2023) model for our evaluations.
Baselines. For methods not employing sparsity or weight sharing, at least half of the model must be transferred from flash memory during the forward pass. This necessity arises because, initially, only half of the model is available in DRAM, but as the forward pass progresses, the entire model capacity is utilized. Consequently, any data not present at the start must be transferred at least once. Thus, the most efficient theoretical baseline involves loading half of the model size from the flash memory into DRAM. This optimal I/O scenario serves as our primary baseline. Comparative methods, such as FlexGen (Sheng et al., 2023) and Petals (Borzunov et al., 2023), are also constrained by the limited available DRAM or GPU memory, and therefore cannot surpass this theoretical I/O efficiency.
Flash memory Data Loading Implementation. To optimize data loading from flash memory, our system employs reads parallelized over 32 threads. This multithreaded approach is intended to both better amortize latency to first byte by not waiting for each read sequentially, and maximize read throughput by reading multiple streams at once (Figure 2b).
Caching Considerations for Data Loading from Flash Memory. When data is read from flash memory, the operating system typically caches these pages, anticipating future reuse. However, this caching mechanism consumes additional memory in DRAM beyond what is allocated for the model. To accurately assess the real throughput of flash memory under limited DRAM conditions, benchmarks should be conducted without relying on caching. Practical systems may or may not rely on filesystem cache, depending on requirements.
For the purpose of our hardware benchmarking in this study, we deliberately and significantly pessimize our NVMe throughput measurements. On macOS and iOS, we employ the F_NOCACHE flag with the fcntl() function, while on Linux, we use DirectIO. Additionally, on macOS, we clear any resident buffers before initiating the benchmark using the purge command. This approach provides a conservative lower bound of throughput in scenarios where no caching is permitted, and makes the benchmarks repeatable. It’s worth noting that these figures can improve if either the inference code or the operating system is allowed to cache some part of the weights.
While OS-level buffer caching is advantageous for general purpose applications with high cache hit rates, it lacks fine-grained control over cache usage per process or buffer eviction at the application level. In the context of on-device memory constraints and large model sizes, this could lead to a situation where filesystem level does not help, because in order to evaluate later layers earlier layers must be evicted in a rolling pattern, so the effective cache hit rate is close to zero. Aside from being inefficient, this can cause coexistence issues with other processes due to memory allocation pressure and Translation Lookaside Buffer (TLB) churn.
This paper is available on arxiv under CC BY-SA 4.0 DEED license.
[2] For OPT 6.7 B model with context length 2048 KV-cache requires 2048 × 2dmodel elements which is only 8% of model size. Also the KV-cache itself can be held in flash memory.