Large Language Models on Memory-Constrained Devices Using Flash Memory: Read Throughput

Authors: (1) Keivan Alizadeh; (2) Iman Mirzadeh, Major Contribution; (3) Dmitry Belenko, Major Contribution; (4) S. Karen Khatamifard; (5) Minsik Cho; (6) Carlo C Del Mundo; (7) Mohammad Rastegari; (8) Mehrdad Farajtabar. Table of Links Abstract and 1. Introduction 2. Flash Memory & LLM Inference and 2.1 Bandwidth and Energy Constraints 2.2 Read Throughput 3 Load From Flash 3.1 Reducing Data Transfer 3.2 Improving Transfer Throughput with Increased Chunk Sizes 3.3 Optimized Data Management in DRAM 4 Results 4.1 Results for OPT 6.7B Model 4.2 Results for Falcon 7B Model 5 Related Works 6 Conclusion and Discussion, Acknowledgements and References 2.2 Read Throughput Flash memory systems perform optimally with large sequential reads. For instance, benchmarks on an Apple MacBook Pro M2 with 2TB flash demonstrate speeds exceeding 6GiB/s for a 1GiB linear read of an uncached file. However, this high bandwidth is not replicated for smaller, random reads due to the inherent multi-phase nature of these reads, encompassing the operating system, drivers, interrupt handling, and the flash controller, among others. Each phase introduces latency, disproportionately affecting smaller reads. To circumvent these limitations, we advocate two primary strategies, which can be employed jointly. The first involves reading larger chunks of data. For smaller blocks, a substantial part of the overall read time is spent waiting for data transfer to begin. This is often referred to as latency to first byte. This latency reduces the overall throughput of each read operation considerably, because the overall measured throughput has to take into account not just the speed of transfer once it begins, but the latency before it begins as well, which penalizes small reads. This means that if we coalesce the reads for rows and colums of the FFN matrices, we can pay the latency cost only once for any given row/column pair in both matrices, and higher throughput can be realized. This principle is depicted in Figure 2b. Perhaps a counterintuitive yet interesting observation is that in some scenarios, it will be worthwhile to read more than needed (but in larger chunks) and then discard, than only reading strictly the necessary parts but in smaller chunks. The second strategy leverages parallelized reads, utilizing the inherent parallelism within storage stacks and flash controllers. Our results indicate that throughputs appropriate for sparse LLM inference are achievable on modern off-the-shelf hardware using 32KiB or larger random reads across multiple threads. Motivated by the challenges described in this section, in section 3, we propose methods to optimize data transfer volume and enhance read throughput to significantly enhance inference speeds. This paper is available on arxiv under CC BY-SA 4.0 DEED license. Authors: (1) Keivan Alizadeh; (2) Iman Mirzadeh, Major Contribution; (3) Dmitry Belenko, Major Contribution; (4) S. Karen Khatamifard; (5) Minsik Cho; (6) Carlo C Del Mundo; (7) Mohammad Rastegari; (8) Mehrdad Farajtabar. Authors: Authors: (1) Keivan Alizadeh; (2) Iman Mirzadeh, Major Contribution; (3) Dmitry Belenko, Major Contribution; (4) S. Karen Khatamifard; (5) Minsik Cho; (6) Carlo C Del Mundo; (7) Mohammad Rastegari; (8) Mehrdad Farajtabar. Table of Links Abstract and 1. Introduction Abstract and 1. Introduction 2. Flash Memory & LLM Inference and 2.1 Bandwidth and Energy Constraints 2. Flash Memory & LLM Inference and 2.1 Bandwidth and Energy Constraints 2.2 Read Throughput 2.2 Read Throughput 3 Load From Flash 3 Load From Flash 3.1 Reducing Data Transfer 3.1 Reducing Data Transfer 3.2 Improving Transfer Throughput with Increased Chunk Sizes 3.2 Improving Transfer Throughput with Increased Chunk Sizes 3.3 Optimized Data Management in DRAM 3.3 Optimized Data Management in DRAM 4 Results 4 Results 4.1 Results for OPT 6.7B Model 4.1 Results for OPT 6.7B Model 4.2 Results for Falcon 7B Model 4.2 Results for Falcon 7B Model 5 Related Works 5 Related Works 6 Conclusion and Discussion, Acknowledgements and References 6 Conclusion and Discussion, Acknowledgements and References 2.2 Read Throughput Flash memory systems perform optimally with large sequential reads. For instance, benchmarks on an Apple MacBook Pro M2 with 2TB flash demonstrate speeds exceeding 6GiB/s for a 1GiB linear read of an uncached file. However, this high bandwidth is not replicated for smaller, random reads due to the inherent multi-phase nature of these reads, encompassing the operating system, drivers, interrupt handling, and the flash controller, among others. Each phase introduces latency, disproportionately affecting smaller reads. To circumvent these limitations, we advocate two primary strategies, which can be employed jointly. The first involves reading larger chunks of data. For smaller blocks, a substantial part of the overall read time is spent waiting for data transfer to begin. This is often referred to as latency to first byte. This latency reduces the overall throughput of each read operation considerably, because the overall measured throughput has to take into account not just the speed of transfer once it begins, but the latency before it begins as well, which penalizes small reads. This means that if we coalesce the reads for rows and colums of the FFN matrices, we can pay the latency cost only once for any given row/column pair in both matrices, and higher throughput can be realized. This principle is depicted in Figure 2b. Perhaps a counterintuitive yet interesting observation is that in some scenarios, it will be worthwhile to read more than needed (but in larger chunks) and then discard, than only reading strictly the necessary parts but in smaller chunks. The second strategy leverages parallelized reads, utilizing the inherent parallelism within storage stacks and flash controllers. Our results indicate that throughputs appropriate for sparse LLM inference are achievable on modern off-the-shelf hardware using 32KiB or larger random reads across multiple threads. Motivated by the challenges described in this section, in section 3, we propose methods to optimize data transfer volume and enhance read throughput to significantly enhance inference speeds. This paper is available on arxiv under CC BY-SA 4.0 DEED license. This paper is available on arxiv under CC BY-SA 4.0 DEED license. available on arxiv

Large Language Models on Memory-Constrained Devices Using Flash Memory: Read Throughput

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Large Language Models on Memory-Constrained Devices Using Flash Memory: Abstract and Intro

Large Language Models on Memory-Constrained Devices Using Flash Memory: Abstract and Intro

Large Language Models on Memory-Constrained Devices Using Flash Memory: Flash Memory & LLM Inference

Large Language Models on Memory-Constrained Devices Using Flash Memory: Load From Flash

Large Language Models on Memory-Constrained Devices Using Flash Memory: Reducing Data Transfer

Large Language Models on Memory-Constrained Devices Using Flash Memory: Improving Throughput

Large Language Models on Memory-Constrained Devices Using Flash Memory: Abstract and Intro

Large Language Models on Memory-Constrained Devices Using Flash Memory: Abstract and Intro

Large Language Models on Memory-Constrained Devices Using Flash Memory: Flash Memory & LLM Inference

Large Language Models on Memory-Constrained Devices Using Flash Memory: Load From Flash

Large Language Models on Memory-Constrained Devices Using Flash Memory: Reducing Data Transfer

Large Language Models on Memory-Constrained Devices Using Flash Memory: Improving Throughput

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps