Large Language Models on Memory-Constrained Devices Using Flash Memory: Flash Memory & LLM Inference

by Knapsack TechnologyJuly 31st, 2024

Too Long; Didn't Read

Efficiently run large language models on devices with limited DRAM by optimizing flash memory use, reducing data transfer, and enhancing throughput.

featured image - Large Language Models on Memory-Constrained Devices Using Flash Memory: Flash Memory & LLM Inference

Authors:

(1) Keivan Alizadeh;

(2) Iman Mirzadeh, Major Contribution;

(3) Dmitry Belenko, Major Contribution;

(4) S. Karen Khatamifard;

(5) Minsik Cho;

(6) Carlo C Del Mundo;

(7) Mohammad Rastegari;

(8) Mehrdad Farajtabar.

Table of Links

Abstract and 1. Introduction

2. Flash Memory & LLM Inference and 2.1 Bandwidth and Energy Constraints

2.2 Read Throughput

3 Load From Flash

3.1 Reducing Data Transfer

3.2 Improving Transfer Throughput with Increased Chunk Sizes

3.3 Optimized Data Management in DRAM

4 Results

4.1 Results for OPT 6.7B Model

4.2 Results for Falcon 7B Model

2 Flash Memory & LLM Inference

In this section, we explore the characteristics of memory storage systems (e.g., flash, DRAM), and their implications for large language model (LLM) inference. Our aim is to elucidate the challenges and hardware-specific considerations essential for algorithm design, particularly in optimizing inference when working with flash memory.

2.1 Bandwidth and Energy Constraints

While modern NAND flash memories offer high bandwidth and low latency, they fall well short of the performance levels of DRAM (Dynamic Random-Access Memory), in terms of both latency and throughput. Figure 2a illustrates these differences. A naive inference implementation that relies on NAND flash memory might necessitate reloading the entire model for each forward pass. This process is not only time-consuming, often taking seconds for even compressed models, but it also consumes more energy than transferring data from DRAM to the CPU or GPU’s internal memory.

Load times for the models can be a problem even in the traditional DRAM-resident set up where weights are not reloaded partially – the initial, full load of the model still incurs a penalty, particularly in situations requiring rapid response times for the first token. Our approach, leveraging activation sparsity in LLMs, addresses these challenges by enabling selective reading of model weights, thereby reducing the response latency.

This paper is available on arxiv under CC BY-SA 4.0 DEED license.

L O A D I N G
. . . comments & more!

About Author

Knapsack Technology@knapsack

Optimizing capacity with Knapsack, efficiently packing valuable essentials for a lighter and more sustainable journey fo

Read my stories About @knapsack

TOPICS

machine-learning #large-language-models #flash-memory #dram-optimization #model-inference #hardware-aware-design #data-transfer-efficiency #memory-constrained-devices #model-acceleration

THIS ARTICLE WAS FEATURED IN...

Terminal

Lite

Newsbreak

Join HackerNoon

Latest technology trends. Customized Experience. Curated Stories. Publish Your Ideas

Large Language Models on Memory-Constrained Devices Using Flash Memory: Flash Memory & LLM Inference

Too Long; Didn't Read

Table of Links

2 Flash Memory & LLM Inference

2.1 Bandwidth and Energy Constraints

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES