Authors:
(1) Iason Ofeidis, Department of Electrical Engineering, and Yale Institute for Network Science, Yale University, New Haven {Equal contribution};
(2) Diego Kiedanski, Department of Electrical Engineering, and Yale Institute for Network Science, Yale University, New Haven {Equal contribution};
(3) Leandros TassiulasLevon Ghukasyan, Activeloop, Mountain View, CA, USA, Department of Electrical Engineering, and Yale Institute for Network Science, Yale University, New Haven.
In the process of training a Deep Learning model, the dataset needs to be read from memory and pre-processed, before it can be passed as input to the model. This operation requires loading the data into memory all at once. In most cases and, especially, with large datasets, a memory shortage arises due to the limited amount of it available in the system, which also deteriorates the system’s response time. This bottleneck is often remedied in deep learning libraries by using a so-called dataloader. This structure provides a way to iterate over the dataset by leveraging parallel processing, pre-fetching, batching, and other techniques to reduce data loading time and memory overhead as much as possible (Paszke et al., 2019).
The main goal of a dataloader is to perform the actions responsible for transferring data samples from a storage location to the memory co-located with the processing units for training in order to form a batch of samples to be fed into the model. These actions are restrained by the storage system’s bandwidth and specifically its I/O performance. Thus, depending on the system’s hardware specifications, its filesystem serving it, and the throughput of the link with the computing units, it can have an immense influence on the total amount of time needed to complete the training.
The following specification of the dataloader component is mainly focused towards PyTorch (torch.DataLoader() (PyTorch Core Team)), while its TensorFlow counterpart (tf.Dataset() (Abadi et al., 2015)), albeit not the same, bears great similarities.
When employing a dataloader, apart from providing the dataset for input, the user has the option to configure a number of hyperparameters, tailored to their needs and resources. A common one available in all dataloaders is the batch size, which, as mentioned before, defines the number of samples that will be used before updating the internal model parameters. This parameter is intrinsically linked with the concept of ”mini-batch” in stochastic gradient descent (Hinton et al., 2012) and therefore, it is one of the first parameters that usually undergoes fine-tuning when one needs to achieve better training results.
Secondly, the user can define the sampling approach, which determines the strategy for drawing samples from the dataset and inserting them into a batch. This could include selecting samples based on specific criteria, as well as a probability distribution. In this step exists the option of shuffling, where the samples can be rearranged before every dataset iteration, with the goal usually being to improve the generalization of the training model. Another parameter is the collate/padding function, which essentially specifies the process of linking together all the individual samples inside a batch (think of stacking vectors into a tensor), in order to form a single element to be fed as input to the training model. Moreover, the dataloader can be configured to automatically store fetched data samples in pinned (page-locked) memory, thus enabling faster data transfer to CUDA-enabled devices.
Dataloaders come with a component called workers, whose purpose is to optimize this data-transferring process. Workers are defined as sub-processes responsible to carry out the data loading in an asynchronous fashion. When creating an instance of a dataloader, the user has the option to specify the number of workers that will be spawned and will be in control of this operation. If the number of workers is equal to zero, no sub-processes will be created, which, in turn, means that data fetching happens synchronously in the same process and, thus, the computing units (GPU) have to wait for the data loading to be completed (PyTorch Core Team). Reversely, there will be generated sub-processes equal to the number of workers, which will prevent the blocking of computation code with data loading. This is accomplished by pre-fetching future batches in advance to be ready when needed.
This paper is available on arxiv under CC 4.0 license.