In this paper, researchers highlight dataloaders as key to improving ML training, comparing libraries for functionality, usability, and performance.
(1) Iason Ofeidis, Department of Electrical Engineering, and Yale Institute for Network Science, Yale University, New Haven {Equal contribution};

(2) Diego Kiedanski, Department of Electrical Engineering, and Yale Institute for Network Science, Yale University, New Haven {Equal contribution};

(3) Leandros TassiulasLevon Ghukasyan, Activeloop, Mountain View, CA, USA, Department of Electrical Engineering, and Yale Institute for Network Science, Yale University, New Haven.


In this paper, we explored the current landscape of Pytorch libraries that allow machine learning practitioners to load their datasets into their models. These libraries offer a wide array of features from increased speed, creating views of only a subset of the data, and loading data from remote storage. We believe that remote loading holds the most promise for all these features since it enables the de-coupling of data storage and model training. Even though loading speed over the public internet is naturally slower than from a local disk, some libraries, such as Deep Lake, showed remarkable results (only a 13% increase in time). For the most part, we did not find a considerable difference in performance across libraries except for FFCV for multi-GPUs and Deep Lake for networked loading, which performed remarkably well. However, we did notice that the documentation for most of these libraries is not readily available or comprehensive, which might result in misconfigured setups. Since good practices are hard to find, a programmer might use what works well in a different dataloader, which need not work in the new library. At this point, the performance gains do not seem large enough to justify the migration of existing code bases for small to medium jobs. For larger jobs, there could be significant cost reductions for switching to one of the faster libraries. Finally, we believe that an innovative caching system designed for machine learning applications could be the final piece in realizing the vision of a truly decoupled dataset model system. Any such approach would have to build existing knowledge on dataset summarization and active learning.


The authors would like to thank the Activeloop team for their support and insights during the development of this project. The authors would also like to thank both Tryolabs and Activeloop for their resources for running some of the experiments.


This paper is available on arxiv under CC 4.0 license.