Authors:
(1) Sasun Hambardzumyan, Activeloop, Mountain View, CA, USA;
(2) Abhinav Tuli, Activeloop, Mountain View, CA, USA;
(3) Levon Ghukasyan, Activeloop, Mountain View, CA, USA;
(4) Fariz Rahman, Activeloop, Mountain View, CA, USA;.
(5) Hrant Topchyan, Activeloop, Mountain View, CA, USA;
(6) David Isayan, Activeloop, Mountain View, CA, USA;
(7) Mark McQuade, Activeloop, Mountain View, CA, USA;
(8) Mikayel Harutyunyan, Activeloop, Mountain View, CA, USA;
(9) Tatevik Hakobyan, Activeloop, Mountain View, CA, USA;
(10) Ivo Stranic, Activeloop, Mountain View, CA, USA;
(11) Davit Buniatyan, Activeloop, Mountain View, CA, USA.
Multiple projects have tried to improve upon or create new formats for storing unstructured datasets including TFRecord extending Protobuf [5], Petastorm [18] extending Parquet [79], Feather [7] extending arrow [13], Squirrel using MessagePack [75], Beton in FFCV [39]. Designing a universal dataset format that solves all use cases is very challenging. Our approach was mostly inspired by CloudVolume [11], a 4-D chunked NumPy storage for storing large volumetric biomedical data. There are other similar chunked NumPy array storage formats such as Zarr [52], TensorStore [23], TileDB [57]. Deep Lake introduced a typing system, dynamically shaped tensors, integration with fast deep learning streaming data loaders, queries on tensors and in-browser visualization support. An alternative approach to store large-scale datasets is to use HPC distributed file system such as Lustre [69], extending with PyTorch cache [45] or performant storage layer such as AIStore [26]. Deep Lake datasets can be stored on top of POSIX or REST API-compatible distributed storage systems by leveraging their benefits. Other comparable approaches evolve in vector databases [80, 8, 80] for storing embeddings, feature stores [73, 16] or data version control systems such as DVC [46], or LakeFS [21]. In contrast, Deep Lake version control is in-built into the format without an external dependency, including Git. Tensor Query Language, similar to TQP [41] and Velox [59] approaches, runs n-dimensional numeric operations on tensor storage by truly leveraging the full capabilities of deep learning frameworks. Overall, Deep Lake takes parallels from data lakes such as Hudi, Iceberg, Delta [27, 15, 10] and complements systems such as Databarick’s Lakehouse [28] for Deep Learning applications.
This paper is available on arxiv under CC 4.0 license.