Authors:
(1) Sasun Hambardzumyan, Activeloop, Mountain View, CA, USA;
(2) Abhinav Tuli, Activeloop, Mountain View, CA, USA;
(3) Levon Ghukasyan, Activeloop, Mountain View, CA, USA;
(4) Fariz Rahman, Activeloop, Mountain View, CA, USA;.
(5) Hrant Topchyan, Activeloop, Mountain View, CA, USA;
(6) David Isayan, Activeloop, Mountain View, CA, USA;
(7) Mark McQuade, Activeloop, Mountain View, CA, USA;
(8) Mikayel Harutyunyan, Activeloop, Mountain View, CA, USA;
(9) Tatevik Hakobyan, Activeloop, Mountain View, CA, USA;
(10) Ivo Stranic, Activeloop, Mountain View, CA, USA;
(11) Davit Buniatyan, Activeloop, Mountain View, CA, USA. Table of Links Abstract and Intro
Current Challenges
Tensor Storage Format
Deep Lake System Overview
Machine Learning Use Cases
Performance Benchmarks
Discussion and Limitations
Related Work
Conclusions, Acknowledgement, and References 8. RELATED WORK Multiple projects have tried to improve upon or create new formats for storing unstructured datasets including TFRecord extending Protobuf [5], Petastorm [18] extending Parquet [79], Feather [7] extending arrow [13], Squirrel using MessagePack [75], Beton in FFCV [39]. Designing a universal dataset format that solves all use cases is very challenging. Our approach was mostly inspired by CloudVolume [11], a 4-D chunked NumPy storage for storing large volumetric biomedical data. There are other similar chunked NumPy array storage formats such as Zarr [52], TensorStore [23], TileDB [57]. Deep Lake introduced a typing system, dynamically shaped tensors, integration with fast deep learning streaming data loaders, queries on tensors and in-browser visualization support. An alternative approach to store large-scale datasets is to use HPC distributed file system such as Lustre [69], extending with PyTorch cache [45] or performant storage layer such as AIStore [26]. Deep Lake datasets can be stored on top of POSIX or REST API-compatible distributed storage systems by leveraging their benefits. Other comparable approaches evolve in vector databases [80, 8, 80] for storing embeddings, feature stores [73, 16] or data version control systems such as DVC [46], or LakeFS [21]. In contrast, Deep Lake version control is in-built into the format without an external dependency, including Git. Tensor Query Language, similar to TQP [41] and Velox [59] approaches, runs n-dimensional numeric operations on tensor storage by truly leveraging the full capabilities of deep learning frameworks. Overall, Deep Lake takes parallels from data lakes such as Hudi, Iceberg, Delta [27, 15, 10] and complements systems such as Databarick’s Lakehouse [28] for Deep Learning applications. This paper is available on arxiv under CC 4.0 license. Authors: (1) Sasun Hambardzumyan, Activeloop, Mountain View, CA, USA; (2) Abhinav Tuli, Activeloop, Mountain View, CA, USA; (3) Levon Ghukasyan, Activeloop, Mountain View, CA, USA; (4) Fariz Rahman, Activeloop, Mountain View, CA, USA;. (5) Hrant Topchyan, Activeloop, Mountain View, CA, USA; (6) David Isayan, Activeloop, Mountain View, CA, USA; (7) Mark McQuade, Activeloop, Mountain View, CA, USA; (8) Mikayel Harutyunyan, Activeloop, Mountain View, CA, USA; (9) Tatevik Hakobyan, Activeloop, Mountain View, CA, USA; (10) Ivo Stranic, Activeloop, Mountain View, CA, USA; (11) Davit Buniatyan, Activeloop, Mountain View, CA, USA. Authors: Authors: (1) Sasun Hambardzumyan, Activeloop, Mountain View, CA, USA; (2) Abhinav Tuli, Activeloop, Mountain View, CA, USA; (3) Levon Ghukasyan, Activeloop, Mountain View, CA, USA; (4) Fariz Rahman, Activeloop, Mountain View, CA, USA;. (5) Hrant Topchyan, Activeloop, Mountain View, CA, USA; (6) David Isayan, Activeloop, Mountain View, CA, USA; (7) Mark McQuade, Activeloop, Mountain View, CA, USA; (8) Mikayel Harutyunyan, Activeloop, Mountain View, CA, USA; (9) Tatevik Hakobyan, Activeloop, Mountain View, CA, USA; (10) Ivo Stranic, Activeloop, Mountain View, CA, USA; (11) Davit Buniatyan, Activeloop, Mountain View, CA, USA. Table of Links Abstract and Intro Current Challenges Tensor Storage Format Deep Lake System Overview Machine Learning Use Cases Performance Benchmarks Discussion and Limitations Related Work Conclusions, Acknowledgement, and References Abstract and Intro Abstract and Intro Current Challenges Current Challenges Tensor Storage Format Tensor Storage Format Deep Lake System Overview Deep Lake System Overview Machine Learning Use Cases Machine Learning Use Cases Performance Benchmarks Performance Benchmarks Discussion and Limitations Discussion and Limitations Related Work Related Work Conclusions, Acknowledgement, and References Conclusions, Acknowledgement, and References 8. RELATED WORK Multiple projects have tried to improve upon or create new formats for storing unstructured datasets including TFRecord extending Protobuf [5], Petastorm [18] extending Parquet [79], Feather [7] extending arrow [13], Squirrel using MessagePack [75], Beton in FFCV [39]. Designing a universal dataset format that solves all use cases is very challenging. Our approach was mostly inspired by CloudVolume [11], a 4-D chunked NumPy storage for storing large volumetric biomedical data. There are other similar chunked NumPy array storage formats such as Zarr [52], TensorStore [23], TileDB [57]. Deep Lake introduced a typing system, dynamically shaped tensors, integration with fast deep learning streaming data loaders, queries on tensors and in-browser visualization support. An alternative approach to store large-scale datasets is to use HPC distributed file system such as Lustre [69], extending with PyTorch cache [45] or performant storage layer such as AIStore [26]. Deep Lake datasets can be stored on top of POSIX or REST API-compatible distributed storage systems by leveraging their benefits. Other comparable approaches evolve in vector databases [80, 8, 80] for storing embeddings, feature stores [73, 16] or data version control systems such as DVC [46], or LakeFS [21]. In contrast, Deep Lake version control is in-built into the format without an external dependency, including Git. Tensor Query Language, similar to TQP [41] and Velox [59] approaches, runs n-dimensional numeric operations on tensor storage by truly leveraging the full capabilities of deep learning frameworks. Overall, Deep Lake takes parallels from data lakes such as Hudi, Iceberg, Delta [27, 15, 10] and complements systems such as Databarick’s Lakehouse [28] for Deep Learning applications. This paper is available on arxiv under CC 4.0 license. This paper is available on arxiv under CC 4.0 license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Deep Lake, a Lakehouse for Deep Learning: Related Work

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A Practical Approach to Novel Class Discovery in Tabular Data

105 Stories To Learn About K8s

10 Ways to Future-Proof Your Business With Cloud

10 Ways to Reduce Data Loss and Potential Downtime Of Your Database

10 Upcoming DevOps Conferences for 2018

101 Stories To Learn About Cloud Infrastructure

A Practical Approach to Novel Class Discovery in Tabular Data

105 Stories To Learn About K8s

10 Ways to Future-Proof Your Business With Cloud

10 Ways to Reduce Data Loss and Potential Downtime Of Your Database

10 Upcoming DevOps Conferences for 2018

101 Stories To Learn About Cloud Infrastructure

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps