Authors:
(1) Sasun Hambardzumyan, Activeloop, Mountain View, CA, USA;
(2) Abhinav Tuli, Activeloop, Mountain View, CA, USA;
(3) Levon Ghukasyan, Activeloop, Mountain View, CA, USA;
(4) Fariz Rahman, Activeloop, Mountain View, CA, USA;.
(5) Hrant Topchyan, Activeloop, Mountain View, CA, USA;
(6) David Isayan, Activeloop, Mountain View, CA, USA;
(7) Mark McQuade, Activeloop, Mountain View, CA, USA;
(8) Mikayel Harutyunyan, Activeloop, Mountain View, CA, USA;
(9) Tatevik Hakobyan, Activeloop, Mountain View, CA, USA;
(10) Ivo Stranic, Activeloop, Mountain View, CA, USA;
(11) Davit Buniatyan, Activeloop, Mountain View, CA, USA. Table of Links Abstract and Intro
Current Challenges
Tensor Storage Format
Deep Lake System Overview
Machine Learning Use Cases
Performance Benchmarks
Discussion and Limitations
Related Work
Conclusions, Acknowledgement, and References 2. CURRENT CHALLENGES In this section, we discuss the current and historical challenges of unstructured or complex data management. 2.1 Complex Data Types in a Databases It is generally not recommended to store binary data, such as images, directly in a database. This is because databases are not optimized for storing and serving large files and can cause performance issues. In addition, binary data does not fit well into a database’s structured format, making it difficult to query and manipulate. This can lead to slow load times for users. Databases are typically more expensive to operate and maintain than other types of storage, such as file systems or cloud storage services. Therefore, storing large amounts of binary data in a database can be more costly than other storage solutions. 2.2 Complex Data Along with Tabular Formats Increases in large-scale analytical and BI workloads motivated the development of compressed structured formats like Parquet, ORC, Avro, or transient in-memory formats like Arrow [79, 6, 20, 13]. As tabular formats gained adoption, attempts to extend those formats, such as Petastorm [18] or Feather [7] for deep learning, have emerged. To the best of our knowledge, these formats have yet to gain wide adoption. This approach primarily benefits from native integrations with Modern Data Stack (MDS). However, as discussed previously, upstream tools require fundamental modifications to adapt to deep learning applications. 2.3 Object Storage for Deep Learning The current cloud-native choice for storing large unstructured datasets is object storage such as AWS S3 [1], Google Cloud Storage (GCS) [3], or MinIO [17]. Object storage does offer three main benefits over distributed network file systems. They are (a) cost-efficient, (b) scalable, and (c) serve as a format-agnostic repository. However, cloud storages are not without drawbacks. Firstly, they introduce significant latency overhead, especially when iterating over many small files such as text or JSON. Next, unstructured data ingestion without metadata control can produce "data swamps". Furthermore, object storage has built-in version control; it is rarely used in data science workflows. Lastly, data on object storage gets copied to a virtual machine before training, thus resulting in storage overhead and additional costs. 2.4 Second Generation of Data Lakes The second-generation data lakes led by Delta, Iceberg, Hudi [27, 15, 10] extend object storage by managing tabular format files with the following primary properties. (1) Update operations: inserting or deleting a row on top of a tabular format file. (2) Streaming: downstream data ingestion with ACID properties and upstream integration with query engine exposing SQL interface. (3) Schema evolution: evolving columnar structure while preserving backward compatibility. (4) Time travel and audit log trailing: preserving historical state with rollback property where queries can be reproducible. Also, support for row-level control on data lineage. (5) Layout optimization: Built-in feature to optimize file sizes and data compaction with custom ordering support. Significantly speeds up querying. However, second-generation data lakes are still bound by the limitations of the inherent data formats to be used in deep learning, as previously discussed in section 2.2. Hence in this paper, we extend the second generation of data lake capabilities for deep learning use cases by rethinking the format and upstream features, including querying, visualization, and native integration to deep learning frameworks to complete the ML lifecycle as shown in Fig. 2. This paper is available on arxiv under CC 4.0 license. Authors: (1) Sasun Hambardzumyan, Activeloop, Mountain View, CA, USA; (2) Abhinav Tuli, Activeloop, Mountain View, CA, USA; (3) Levon Ghukasyan, Activeloop, Mountain View, CA, USA; (4) Fariz Rahman, Activeloop, Mountain View, CA, USA;. (5) Hrant Topchyan, Activeloop, Mountain View, CA, USA; (6) David Isayan, Activeloop, Mountain View, CA, USA; (7) Mark McQuade, Activeloop, Mountain View, CA, USA; (8) Mikayel Harutyunyan, Activeloop, Mountain View, CA, USA; (9) Tatevik Hakobyan, Activeloop, Mountain View, CA, USA; (10) Ivo Stranic, Activeloop, Mountain View, CA, USA; (11) Davit Buniatyan, Activeloop, Mountain View, CA, USA. Authors: Authors: (1) Sasun Hambardzumyan, Activeloop, Mountain View, CA, USA; (2) Abhinav Tuli, Activeloop, Mountain View, CA, USA; (3) Levon Ghukasyan, Activeloop, Mountain View, CA, USA; (4) Fariz Rahman, Activeloop, Mountain View, CA, USA;. (5) Hrant Topchyan, Activeloop, Mountain View, CA, USA; (6) David Isayan, Activeloop, Mountain View, CA, USA; (7) Mark McQuade, Activeloop, Mountain View, CA, USA; (8) Mikayel Harutyunyan, Activeloop, Mountain View, CA, USA; (9) Tatevik Hakobyan, Activeloop, Mountain View, CA, USA; (10) Ivo Stranic, Activeloop, Mountain View, CA, USA; (11) Davit Buniatyan, Activeloop, Mountain View, CA, USA. Table of Links Abstract and Intro Current Challenges Tensor Storage Format Deep Lake System Overview Machine Learning Use Cases Performance Benchmarks Discussion and Limitations Related Work Conclusions, Acknowledgement, and References Abstract and Intro Abstract and Intro Current Challenges Current Challenges Tensor Storage Format Tensor Storage Format Deep Lake System Overview Deep Lake System Overview Machine Learning Use Cases Machine Learning Use Cases Performance Benchmarks Performance Benchmarks Discussion and Limitations Discussion and Limitations Related Work Related Work Conclusions, Acknowledgement, and References Conclusions, Acknowledgement, and References 2. CURRENT CHALLENGES In this section, we discuss the current and historical challenges of unstructured or complex data management. 2.1 Complex Data Types in a Databases It is generally not recommended to store binary data, such as images, directly in a database. This is because databases are not optimized for storing and serving large files and can cause performance issues. In addition, binary data does not fit well into a database’s structured format, making it difficult to query and manipulate. This can lead to slow load times for users. Databases are typically more expensive to operate and maintain than other types of storage, such as file systems or cloud storage services. Therefore, storing large amounts of binary data in a database can be more costly than other storage solutions. 2.2 Complex Data Along with Tabular Formats Increases in large-scale analytical and BI workloads motivated the development of compressed structured formats like Parquet, ORC, Avro, or transient in-memory formats like Arrow [79, 6, 20, 13]. As tabular formats gained adoption, attempts to extend those formats, such as Petastorm [18] or Feather [7] for deep learning, have emerged. To the best of our knowledge, these formats have yet to gain wide adoption. This approach primarily benefits from native integrations with Modern Data Stack (MDS). However, as discussed previously, upstream tools require fundamental modifications to adapt to deep learning applications. 2.3 Object Storage for Deep Learning The current cloud-native choice for storing large unstructured datasets is object storage such as AWS S3 [1], Google Cloud Storage (GCS) [3], or MinIO [17]. Object storage does offer three main benefits over distributed network file systems. They are (a) cost-efficient, (b) scalable, and (c) serve as a format-agnostic repository. However, cloud storages are not without drawbacks. Firstly, they introduce significant latency overhead, especially when iterating over many small files such as text or JSON. Next, unstructured data ingestion without metadata control can produce "data swamps". Furthermore, object storage has built-in version control; it is rarely used in data science workflows. Lastly, data on object storage gets copied to a virtual machine before training, thus resulting in storage overhead and additional costs. 2.4 Second Generation of Data Lakes The second-generation data lakes led by Delta, Iceberg, Hudi [27, 15, 10] extend object storage by managing tabular format files with the following primary properties. (1) Update operations: inserting or deleting a row on top of a tabular format file. Update operations: (2) Streaming : downstream data ingestion with ACID properties and upstream integration with query engine exposing SQL interface. Streaming (3) Schema evolution: evolving columnar structure while preserving backward compatibility. Schema evolution: (4) Time travel and audit log trailing: preserving historical state with rollback property where queries can be reproducible. Also, support for row-level control on data lineage. Time travel and audit log trailing: (5) Layout optimization: Built-in feature to optimize file sizes and data compaction with custom ordering support. Significantly speeds up querying. Layout optimization: However, second-generation data lakes are still bound by the limitations of the inherent data formats to be used in deep learning, as previously discussed in section 2.2. Hence in this paper, we extend the second generation of data lake capabilities for deep learning use cases by rethinking the format and upstream features, including querying, visualization, and native integration to deep learning frameworks to complete the ML lifecycle as shown in Fig. 2. This paper is available on arxiv under CC 4.0 license. This paper is available on arxiv under CC 4.0 license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Deep Lake, a Lakehouse for Deep Learning: Current Challenges

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A Practical Approach to Novel Class Discovery in Tabular Data

105 Stories To Learn About K8s

10 Ways to Future-Proof Your Business With Cloud

10 Ways to Reduce Data Loss and Potential Downtime Of Your Database

10 Upcoming DevOps Conferences for 2018

101 Stories To Learn About Cloud Infrastructure

A Practical Approach to Novel Class Discovery in Tabular Data

105 Stories To Learn About K8s

10 Ways to Future-Proof Your Business With Cloud

10 Ways to Reduce Data Loss and Potential Downtime Of Your Database

10 Upcoming DevOps Conferences for 2018

101 Stories To Learn About Cloud Infrastructure

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps