Authors:
(1) Sasun Hambardzumyan, Activeloop, Mountain View, CA, USA;
(2) Abhinav Tuli, Activeloop, Mountain View, CA, USA;
(3) Levon Ghukasyan, Activeloop, Mountain View, CA, USA;
(4) Fariz Rahman, Activeloop, Mountain View, CA, USA;.
(5) Hrant Topchyan, Activeloop, Mountain View, CA, USA;
(6) David Isayan, Activeloop, Mountain View, CA, USA;
(7) Mark McQuade, Activeloop, Mountain View, CA, USA;
(8) Mikayel Harutyunyan, Activeloop, Mountain View, CA, USA;
(9) Tatevik Hakobyan, Activeloop, Mountain View, CA, USA;
(10) Ivo Stranic, Activeloop, Mountain View, CA, USA;
(11) Davit Buniatyan, Activeloop, Mountain View, CA, USA. Table of Links Abstract and Intro
Current Challenges
Tensor Storage Format
Deep Lake System Overview
Machine Learning Use Cases
Performance Benchmarks
Discussion and Limitations
Related Work
Conclusions, Acknowledgement, and References 5. MACHINE LEARNING USE CASES In this section, we review the applications of Deep Lake. A typical scenario in a Deep Learning application starts with (1) A raw set of files that is collected on an object storage bucket. It might include images, videos, and other types of multimedia data in their native formats such as JPEG, PNG or MP4. (2) Any associated metadata and labels stored on a relational database. Optionally, they could be stored on the same bucket along with the raw data in a normalized tabular form such as CSV, JSON, or Parquet format. As shown in Fig. 4, an empty Deep Lake dataset is created. Then, empty tensors are defined for storing both raw data as well as metadata. The number of tensors could be arbitrary. A basic example of an image classification task would have two tensors, • images tensor with htype of 𝑖𝑚𝑎𝑔𝑒 and sample compression of JPEG • labels tensor with htype of 𝑐𝑙𝑎𝑠𝑠_𝑙𝑎𝑏𝑒𝑙 and chunk compression of LZ4. After declaring tensors, the data can be appended to the dataset. If a raw image compression matches the tensor sample compression, the binary is directly copied into a chunk without additional decoding. Label data is extracted from a SQL query or CSV table into a categorical integer and appended into labels tensor. labels tensor chunks are stored using LZ4 compression. All Deep Lake data is stored in the bucket and is self-contained. After storage, the data can be accessed in a NumPy interface or as a streamable deep learning dataloader. Then, the model running on a compute machine iterates over the stream of image tensors, and stores the output of the model in a new tensor called predictions. Furthermore, we discuss below how one can train, version control, query, and inspect the quality of a Deep Lake dataset. 5.1 Deep Learning Model Training Deep learning models are trained at multiple levels in an organization, ranging from exploratory training occurring on personal computers to large-scale training that occurs on distributed machines involving many GPUs. The time and effort required to bring the data from long-term storage to the training client are often comparable to the training itself. Deep Lake solves this problem by enabling rapid streaming of data without bottlenecking the downstream training process, thus avoiding the cost and time required to duplicate data on local storage. 5.2 Data Lineage and Version Control Deep learning data constantly evolve as new data is added and existing data is quality controlled. Analytical and training workloads occur in parallel while the data is changing. Hence, knowing which data version was used by a given workload is critical to understand the relationship between the data and model performance. Deep Lake enables deep learning practitioners to understand which version of their data was used in any analytical workload and to time travel across these versions if an audit is required. Since all data is mutable, it can be edited to meet compliance-related privacy requirements. Like Git for code, Deep Lake also introduces the concept of data branches, allowing experimentation and editing of data without affecting colleagues’ work. 5.3 Data Querying and Analytics Training of deep learning models rarely occurs on all data collected by an organization for a particular application. Training datasets are often constructed by filtering the raw data based on conditions increasing model performance, which often includes data balancing, eliminating redundant data, or selecting data that contains specific features. Deep Lake provides the tools to query and analyze data so that deep learning engineers can create datasets yielding the highest accuracy models. 5.4 Data Inspection and Quality Control Though unsupervised learning is becoming more applicable in realworld use cases, most deep learning applications still rely on supervised learning. Any supervised learning system is only as good as the quality of its data, often achieved by manual and exhaustive inspection of the data. Since this process is time-consuming, it is critical to provide the humans in the loop with tools to examine vast amounts of data very quickly. Deep Lake allows inspecting deep learning datasets of any size from the browser without any setup time or need to download data. Furthermore, the tools can be extended for comparing model results with ground truth. Combined with querying and version control, this can be applied to the iterative improvement of data to achieve the best possible model. This paper is available on arxiv under CC 4.0 license. Authors: (1) Sasun Hambardzumyan, Activeloop, Mountain View, CA, USA; (2) Abhinav Tuli, Activeloop, Mountain View, CA, USA; (3) Levon Ghukasyan, Activeloop, Mountain View, CA, USA; (4) Fariz Rahman, Activeloop, Mountain View, CA, USA;. (5) Hrant Topchyan, Activeloop, Mountain View, CA, USA; (6) David Isayan, Activeloop, Mountain View, CA, USA; (7) Mark McQuade, Activeloop, Mountain View, CA, USA; (8) Mikayel Harutyunyan, Activeloop, Mountain View, CA, USA; (9) Tatevik Hakobyan, Activeloop, Mountain View, CA, USA; (10) Ivo Stranic, Activeloop, Mountain View, CA, USA; (11) Davit Buniatyan, Activeloop, Mountain View, CA, USA. Authors: Authors: (1) Sasun Hambardzumyan, Activeloop, Mountain View, CA, USA; (2) Abhinav Tuli, Activeloop, Mountain View, CA, USA; (3) Levon Ghukasyan, Activeloop, Mountain View, CA, USA; (4) Fariz Rahman, Activeloop, Mountain View, CA, USA;. (5) Hrant Topchyan, Activeloop, Mountain View, CA, USA; (6) David Isayan, Activeloop, Mountain View, CA, USA; (7) Mark McQuade, Activeloop, Mountain View, CA, USA; (8) Mikayel Harutyunyan, Activeloop, Mountain View, CA, USA; (9) Tatevik Hakobyan, Activeloop, Mountain View, CA, USA; (10) Ivo Stranic, Activeloop, Mountain View, CA, USA; (11) Davit Buniatyan, Activeloop, Mountain View, CA, USA. Table of Links Abstract and Intro Current Challenges Tensor Storage Format Deep Lake System Overview Machine Learning Use Cases Performance Benchmarks Discussion and Limitations Related Work Conclusions, Acknowledgement, and References Abstract and Intro Abstract and Intro Current Challenges Current Challenges Tensor Storage Format Tensor Storage Format Deep Lake System Overview Deep Lake System Overview Machine Learning Use Cases Machine Learning Use Cases Performance Benchmarks Performance Benchmarks Discussion and Limitations Discussion and Limitations Related Work Related Work Conclusions, Acknowledgement, and References Conclusions, Acknowledgement, and References 5. MACHINE LEARNING USE CASES In this section, we review the applications of Deep Lake. A typical scenario in a Deep Learning application starts with (1) A raw set of files that is collected on an object storage bucket. It might include images, videos, and other types of multimedia data in their native formats such as JPEG, PNG or MP4. (2) Any associated metadata and labels stored on a relational database. Optionally, they could be stored on the same bucket along with the raw data in a normalized tabular form such as CSV, JSON, or Parquet format. As shown in Fig. 4, an empty Deep Lake dataset is created. Then, empty tensors are defined for storing both raw data as well as metadata. The number of tensors could be arbitrary. A basic example of an image classification task would have two tensors, • images tensor with htype of 𝑖𝑚𝑎𝑔𝑒 and sample compression of JPEG • labels tensor with htype of 𝑐𝑙𝑎𝑠𝑠_𝑙𝑎𝑏𝑒𝑙 and chunk compression of LZ4. After declaring tensors, the data can be appended to the dataset. If a raw image compression matches the tensor sample compression, the binary is directly copied into a chunk without additional decoding. Label data is extracted from a SQL query or CSV table into a categorical integer and appended into labels tensor. labels tensor chunks are stored using LZ4 compression. All Deep Lake data is stored in the bucket and is self-contained. After storage, the data can be accessed in a NumPy interface or as a streamable deep learning dataloader. Then, the model running on a compute machine iterates over the stream of image tensors, and stores the output of the model in a new tensor called predictions. Furthermore, we discuss below how one can train, version control, query, and inspect the quality of a Deep Lake dataset. 5.1 Deep Learning Model Training Deep learning models are trained at multiple levels in an organization, ranging from exploratory training occurring on personal computers to large-scale training that occurs on distributed machines involving many GPUs. The time and effort required to bring the data from long-term storage to the training client are often comparable to the training itself. Deep Lake solves this problem by enabling rapid streaming of data without bottlenecking the downstream training process, thus avoiding the cost and time required to duplicate data on local storage. 5.2 Data Lineage and Version Control Deep learning data constantly evolve as new data is added and existing data is quality controlled. Analytical and training workloads occur in parallel while the data is changing. Hence, knowing which data version was used by a given workload is critical to understand the relationship between the data and model performance. Deep Lake enables deep learning practitioners to understand which version of their data was used in any analytical workload and to time travel across these versions if an audit is required. Since all data is mutable, it can be edited to meet compliance-related privacy requirements. Like Git for code, Deep Lake also introduces the concept of data branches, allowing experimentation and editing of data without affecting colleagues’ work. 5.3 Data Querying and Analytics Training of deep learning models rarely occurs on all data collected by an organization for a particular application. Training datasets are often constructed by filtering the raw data based on conditions increasing model performance, which often includes data balancing, eliminating redundant data, or selecting data that contains specific features. Deep Lake provides the tools to query and analyze data so that deep learning engineers can create datasets yielding the highest accuracy models. 5.4 Data Inspection and Quality Control Though unsupervised learning is becoming more applicable in realworld use cases, most deep learning applications still rely on supervised learning. Any supervised learning system is only as good as the quality of its data, often achieved by manual and exhaustive inspection of the data. Since this process is time-consuming, it is critical to provide the humans in the loop with tools to examine vast amounts of data very quickly. Deep Lake allows inspecting deep learning datasets of any size from the browser without any setup time or need to download data. Furthermore, the tools can be extended for comparing model results with ground truth. Combined with querying and version control, this can be applied to the iterative improvement of data to achieve the best possible model. This paper is available on arxiv under CC 4.0 license. This paper is available on arxiv under CC 4.0 license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Deep Lake, a Lakehouse for Deep Learning: Machine Learning Use Cases

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A Practical Approach to Novel Class Discovery in Tabular Data

105 Stories To Learn About K8s

10 Ways to Future-Proof Your Business With Cloud

10 Ways to Reduce Data Loss and Potential Downtime Of Your Database

10 Upcoming DevOps Conferences for 2018

101 Stories To Learn About Cloud Infrastructure

A Practical Approach to Novel Class Discovery in Tabular Data

105 Stories To Learn About K8s

10 Ways to Future-Proof Your Business With Cloud

10 Ways to Reduce Data Loss and Potential Downtime Of Your Database

10 Upcoming DevOps Conferences for 2018

101 Stories To Learn About Cloud Infrastructure

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps