Authors:
(1) Sasun Hambardzumyan, Activeloop, Mountain View, CA, USA;
(2) Abhinav Tuli, Activeloop, Mountain View, CA, USA;
(3) Levon Ghukasyan, Activeloop, Mountain View, CA, USA;
(4) Fariz Rahman, Activeloop, Mountain View, CA, USA;.
(5) Hrant Topchyan, Activeloop, Mountain View, CA, USA;
(6) David Isayan, Activeloop, Mountain View, CA, USA;
(7) Mark McQuade, Activeloop, Mountain View, CA, USA;
(8) Mikayel Harutyunyan, Activeloop, Mountain View, CA, USA;
(9) Tatevik Hakobyan, Activeloop, Mountain View, CA, USA;
(10) Ivo Stranic, Activeloop, Mountain View, CA, USA;
(11) Davit Buniatyan, Activeloop, Mountain View, CA, USA.
In this section, we discuss the current and historical challenges of unstructured or complex data management.
It is generally not recommended to store binary data, such as images, directly in a database. This is because databases are not optimized for storing and serving large files and can cause performance issues. In addition, binary data does not fit well into a database’s structured format, making it difficult to query and manipulate. This can lead to slow load times for users. Databases are typically more expensive to operate and maintain than other types of storage, such as file systems or cloud storage services. Therefore, storing large amounts of binary data in a database can be more costly than other storage solutions.
Increases in large-scale analytical and BI workloads motivated the development of compressed structured formats like Parquet, ORC, Avro, or transient in-memory formats like Arrow [79, 6, 20, 13]. As tabular formats gained adoption, attempts to extend those formats, such as Petastorm [18] or Feather [7] for deep learning, have emerged. To the best of our knowledge, these formats have yet to gain wide adoption. This approach primarily benefits from native integrations with Modern Data Stack (MDS). However, as discussed previously, upstream tools require fundamental modifications to adapt to deep learning applications.
The current cloud-native choice for storing large unstructured datasets is object storage such as AWS S3 [1], Google Cloud Storage (GCS) [3], or MinIO [17]. Object storage does offer three main benefits over distributed network file systems. They are (a) cost-efficient, (b) scalable, and (c) serve as a format-agnostic repository. However, cloud storages are not without drawbacks. Firstly, they introduce significant latency overhead, especially when iterating over many small files such as text or JSON. Next, unstructured data ingestion without metadata control can produce "data swamps". Furthermore, object storage has built-in version control; it is rarely used in data science workflows. Lastly, data on object storage gets copied to a virtual machine before training, thus resulting in storage overhead and additional costs.
The second-generation data lakes led by Delta, Iceberg, Hudi [27, 15, 10] extend object storage by managing tabular format files with the following primary properties.
(1) Update operations: inserting or deleting a row on top of a tabular format file.
(2) Streaming: downstream data ingestion with ACID properties and upstream integration with query engine exposing SQL interface.
(3) Schema evolution: evolving columnar structure while preserving backward compatibility.
(4) Time travel and audit log trailing: preserving historical state with rollback property where queries can be reproducible. Also, support for row-level control on data lineage.
(5) Layout optimization: Built-in feature to optimize file sizes and data compaction with custom ordering support. Significantly speeds up querying.
However, second-generation data lakes are still bound by the limitations of the inherent data formats to be used in deep learning, as previously discussed in section 2.2. Hence in this paper, we extend the second generation of data lake capabilities for deep learning use cases by rethinking the format and upstream features, including querying, visualization, and native integration to deep learning frameworks to complete the ML lifecycle as shown in Fig. 2.
This paper is available on arxiv under CC 4.0 license.