Authors:
(1) Sasun Hambardzumyan, Activeloop, Mountain View, CA, USA;
(2) Abhinav Tuli, Activeloop, Mountain View, CA, USA;
(3) Levon Ghukasyan, Activeloop, Mountain View, CA, USA;
(4) Fariz Rahman, Activeloop, Mountain View, CA, USA;.
(5) Hrant Topchyan, Activeloop, Mountain View, CA, USA;
(6) David Isayan, Activeloop, Mountain View, CA, USA;
(7) Mark McQuade, Activeloop, Mountain View, CA, USA;
(8) Mikayel Harutyunyan, Activeloop, Mountain View, CA, USA;
(9) Tatevik Hakobyan, Activeloop, Mountain View, CA, USA;
(10) Ivo Stranic, Activeloop, Mountain View, CA, USA;
(11) Davit Buniatyan, Activeloop, Mountain View, CA, USA.
We presented Deep Lake, the lakehouse for deep learning. Deep Lake is designed to help deep learning workflows run as seamlessly as analytical workflows run on Modern Data Stack. Notably, Deep Lake is built to retain prominent features of data lakes, such as time travel, querying, and rapid data ingestion at scale. One important distinction from traditional data lakes is Deep Lake’s ability to store unstructured data with all its metadata in deep learning-native columnar format, which enables rapid data streaming. This allows materializing data subsets on-the-fly, visualizing them in-browser, or ingesting them into deep learning frameworks without sacrificing GPU utilization. Finally, we show that Deep Lake achieves state-of-the-art performance for deep learning on large datasets via multiple benchmarks.
The authors would like to thank Richard Socher, Travis Oliphant, Charu Rudrakshi, Artem Harutyunyan, Iason Ofeidis, Diego Kiedanski, Vishnu Nair, Fayaz Rahman, Dyllan McCreary, Benjamin Hindman, Eduard Grigoryan, Kristina Grigoryan, Ben Chislett, Joubin Houshyar, Andrii Liubimov, Assaf Pinhasi, Vishnu Nair, Eshan Arora, Shashank Agarwal, Pawel Janowski, Kristina Arezina, Gevorg Karapetyan, Vigen Sahakyan and the open-source community including contributors. The project was funded by Activeloop. We also thank the CIDR reviewers for their feedback.
[1] 2006. Amazon S3. GitHub 2022, 1 (2006). https://aws.amazon. com/s3
[2] 2009. Clickhouse. GitHub 2022, 1 (2009). https://github.com/ ClickHouse/ClickHouse
[3] 2010. Google Cloud Storage. GitHub 2022, 1 (2010). https: //cloud.google.com/storage
[4] 2012. Google BigQuery. GitHub 2022, 1 (2012). https://cloud. google.com/bigquery
[5] 2014. Protocol Buffers - Google’s data interchange format. GitHub 2022, 1 (2014). https://github.com/protocolbuffers/ protobuf
[6] 2015. The Apache Software Foundation: Apache ORC. GitHub 2022, 1 (2015). https://github.com/apache/orc
[7] 2016. Feather. GitHub 2022, 1 (2016). https://github.com/ wesm/feather
[8] 2016. Weaviate: The ML-first vector search engine. GitHub 2022, 1 (2016). https://github.com/semi-technologies/weaviate
[9] 2017. Apache Airflow. GitHub 2022, 1 (2017). http://airflow. incubator.apache.org
[10] 2017. The Apache Software Foundation: Apache Hudi. GitHub 2022, 1 (2017). https://hudi.apache.org
[11] 2017. CloudVolume: IO for Neuroglancer Datasets. GitHub 2022, 1 (2017). https://github.com/seung-lab/cloud-volume
[12] 2018. Amazon Athena. GitHub 2022, 1 (2018). https://aws. amazon.com/athena
[13] 2018. The Apache Software Foundation: Apache Arrow. GitHub 2022, 1 (2018). https://arrow.apache.org
[14] 2018. The Apache Software Foundation: Apache Hadoop. GitHub 2022, 1 (2018). https://hadoop.apache.org
[15] 2018. The Apache Software Foundation: Apache Iceberg. GitHub 2022, 1 (2018). https://iceberg.apache.org
[16] 2018. Feast: open source feature store for machine learning. GitHub 2022, 1 (2018). https://github.com/feast-dev/feast
[17] 2018. MinIO high performance object storage server compatible with Amazon S3 API. GitHub 2022, 1 (2018). https: //github.com/minio/minio
[18] 2018. Petastorm. GitHub 2022, 1 (2018). https://github.com/ uber/petastorm [19] 2018. The WebDataset Format. GitHub 2022, 1 (2018). https: //github.com/webdataset/webdataset
[20] 2019. The Apache Software Foundation: Apache Avro. GitHub 2019, 1 (2019). https://avro.apache.org
[21] 2019. LakeFS: data lake with Git-like repository. GitHub 2022, 1 (2019). https://github.com/treeverse/lakeFS
[22] 2020. Airbyte. GitHub 2022, 1 (2020). https://github.com/ airbytehq/airbyte
[23] 2020. TensorStore: Library for reading and writing large multidimensional arrays. GitHub 2022, 1 (2020). https://github. com/google/tensorstore
[24] 2021. N5: specifies the primitive operations needed to store large chunked n-dimensional tensors, and arbitrary meta-data in a hierarchy of groups similar to HDF5. GitHub 2021, 1 (2021). https://github.com/saalfeldlab/n5
[25] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: A system for large-scale machine learning. In 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16). 265–283.
[26] Alex Aizman, Gavin Maltby, and Thomas Breuel. 2019. High performance I/O for large scale deep learning. In 2019 IEEE International Conference on Big Data (Big Data). IEEE, 5965– 5967.
[27] Michael Armbrust, Tathagata Das, Liwen Sun, Burak Yavuz, Shixiong Zhu, Mukul Murthy, Joseph Torres, Herman van Hovell, Adrian Ionescu, Alicja Łuszczak, et al. 2020. Delta lake: high-performance ACID table storage over cloud object stores. Proceedings of the VLDB Endowment 13, 12 (2020), 3411–3424. [28] Michael Armbrust, Ali Ghodsi, Reynold Xin, and Matei Zaharia. 2021. Lakehouse: a new generation of open platforms that unify data warehousing and advanced analytics. In Proceedings of CIDR.
[29] Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, and Michael Auli. 2022. Data2vec: A general framework for self-supervised learning in speech, vision and language. arXiv preprint arXiv:2202.03555 (2022).
[30] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).
[31] James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. 2018. JAX: composable transformations of Python+NumPy programs. http://github.com/google/jax
[32] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
[33] Benoit Dageville, Thierry Cruanes, Marcin Zukowski, Vadim Antonov, Artin Avanes, Jon Bock, Jonathan Claybaugh, Daniel Engovatov, Martin Hentschel, Jiansheng Huang, et al. 2016. The snowflake elastic data warehouse. In Proceedings of the 2016 International Conference on Management of Data. 215– 226.
[34] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248–255.
[35] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09.
[36] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[37] Markus Dreseler, Jan Kossmann, Martin Boissier, Stefan Klauck, Matthias Uflacker, and Hasso Plattner. 2019. Hyrise Re-engineered: An Extensible Database System for Research in Relational In-Memory Data Management. In Advances in Database Technology - 22nd International Conference on Extending Database Technology, EDBT 2019, Lisbon, Portugal, March 26-29, 2019, Melanie Herschel, Helena Galhardas, Berthold Reinwald, Irini Fundulaki, Carsten Binnig, and Zoi Kaoudi (Eds.). OpenProceedings.org, 313–324. https://doi.org/10.5441/002/edbt. 2019.28
[38] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep learning. MIT press.
[39] Andrew Ilyas Sam Park Hadi Salman Guillaume Leclerc, Logan Engstrom. 2021. The WebDataset Format. GitHub 2022, 1 (2021). https://github.com/libffcv/ffcv
[40] Anurag Gupta, Deepak Agarwal, Derek Tan, Jakub Kulesza, Rahul Pathak, Stefano Stefani, and Vidhya Srinivasan. 2015. Amazon redshift and the case for simpler data warehouses. In Proceedings of the 2015 ACM SIGMOD international conference on management of data. 1917–1923.
[41] Dong He, Supun Nakandala, Dalitso Banda, Rathijit Sen, Karla Saur, Kwanghyun Park, Carlo Curino, Jesús CamachoRodríguez, Konstantinos Karanasos, and Matteo Interlandi. 2022. Query Processing on Tensor Computation Runtimes. arXiv preprint arXiv:2203.01877 (2022).
[42] Yu Huang and Yue Chen. 2020. Survey of state-of-art autonomous driving technologies with deep learning. In 2020 IEEE 20th International Conference on Software Quality, Reliability and Security Companion (QRS-C). IEEE, 221–228.
[43] Tero Karras, Samuli Laine, and Timo Aila. 2019. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4401–4410.
[44] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097–1105. Deep Lake: a Lakehouse for Deep Learning
[45] Abhishek Vijaya Kumar and Muthian Sivathanu. 2020. Quiver: An informed storage cache for deep learning. In 18th USENIX Conference on File and Storage Technologies (FAST 20). 283–296.
[46] Ruslan Kuprieiev, skshetry, Dmitry Petrov, Paweł Redzyński, Peter Rowlands, Casper da Costa-Luis, Alexander Schepanovski, Ivan Shcheklein, Batuhan Taskaya, Gao, Jorge Orpinel, David de la Iglesia Castro, Fábio Santos, Aman Sharma, Dave Berenbaum, Zhanibek, Dani Hodovic, Nikita Kodenko, Andrew Grigorev, Earl, daniele, Nabanita Dash, George Vyshnya, maykulkarni, Max Hora, Vera, Sanidhya Mangal, and Wojciech Baranowski. 2022. DVC: Data Version Control - Git for Data & Models. https://doi.org/10.5281/zenodo.7039863
[47] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. nature 521, 7553 (2015), 436.
[48] Kisuk Lee, Jonathan Zung, Peter Li, Viren Jain, and H Sebastian Seung. 2017. Superhuman accuracy on the SNEMI3D connectomics challenge. arXiv preprint arXiv:1706.00120 (2017).
[49] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, 740–755.
[50] Frank Sifei Luan, Stephanie Wang, Samyukta Yagati, Sean Kim, Kenneth Lien, SangBin Cho, Eric Liang, and Ion Stoica. 2022. Exoshuffle: Large-Scale Shuffle at the Application Level. arXiv preprint arXiv:2203.05072 (2022).
[51] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
[52] Alistair Miles, John Kirkham, Martin Durant, James Bourbeau, Tarik Onalan, Joe Hamman, Zain Patel, shikharsg, Matthew Rocklin, raphael dussin, Vincent Schut, Elliott Sales de Andrade, Ryan Abernathey, Charles Noyes, sbalmer, pyup.io bot, Tommy Tran, Stephan Saalfeld, Justin Swaney, Josh Moore, Joe Jevnik, Jerome Kelleher, Jan Funke, George Sakkis, Chris Barnes, and Anderson Banihirwe. 2020. zarr-developers/zarrpython: v2.4.0. https://doi.org/10.5281/zenodo.3773450
[53] Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I Jordan, et al. 2018. Ray: A distributed framework for emerging {AI} applications. In 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18). 561–577.
[54] Iason Ofeidis, Diego Kiedanski, and Leandros Tassiulas. 2022. An Overview of the Data-Loader Landscape: Comparative Performance Analysis. arXiv preprint arXiv:2209.13705 (2022). [
55] Travis E Oliphant. 2006. A guide to NumPy. Vol. 1. Trelgol Publishing USA.
[56] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016).
[57] Stavros Papadopoulos, Kushal Datta, Samuel Madden, and Timothy Mattson. 2016. The tiledb array data storage manager. Proceedings of the VLDB Endowment 10, 4 (2016), 349–360.
[58] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in pytorch. (2017).
[59] Masha Basmanova Kevin Wilfong Laith Sakka Krishna Pai Wei He Biswapesh Chattopadhyay Pedro Pedreira, Orri Erling. 2022. Velox: Meta’s Unified Execution Engine. Proceedings of the VLDB Endowment (2022).
[60] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 8748–8763.
[61] Pranav Rajpurkar, Jeremy Irvin, Kaylie Zhu, Brandon Yang, Hershel Mehta, Tony Duan, Daisy Ding, Aarti Bagul, Curtis Langlotz, Katie Shpanskaya, et al. 2017. Chexnet: Radiologistlevel pneumonia detection on chest x-rays with deep learning. arXiv preprint arXiv:1711.05225 (2017).
[62] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-shot text-to-image generation. In International Conference on Machine Learning. PMLR, 8821–8831.
[63] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 779–788.
[64] Amit Sabne. 2020. Xla: Compiling machine learning for peak performance. (2020).
[65] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al. 2022. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. arXiv preprint arXiv:2205.11487 (2022).
[66] Tom van Bussel Samwel, Herman van Hovell, Maryann Xue, Reynold Xin, and Matei Zaharia. 2022. Photon: A Fast Query Engine for Lakehouse Systems. (2022).
[67] Christoph Schuhmann, Romain Beaumont, Cade W Gordon, Ross Wightman, Theo Coombes, Aarush Katta, Clayton Mullis, Patrick Schramowski, Srivatsa R Kundurthy, Katherine Crowson, et al. 2022. LAION-5B: An open large-scale dataset for training next generation image-text models. (2022).
[68] Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. 2021. Laion400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114 (2021).
[69] Philip Schwan et al. 2003. Lustre: Building a file system for 1000-node clusters. In Proceedings of the 2003 Linux symposium, Vol. 2003. 380–386.
[70] Raghav Sethi, Martin Traverso, Dain Sundstrom, David Phillips, Wenlei Xie, Yutian Sun, Nezih Yegitbasi, Haozhun Jin, Eric Hwang, Nileema Shingte, et al. 2019. Presto: SQL on everything. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). IEEE, 1802–1813.
[71] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler, et al. 2010. The hadoop distributed file system.. In MSST, Vol. 10. 1–10.
[72] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. 2018. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science 362, 6419 (2018), 1140–1144.
[73] K Stumpf, S Bedratiuk, and O Cirit. 2018. Michelangelo PyML: introducing Uber’s platform for rapid python ML model development. Uber. See: https://eng. uber. com/michelangelo-pyml (2018).
[74] Pedro Javier Ortiz Suárez, Benoît Sagot, and Laurent Romary. 2019. Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures. In 7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7). Leibniz-Institut für Deutsche Sprache.
[75] Squirrel Developer Team. 2022. Squirrel: A Python library that enables ML teams to share, load, and transform data in a collaborative, flexible, and efficient way. GitHub. Note: https://github.com/merantix-momentum/squirrel-core (2022). https://doi.org/10.5281/zenodo.6418280
[76] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang, Suresh Antony, Hao Liu, and Raghotham Murthy. 2010. Hive-a petabyte scale data warehouse using hadoop. In 2010 IEEE 26th international conference on data engineering (ICDE 2010). IEEE, 996–1005.
[77] Kathryn Tunyasuvunakool, Jonas Adler, Zachary Wu, Tim Green, Michal Zielinski, Augustin Žídek, Alex Bridgland, Andrew Cowie, Clemens Meyer, Agata Laydon, et al. 2021. Highly accurate protein structure prediction for the human proteome. Nature 596, 7873 (2021), 590–596.
[78] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
[79] Deepak Vohra. 2016. Apache parquet. In Practical Hadoop Ecosystem. Springer, 325–335.
[80] Jianguo Wang, Xiaomeng Yi, Rentong Guo, Hai Jin, Peng Xu, Shengjun Li, Xiangyu Wang, Xiangzhou Guo, Chengming Li, Xiaohai Xu, et al. 2021. Milvus: A Purpose-Built Vector Data Management System. In Proceedings of the 2021 International Conference on Management of Data. 2614–2627.
[81] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems 32 (2019).
[82] Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster computing with working sets. HotCloud 10, 10-10 (2010), 95.
[83] Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Characterlevel convolutional networks for text classification. In Advances in neural information processing systems. 649–657.
This paper is available on arxiv under CC 4.0 license.