The modern data landscape demands a new breed of infrastructure – one that seamlessly integrates structured and unstructured data, scales effortlessly, and empowers efficient AI/ML workloads. This is where come in, providing a central hub for all your data needs. However, building and managing an effective data lake can be complex. modern datalakes This blog post dives deep into three powerful tools that can optimize your current approach: , and MinIO. The steps below will walk you through how these services seamlessly combine to create a robust, cloud-native data lake architecture specifically optimized for AI/ML workloads. Apache Iceberg Tabular, What is Tabular? Tabular is a data platform created by the original creators of Apache Iceberg. It is designed to provide an independent, universal storage platform that connects to any compute layer, eliminating data vendor lock-in. This feature is critical to the modern data stack, it allows users to pick best-in-class compute and storage tools without being forced into a particular vendor's aged and or mismatched tool set. In an of MinIO and Iceberg and can be enhanced by Tabular. Tabular can be used to manage and query Iceberg data stored in MinIO, allowing for the storage and management of structured data in a scalable, high-performance, and cloud-native manner. These Kubernetes native, components work together smoothly with very little friction and build on each other’s capabilities to perform at scale. architecture Why S3FileIO instead of Hadoop's file-io? This implementation leverages Iceberg's S3FileIO. S3FileIO is considered better than Hadoop's file-io for several reasons. Some of which we’ve already discussed : elsewhere : Iceberg's S3FileIO is designed to work with cloud-native storage. Optimized for Cloud Storage Iceberg uses an to distribute files across multiple prefixes in a MinIO bucket, which helps to minimize throttling and maximize throughput for S3-related IO operations. Improved Throughput and Minimized Throttling: ObjectStoreLocationProvider Iceberg has been to fully leverage strict consistency by eliminating redundant consistency checks that could impact performance. Strict Consistency: updated Iceberg's S3FileIO implements a algorithm, which uploads data files parts in parallel as soon as each part is ready, reducing local disk usage and increasing upload speed. Progressive Multipart Upload: progressive multipart upload Iceberg allows for for S3 API writes to ensure the integrity of uploaded objects, which can be enabled by setting the appropriate catalog property. Checksum Verification: checksum validations Iceberg supports adding to objects during write and delete operations with the S3 API, which can be useful for cost tracking and management. Custom Tags: custom tags The FileIO interface in Iceberg does not require as strict guarantees as a Hadoop-compatible FileSystem, which allows it to that could otherwise degrade performance. Avoidance of Negative Caching: avoid negative caching In contrast, Hadoop's S3A FileSystem, which was used before S3FileIO, does not offer the same level of optimization for cloud storage. All this to say: Don’t hobble your future-facing data lake infrastructure with the trappings of the past. Prerequisites Before you begin, make sure your system meets the following requirements: Docker Docker Compose If you're starting from scratch, you can install both using the installer for your specific platform. It is often easier than downloading Docker and Docker Compose separately. Verify if Docker is installed by running the following command: Docker Desktop docker-compose --version Getting started To get started, clone or copy the YAML file in Tabular’s . You just need the YAML for this tutorial. Feel free to explore the rest of the repository at a later time. git repository Breaking it Down The provided YAML file is a Docker Compose configuration file. It defines a set of services and their configurations for a multi-container Docker application. In this case, there are two services: Spark-Iceberg and MinIO. Let's break down each section: 1. Spark-Iceberg Service: spark-iceberg: image: tabulario/spark-iceberg container_name: spark-iceberg build: spark/ networks: iceberg_net: depends_on: - rest - minio volumes: - ./warehouse:/home/iceberg/warehouse - ./notebooks:/home/iceberg/notebooks/notebooks environment: - AWS_ACCESS_KEY_ID=admin - AWS_SECRET_ACCESS_KEY=password - AWS_REGION=us-east-1 ports: - 8888:8888 - 8080:8080 - 10000:10000 - 10001:10001 rest: image: tabulario/iceberg-rest container_name: iceberg-rest networks: iceberg_net: ports: - 8181:8181 environment: - AWS_ACCESS_KEY_ID=admin - AWS_SECRET_ACCESS_KEY=password - AWS_REGION=us-east-1 - CATALOG_WAREHOUSE=s3://warehouse/ - CATALOG_IO__IMPL=org.apache.iceberg.aws.s3.S3FileIO - CATALOG_S3_ENDPOINT=http://minio:9000 Specifies the Docker image to use for the spark-iceberg service. In this case, it uses the tabulario/spark-iceberg:latest image. image: Specifies that the spark-iceberg service depends on the rest and minio services. depends_on: Assigns a specific name (spark-iceberg) to the container. container_name: Sets environment variables for the container, including Spark and AWS credentials. environment: Mounts local directories (./warehouse and ./notebooks) as volumes inside the container. volumes: Maps container ports to host ports for accessing Spark UI and other services. ports: 2. Minio Service: minio: image: minio/minio container_name: minio environment: - MINIO_ROOT_USER=admin - MINIO_ROOT_PASSWORD=password - MINIO_DOMAIN=minio networks: iceberg_net: aliases: - warehouse.minio ports: - 9001:9001 - 9000:9000 command: ["server", "/data", "--console-address", ":9001"] Specifies the Docker image for the MinIO service. image: Assigns a specific name (MinIO) to the container. container_name: Sets environment variables for configuring MinIO, including root user credentials. environment: Maps container ports to host ports for accessing MinIO UI. ports: Specifies the command to start the MinIO server with specific parameters. command: Another aspect of the MinIO service is , MinIO’s command line tool. mc mc: depends_on: - minio image: minio/mc container_name: mc networks: iceberg_net: environment: - AWS_ACCESS_KEY_ID=admin - AWS_SECRET_ACCESS_KEY=password - AWS_REGION=us-east-1 entrypoint: > /bin/sh -c " until (/usr/bin/mc config host add minio http://minio:9000 admin password) do echo '...waiting...' && sleep 1; done; /usr/bin/mc rm -r --force minio/warehouse; /usr/bin/mc mb minio/warehouse; /usr/bin/mc policy set public minio/warehouse; tail -f /dev/null " Specifies that the mc service depends on the MinIO service. depends_on: Specifies the Docker image for the mc service. image: Assigns a specific name (mc) to the container. container_name: Sets environment variables for configuring the MinIO client. environment: Defines the entry point command for the container, including setup steps for the MinIO client. entrypoint: /usr/bin/mc rm -r --force minio/warehouse; /usr/bin/mc mb minio/warehouse; /usr/bin/mc policy set public minio/warehouse; tail -f /dev/null " This sequence of commands essentially performs the following tasks: Removes the existing warehouse directory and its contents from the MinIO server. Creates a new bucket named warehouse. Sets the access policy of the warehouse bucket to public. This Docker Compose file orchestrates a multi-container environment with services for Spark, PostgreSQL, MinIO. It sets up dependencies, environment variables, and commands necessary for running the services together. The services work in tandem to create a development environment for data processing using Spark and Iceberg with MinIO as the object storage backend. Starting Up In a terminal window, cd into the tabular-spark-setup directory in the repository and run the following command: docker-compose up Login into MinIO at with the credentials to see that the warehouse bucket has been created. http://127.0.0.1:9001 admin:password Once all the containers are up and running, you can access your Jupyter Notebook server by navigating to http://localhost:8888 Run one of the sample notebooks and return to MinIO at to see your warehouse populated with data. http://127.0.0.1:9001 Building Your Modern Datalake This tutorial on building a modern datalake with Iceberg, Tabular, and MinIO is just the beginning. This powerful trio opens doors to a world of possibilities. With these tools, you can seamlessly integrate and analyze all your data, structured and unstructured, to uncover hidden patterns and drive data-driven decisions that fuel innovation. Leverage the efficiency and flexibility of this architecture in production to expedite your AI/ML initiatives and unlock the true potential of your machine learning models, accelerating your path to groundbreaking discoveries. Reach out to us at or on our channel if you have any questions as you build. hello@min.io Slack