The modern data landscape demands a new breed of infrastructure – one that seamlessly integrates structured and unstructured data, scales effortlessly, and empowers efficient AI/ML workloads. This is where
This blog post dives deep into three powerful tools that can optimize your current approach:
Tabular is a data platform created by the original creators of Apache Iceberg. It is designed to provide an independent, universal storage platform that connects to any compute layer, eliminating data vendor lock-in. This feature is critical to the modern data stack, it allows users to pick best-in-class compute and storage tools without being forced into a particular vendor's aged and or mismatched tool set.
In an
This implementation leverages Iceberg's S3FileIO. S3FileIO is considered better than Hadoop's file-io for several reasons. Some of which we’ve already discussed
Optimized for Cloud Storage: Iceberg's S3FileIO is designed to work with cloud-native storage.
Improved Throughput and Minimized Throttling: Iceberg uses an
Strict Consistency: Iceberg has been
Progressive Multipart Upload: Iceberg's S3FileIO implements a
Checksum Verification: Iceberg allows for
Custom Tags: Iceberg supports adding
Avoidance of Negative Caching: The FileIO interface in Iceberg does not require as strict guarantees as a Hadoop-compatible FileSystem, which allows it to
In contrast, Hadoop's S3A FileSystem, which was used before S3FileIO, does not offer the same level of optimization for cloud storage. All this to say: Don’t hobble your future-facing data lake infrastructure with the trappings of the past.
Before you begin, make sure your system meets the following requirements:
If you're starting from scratch, you can install both using the
docker-compose --version
To get started, clone or copy the YAML file in Tabular’s
The provided YAML file is a Docker Compose configuration file. It defines a set of services and their configurations for a multi-container Docker application. In this case, there are two services: Spark-Iceberg and MinIO. Let's break down each section:
1. Spark-Iceberg Service:
spark-iceberg:
image: tabulario/spark-iceberg
container_name: spark-iceberg
build: spark/
networks:
iceberg_net:
depends_on:
- rest
- minio
volumes:
- ./warehouse:/home/iceberg/warehouse
- ./notebooks:/home/iceberg/notebooks/notebooks
environment:
- AWS_ACCESS_KEY_ID=admin
- AWS_SECRET_ACCESS_KEY=password
- AWS_REGION=us-east-1
ports:
- 8888:8888
- 8080:8080
- 10000:10000
- 10001:10001
rest:
image: tabulario/iceberg-rest
container_name: iceberg-rest
networks:
iceberg_net:
ports:
- 8181:8181
environment:
- AWS_ACCESS_KEY_ID=admin
- AWS_SECRET_ACCESS_KEY=password
- AWS_REGION=us-east-1
- CATALOG_WAREHOUSE=s3://warehouse/
- CATALOG_IO__IMPL=org.apache.iceberg.aws.s3.S3FileIO
- CATALOG_S3_ENDPOINT=http://minio:9000
image: Specifies the Docker image to use for the spark-iceberg service. In this case, it uses the tabulario/spark-iceberg:latest image.
depends_on: Specifies that the spark-iceberg service depends on the rest and minio services.
container_name: Assigns a specific name (spark-iceberg) to the container.
environment: Sets environment variables for the container, including Spark and AWS credentials.
volumes: Mounts local directories (./warehouse and ./notebooks) as volumes inside the container.
ports: Maps container ports to host ports for accessing Spark UI and other services.
2. Minio Service:
minio:
image: minio/minio
container_name: minio
environment:
- MINIO_ROOT_USER=admin
- MINIO_ROOT_PASSWORD=password
- MINIO_DOMAIN=minio
networks:
iceberg_net:
aliases:
- warehouse.minio
ports:
- 9001:9001
- 9000:9000
command: ["server", "/data", "--console-address", ":9001"]
image: Specifies the Docker image for the MinIO service.
container_name: Assigns a specific name (MinIO) to the container.
environment: Sets environment variables for configuring MinIO, including root user credentials.
ports: Maps container ports to host ports for accessing MinIO UI.
command: Specifies the command to start the MinIO server with specific parameters.
Another aspect of the MinIO service is
mc:
depends_on:
- minio
image: minio/mc
container_name: mc
networks:
iceberg_net:
environment:
- AWS_ACCESS_KEY_ID=admin
- AWS_SECRET_ACCESS_KEY=password
- AWS_REGION=us-east-1
entrypoint: >
/bin/sh -c "
until (/usr/bin/mc config host add minio http://minio:9000 admin password) do echo '...waiting...' && sleep 1; done;
/usr/bin/mc rm -r --force minio/warehouse;
/usr/bin/mc mb minio/warehouse;
/usr/bin/mc policy set public minio/warehouse;
tail -f /dev/null
"
depends_on: Specifies that the mc service depends on the MinIO service.
image: Specifies the Docker image for the mc service.
container_name: Assigns a specific name (mc) to the container.
environment: Sets environment variables for configuring the MinIO client.
entrypoint: Defines the entry point command for the container, including setup steps for the MinIO client.
/usr/bin/mc rm -r --force minio/warehouse;
/usr/bin/mc mb minio/warehouse;
/usr/bin/mc policy set public minio/warehouse;
tail -f /dev/null
"
This sequence of commands essentially performs the following tasks:
Removes the existing warehouse directory and its contents from the MinIO server.
Creates a new bucket named warehouse.
Sets the access policy of the warehouse bucket to public.
This Docker Compose file orchestrates a multi-container environment with services for Spark, PostgreSQL, MinIO. It sets up dependencies, environment variables, and commands necessary for running the services together. The services work in tandem to create a development environment for data processing using Spark and Iceberg with MinIO as the object storage backend.
In a terminal window, cd into the tabular-spark-setup directory in the repository and run the following command:
docker-compose up
Login into MinIO at http://127.0.0.1:9001
with the credentials admin:password
to see that the warehouse bucket has been created.
Once all the containers are up and running, you can access your Jupyter Notebook server by navigating to http://localhost:8888
Run one of the sample notebooks and return to MinIO at http://127.0.0.1:9001
to see your warehouse populated with data.
This tutorial on building a modern datalake with Iceberg, Tabular, and MinIO is just the beginning. This powerful trio opens doors to a world of possibilities. With these tools, you can seamlessly integrate and analyze all your data, structured and unstructured, to uncover hidden patterns and drive data-driven decisions that fuel innovation. Leverage the efficiency and flexibility of this architecture in production to expedite your AI/ML initiatives and unlock the true potential of your machine learning models, accelerating your path to groundbreaking discoveries. Reach out to us at