In this post, we will use MinIO Bucket Notifications and Apache Tika, for document text extraction, which is at the heart of critical downstream tasks like Large Language Model (LLM) training and Retrieval Augmented Generation (RAG).
Let’s say that I want to construct a dataset of text that I can then use to
The simplest way to get Apache Tika up and running is with the
In this example, I allow it to use and expose the default port, 9998.
docker pull apache/tika:<version>
docker run -d -p 127.0.0.1:9998:9998 apache/tika:<version>
Use docker ps
or Docker Desktop to verify that the Tika container is running and has exposed port 9998.
Now that Tika is running, we need to construct a server that can programmatically make Tika extraction requests for new objects. Following that, we need to configure webhooks on a MinIO bucket to alert this server about the arrival of new objects (in other words, PUT events for a bucket). Let’s walk through it step-by-step.
To keep things relatively simple and to highlight the portability of this approach, the text extraction server will be built in Python, using the popular Flask framework. Here’s the code for the server (also available in the MinIO Blog Resources repository under
"""
This is a simple Flask text extraction server that functions as a webhook service endpoint
for PUT events in a MinIO bucket. Apache Tika is used to extract the text from the new objects.
"""
from flask import Flask, request, abort, make_response
import io
import logging
from tika import parser
from minio import Minio
# Make sure the following are populated with your MinIO details
# (Best practice is to use environment variables!)
MINIO_ENDPOINT = ''
MINIO_ACCESS_KEY = ''
MINIO_SECRET_KEY = ''
# This depends on how you are deploying Tika (and this server):
TIKA_SERVER_URL = 'http://localhost:9998/tika'
client = Minio(
MINIO_ENDPOINT,
access_key=MINIO_ACCESS_KEY,
secret_key=MINIO_SECRET_KEY,
)
logger = logging.getLogger(__name__)
app = Flask(__name__)
@app.route('/', methods=['POST'])
async def text_extraction_webhook():
"""
This endpoint will be called when a new object is placed in the bucket
"""
if request.method == 'POST':
# Get the request event from the 'POST' call
event = request.json
bucket = event['Records'][0]['s3']['bucket']['name']
obj_name = event['Records'][0]['s3']['object']['key']
obj_response = client.get_object(bucket, obj_name)
obj_bytes = obj_response.read()
file_like = io.BytesIO(obj_bytes)
parsed_file = parser.from_buffer(file_like.read(), serverEndpoint=TIKA_SERVER_URL)
text = parsed_file["content"]
metadata = parsed_file["metadata"]
logger.info(text)
result = {
"text": text,
"metadata": metadata
}
resp = make_response(result, 200)
return resp
else:
abort(400)
if __name__ == '__main__':
app.run()
Let’s start the extraction server:
Make note of the hostname and port that the Flask application is running on.
Now, all that’s left is to configure the webhook for the bucket on the MinIO server so that any PUT events (a.k.a., new objects added) in the bucket will trigger a call to the extraction endpoint. With the mc
tool, we can do this in just a few commands.
First, we need to set a few environment variables to signal to your MinIO server that you are enabling a webhook and the corresponding endpoint to be called. Replace <YOURFUNCTIONNAME> with a function name of your choosing. For simplicity, I went with ‘extraction.’ Also, make sure that the endpoint environment variable is set to the correct host and port for your inference server. In this case, http://localhost:5000 is where our Flask application is running.
export MINIO_NOTIFY_WEBHOOK_ENABLE_<YOURFUNCTIONNAME>=on
export MINIO_NOTIFY_WEBHOOK_ENDPOINT_<YOURFUNCTIONNAME>=http://localhost:5000
Once you have set these environment variables, start the mc
, the MinIO Client command line tool, so make sure you have it
Next, let’s configure the event notification for our bucket and the type of event we want to be notified about. For the purposes of this project, I created a brand new bucket also named ‘extraction’. You can do this either mc
mc event add ALIAS/BUCKET arn:minio:sqs::<YOURFUNCTIONNAME>:webhook --event put
Finally, you can check that you have configured the correct event type for the bucket notifications by verifying whether s3:ObjectCreated:*
is outputted when you run this command:
mc event ls ALIAS/BUCKET arn:minio:sqs::<YOURFUNCTIONNAME>:webhook
If you want to learn more about publishing bucket events to a webhook, check out the
Here’s a document that I want to extract text from. It’s a
I put this PDF in my ‘extraction’ bucket using the MinIO Console.
This PUT event triggers a bucket notification which then gets published to the extraction server endpoint. Accordingly, the text is extracted by Tika and printed to the console.
Although we are just printing out the extracted text for now, this text could have been used for many downstream tasks, as hinted at in The Premise. For example:
Dataset creation for LLM fine-tuning: Imagine you want to fine-tune a large language model on a collection of corporate documents that exist in a variety of file formats (i.e., PDF, DOCX, PPTX, Markdown, etc.). To create the LLM-friendly, text dataset for this task, you could collect all these documents into a MinIO bucket configured with a similar webhook and pass the extracted text for each document to a dataframe of the fine-tuning/training set. Furthermore, by having your dataset’s source files on MinIO, it becomes much easier to manage, audit, and track the composition of your datasets.
Retrieval Augmented Generation: RAG is a way that LLM applications can make use of precise context and avoid hallucination. A central aspect of this approach is ensuring your documents’ text can be extracted and then embedded into vectors, thereby enabling semantic search. In addition, it’s generally a best practice to store the actual source documents of these vectors in an object store (like MinIO!). With the approach outlined in this post, you can easily achieve both. If you want to learn more about RAG and its benefits, check out this
LLM Application: With a programmatic way to instantly extract the text from a newly stored document, the possibilities are endless, especially if you can utilize an LLM. Think keyword detection (i.e., Prompt: “What stock tickers are mentioned?”), content assessment (i.e., Prompt: “Per the rubric, what score should this essay submission get?), or pretty much any kind of text-based analysis (i.e., Prompt: “Based on this log output, when did the first error occur?”).
Beyond the utility of Bucket Notifications for these tasks, MinIO is built to afford world-class fault tolerance and performance to any type and number of objects– whether they are Powerpoints, images, or code snippets.
If you have any questions join our