Listen to this story
Hacker, Builder, Founder, CocoIndex
I’ve built an open source ETL framework (CocoIndex) to prepare data for RAG with my friend. CocoIndex is an ETL framework to preapare data for AI applications such as semantic search, retrieval-augmented generation (RAG). It offers a data-driven programming model that simplifies the creation and maintenance of data indexing pipelines, ensuring data freshness and consistency.
🔗 GitHub Repo:
I also created a video tutorial with a quick starter guide Quickstart Guide.
Installation: Install the CocoIndex Python library:
pip install cocoindex
Set Up Postgres with pgvector Extension: Ensure Docker Compose is installed, then start a Postgres database:
docker compose -f <(curl -L https://raw.githubusercontent.com/cocoindex-io/cocoindex/refs/heads/main/dev/postgres.yaml) up -d
Define Your Indexing Flow: Create a flow to index your data. For example:
@cocoindex.flow_def(name="TextEmbedding")
def text_embedding(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
data_scope["documents"] = flow_builder.add_source(cocoindex.sources.LocalFile(path="markdown_files"))
doc_embeddings = data_scope.add_collector()
with data_scope["documents"].row() as doc:
doc["chunks"] = doc["content"].transform(
cocoindex.functions.SplitRecursively(language="markdown", chunk_size=300, chunk_overlap=100))
with doc["chunks"].row() as chunk:
chunk["embedding"] = chunk["text"].transform(
cocoindex.functions.SentenceTransformerEmbed(model="sentence-transformers/all-MiniLM-L6-v2"))
doc_embeddings.collect(filename=doc["filename"], location=chunk["location"],
text=chunk["text"], embedding=chunk["embedding"])
doc_embeddings.export(
"doc_embeddings",
cocoindex.storages.Postgres(),
primary_key_fields=["filename", "location"],
vector_index=[("embedding", cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])
Sincerely looking for feedback and learning from your thoughts. Thank you so much!