Use ScyllaDB to perform semantic search across movie plot descriptions. Use ScyllaDB to perform semantic search across movie plot descriptions. We built a sample movie recommendation app to showcase ScyllaDB’s new vector search capabilities. The sample app gives you a simple way to experience building low-latency semantic search and vector-based applications with ScyllaDB. ScyllaDB’s new vector search capabilities Join the Vector Search Early Access Program Join the Vector Search Early Access Program Join the Vector Search Early Access Program In this post, we’ll show how to perform semantic search across movie plot descriptions to find movies by meaning, not keywords. This example also shows how you can add ScyllaDB Vector Search to your existing applications. Before diving into the application, let’s clarify what we mean by semantic search and provide some context about similarity functions. About vector similarity functions Similarity between two vectors can be calculated in several ways. The most common methods are cosine similarity, dot product (inner product), and L2 (Euclidean) distance. ScyllaDB Vector Search supports all of these functions. cosine similarity dot product L2 For text embeddings, cosine similarity is the most often used similarity function. That’s because, when working with text, we mostly focus on the direction of the vector, rather than its magnitude. Cosine similarity considers only the angle between the vectors (i.e., the difference in directions) and ignores the magnitude (length of the vector). For example, a short document (1 page) and a longer document (10 pages) on the same topic will still point in similar directions in the vector space even though they are different lengths. This is what makes cosine similarity ideal for capturing topical similarity. In practice, many embedding models (e.g., OpenAI models) produce normalized vectors. Normalized vectors all have the same length (magnitude of 1). For normalized vectors, cosine similarity and the dot product return the same result. This is because cosine similarity divides the dot product by the magnitudes of the vectors, which are all 1 when vectors are normalized. The L2 function produces different distance values compared to the dot product or cosine similarity, but the ordering of the embeddings remains the same (assuming normalized vectors). OpenAI models Now that you have a better understanding of semantic similarity functions, let’s explain how the recommendation app works. App overview The application allows users to input what kind of movie they want to watch. For example, if you type “American football,” the app compares your input to the plots of movies stored in the database. The first result is the best match, followed by other similar recommendations. This comparison uses ScyllaDB Vector Search. plots of movies ScyllaDB Vector Search You can find the source code on GitHub, along with setup instructions and a step-by-step tutorial in the documentation. For the dataset, we are reusing a TMDB dataset available on Kaggle. source code on GitHub step-by-step tutorial available on Kaggle Project requirements To run the application, you need a ScyllaDB Cloud account and a vector search enabled cluster. Right now, you need to use the API to create a vector search enabled cluster. Follow the instructions here to get started! Follow the instructions here to get started The application depends on a few Python packages: ScyllaDB Python driver – for connecting and querying ScyllaDB.Sentence Transformers – to generate embeddings locally without requiring OpenAI or other paid APIs.Streamlit – for the UI.Pydantic – to make working with query results easier. ScyllaDB Python driver – for connecting and querying ScyllaDB. ScyllaDB Python driver Sentence Transformers – to generate embeddings locally without requiring OpenAI or other paid APIs. Sentence Transformers Streamlit – for the UI. Streamlit Pydantic – to make working with query results easier. Pydantic By default, the app uses the all-MiniLM-L6-v2 model so anyone can run it locally without heavy compute requirements. Other than ScyllaDB Cloud, no commercial or paid services are needed to run the example. all-MiniLM-L6-v2 ScyllaDB Cloud Configuration and database connection A config.py file stores ScyllaDB Cloud credentials, including the host address and connection details. config.py A separate ScyllaDB helper module handles the following: helper module Creating the connection and sessionInserting and querying dataProviding helper functions for clean database interactions Creating the connection and session Inserting and querying data Providing helper functions for clean database interactions Database schema The schema is defined in aschema.cql file, executed when running the project’s migration script. It includes: schema.cql Keyspace creation (with a replication factor of 3)Table definition for movies, storing fields like release_date, title, genre, and plotVector search index Keyspace creation (with a replication factor of 3) Table definition for movies, storing fields like release_date, title, genre, and plot release_date title genre plot Vector search index CREATE KEYSPACE recommend WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': '3'}; CREATE TABLE recommend.movies ( id INT, release_date TIMESTAMP, title TEXT, tagline TEXT, genre TEXT, imdb_id TEXT, poster_url TEXT, plot TEXT, plot_embedding VECTOR<FLOAT, 384>, PRIMARY KEY (id) ) WITH cdc = {'enabled': 'true'}; CREATE INDEX IF NOT EXISTS ann_index ON recommend.movies(plot_embedding) USING 'vector_index' WITH OPTIONS = { 'similarity_function': 'DOT_PRODUCT' }; CREATE KEYSPACE recommend WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': '3'}; CREATE TABLE recommend.movies ( id INT, release_date TIMESTAMP, title TEXT, tagline TEXT, genre TEXT, imdb_id TEXT, poster_url TEXT, plot TEXT, plot_embedding VECTOR<FLOAT, 384>, PRIMARY KEY (id) ) WITH cdc = {'enabled': 'true'}; CREATE INDEX IF NOT EXISTS ann_index ON recommend.movies(plot_embedding) USING 'vector_index' WITH OPTIONS = { 'similarity_function': 'DOT_PRODUCT' }; schema.cql hosted with ❤ by GitHub schema.cql GitHub Schema highlights: `plot` – text, stores the movie description used for similarity comparison.`plot_embedding` – vector, embedding representation of the plot, defined using the vector data type with 384 dimensions (matching the Sentence Transformers model).`Primary key` – id as the partition key for efficient lookups querying by idCDC enabled – required for ScyllaDB vector search.`Vector index` – an Approximate Nearest Neighbor (ANN) index created on the plot_embedding column to enable efficient vector queries. `plot` – text, stores the movie description used for similarity comparison. `plot_embedding` – vector, embedding representation of the plot, defined using the vector data type with 384 dimensions (matching the Sentence Transformers model). `Primary key` – id as the partition key for efficient lookups querying by id CDC enabled – required for ScyllaDB vector search. CDC `Vector index` – an Approximate Nearest Neighbor (ANN) index created on the plot_embedding column to enable efficient vector queries. The goal of this schema is to allow efficient search on the plot embeddings and store additional information alongside the vectors. Embeddings An Embedding Creator class handles text embedding generation with Sentence Transformers. The function accepts any text input and returns a list of float values that you can insert into ScyllaDB’s `vector` column. Recommendations implemented with vector search The app’s main function is to provide movie recommendations. These recommendations are implemented using vector search. So we create a module called recommenderthat handles recommender Taking the input textTurning the text into embeddingsRunning vector search Taking the input text Turning the text into embeddings Running vector search from db.scylladb import ScyllaClient from embedding_creator import EmbeddingCreator from models import Movie class MovieRecommender: def __init__(self): self.scylla_client = ScyllaClient() self.embedding_creator = EmbeddingCreator("all-MiniLM-L6-v2") def similar_movies(self, user_query: str, top_k=5) -> list[Movie]: db_client = ScyllaClient() user_query_embedding = self.embedding_creator.create_embedding(user_query) db_query = f""" SELECT * FROM recommend.movies ORDER BY plot_embedding ANN OF %s LIMIT %s; """ values = [user_query_embedding, top_k] results = db_client.query_data(db_query, values) return [Movie(**row) for row in results] from db.scylladb import ScyllaClient from embedding_creator import EmbeddingCreator from models import Movie class MovieRecommender: def __init__(self): self.scylla_client = ScyllaClient() self.embedding_creator = EmbeddingCreator("all-MiniLM-L6-v2") def similar_movies(self, user_query: str, top_k=5) -> list[Movie]: db_client = ScyllaClient() user_query_embedding = self.embedding_creator.create_embedding(user_query) db_query = f""" SELECT * FROM recommend.movies ORDER BY plot_embedding ANN OF %s LIMIT %s; """ values = [user_query_embedding, top_k] results = db_client.query_data(db_query, values) return [Movie(**row) for row in results] recommender.py hosted with ❤ by GitHub recommender.py GitHub Let’s break down the vector search query: SELECT * FROM recommend.movies ORDER BY plot_embedding ANN OF [0.1, 0.2, 0.3, …] LIMIT 5; SELECT * FROM recommend.movies ORDER BY plot_embedding ANN OF [0.1, 0.2, 0.3, …] LIMIT 5; User input is first converted to an embedding, ensuring that we’re comparing embedding to embedding.The rows in the table are ordered by similarity using the ANN operator (ANN OF).Results are limited to five similar movies.The SELECT statement retrieves all columns from the table. User input is first converted to an embedding, ensuring that we’re comparing embedding to embedding. The rows in the table are ordered by similarity using the ANN operator (ANN OF). ANN OF Results are limited to five similar movies. The SELECT statement retrieves all columns from the table. SELECT In similarity search, we calculate the distance between two vectors. The closer the vectors in vector space, the more similar their underlying content. Or, in other words, a smaller distance suggests higher similarity. Therefore, an ORDER BY sort results in ascending order, with smaller distances appearing first. Streamlit UI The UI, defined in app.py, ties everything together. app.py It takes the user’s query, converts it to an embedding, and executes a vector search.The UI displays the best match and a list of other similar movie recommendations. It takes the user’s query, converts it to an embedding, and executes a vector search. The UI displays the best match and a list of other similar movie recommendations. Try it yourself! If you want to get started building with ScyllaDB Vector Search, you have several options: Explore the source code on GitHubUse the README to set up the app on your computerFollow the tutorial to build the app from scratch Explore the source code on GitHub source code on GitHub Use the README to set up the app on your computer README Follow the tutorial to build the app from scratch tutorial And if you have questions, use the forum and we’ll be happy to help. forum About Attila Tóth Attila Tóth is a developer advocate at ScyllaDB. He writes tutorials and blog posts, speaks at events, creates demos and sample applications to help developers build high-performance applications. Attila Tóth