Summarizing Large Datasets of Customer Feedback Using Retrieval-Augmented Generation (RAG)

Introduction Customer feedback can be a goldmine, full of insights about what people love and what could be improved. But with thousands of reviews, finding meaningful takeaways is like searching for needles in a haystack. That’s where Retrieval-Augmented Generation (RAG) can be a game-changer, helping us pull out the most relevant feedback and summarize it in a way that’s easy to understand. Let’s dive in! In this tutorial, we’ll go step-by-step through: Setting up a simple, powerful retrieval system with Whoosh to search Amazon reviews effectively. Using Hugging Face’s BART model to craft concise, meaningful summaries. Putting everything together into a smooth pipeline that lets you generate focused summaries based on specific keywords. Step 1: Loading the Dataset To kick things off, download the Amazon Customer Reviews dataset, either from Kaggle or Amazon Open Data. Let’s say you’ve saved it as amazon_reviews.csv. Now, let’s load it and take a quick look at what’s inside to get familiar with the data. Here's how to get started: import pandas as pd # Load dataset data = pd.read_csv("amazon_reviews.csv") # Display the first few rows print(data.head()) Dataset Structure The dataset we’ll be working with contains some key columns that offer a lot of insights. First, there’s the reviewText column, which holds the full body of each customer review – this is where the real feedback lives. Next, we have the summary column, a brief synopsis provided by the reviewer. The overall column gives the star rating (on a scale from 1 to 5) to capture their general satisfaction. Finally, productId is the unique identifier for each product. For our purposes, we’ll zero in on reviewText as it’s the most direct source of customer sentiment. Step 2: Setting Up the Retrieval System with Whoosh Whoosh, a Python library, makes it easy and efficient to search through text by creating an index. We’ll create an index of customer reviews so we can quickly search for specific keywords. Defining the Schema Our schema, in essence, defines which fields we want to include in our index. Here, we’ll focus on productId, reviewText, and summary. Here’s a quick look at how to load and examine the data. from whoosh.index import create_in from whoosh.fields import Schema, TEXT, ID import os # Define schema for the search index schema = Schema(productId=ID(stored=True), reviewText=TEXT(stored=True), summary=TEXT(stored=True)) # Create an index directory if not os.path.exists("indexdir"): os.mkdir("indexdir") index = create_in("indexdir", schema) Indexing the Reviews Next, let’s add each review to the index. This allows Whoosh to quickly retrieve reviews relevant to specific keywords. from whoosh.index import open_dir with index.writer() as writer: for _, row in data.iterrows(): writer.add_document(productId=str(row['productId']), reviewText=row['reviewText'], summary=row['summary']) Step 3: Defining a Function to Retrieve Reviews by Keyword We’ll create a search function to retrieve reviews matching specific keywords, such as "battery life" or "sound quality". This function retrieves a list of reviews most relevant to the search term. from whoosh.qparser import QueryParser def search_reviews(keyword, index_dir="indexdir"): index = open_dir(index_dir) results_list = [] with index.searcher() as searcher: query = QueryParser("reviewText", index.schema).parse(keyword) results = searcher.search(query, limit=10) # Adjust limit based on need for hit in results: results_list.append(hit['reviewText']) return results_list # Example search retrieved_reviews = search_reviews("battery life") print("Retrieved Reviews:", retrieved_reviews) Explanation: index_dir="indexdir": The location of the index we created. limit=10: Retrieves the top 10 reviews for simplicity (adjust as needed). The retrieved_reviews list will contain text from reviews related to the specified keyword, making it ready for summarization. Step 4: Summarizing the Retrieved Reviews with Hugging Face’s Transformers For summarization, we’ll use Hugging Face’s transformers library, specifically the facebook/bart-large-cnn model, which is designed for summarization tasks. Setting Up the Summarizer Load the BART model for summarization, then input the retrieved reviews as a single text block. This model handles long text inputs well by creating concise summaries. from transformers import pipeline # Initialize summarization pipeline summarizer = pipeline("summarization", model="facebook/bart-large-cnn") # Concatenate retrieved reviews and generate summary feedback_text = " ".join(retrieved_reviews) summary = summarizer(feedback_text, max_length=50, min_length=25, do_sample=False) print("Summary:", summary[0]['summary_text']) Explanation: max_length and min_length: These control the length of the summary. do_sample=False: Disables random sampling for more consistent summaries. Step 5: Creating a Complete RAG Pipeline Function Combine the retrieval and summarization into a single function. This function will search for relevant reviews based on a keyword, retrieve the top results, and summarize them. def generate_review_summary(keyword, index_dir="indexdir"): # Retrieve reviews based on keyword retrieved_reviews = search_reviews(keyword, index_dir) if not retrieved_reviews: return "No relevant reviews found." # Concatenate reviews for summarization feedback_text = " ".join(retrieved_reviews) # Generate summary summary = summarizer(feedback_text, max_length=50, min_length=25, do_sample=False) return summary[0]['summary_text'] # Example usage print("Generated Summary for 'battery life':", generate_review_summary("battery life")) Sample Output For a keyword like "battery life", you might see an output like: "Users often mention short battery life and overheating issues. While many enjoy the device's functionality, they note a need for better power management." Step 6: Advanced Enhancements Expanding Search Options: Multi-Field Search: Enable searches across several fields, like productId or summary, to help users zero in on what they need. Sentiment Analysis: Add sentiment filters to highlight only positive or negative reviews, offering a quick pulse on customer feedback. Fine-Tuning Summaries: Adjusting Summary Length: Control the summary length as needed – setting max_length for longer or shorter insights. Batch Processing for Large Texts: For bigger data sets, divide reviews into smaller chunks, helping the system manage memory better. Visualizing Insights: Word Cloud: Create a word cloud from review keywords, making popular terms easy to spot at a glance. Summary Metrics: Show average ratings or key themes from reviews, offering a quick, quantitative snapshot. Conclusion This approach highlights how RAG can transform huge volumes of customer feedback into clear, actionable insights. By combining Whoosh for search and Hugging Face’s BART for summaries, you can quickly extract relevant feedback and make it manageable, turning unstructured data into decisions you can act on. This pipeline is versatile—it’s also great for summarizing research papers, meeting notes, or incident reports. With a few tweaks, it can be tailored to meet industry-specific needs or adapt to various types of unstructured data. Disclaimer: The opinions expressed here are my own and do not necessarily reflect the views of CVS Health or its affiliates. Introduction Introduction Customer feedback can be a goldmine, full of insights about what people love and what could be improved. But with thousands of reviews, finding meaningful takeaways is like searching for needles in a haystack. That’s where Retrieval-Augmented Generation (RAG) can be a game-changer, helping us pull out the most relevant feedback and summarize it in a way that’s easy to understand. Let’s dive in! In this tutorial, we’ll go step-by-step through: Setting up a simple, powerful retrieval system with Whoosh to search Amazon reviews effectively. Using Hugging Face’s BART model to craft concise, meaningful summaries. Putting everything together into a smooth pipeline that lets you generate focused summaries based on specific keywords. Setting up a simple, powerful retrieval system with Whoosh to search Amazon reviews effectively. Using Hugging Face’s BART model to craft concise, meaningful summaries. Putting everything together into a smooth pipeline that lets you generate focused summaries based on specific keywords. Step 1: Loading the Dataset Step 1: Loading the Dataset To kick things off, download the Amazon Customer Reviews dataset, either from Kaggle or Amazon Open Data. Let’s say you’ve saved it as amazon_reviews.csv. Now, let’s load it and take a quick look at what’s inside to get familiar with the data. Here's how to get started: import pandas as pd # Load dataset data = pd.read_csv("amazon_reviews.csv") # Display the first few rows print(data.head()) import pandas as pd # Load dataset data = pd.read_csv("amazon_reviews.csv") # Display the first few rows print(data.head()) Dataset Structure Dataset Structure The dataset we’ll be working with contains some key columns that offer a lot of insights. First, there’s the reviewText column, which holds the full body of each customer review – this is where the real feedback lives. Next, we have the summary column, a brief synopsis provided by the reviewer. The overall column gives the star rating (on a scale from 1 to 5) to capture their general satisfaction. Finally, productId is the unique identifier for each product. For our purposes, we’ll zero in on reviewText as it’s the most direct source of customer sentiment. Step 2: Setting Up the Retrieval System with Whoosh Step 2: Setting Up the Retrieval System with Whoosh Whoosh, a Python library, makes it easy and efficient to search through text by creating an index. We’ll create an index of customer reviews so we can quickly search for specific keywords. Defining the Schema Defining the Schema Our schema, in essence, defines which fields we want to include in our index. Here, we’ll focus on productId, reviewText, and summary. Here’s a quick look at how to load and examine the data. from whoosh.index import create_in from whoosh.fields import Schema, TEXT, ID import os # Define schema for the search index schema = Schema(productId=ID(stored=True), reviewText=TEXT(stored=True), summary=TEXT(stored=True)) # Create an index directory if not os.path.exists("indexdir"): os.mkdir("indexdir") index = create_in("indexdir", schema) from whoosh.index import create_in from whoosh.fields import Schema, TEXT, ID import os # Define schema for the search index schema = Schema(productId=ID(stored=True), reviewText=TEXT(stored=True), summary=TEXT(stored=True)) # Create an index directory if not os.path.exists("indexdir"): os.mkdir("indexdir") index = create_in("indexdir", schema) Indexing the Reviews Indexing the Reviews Next, let’s add each review to the index. This allows Whoosh to quickly retrieve reviews relevant to specific keywords. from whoosh.index import open_dir with index.writer() as writer: for _, row in data.iterrows(): writer.add_document(productId=str(row['productId']), reviewText=row['reviewText'], summary=row['summary']) from whoosh.index import open_dir with index.writer() as writer: for _, row in data.iterrows(): writer.add_document(productId=str(row['productId']), reviewText=row['reviewText'], summary=row['summary']) Step 3: Defining a Function to Retrieve Reviews by Keyword Step 3: Defining a Function to Retrieve Reviews by Keyword We’ll create a search function to retrieve reviews matching specific keywords, such as "battery life" or "sound quality". This function retrieves a list of reviews most relevant to the search term. from whoosh.qparser import QueryParser def search_reviews(keyword, index_dir="indexdir"): index = open_dir(index_dir) results_list = [] with index.searcher() as searcher: query = QueryParser("reviewText", index.schema).parse(keyword) results = searcher.search(query, limit=10) # Adjust limit based on need for hit in results: results_list.append(hit['reviewText']) return results_list # Example search retrieved_reviews = search_reviews("battery life") print("Retrieved Reviews:", retrieved_reviews) from whoosh.qparser import QueryParser def search_reviews(keyword, index_dir="indexdir"): index = open_dir(index_dir) results_list = [] with index.searcher() as searcher: query = QueryParser("reviewText", index.schema).parse(keyword) results = searcher.search(query, limit=10) # Adjust limit based on need for hit in results: results_list.append(hit['reviewText']) return results_list # Example search retrieved_reviews = search_reviews("battery life") print("Retrieved Reviews:", retrieved_reviews) Explanation: Explanation: index_dir="indexdir": The location of the index we created. limit=10: Retrieves the top 10 reviews for simplicity (adjust as needed). index_dir="indexdir": The location of the index we created. limit=10: Retrieves the top 10 reviews for simplicity (adjust as needed). The retrieved_reviews list will contain text from reviews related to the specified keyword, making it ready for summarization. Step 4: Summarizing the Retrieved Reviews with Hugging Face’s Transformers Step 4: Summarizing the Retrieved Reviews with Hugging Face’s Transformers For summarization, we’ll use Hugging Face’s transformers library, specifically the facebook/bart-large-cnn model, which is designed for summarization tasks. Setting Up the Summarizer Setting Up the Summarizer Load the BART model for summarization, then input the retrieved reviews as a single text block. This model handles long text inputs well by creating concise summaries. from transformers import pipeline # Initialize summarization pipeline summarizer = pipeline("summarization", model="facebook/bart-large-cnn") # Concatenate retrieved reviews and generate summary feedback_text = " ".join(retrieved_reviews) summary = summarizer(feedback_text, max_length=50, min_length=25, do_sample=False) print("Summary:", summary[0]['summary_text']) from transformers import pipeline # Initialize summarization pipeline summarizer = pipeline("summarization", model="facebook/bart-large-cnn") # Concatenate retrieved reviews and generate summary feedback_text = " ".join(retrieved_reviews) summary = summarizer(feedback_text, max_length=50, min_length=25, do_sample=False) print("Summary:", summary[0]['summary_text']) Explanation : max_length and min_length: These control the length of the summary. do_sample=False: Disables random sampling for more consistent summaries. Explanation Step 5: Creating a Complete RAG Pipeline Function Combine the retrieval and summarization into a single function. This function will search for relevant reviews based on a keyword, retrieve the top results, and summarize them. def generate_review_summary(keyword, index_dir="indexdir"): # Retrieve reviews based on keyword retrieved_reviews = search_reviews(keyword, index_dir) if not retrieved_reviews: return "No relevant reviews found." # Concatenate reviews for summarization feedback_text = " ".join(retrieved_reviews) # Generate summary summary = summarizer(feedback_text, max_length=50, min_length=25, do_sample=False) return summary[0]['summary_text'] # Example usage print("Generated Summary for 'battery life':", generate_review_summary("battery life")) def generate_review_summary(keyword, index_dir="indexdir"): # Retrieve reviews based on keyword retrieved_reviews = search_reviews(keyword, index_dir) if not retrieved_reviews: return "No relevant reviews found." # Concatenate reviews for summarization feedback_text = " ".join(retrieved_reviews) # Generate summary summary = summarizer(feedback_text, max_length=50, min_length=25, do_sample=False) return summary[0]['summary_text'] # Example usage print("Generated Summary for 'battery life':", generate_review_summary("battery life")) Sample Output Sample Output For a keyword like "battery life", you might see an output like: "Users often mention short battery life and overheating issues. While many enjoy the device's functionality, they note a need for better power management." "Users often mention short battery life and overheating issues. While many enjoy the device's functionality, they note a need for better power management." Step 6: Advanced Enhancements Step 6: Advanced Enhancements Expanding Search Options: Multi-Field Search: Enable searches across several fields, like productId or summary, to help users zero in on what they need. Sentiment Analysis: Add sentiment filters to highlight only positive or negative reviews, offering a quick pulse on customer feedback. Fine-Tuning Summaries: Adjusting Summary Length: Control the summary length as needed – setting max_length for longer or shorter insights. Batch Processing for Large Texts: For bigger data sets, divide reviews into smaller chunks, helping the system manage memory better. Visualizing Insights: Word Cloud: Create a word cloud from review keywords, making popular terms easy to spot at a glance. Summary Metrics: Show average ratings or key themes from reviews, offering a quick, quantitative snapshot. Expanding Search Options: Multi-Field Search: Enable searches across several fields, like productId or summary, to help users zero in on what they need. Sentiment Analysis: Add sentiment filters to highlight only positive or negative reviews, offering a quick pulse on customer feedback. Expanding Search Options: Multi-Field Search: Enable searches across several fields, like productId or summary, to help users zero in on what they need. Sentiment Analysis: Add sentiment filters to highlight only positive or negative reviews, offering a quick pulse on customer feedback. Multi-Field Search: Enable searches across several fields, like productId or summary, to help users zero in on what they need. Multi-Field Search: Enable searches across several fields, like productId or summary, to help users zero in on what they need. Sentiment Analysis: Add sentiment filters to highlight only positive or negative reviews, offering a quick pulse on customer feedback. Sentiment Analysis: Add sentiment filters to highlight only positive or negative reviews, offering a quick pulse on customer feedback. Fine-Tuning Summaries: Adjusting Summary Length: Control the summary length as needed – setting max_length for longer or shorter insights. Batch Processing for Large Texts: For bigger data sets, divide reviews into smaller chunks, helping the system manage memory better. Fine-Tuning Summaries: Adjusting Summary Length: Control the summary length as needed – setting max_length for longer or shorter insights. Batch Processing for Large Texts: For bigger data sets, divide reviews into smaller chunks, helping the system manage memory better. Adjusting Summary Length: Control the summary length as needed – setting max_length for longer or shorter insights. Adjusting Summary Length: Control the summary length as needed – setting max_length for longer or shorter insights. Batch Processing for Large Texts: For bigger data sets, divide reviews into smaller chunks, helping the system manage memory better. Batch Processing for Large Texts: For bigger data sets, divide reviews into smaller chunks, helping the system manage memory better. Visualizing Insights: Word Cloud: Create a word cloud from review keywords, making popular terms easy to spot at a glance. Summary Metrics: Show average ratings or key themes from reviews, offering a quick, quantitative snapshot. Visualizing Insights: Word Cloud: Create a word cloud from review keywords, making popular terms easy to spot at a glance. Summary Metrics: Show average ratings or key themes from reviews, offering a quick, quantitative snapshot. Word Cloud: Create a word cloud from review keywords, making popular terms easy to spot at a glance. Word Cloud: Create a word cloud from review keywords, making popular terms easy to spot at a glance. Summary Metrics: Show average ratings or key themes from reviews, offering a quick, quantitative snapshot. Summary Metrics: Show average ratings or key themes from reviews, offering a quick, quantitative snapshot. Conclusion Conclusion This approach highlights how RAG can transform huge volumes of customer feedback into clear, actionable insights. By combining Whoosh for search and Hugging Face’s BART for summaries, you can quickly extract relevant feedback and make it manageable, turning unstructured data into decisions you can act on. This pipeline is versatile—it’s also great for summarizing research papers, meeting notes, or incident reports. With a few tweaks, it can be tailored to meet industry-specific needs or adapt to various types of unstructured data. Disclaimer: The opinions expressed here are my own and do not necessarily reflect the views of CVS Health or its affiliates. Disclaimer: The opinions expressed here are my own and do not necessarily reflect the views of CVS Health or its affiliates. Disclaimer: The opinions expressed here are my own and do not necessarily reflect the views of CVS Health or its affiliates.