378 reads

Leveraging RAG With Reddit URLs on Diabetes: Open Source LLMs for Enhanced Knowledge Retrieval

by Shilpi BhattacharyyaMarch 27th, 2025

Too Long; Didn't Read

Given my interest in sourcing experiential data on diabetes using LLMs, I conducted this experiment with Ollama - one of many open-source LLMs that can be utilized for such tasks.

Company Mentioned

featured image - Leveraging RAG With Reddit URLs on Diabetes: Open Source LLMs for Enhanced Knowledge Retrieval

Large Language Models are incredibly powerful and it’s impressive that they have been trained on a vast data corpora. However, how many of us, as we embark on our Founder journey, have the budget to train these models in-house? Likely, very few.

But wait, are these pre-trained LLMs helpful to us? Of course, if they are open source. Fortunately, quite a few are available now.

So, how exactly are they helpful? Most of us in the AI field are aware of RAGs or have at least heard of them. Let me provide a simple one-line definition. Retrieval-augmented generation is exactly what it sounds like - it retrieves data from the external sources we provide and augments it to the LLM output.

It’s extremely useful because it leverages the generative power of LLMs while incorporating the knowledge we supply externally to generate outputs from the desired corpus. If you have a limited external corpus, we can let the model fall back on the general knowledge of the LLM.

I am particularly interested in how we eat and deeply believe in the concept of “garbage in, garbage out” when it comes to food and bodies. If we nourish ourselves with naturally healthy foods, we reflect nature - strong, vibrant, and unstoppable. But if we consume artificial, lifeless foods, we begin to look and feel the same - drained and unnatural. One of the worst consequences of excess artificial and refined food consumption today is diabetes.

And who truly understands the real pain points of living with diabetes? It’s simple - the people who experience it firsthand. Given my interest in sourcing experiential data on diabetes using LLMs, I conducted this experiment with Ollama - one of many open-source LLMs that can be utilized for such tasks.

I am sharing my notebook step by step with an explanation at each stage. Additionally, to aid comprehension, I am including a high-level architecture diagram.

Step 1: Fetch and process content from multiple URLs. We scrape text from the provided list of Reddit URLs and store it in all_texts.

# Import necessary libraries
import requests
from bs4 import BeautifulSoup
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OllamaEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain.llms import Ollama

# List of URLs to scrape
urls = [
    'https://www.reddit.com/r/diabetes/comments/1broigp/what_are_your_biggest_frustrations_with_diabetes/',
    'https://www.reddit.com/r/diabetes_t2/comments/156znkx/whats_the_most_challenging_part_about_dealing/',
    'https://www.reddit.com/r/diabetes/comments/qcsgji/what_is_the_hardest_part_about_managing_diabetes/',
    'https://www.reddit.com/r/diabetes_t1/comments/1hdlipr/diabetes_and_pain/',
    'https://www.reddit.com/r/diabetes/comments/ww6mrj/what_does_diabetic_nerve_pain_feel_like/',
    'https://www.reddit.com/r/AskReddit/comments/avl1x0/diabetics_of_reddit_what_is_your_experience_of/',
    'https://www.reddit.com/r/diabetes_t2/comments/1jggxi9/my_fathers_sugar_levels_are_not_dropping/',
    'https://www.reddit.com/r/diabetes_t2/comments/1jglmie/shaky_feeling/',
    'https://www.reddit.com/r/diabetes_t2/comments/1jgccvo/rant_from_a_depressedeating_disordered_diabetic/'
]

# Initialize text storage
all_texts = []

# Step 1: Fetch and process content from multiple URLs
for url in urls:
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Extract text from <p> tags
    text = ' '.join([para.get_text() for para in soup.find_all('p')])
    
    if text:  # Store only if text is found
        all_texts.append(text)

Step 2: We split the scraped text into more manageable chunks for efficient text processing. We can also reduce memory usage and improve search performance.

# Step 2: Split all content into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
all_chunks = []
for text in all_texts:
    all_chunks.extend(text_splitter.split_text(text))

Step 3: We initialize Ollama embeddings. Embeddings are numerical (vector) representations of raw text capturing the semantic meaning of the text. Embeddings enable various ML models process and understand text effectively. OllamaEmbeddings class leverages llama2 model to generate these embeddings.

# Step 3: Initialize Ollama embeddings
embeddings = OllamaEmbeddings(model="llama2")  # Adjust model name if needed

Step 4: Facebook AI Similarity Search (FAISS) is a library for efficient similarity search and clustering of high-dimensional vectors. The from_texts function converts chunks of texts into vectors (from step 3) and stores them in the FAISS vector store. FAISS helps you find chunks similar to your query by comparing vector distances (cosine, euclidean) in a highly optimized fashion.

# Step 4: Create a FAISS vector store using all chunks
vector_store = FAISS.from_texts(all_chunks, embeddings)

Step 5: We initialize Ollama LLM, which will generate responses or process queries based on the knowledge stored in embeddings in the FAISS vector store from step 4.


# Step 5: Initialize the Ollama LLM
llm = Ollama(model="llama2", temperature=0.3)

Together, steps 3-5 enable RAG, where the LLM can pull knowledge from stored chunks in the vector store and use the retrieved context to answer a user’s query in the most relevant way.

Step 6: Here, we define the ask_question_with_fallback function to query the stored knowledge to answer a user’s query. But if it can’t find similar docs or the similarity score is low, it will fall back to the general knowledge of the underlying LLM (Ollama here).

# Step 6: Create the question-answering function
def ask_question_with_fallback(query):
    # Retrieve relevant documents
    docs = vector_store.similarity_search(query, k=3)

    docs = vector_store.similarity_search(query, k=3)
    for doc in docs:
        print(f"Retrieved doc: {doc.page_content[:200]}")
    
    # If no relevant documents or low similarity, use general knowledge
    #if not docs or all(doc.metadata.get('score', 1.0) < 0.3 for doc in docs):
    #    return use_general_knowledge(query)

    if not docs:
        return use_general_knowledge(query)

    # Format retrieved documents as context
    context = "\n\n".join([doc.page_content for doc in docs])
    
    # Construct RAG prompt
    rag_prompt = f"""
    Use the following pieces of context to answer the question at the end. 
    If you don't know the answer based on this context, respond with "NO_ANSWER_FOUND".

    Context:
    {context}

    Question: {query}

    Provide a direct and concise answer to the question based only on the context above:
    """
    
    rag_answer = llm(rag_prompt)
    
    # Check for fallback trigger
    if "NO_ANSWER_FOUND" in rag_answer or "don't know" in rag_answer.lower() or "cannot find" in rag_answer.lower():
        return use_general_knowledge(query)
    
    return {
        "answer": rag_answer,
        "source": "URL content",
        "source_documents": docs
    }

Step 7: This is the fallback function. If no relevant documents could be retrieved in step 6, the LLM uses its general knowledge to answer the user’s query.

# Step 7: Define fallback general knowledge function
def use_general_knowledge(query):
    general_prompt = f"""
    Answer this question using your general knowledge: {query}
    
    Provide a direct and helpful response. If you don't know, simply say so.
    """
    
    general_answer = llm(general_prompt)
    
    return {
        "answer": general_answer,
        "source": "General knowledge",
        "source_documents": []
    }

Step 8: This step shows an example of how to use this RAG mechanism. You provide a query to the model, and LLM uses external or internal knowledge to answer your query.

#Step 8
# Example usage
query = "What is the hardest part about managing diabetes?"  # Replace with your actual question
result = ask_question_with_fallback(query)

# Display results
print("Answer:")
print(result["answer"])
print(f"\nSource: {result['source']}")

if result["source_documents"]:
    print("\nSource Documents:")
    for i, doc in enumerate(result["source_documents"]):
        print(f"Source {i+1}:")
        print(doc.page_content[:200] + "...")  # Print first 200 chars of each source
        print()

My output from Step 8 is below. The model uses RAG, which identifies a similar document from the FAISS store and answers my query.

In conclusion, leveraging RAG with URLs is a powerful approach for enhanced knowledge retrieval in diabetes-related queries. By combining real-world insights from community platforms like Reddit on top of open-source LLMs, we can offer personalized and accurate information - after all, who understands it better than those living with it every day?

This method is not only cost-effective but also fosters collaboration, ultimately improving support for individuals managing diabetes. As AI continues to evolve, its potential to improve healthcare and well-being remains vast.

Feature photo by Suzy Hazelwood: https://www.pexels.com/photo/close-up-photo-of-sugar-cubes-in-glass-jar-2523650/