Large Language Models are incredibly powerful and it’s impressive that they have been trained on a vast data corpora. However, how many of us, as we embark on our Founder journey, have the budget to train these models in-house? Likely, very few.
But wait, are these pre-trained LLMs helpful to us? Of course, if they are open source. Fortunately, quite a few are available now.
So, how exactly are they helpful? Most of us in the AI field are aware of RAGs or have at least heard of them. Let me provide a simple one-line definition. Retrieval-augmented generation is exactly what it sounds like - it retrieves data from the external sources we provide and augments it to the LLM output.
It’s extremely useful because it leverages the generative power of LLMs while incorporating the knowledge we supply externally to generate outputs from the desired corpus. If you have a limited external corpus, we can let the model fall back on the general knowledge of the LLM.
I am particularly interested in how we eat and deeply believe in the concept of “garbage in, garbage out” when it comes to food and bodies. If we nourish ourselves with naturally healthy foods, we reflect nature - strong, vibrant, and unstoppable. But if we consume artificial, lifeless foods, we begin to look and feel the same - drained and unnatural. One of the worst consequences of excess artificial and refined food consumption today is diabetes.
And who truly understands the real pain points of living with diabetes? It’s simple - the people who experience it firsthand. Given my interest in sourcing experiential data on diabetes using LLMs, I conducted this experiment with Ollama - one of many open-source LLMs that can be utilized for such tasks.
I am sharing my notebook step by step with an explanation at each stage. Additionally, to aid comprehension, I am including a high-level architecture diagram.
# Import necessary libraries
import requests
from bs4 import BeautifulSoup
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OllamaEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain.llms import Ollama
# List of URLs to scrape
urls = [
'https://www.reddit.com/r/diabetes/comments/1broigp/what_are_your_biggest_frustrations_with_diabetes/',
'https://www.reddit.com/r/diabetes_t2/comments/156znkx/whats_the_most_challenging_part_about_dealing/',
'https://www.reddit.com/r/diabetes/comments/qcsgji/what_is_the_hardest_part_about_managing_diabetes/',
'https://www.reddit.com/r/diabetes_t1/comments/1hdlipr/diabetes_and_pain/',
'https://www.reddit.com/r/diabetes/comments/ww6mrj/what_does_diabetic_nerve_pain_feel_like/',
'https://www.reddit.com/r/AskReddit/comments/avl1x0/diabetics_of_reddit_what_is_your_experience_of/',
'https://www.reddit.com/r/diabetes_t2/comments/1jggxi9/my_fathers_sugar_levels_are_not_dropping/',
'https://www.reddit.com/r/diabetes_t2/comments/1jglmie/shaky_feeling/',
'https://www.reddit.com/r/diabetes_t2/comments/1jgccvo/rant_from_a_depressedeating_disordered_diabetic/'
]
# Initialize text storage
all_texts = []
# Step 1: Fetch and process content from multiple URLs
for url in urls:
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# Extract text from <p> tags
text = ' '.join([para.get_text() for para in soup.find_all('p')])
if text: # Store only if text is found
all_texts.append(text)
# Step 2: Split all content into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
all_chunks = []
for text in all_texts:
all_chunks.extend(text_splitter.split_text(text))
# Step 3: Initialize Ollama embeddings
embeddings = OllamaEmbeddings(model="llama2") # Adjust model name if needed
# Step 4: Create a FAISS vector store using all chunks
vector_store = FAISS.from_texts(all_chunks, embeddings)
# Step 5: Initialize the Ollama LLM
llm = Ollama(model="llama2", temperature=0.3)
Together, steps 3-5 enable RAG, where the LLM can pull knowledge from stored chunks in the vector store and use the retrieved context to answer a user’s query in the most relevant way.
# Step 6: Create the question-answering function
def ask_question_with_fallback(query):
# Retrieve relevant documents
docs = vector_store.similarity_search(query, k=3)
docs = vector_store.similarity_search(query, k=3)
for doc in docs:
print(f"Retrieved doc: {doc.page_content[:200]}")
# If no relevant documents or low similarity, use general knowledge
#if not docs or all(doc.metadata.get('score', 1.0) < 0.3 for doc in docs):
# return use_general_knowledge(query)
if not docs:
return use_general_knowledge(query)
# Format retrieved documents as context
context = "\n\n".join([doc.page_content for doc in docs])
# Construct RAG prompt
rag_prompt = f"""
Use the following pieces of context to answer the question at the end.
If you don't know the answer based on this context, respond with "NO_ANSWER_FOUND".
Context:
{context}
Question: {query}
Provide a direct and concise answer to the question based only on the context above:
"""
rag_answer = llm(rag_prompt)
# Check for fallback trigger
if "NO_ANSWER_FOUND" in rag_answer or "don't know" in rag_answer.lower() or "cannot find" in rag_answer.lower():
return use_general_knowledge(query)
return {
"answer": rag_answer,
"source": "URL content",
"source_documents": docs
}
# Step 7: Define fallback general knowledge function
def use_general_knowledge(query):
general_prompt = f"""
Answer this question using your general knowledge: {query}
Provide a direct and helpful response. If you don't know, simply say so.
"""
general_answer = llm(general_prompt)
return {
"answer": general_answer,
"source": "General knowledge",
"source_documents": []
}
#Step 8
# Example usage
query = "What is the hardest part about managing diabetes?" # Replace with your actual question
result = ask_question_with_fallback(query)
# Display results
print("Answer:")
print(result["answer"])
print(f"\nSource: {result['source']}")
if result["source_documents"]:
print("\nSource Documents:")
for i, doc in enumerate(result["source_documents"]):
print(f"Source {i+1}:")
print(doc.page_content[:200] + "...") # Print first 200 chars of each source
print()
My output from Step 8 is below. The model uses RAG, which identifies a similar document from the FAISS store and answers my query.
In conclusion, leveraging RAG with URLs is a powerful approach for enhanced knowledge retrieval in diabetes-related queries. By combining real-world insights from community platforms like Reddit on top of open-source LLMs, we can offer personalized and accurate information - after all, who understands it better than those living with it every day?
This method is not only cost-effective but also fosters collaboration, ultimately improving support for individuals managing diabetes. As AI continues to evolve, its potential to improve healthcare and well-being remains vast.
Feature photo by Suzy Hazelwood: https://www.pexels.com/photo/close-up-photo-of-sugar-cubes-in-glass-jar-2523650/