How I Built a Fail-Safe Legal AI Engine for Singapore Laws Using Triple-Model RAG

I’ve always been interested in the relationship between code and “The Code” – the legal systems that govern our world. In Singapore, the laws are specific, but for the layman or legal researcher, wading through thick legislation and past changes can be like looking for a needle in a digital haystack.

I chose to create a custom RAG (Retrieval-Augmented Generation) engine to address this problem. But I soon realized that when working with legal data, hallucinations can be costly (and not just frustrating).

The Technical Challenge with Custom RAGs

Traditional RAG architectures are brittle. What if your main model goes down or your context window gets congested? The system breaks. I wanted to create something that was “Apple-level” in its polish but “Industrial-grade” in its robustness.

This led me to build a triple-AI failover backend as follows:

Gemini – Primary model powering the webpage.
Llama 3 – Automatically takes over if Gemini hits a rate limit.
Groq – Acts as the final fallback if both Gemini and Llama 3 fail.

As the engine uses FAISS for semantic embeddings, it not only searches for keywords but also comprehends the meaning of Singaporean legislative language.

Where It Clicked

The moment of truth wasn’t only in the search results but also in the UI. I shifted away from the “chatbox” style and opted for a high-precision, interactive UI that resembles more of a research tool than a playground.

Technical Information

Uses the BGE-M3 embedding model to convert PDFs from Gov.sg into a vector database. It also converts user queries into vectors and retrieves relevant documents using relative vector distance.
Interaction with the API URL provided by Hugging Face is managed through Flask.
Multiple models are used to keep the AI running 24/7, so if one crashes, others automatically handle the request.
594 PDFs were converted into a vector database using Google Colab with a Tesla T4 GPU, completing in approximately 55 minutes.
Different modes are used to handle different types of queries:
Mode 1 (Travel, food, culture, infrastructure, etc.): Responses are generated using general knowledge.
Mode 2 (Law): Legal queries are handled using the RAG system for grounded answers.
To build the vector database, each PDF was divided into smaller text chunks to reduce computational load and processing time.
The database currently contains over 30,000 pages, providing near-accurate information.
The BGE-M3 embedding model runs locally and smoothly on a system with 12GB RAM and a 2-core CPU (default Hugging Face plan).

Check the product out!

I have decided to open-source the engine. Whether you are interested in LegalTech, analyzing RAG architectures, or simply want to see how a multi-model failover is implemented in Python, you should take a look.

Live Experience: https://adityaprasad-sudo.github.io/ExploreSingapore/
Deep Dive into the Code: https://github.com/adityaprasad-sudo/ExploreSingapore

What's Next?

This is only the beginning. As Singapore’s policies change, so will the engine. I am currently researching how to make an accurate text embedding model to further increase the autonomy of the research.

Under The Hood

1.import dependencies and the tools we need

import os
import time
import random
import numpy as np
from flask import Flask, request, jsonify
from flask_cors import CORS
from dotenv import load_dotenv
import google.generativeai as genai

other imports

i. os, time, numpy: just standard Python stuff. os is for file paths to avoid confusion about directory paths, and numpy is for math when working with vectors.

ii. flask, jsonify: This creates the server. We use the jsonify function to output the data in a json manner so the frontend can use it.

iii. flask_cors: this is super important. CORS allows the frontend (which is on a different domain usually) to talk to this backend without the browser blocking it.

iv. dotenv: this loads the secret keys from the .env file so i dont leak my api keys on GitHub lol.

FAISS: This is basically the database for vectors from Facebook and is really fast for similarity search.

2. setup and keys

For this step, we're initializing the keys.

#set api keys
load_dotenv()
API_KEY = os.getenv("GENAI_API_KEY") #add backup models api keys also 

initialize clients
OR_CLIENT = OpenAI(base_url="https://openrouter.ai/api/v1", api_key=OR_API_KEY) if OR_API_KEY else None
GROQ_CLIENT = OpenAI(base_url="https://api.groq.com/openai/v1", api_key=GROQ_API_KEY) if GROQ_API_KEY else None

other keys

i. load_dotenv(): reads the .env file.

ii. OR_CLIENT / GROQ_CLIENT: here I'm creating the clients

notice I'm using the default open AI client structure but changing the base_url

it's a hack to use openrouter and groq with the same structure as in chatgpt

3. app init and the 'brain'

Setting up the Flask app and the embedding model.

app = Flask(__name__)
CORS(app, resources={r"/*": {"origins": "*"}}) # Allow CORS for all routes and to prevent browser from blocking our api
# This Block is used to setup the memoru(Vector data) which is then used for search
print("Initializing AI Memory...") # i will use this for debugging

# this AI model will run local on the backend platform SO MAKE SURE THAT YOUR PLANTFORM HAS SUFFICIENT POWER AND RAM
# This must match the model used to build the vector data
embeddings = HuggingFaceBgeEmbeddings(
    model_name="BAAI/bge-m3",
    model_kwargs={'device': 'cpu'}, 
    encode_kwargs={'normalize_embeddings': True} # crucial for BGE performance
)

i. CORS(.): allowing all origins (*), cause i was lazy and wanted it to just work during testing.

ii. HuggingFaceBgeEmbeddings: this is the model that transforms text into numbers.

iii. device: 'cpu': i forced it to run on cpu because the server i am using is free tier and i do not have a gpu. i think it runs just a bit slower.

iv. normalize_embeddings: preserve math correctness for cosine similarity.

4. Loading the vector database

This part looks for the folder where I saved the FAISS index.

# 1. Get the absolute path of the directory where this script (app.py) is located
script_dir = os.path.dirname(os.path.abspath(__file__))

# 2. Join it with your folder name to get the FULL path
vector_FOLDER = os.path.join(script_dir, "faiss_index_bgem3")

print(f"Looking for vector DB at: {vector_FOLDER}") # Debug print

vectorstore = None

if os.path.exists(vector_FOLDER):
    try:
        vectorstore = FAISS.load_local(
            vector_FOLDER, 
            embeddings, 
            allow_dangerous_deserialization=True
        )
        print("Vector Database Loaded Successfully")
    except Exception as e:
        print(f"Error while loading vector DB: {e}")
else:
    # This will now tell you EXACTLY where it looked and failed
    print(f"Folder NOT found at: {vector_FOLDER}")
    print(f"Current Working Directory is: {os.getcwd()}")

i. load_local: This command, however, actually loads the "memory" into RAM.

ii. allow_dangerous_deserialization: yeah so the pickle file can be dangerous but since this file has been created by me, it is safe to use.

5. The search function

This is where the retrieval magic happens.

def findbestmatch(query):
    if vectorstore is None:
        return None, 0.0, "System Error", 0
    """
    Searches the vector database for the most relevant document.(If any)
    Returns: (contenttext, score, sourcename)
    """
    if vectorstore is None:
        return None, 0.0, "System Error"        # most probable cause is the vector DB failed to load or maybe the vector database is corrupted

    try:
        # Search for the top 1 closest chunk ie most relevant
        # Note: FAISS(vector data bae file format) default metric is L2 Distance (Lower is better)
        # A distance of 0 is a perfect match. A distance > 1.5 is usually irrelevant.(this is the distance of vector from the use query to rvevant document)
        results = vectorstore.similarity_search_with_score(query, k=1)
        
        if not results:
            return None, 0.0, None, 0
            
        doc, score = results[0]
        page_num = doc.metadata.get('page', 0) 
        source_name = doc.metadata.get('source', 'Unknown PDF')
        # Convert distance to a "Confidence Score" (approximate)
        # if score is 0 then the confidence would be 1 ie 100% confidence and if score is 1 then confidence        would be 1/2 ie 50%
        # basically score is the actual distance of the vector from the revlevant vector vector and confidence i just a way to represent how sure the ai is based on the distance of vector(score)
        confidence = 1.0 / (1.0 + score) 
        
        return doc.page_content, confidence, doc.metadata.get('source', 'Unknown PDF'), page_num

    except Exception as e:
        print(f"Search Error: {e}")           #usally happens if the server ram is overused which doesnt allow                                                                                  .                                             #our local ai model BGE-M3 to work
        return None, 0.0, None, 0

i. Math stuff: FAISS has a notion of distance (0 = identical, 1 = max distance). Humans do not understand that and I use 1 / (1+score) to use as a confidence level.

6. System Instructions

The large instruction strings—instructionsgemini, instructionsopenrouter, and instructionsgroq (not included here due to length; see the GitHub repository)—define the system-level behavior for each model.

They function as structured prompts that establish role and constraints, such as: “You are a Singapore expert,” “Do not hallucinate,” and “Use only the provided context.” These strict guardrails are designed to ensure accurate, grounded responses.

7. the api route (/ask) and the prompt finalizer

This is the endpoint the frontend hits.

# Api route
@app.route('/ask', methods=['POST']) #/ask allows us to give the prompt to the Chat generating AI
def ask_gemini():
    data = request.get_json(force=True, silent=True) or {}
    user_query = data.get('query', '')

    if not user_query: #if enter is pressed without giving any prompt
        return jsonify({"answer": "Please ask a question."})

    # retriving the info our searching model gave us
    contexttext = ""
    sourcename = "General Knowledge"

    # this performs the searcg based on use query(uer prompt)
    found_text, score, src, page_num = findbestmatch(user_query)
    
    # if the searched content is having confidence greater than 0.4 use it.
    if found_text and score > 0.5:
        print(f"Match Found: {src} (Confidence: {score:.2f})")

        contexttext = found_text
        sourcename = src
        final_page_num = page_num

        context_block = f"""
        <document>
        SOURCE: {src}
        PAGE_NUMBER: {page_num}
        CONTENT: {found_text}
        </document>
        """
    else:
        context_block = "No relevant document found."

    # Final Prompt with Explicit Page Structure
    finalprompt = f"""
    Use the following verified document fragment to answer the question.
    
    {context_block}
    
    USER QUESTION: {user_query}
    
    REQUIREMENT: You MUST cite the [PAGE_NUMBER] provided above in your answer.
    """

    # Constructing the final prompt for aaour chat generation ai
    trinityprompt = f"""
<context>
SOURCE: {sourcename}
PAGE_NUMBER: {page_num}
CONTENT: {contexttext}
</context>
USER_QUERY: {user_query}
INSTRUCTION: Analyze the <context> to answer the USER_QUERY. Follow the System Role instructions exactly.
"""

i. Threshold: I added a condition requiring the similarity score to be greater than 0.5. If the match falls below that threshold, the system returns “No document found” instead of passing weak context to the AI and risking a misleading response.

8. the multi-model logic (the cascade)

This is the coolest part. It is a failover system.

Step 1: Ask Google as follows.

if API_KEY:
        try:
            print("attempting primary (gemini)")
            client = genai.Client(api_key=API_KEY)
            response = client.models.generate_content(
                model="gemini-2.5-flash", # Using the latest available model
                contents=[instructionsgemini, trinityprompt],#feeding the ai the final prompt and system instructions
                config=types.GenerateContentConfig(temperature=0.3, max_output_tokens=300)
            )
            return jsonify({"answer": response.text})
        except Exception as e:
            print(f"Gemini Failed: {e}") #would probably fail if we hit the gemini rate limit

i. I try to use the Gemini flash method since it is fast. It then returns the answer and terminates.

Step 2: Try OpenRouter (Backup)

if OR_CLIENT:
        try:
            print("attempting backup (openRouter)")
            response = OR_CLIENT.chat.completions.create(
                model="meta-llama/llama-3.3-70b-instruct:free", # Fast & Free
                messages=[
                    {"role": "system", "content": instructionsopenrouter},
                    {"role": "user", "content": trinityprompt}
                ],
                temperature=0.3,                   #controls Creativity of the AI
                max_tokens=300                      #the max words the ai can print
            )
            return jsonify({"answer": f"{response.choices[0].message.content}\n\n"})
        except Exception as e:
            print(f"OpenRouter Failed: {e}") #would fail if the models is down

Step 3: Try Groq (last resort)

 if GROQ_CLIENT:
        try:
            print("attempting tertiary (groq)")
            response = GROQ_CLIENT.chat.completions.create(
                model="llama-3.3-70b-versatile",
                messages=[
                    {"role": "system", "content": instructionsGroq},
                    {"role": "user", "content": trinityprompt}
                ],
                temperature=0.3,          
                max_tokens=300                 
            )
            return jsonify({"answer": f"{response.choices[0].message.content}\n\n"})
        except Exception as e:
            print(f"groq failed: {e}")

9. running it

if __name__ == '__main__':

    app.run(host="0.0.0.0", port=7860)

This will be used for cloud deployment, for example, Hugging Face Spaces. If you do not specify this and leave a default option of localhost, it will not work.

Thank you for reading this far.