In the previous article, we learned about Retrieval-Augmented Generation (RAG). This has emerged as a powerful technique to enhance the capabilities of large language models (LLMs). RAG allows LLMs to provide more accurate, relevant, and context-specific answers, mitigating issues like “hallucination” and outdated information.
This is part two of a three-part series. This article provides a detailed walkthrough of creating a RAG-powered chat application entirely within your local Python environment. We'll explore how to integrate several key technologies to build an interactive and informative tool:
- Reflex: A modern, pure-Python framework designed for rapidly building and deploying interactive web applications without needing separate frontend expertise (like JavaScript). Its reactive nature simplifies state management.
- LangChain: A comprehensive framework specifically designed for developing applications powered by language models. It provides modular components and chains to streamline complex workflows, such as RAG, making pipeline construction significantly easier.
- Ollama: An increasingly popular tool that enables users to download, run, and manage various open-source LLMs (like Google's Gemma, Meta's Llama series, Mistral models, etc.) directly on their local machine, promoting privacy and offline capabilities.
- FAISS (Facebook AI Similarity Search): A highly optimized library for performing efficient similarity searches on large datasets of vectors. In our RAG context, it's crucial to quickly find relevant text passages based on the user's query embedding.
- Hugging Face Datasets & Transformers: The de facto standard libraries in the NLP ecosystem. Datasets provide easy access to a vast collection of datasets, while sentence-transformers (built on Transformers) offer convenient ways to generate high-quality text embeddings.
Our objective is to create a web-based chat application where users can pose questions. The application will then:
- Convert the question into a numerical vector (embedding).
- Search a pre-indexed vector store (built from a dataset using FAISS) to find text passages with similar embeddings (i.e., relevant context).
- Provide this retrieved context, along with the original question, to a locally running LLM (via Ollama).
- Display the LLM's generated answer, which is now grounded in the retrieved information, back to the user in the chat interface built in Reflex.
Initial code structure and setup
Note: You can find the complete code at this GitHub repository.
Folder/File Layout
Layout for our RAG application.
rag_app/ # Root folder for this project
│
├── .env
├── requirements.txt
├── rxconfig.py
│
└── rag_gemma_reflex/
├── __init__.py
├── rag_logic.py
├── state.py
└── rag_gemma_reflex.py
Requirements
Make sure to create a virtual environment using uv or via basic pip
reflex
langchain
langchain-community
langchain-huggingface
datasets
faiss-cpu
sentence-transformers
ollama
python-dotenv
langchain-ollama
For an easier installation, download the requirements.txt file from here. And then in your virtual environment run any one of the commands:
pip install -r requirements.txt
## if you're using uv
uv pip install -r requirements.txt
Each library is performing their roles:
- reflex builds the interactive frontend
- langchain orchestrates the RAG flow
- datasets provides the knowledge source
- sentence-transformers and faiss-cpu handle the retrieval mechanism
- ollama runs the local LLM
- python-dotenv helps manage configuration
Diving into the code
Downloading the local AI model
Ollama provides a lot of AI models to download and get started with. For this demo we’ll be using gemma3:4b-it-qat this is a 4 billion parameter quantized model which takes 3x less ram than their non-quantized counterparts. You’re free to change it to a bigger or smaller model for this demo.
Dataset we’re using
For this demo we’re using the dataset, neural-bridge/rag-dataset-12000 from Huggingface. There are many datasets available, you can download and change the name in the settings.
Reflex Configuration (rxconfig.py):
This file contains basic settings for the Reflex application. For this project, it’s minimal, primarily just naming the app:
import reflex as rx
# Basic configuration defining the application's name
config = rx.Config(
app_name="rag_app",
)
Implementing the RAG Core Logic
This script is the engine of our application and is responsible for setting up and executing the entire RAG pipeline. You can find the code for this file here.
Key Steps:
DEFAULT_OLLAMA_MODEL = "gemma3:4b-it-qat"
DATASET_NAME = "neural-bridge/rag-dataset-12000"
DATASET_SUBSET_SIZE = 100
EMBEDDING_MODEL_NAME = "all-MiniLM-L6-v2"
OLLAMA_MODEL = os.getenv("OLLAMA_MODEL", DEFAULT_OLLAMA_MODEL)
FAISS_INDEX_PATH = "faiss_index_neural_bridge"
- Configuration: Defined constants for the dataset name (neural-bridge/rag-dataset-12000), embedding model (all-MiniLM-L6-v2), the default Ollama model (gemma3:4b-it-qat), and the FAISS index path. Added logic to read the Ollama model name from the environment (OLLAMA_MODEL) for flexibility.
def load_and_split_data():
"""
Loads the neural-bridge/rag-dataset-12000 dataset and converts
contexts into LangChain Documents.
"""
print(f"Loading dataset '{DATASET_NAME}'...")
try:
if DATASET_SUBSET_SIZE:
print(f"Loading only the first {DATASET_SUBSET_SIZE} entries.")
dataset = load_dataset(DATASET_NAME, split=f"train[:{DATASET_SUBSET_SIZE}]")
else:
print("Loading the full dataset...")
dataset = load_dataset(DATASET_NAME, split="train")
documents = [
Document(
page_content=row["context"],
metadata={"question": row["question"], "answer": row["answer"]},
)
for row in dataset
if row.get("context")
]
print(f"Loaded {len(documents)} documents.")
return documents
except Exception as e:
print(f"Error loading dataset '{DATASET_NAME}': {e}")
print(traceback.format_exc())
return []
- Data Loading (load_and_split_data):
- Loads the specified dataset from Hugging Face (datasets.load_dataset).
- Handles loading a subset (
DATASET_SUBSET_SIZE
) for faster testing. - Converts each row’s ’context’ into a LangChain Document object, storing the ’question’ and ’answer’ in the metadata. (We initially tried rag-datasets/rag-mini-wikipedia but switched back due to loading complexities)
def get_embeddings_model():
"""Initializes and returns the HuggingFace embedding model."""
print(f"Loading embedding model '{EMBEDDING_MODEL_NAME}'...")
model_kwargs = {"device": "cpu"}
encode_kwargs = {"normalize_embeddings": False}
embeddings = HuggingFaceEmbeddings(
model_name=EMBEDDING_MODEL_NAME,
model_kwargs=model_kwargs,
encode_kwargs=encode_kwargs,
)
print("Embedding model loaded.")
return embeddings
- Embeddings (get_embeddings_model): Initializes a sentence transformer model (all-MiniLM-L6-v2) using langchain_huggingface.HuggingFaceEmbeddings to convert text documents into numerical.
def create_or_load_vector_store(documents, embeddings):
"""Creates a FAISS vector store from documents or loads it if it exists."""
if os.path.exists(FAISS_INDEX_PATH) and os.listdir(FAISS_INDEX_PATH):
print(f"Loading existing FAISS index from '{FAISS_INDEX_PATH}'...")
try:
vector_store = FAISS.load_local(
FAISS_INDEX_PATH,
embeddings,
allow_dangerous_deserialization=True
)
print("FAISS index loaded.")
except Exception as e:
print(f"Error loading FAISS index: {e}")
print("Attempting to rebuild the index...")
vector_store = None
else:
vector_store = None
if vector_store is None:
if not documents:
print("Error: No documents loaded to create FAISS index.")
return None
print("Creating new FAISS index...")
vector_store = FAISS.from_documents(documents, embeddings)
print("FAISS index created.")
print(f"Saving FAISS index to '{FAISS_INDEX_PATH}'...")
try:
vector_store.save_local(FAISS_INDEX_PATH)
print("FAISS index saved.")
except Exception as e:
print(f"Error saving FAISS index: {e}")
return vector_store
- Vector Store (create_or_load_vector_store):
- Uses FAISS (
langchain_community.vectorstores.FAISS
) to create a vector index from the document embeddings. - Crucially, it checks if an index already exists at FAISS_INDEX_PATH and loads it to avoid re- processing on every run. If not found, it creates and saves a new index.
def get_ollama_llm():
"""Initializes and returns the Ollama LLM using the new package."""
global OLLAMA_MODEL
current_ollama_model = os.getenv("OLLAMA_MODEL", DEFAULT_OLLAMA_MODEL)
if OLLAMA_MODEL != current_ollama_model:
print(f"Ollama model changed to '{current_ollama_model}'.")
OLLAMA_MODEL = current_ollama_model
global _rag_chain
_rag_chain = None
print(f"Initializing Ollama LLM with model '{OLLAMA_MODEL}'...")
try:
ollama_client.show(OLLAMA_MODEL)
print(f"Confirmed Ollama model '{OLLAMA_MODEL}' is available locally.")
except ollama_client.ResponseError as e:
if "model not found" in str(e).lower():
print(f"Error: Ollama model '{OLLAMA_MODEL}' not found locally.")
print(f"Please pull it first using: ollama pull {OLLAMA_MODEL}")
return None
else:
print(f"An error occurred while checking the Ollama model: {e}")
return None
except Exception as e:
print(f"An unexpected error occurred while checking Ollama model: {e}")
return None
ollama_base_url = os.getenv("OLLAMA_HOST")
if ollama_base_url:
print(f"Using Ollama host: {ollama_base_url}")
llm = Ollama(model=OLLAMA_MODEL, base_url=ollama_base_url)
else:
print("Using default Ollama host (http://localhost:11434).")
llm = Ollama(model=OLLAMA_MODEL)
print("Ollama LLM initialized.")
return llm
- LLM Initialization (get_ollama_llm):
- Connects to the locally running Ollama service.
- Uses the langchain_ollama.OllamaLLM class (after updating from the deprecated langchain_community version).
- Specifies the desired model (OLLAMA_MODEL, defaulting to gemma3:4b-it-qat).
- Includes error handling to check if the specified model has been pulled in Ollama (ollama pull <model_name>).
def get_rag_chain():
"""Returns the initialized RAG chain, setting it up if necessary."""
if _rag_chain is None:
setup_rag_chain()
if _rag_chain is None:
print("Warning: RAG chain is not available.")
return _rag_chain
- Chain Setup (setup_rag_chain, get_rag_chain):
-
Retriever: Creates a retriever from the FAISS vector store (vector_store.as_retriever) to find the top k documents most similar to a user’s query.
-
Prompt Template: Defines a ChatPromptTemplate instructing the LLM how to answer the question using the provided context. We updated the template variable from {question} to {input} to match what the create_retrieval_chain expects.
-
Stuff Documents Chain: Uses create_stuff_documents_chain to take the retrieved documents and the user input, format them into the prompt, and pass them to the LLM.
-
Retrieval Chain: Uses create_retrieval_chain to tie the retriever and the stuff documents chain together into the final RAG chain.
-
The get_rag_chain function acts as a lazy initializer, setting up the chain only on the first request or if the configuration (like the Ollama model) changes.
Reflex App (State and UI)
Managing UI State (state.py)
This file defines the Reflex State class, which holds the data for the UI and the logic to handle user interactions. You can find the code here.
import reflex as rx
from . import rag_logic
import traceback
class QA(rx.Base):
"""A question and answer pair."""
question: str
answer: str
is_loading: bool = False
class State(rx.State):
"""Manages the application state for the RAG chat interface."""
question: str = ""
chat_history: list[QA] = []
is_loading: bool = False
async def handle_submit(self):
"""Handles the user submitting a question."""
if not self.question.strip():
return
user_question = self.question
self.chat_history.append(QA(question=user_question, answer="", is_loading=True))
self.question = ""
yield
try:
rag_chain = rag_logic.get_rag_chain()
if rag_chain is None:
raise Exception("RAG chain could not be initialized. Check logs.")
response = await rag_chain.ainvoke({"input": user_question})
answer = response.get("answer", "Sorry, I couldn't find an answer.")
self.chat_history[-1].answer = answer
self.chat_history[-1].is_loading = False
except Exception as e:
print(f"Error processing question: {e}")
print(traceback.format_exc())
self.chat_history[-1].answer = f"An error occurred: {e}. Check the console logs."
self.chat_history[-1].is_loading = False
finally:
if self.chat_history:
self.chat_history[-1].is_loading = False
- question`: Stores the current text in the input field.
chat_history
: A list of QA objects (a simple rx.Base model holding question, answer, and loading status) to display the conversation.is_loading
: A boolean for global loading states (though we primarily used the per-message loading).handle_submit
: An asynchronous event handler triggered when the user submits the form.
UI in Reflex
For the UI we’re using Reflex components to build the visual chat interface. You can find the Reflex styling guide to know more about how to style your own Reflex components. And the code here. Here’s the code for the `rag_gemma_reflex.py
import reflex as rx
from .state import State, QA
# --- UI Styles ---
colors = {
"background": "#0F0F10",
"text_primary": "#E3E3E3",
"text_secondary": "#BDC1C6",
"input_bg": "#1F1F21",
"input_border": "#3C4043",
"button_bg": "#8AB4F8",
"button_text": "#202124",
"button_hover_bg": "#AECBFA",
"user_bubble_bg": "#3C4043",
"bot_bubble_bg": "#1E1F21",
"bubble_border": "#5F6368",
"loading_text": "#9AA0A6",
"heading_gradient_start": "#8AB4F8",
"heading_gradient_end": "#C3A0F8",
}
base_style = {
"background_color": colors["background"],
"color": colors["text_primary"],
"font_family": "'Roboto', sans-serif",
"font_weight": "200",
"height": "100vh",
"width": "100%",
}
input_style = {
"background_color": colors["input_bg"],
"border": f"1px solid {colors['input_border']}",
"color": colors["text_primary"],
"border_radius": "24px",
"padding": "12px 18px",
"width": "100%",
"font_weight": "400",
"_placeholder": {
"color": colors["text_secondary"],
"font_weight": "300",
},
":focus": {
"border_color": colors["button_bg"],
"box_shadow": f"0 0 0 1px {colors['button_bg']}",
},
}
button_style = {
"background_color": colors["button_bg"],
"color": colors["button_text"],
"border": "none",
"border_radius": "24px",
"padding": "12px 20px",
"cursor": "pointer",
"font_weight": "500",
"font_family": "'Roboto', sans-serif",
"transition": "background-color 0.2s ease",
":hover": {
"background_color": colors["button_hover_bg"],
},
}
chat_box_style = {
"padding": "1em 0",
"flex_grow": 1,
"overflow_y": "auto",
"display": "flex",
"flex_direction": "column-reverse",
"width": "100%",
"&::-webkit-scrollbar": {
"width": "8px",
},
"&::-webkit-scrollbar-track": {
"background": colors["input_bg"],
"border_radius": "4px",
},
"&::-webkit-scrollbar-thumb": {
"background": colors["bubble_border"],
"border_radius": "4px",
},
"&::-webkit-scrollbar-thumb:hover": {
"background": colors["text_secondary"],
},
}
qa_style = {
"margin_bottom": "1em",
"padding": "12px 18px",
"border_radius": "18px",
"word_wrap": "break-word",
"max_width": "85%",
"box_shadow": "0 1px 3px 0 rgba(0, 0, 0, 0.15)",
"line_height": "1.6",
"font_weight": "400",
"code": {
"background_color": "rgba(255, 255, 255, 0.1)",
"padding": "0.2em 0.4em",
"font_size": "85%",
"border_radius": "4px",
"font_family": "monospace",
},
"a": {
"color": colors["button_bg"],
"text_decoration": "underline",
":hover": {
"color": colors["button_hover_bg"],
},
},
"p": {
"margin": "0",
},
}
question_style = {
**qa_style,
"background_color": colors["user_bubble_bg"],
"color": colors["text_primary"],
"align_self": "flex-end",
"border_bottom_right_radius": "4px",
}
answer_style = {
**qa_style,
"background_color": colors["bot_bubble_bg"],
"color": colors["text_primary"],
"align_self": "flex-start",
"border_bottom_left_radius": "4px",
}
loading_style = {
"color": colors["loading_text"],
"font_style": "italic",
"font_weight": "300",
}
# --- UI Components ---
def message_bubble(qa: QA):
"""Displays a single question and its answer."""
return rx.vstack(
rx.box(qa.question, style=question_style),
rx.cond(
qa.is_loading,
rx.box("Thinking...", style={**answer_style, **loading_style}),
rx.markdown(qa.answer, style=answer_style),
),
align_items="stretch",
width="100%",
spacing="1",
)
# --- Main Page ---
def index() -> rx.Component:
"""The main chat interface page."""
heading_style = {
"size": "7",
"margin_bottom": "0.25em",
"font_weight": "400",
"background_image": f"linear-gradient(to right, {colors['heading_gradient_start']}, {colors['heading_gradient_end']})",
"background_clip": "text",
"-webkit-background-clip": "text",
"color": "transparent",
"width": "fit-content",
}
return rx.container(
rx.vstack(
rx.box(
rx.heading("RAG Chat with Gemma", **heading_style),
rx.text(
"Ask a question based on the loaded context.",
color=colors["text_secondary"],
font_weight="300",
),
padding_bottom="0.5em",
width="100%",
text_align="center",
),
rx.box(
rx.foreach(State.chat_history, message_bubble),
style=chat_box_style,
),
rx.form(
rx.hstack(
rx.input(
name="question",
placeholder="Ask your question...",
value=State.question,
on_change=State.set_question,
style=input_style,
flex_grow=1,
height="50px",
),
rx.button(
"Ask",
type="submit",
style=button_style,
is_loading=State.is_loading,
height="50px",
),
width="100%",
align_items="center",
),
on_submit=State.handle_submit,
width="100%",
),
align_items="center",
width="100%",
height="100%",
padding_x="1em",
padding_y="1em",
spacing="4",
),
max_width="900px",
height="100vh",
padding=0,
margin="auto",
)
# --- App Setup ---
stylesheets = [
"https://fonts.googleapis.com/css2?family=Roboto:wght@200;300;400;500&display=swap",
]
app = rx.App(style=base_style, stylesheets=stylesheets)
app.add_page(index, title="Reflex Chat")
Building the Application
You need to make sure that the correct structure is followed. And then from the root directory of the project run these two commands. Your app will be up and running.
reflex init
reflx run
Note: Please ensure that you have Ollama up and running before you run the application. A succussful message would look like this. Some of the messages might appear after you drop in a Hi
in the chat interface.
Chat Interface
Sending a message to our RAG Application:
And then validating the claim from the dataset.
For this answer, our RAG-app answers and provides the correct context; the dataset screenshot is from Huggingface. So, we can say that RAG is working good enough.
How to make this app more accurate/production ready?
For this application to perform better and more accurately, there are a couple of steps that can be followed.
- Using a bigger model from Ollama like Gemma 27B, Llama 3.3 70B, Qwen 2.5 70B or DeepSeek V3 for better accuracy.
- Use a dedicated vector database like Qdrant, Pinecone, Milvus, etc. to store and index results.
- Creating a personalized and dedicated dataset instead of using internet corpus for specific use cases. Remember, RAG is as good as your data you provide. For best results the data should be well cleaned, and primed for using with AI.
- Visual overhauls: You can improve the chat interface by following reflex’s guides to add more features, extra states, database to store chats, etc.
Conclusion
Through this iterative process, we successfully constructed a functional, locally-hosted RAG chat application. This project demonstrates the power of combining Reflex for rapid UI development, LangChain for sophisticated LLM workflow orchestration, FAISS for efficient vector search, and Ollama for the privacy and control offered by local LLM inference.
References
Ready to get started?
Scale your integration strategy and deliver the integrations your customers need in record time.