Building a Local RAG Chat App with Reflex, LangChain, Huggingface, and Ollama

In the previous article, we learned about Retrieval-Augmented Generation (RAG). This has emerged as a powerful technique to enhance the capabilities of large language models (LLMs). RAG allows LLMs to provide more accurate, relevant, and context-specific answers, mitigating issues like “hallucination” and outdated information.

This is part two of a three-part series. This article provides a detailed walkthrough of creating a RAG-powered chat application entirely within your local Python environment. We'll explore how to integrate several key technologies to build an interactive and informative tool:

Reflex: A modern, pure-Python framework designed for rapidly building and deploying interactive web applications without needing separate frontend expertise (like JavaScript). Its reactive nature simplifies state management.
LangChain: A comprehensive framework specifically designed for developing applications powered by language models. It provides modular components and chains to streamline complex workflows, such as RAG, making pipeline construction significantly easier.
Ollama: An increasingly popular tool that enables users to download, run, and manage various open-source LLMs (like Google's Gemma, Meta's Llama series, Mistral models, etc.) directly on their local machine, promoting privacy and offline capabilities.
FAISS (Facebook AI Similarity Search): A highly optimized library for performing efficient similarity searches on large datasets of vectors. In our RAG context, it's crucial to quickly find relevant text passages based on the user's query embedding.
Hugging Face Datasets & Transformers: The de facto standard libraries in the NLP ecosystem. Datasets provide easy access to a vast collection of datasets, while sentence-transformers (built on Transformers) offer convenient ways to generate high-quality text embeddings.

Our objective is to create a web-based chat application where users can pose questions. The application will then:

Convert the question into a numerical vector (embedding).
Search a pre-indexed vector store (built from a dataset using FAISS) to find text passages with similar embeddings (i.e., relevant context).
Provide this retrieved context, along with the original question, to a locally running LLM (via Ollama).
Display the LLM's generated answer, which is now grounded in the retrieved information, back to the user in the chat interface built in Reflex.

Initial code structure and setup

Note: You can find the complete code at this GitHub repository.

Folder/File Layout

Layout for our RAG application.

rag_app/  # Root folder for this project
│
├── .env                 
├── requirements.txt     
├── rxconfig.py          
│
└── rag_gemma_reflex/    
    ├── __init__.py      
    ├── rag_logic.py     
    ├── state.py         
    └── rag_gemma_reflex.py

Requirements

Make sure to create a virtual environment using uv or via basic pip

    reflex             
    langchain           
    langchain-community 
    langchain-huggingface 
    datasets            
    faiss-cpu           
    sentence-transformers
    ollama              
    python-dotenv       
    langchain-ollama

For an easier installation, download the requirements.txt file from here. And then in your virtual environment run any one of the commands:

pip install -r requirements.txt

## if you're using uv
uv pip install -r requirements.txt

Each library is performing their roles:

reflex builds the interactive frontend
langchain orchestrates the RAG flow
datasets provides the knowledge source
sentence-transformers and faiss-cpu handle the retrieval mechanism
ollama runs the local LLM
python-dotenv helps manage configuration

Diving into the code

Downloading the local AI model

Ollama provides a lot of AI models to download and get started with. For this demo we’ll be using gemma3:4b-it-qat this is a 4 billion parameter quantized model which takes 3x less ram than their non-quantized counterparts. You’re free to change it to a bigger or smaller model for this demo.

understanding rag - gemma

Dataset we’re using

For this demo we’re using the dataset, neural-bridge/rag-dataset-12000 from Huggingface. There are many datasets available, you can download and change the name in the settings.

Reflex Configuration (rxconfig.py):

This file contains basic settings for the Reflex application. For this project, it’s minimal, primarily just naming the app:

import reflex as rx

# Basic configuration defining the application's name
config = rx.Config(
    app_name="rag_app",
)

Implementing the RAG Core Logic

This script is the engine of our application and is responsible for setting up and executing the entire RAG pipeline. You can find the code for this file here.

Key Steps:

DEFAULT_OLLAMA_MODEL = "gemma3:4b-it-qat"
DATASET_NAME = "neural-bridge/rag-dataset-12000"
DATASET_SUBSET_SIZE = 100
EMBEDDING_MODEL_NAME = "all-MiniLM-L6-v2"
OLLAMA_MODEL = os.getenv("OLLAMA_MODEL", DEFAULT_OLLAMA_MODEL)
FAISS_INDEX_PATH = "faiss_index_neural_bridge"

Configuration: Defined constants for the dataset name (neural-bridge/rag-dataset-12000),

embedding model (all-MiniLM-L6-v2), the default Ollama model (gemma3:4b-it-qat), and the FAISS index path. Added logic to read the Ollama model name from the environment (OLLAMA_MODEL) for flexibility.

def load_and_split_data():
    """
    Loads the neural-bridge/rag-dataset-12000 dataset and converts
    contexts into LangChain Documents.
    """
    print(f"Loading dataset '{DATASET_NAME}'...")
    try:
        if DATASET_SUBSET_SIZE:
            print(f"Loading only the first {DATASET_SUBSET_SIZE} entries.")
            dataset = load_dataset(DATASET_NAME, split=f"train[:{DATASET_SUBSET_SIZE}]")
        else:
            print("Loading the full dataset...")
            dataset = load_dataset(DATASET_NAME, split="train")

        documents = [
            Document(
                page_content=row["context"],
                metadata={"question": row["question"], "answer": row["answer"]},
            )
            for row in dataset
            if row.get("context")
        ]

        print(f"Loaded {len(documents)} documents.")
        return documents

    except Exception as e:
        print(f"Error loading dataset '{DATASET_NAME}': {e}")
        print(traceback.format_exc())
        return []

Data Loading (load_and_split_data):

Loads the specified dataset from Hugging Face (datasets.load_dataset).
Handles loading a subset ( DATASET_SUBSET_SIZE) for faster testing.
Converts each row’s ’context’ into a LangChain Document object, storing the ’question’ and

’answer’ in the metadata. (We initially tried rag-datasets/rag-mini-wikipedia but switched back due to loading complexities)

def get_embeddings_model():
    """Initializes and returns the HuggingFace embedding model."""
    print(f"Loading embedding model '{EMBEDDING_MODEL_NAME}'...")
    model_kwargs = {"device": "cpu"}
    encode_kwargs = {"normalize_embeddings": False}
    embeddings = HuggingFaceEmbeddings(
        model_name=EMBEDDING_MODEL_NAME,
        model_kwargs=model_kwargs,
        encode_kwargs=encode_kwargs,
    )
    print("Embedding model loaded.")
    return embeddings

Embeddings (get_embeddings_model): Initializes a sentence transformer model (all-MiniLM-L6-v2) using langchain_huggingface.HuggingFaceEmbeddings to convert text documents into

numerical.

def create_or_load_vector_store(documents, embeddings):
    """Creates a FAISS vector store from documents or loads it if it exists."""
    if os.path.exists(FAISS_INDEX_PATH) and os.listdir(FAISS_INDEX_PATH):
        print(f"Loading existing FAISS index from '{FAISS_INDEX_PATH}'...")
        try:
            vector_store = FAISS.load_local(
                FAISS_INDEX_PATH,
                embeddings,
                allow_dangerous_deserialization=True
            )
            print("FAISS index loaded.")
        except Exception as e:
            print(f"Error loading FAISS index: {e}")
            print("Attempting to rebuild the index...")
            vector_store = None
    else:
        vector_store = None

    if vector_store is None:
        if not documents:
            print("Error: No documents loaded to create FAISS index.")
            return None
        print("Creating new FAISS index...")
        vector_store = FAISS.from_documents(documents, embeddings)
        print("FAISS index created.")
        print(f"Saving FAISS index to '{FAISS_INDEX_PATH}'...")
        try:
            vector_store.save_local(FAISS_INDEX_PATH)
            print("FAISS index saved.")
        except Exception as e:
            print(f"Error saving FAISS index: {e}")

    return vector_store

Vector Store (create_or_load_vector_store):

Uses FAISS ( langchain_community.vectorstores.FAISS) to create a vector index from

the document embeddings.

Crucially, it checks if an index already exists at FAISS_INDEX_PATH and loads it to avoid re-

processing on every run. If not found, it creates and saves a new index.

def get_ollama_llm():
    """Initializes and returns the Ollama LLM using the new package."""
    global OLLAMA_MODEL
    current_ollama_model = os.getenv("OLLAMA_MODEL", DEFAULT_OLLAMA_MODEL)

    if OLLAMA_MODEL != current_ollama_model:
        print(f"Ollama model changed to '{current_ollama_model}'.")
        OLLAMA_MODEL = current_ollama_model
        global _rag_chain
        _rag_chain = None

    print(f"Initializing Ollama LLM with model '{OLLAMA_MODEL}'...")

    try:
        ollama_client.show(OLLAMA_MODEL)
        print(f"Confirmed Ollama model '{OLLAMA_MODEL}' is available locally.")
    except ollama_client.ResponseError as e:
        if "model not found" in str(e).lower():
            print(f"Error: Ollama model '{OLLAMA_MODEL}' not found locally.")
            print(f"Please pull it first using: ollama pull {OLLAMA_MODEL}")
            return None
        else:
            print(f"An error occurred while checking the Ollama model: {e}")
            return None
    except Exception as e:
        print(f"An unexpected error occurred while checking Ollama model: {e}")
        return None

    ollama_base_url = os.getenv("OLLAMA_HOST")
    if ollama_base_url:
        print(f"Using Ollama host: {ollama_base_url}")
        llm = Ollama(model=OLLAMA_MODEL, base_url=ollama_base_url)
    else:
        print("Using default Ollama host (http://localhost:11434).")
        llm = Ollama(model=OLLAMA_MODEL)

    print("Ollama LLM initialized.")
    return llm

LLM Initialization (get_ollama_llm):

Connects to the locally running Ollama service.
Uses the langchain_ollama.OllamaLLM class (after updating from the deprecated

langchain_community version).

Specifies the desired model (OLLAMA_MODEL, defaulting to gemma3:4b-it-qat).
Includes error handling to check if the specified model has been pulled in Ollama (ollama

pull <model_name>).

def get_rag_chain():
    """Returns the initialized RAG chain, setting it up if necessary."""
    if _rag_chain is None:
        setup_rag_chain()
    if _rag_chain is None:
        print("Warning: RAG chain is not available.")
    return _rag_chain

Chain Setup (setup_rag_chain, get_rag_chain):

Retriever: Creates a retriever from the FAISS vector store (vector_store.as_retriever) to

find the top k documents most similar to a user’s query.

Prompt Template: Defines a ChatPromptTemplate instructing the LLM how to answer the

question using the provided context. We updated the template variable from {question} to {input} to match what the create_retrieval_chain expects.

Stuff Documents Chain: Uses create_stuff_documents_chain to take the retrieved

documents and the user input, format them into the prompt, and pass them to the LLM.

Retrieval Chain: Uses create_retrieval_chain to tie the retriever and the stuff documents

chain together into the final RAG chain.

The get_rag_chain function acts as a lazy initializer, setting up the chain only on the first

request or if the configuration (like the Ollama model) changes.

Reflex App (State and UI)

Managing UI State (state.py)

This file defines the Reflex State class, which holds the data for the UI and the logic to handle user interactions. You can find the code here.

import reflex as rx
from . import rag_logic
import traceback

class QA(rx.Base):
    """A question and answer pair."""
    question: str
    answer: str
    is_loading: bool = False

class State(rx.State):
    """Manages the application state for the RAG chat interface."""
    question: str = ""
    chat_history: list[QA] = []
    is_loading: bool = False

    async def handle_submit(self):
        """Handles the user submitting a question."""
        if not self.question.strip():
            return

        user_question = self.question
        self.chat_history.append(QA(question=user_question, answer="", is_loading=True))
        self.question = ""
        yield

        try:
            rag_chain = rag_logic.get_rag_chain()
            if rag_chain is None:
                raise Exception("RAG chain could not be initialized. Check logs.")

            response = await rag_chain.ainvoke({"input": user_question})
            answer = response.get("answer", "Sorry, I couldn't find an answer.")
            self.chat_history[-1].answer = answer
            self.chat_history[-1].is_loading = False

        except Exception as e:
            print(f"Error processing question: {e}")
            print(traceback.format_exc())
            self.chat_history[-1].answer = f"An error occurred: {e}. Check the console logs."
            self.chat_history[-1].is_loading = False

        finally:
            if self.chat_history:
                self.chat_history[-1].is_loading = False

question`: Stores the current text in the input field.
chat_history: A list of QA objects (a simple rx.Base model holding question, answer, and

loading status) to display the conversation.

is_loading: A boolean for global loading states (though we primarily used the per-message

loading).

handle_submit: An asynchronous event handler triggered when the user submits the form.

UI in Reflex

For the UI we’re using Reflex components to build the visual chat interface. You can find the Reflex styling guide to know more about how to style your own Reflex components. And the code here. Here’s the code for the `rag_gemma_reflex.py

import reflex as rx
from .state import State, QA

# --- UI Styles ---
colors = {
    "background": "#0F0F10",
    "text_primary": "#E3E3E3",
    "text_secondary": "#BDC1C6",
    "input_bg": "#1F1F21",
    "input_border": "#3C4043",
    "button_bg": "#8AB4F8",
    "button_text": "#202124",
    "button_hover_bg": "#AECBFA",
    "user_bubble_bg": "#3C4043",
    "bot_bubble_bg": "#1E1F21",
    "bubble_border": "#5F6368",
    "loading_text": "#9AA0A6",
    "heading_gradient_start": "#8AB4F8",
    "heading_gradient_end": "#C3A0F8",
}

base_style = {
    "background_color": colors["background"],
    "color": colors["text_primary"],
    "font_family": "'Roboto', sans-serif",
    "font_weight": "200",
    "height": "100vh",
    "width": "100%",
}

input_style = {
    "background_color": colors["input_bg"],
    "border": f"1px solid {colors['input_border']}",
    "color": colors["text_primary"],
    "border_radius": "24px",
    "padding": "12px 18px",
    "width": "100%",
    "font_weight": "400",
    "_placeholder": {
        "color": colors["text_secondary"],
        "font_weight": "300",
    },
    ":focus": {
        "border_color": colors["button_bg"],
        "box_shadow": f"0 0 0 1px {colors['button_bg']}",
    },
}

button_style = {
    "background_color": colors["button_bg"],
    "color": colors["button_text"],
    "border": "none",
    "border_radius": "24px",
    "padding": "12px 20px",
    "cursor": "pointer",
    "font_weight": "500",
    "font_family": "'Roboto', sans-serif",
    "transition": "background-color 0.2s ease",
    ":hover": {
        "background_color": colors["button_hover_bg"],
    },
}

chat_box_style = {
    "padding": "1em 0",
    "flex_grow": 1,
    "overflow_y": "auto",
    "display": "flex",
    "flex_direction": "column-reverse",
    "width": "100%",
    "&::-webkit-scrollbar": {
        "width": "8px",
    },
    "&::-webkit-scrollbar-track": {
        "background": colors["input_bg"],
        "border_radius": "4px",
    },
    "&::-webkit-scrollbar-thumb": {
        "background": colors["bubble_border"],
        "border_radius": "4px",
    },
    "&::-webkit-scrollbar-thumb:hover": {
        "background": colors["text_secondary"],
    },
}

qa_style = {
    "margin_bottom": "1em",
    "padding": "12px 18px",
    "border_radius": "18px",
    "word_wrap": "break-word",
    "max_width": "85%",
    "box_shadow": "0 1px 3px 0 rgba(0, 0, 0, 0.15)",
    "line_height": "1.6",
    "font_weight": "400",
    "code": {
        "background_color": "rgba(255, 255, 255, 0.1)",
        "padding": "0.2em 0.4em",
        "font_size": "85%",
        "border_radius": "4px",
        "font_family": "monospace",
    },
    "a": {
        "color": colors["button_bg"],
        "text_decoration": "underline",
        ":hover": {
            "color": colors["button_hover_bg"],
        },
    },
    "p": {
        "margin": "0",
    },
}

question_style = {
    **qa_style,
    "background_color": colors["user_bubble_bg"],
    "color": colors["text_primary"],
    "align_self": "flex-end",
    "border_bottom_right_radius": "4px",
}

answer_style = {
    **qa_style,
    "background_color": colors["bot_bubble_bg"],
    "color": colors["text_primary"],
    "align_self": "flex-start",
    "border_bottom_left_radius": "4px",
}

loading_style = {
    "color": colors["loading_text"],
    "font_style": "italic",
    "font_weight": "300",
}

# --- UI Components ---
def message_bubble(qa: QA):
    """Displays a single question and its answer."""
    return rx.vstack(
        rx.box(qa.question, style=question_style),
        rx.cond(
            qa.is_loading,
            rx.box("Thinking...", style={**answer_style, **loading_style}),
            rx.markdown(qa.answer, style=answer_style),
        ),
        align_items="stretch",
        width="100%",
        spacing="1",
    )

# --- Main Page ---
def index() -> rx.Component:
    """The main chat interface page."""
    heading_style = {
        "size": "7",
        "margin_bottom": "0.25em",
        "font_weight": "400",
        "background_image": f"linear-gradient(to right, {colors['heading_gradient_start']}, {colors['heading_gradient_end']})",
        "background_clip": "text",
        "-webkit-background-clip": "text",
        "color": "transparent",
        "width": "fit-content",
    }

    return rx.container(
        rx.vstack(
            rx.box(
                rx.heading("RAG Chat with Gemma", **heading_style),
                rx.text(
                    "Ask a question based on the loaded context.",
                    color=colors["text_secondary"],
                    font_weight="300",
                ),
                padding_bottom="0.5em",
                width="100%",
                text_align="center",
            ),
            rx.box(
                rx.foreach(State.chat_history, message_bubble),
                style=chat_box_style,
            ),
            rx.form(
                rx.hstack(
                    rx.input(
                        name="question",
                        placeholder="Ask your question...",
                        value=State.question,
                        on_change=State.set_question,
                        style=input_style,
                        flex_grow=1,
                        height="50px",
                    ),
                    rx.button(
                        "Ask",
                        type="submit",
                        style=button_style,
                        is_loading=State.is_loading,
                        height="50px",
                    ),
                    width="100%",
                    align_items="center",
                ),
                on_submit=State.handle_submit,
                width="100%",
            ),
            align_items="center",
            width="100%",
            height="100%",
            padding_x="1em",
            padding_y="1em",
            spacing="4",
        ),
        max_width="900px",
        height="100vh",
        padding=0,
        margin="auto",
    )

# --- App Setup ---
stylesheets = [
    "https://fonts.googleapis.com/css2?family=Roboto:wght@200;300;400;500&display=swap",
]

app = rx.App(style=base_style, stylesheets=stylesheets)
app.add_page(index, title="Reflex Chat")

Building the Application

You need to make sure that the correct structure is followed. And then from the root directory of the project run these two commands. Your app will be up and running.

reflex init
reflx run

Note: Please ensure that you have Ollama up and running before you run the application. A succussful message would look like this. Some of the messages might appear after you drop in a Hi in the chat interface.

understanding rag - reflex run understanding rag - rag chat with gemma

Chat Interface

Sending a message to our RAG Application:

understanding rag - chat with gemma 2

And then validating the claim from the dataset.

understanding rag - berry export

For this answer, our RAG-app answers and provides the correct context; the dataset screenshot is from Huggingface. So, we can say that RAG is working good enough.

How to make this app more accurate/production ready?

For this application to perform better and more accurately, there are a couple of steps that can be followed.

Using a bigger model from Ollama like Gemma 27B, Llama 3.3 70B, Qwen 2.5 70B or DeepSeek V3 for better accuracy.
Use a dedicated vector database like Qdrant, Pinecone, Milvus, etc. to store and index results.
Creating a personalized and dedicated dataset instead of using internet corpus for specific use cases. Remember, RAG is as good as your data you provide. For best results the data should be well cleaned, and primed for using with AI.
Visual overhauls: You can improve the chat interface by following reflex’s guides to add more features, extra states, database to store chats, etc.

Conclusion

Through this iterative process, we successfully constructed a functional, locally-hosted RAG chat application. This project demonstrates the power of combining Reflex for rapid UI development, LangChain for sophisticated LLM workflow orchestration, FAISS for efficient vector search, and Ollama for the privacy and control offered by local LLM inference.