Back to blog
AIGuides & TutorialsIndustry insights

Building a Local RAG Chat App with Reflex, LangChain, Huggingface, and Ollama

Learn how to create a fully local, privacy-friendly RAG-powered chat app using Reflex, LangChain, Huggingface, FAISS, and Ollama. This step-by-step guide walks you through building an interactive chat UI, embedding search, and local LLM integration—all without needing frontend skills or cloud dependencies.

Saurabh Rai

Saurabh Rai

25 min read
Building a Local RAG Chat App with Reflex, LangChain, Huggingface, and Ollama

In the previous article, we learned about Retrieval-Augmented Generation (RAG). This has emerged as a powerful technique to enhance the capabilities of large language models (LLMs). RAG allows LLMs to provide more accurate, relevant, and context-specific answers, mitigating issues like “hallucination” and outdated information.

This is part two of a three-part series. This article provides a detailed walkthrough of creating a RAG-powered chat application entirely within your local Python environment. We'll explore how to integrate several key technologies to build an interactive and informative tool:

  • Reflex: A modern, pure-Python framework designed for rapidly building and deploying interactive web applications without needing separate frontend expertise (like JavaScript). Its reactive nature simplifies state management.
  • LangChain: A comprehensive framework specifically designed for developing applications powered by language models. It provides modular components and chains to streamline complex workflows, such as RAG, making pipeline construction significantly easier.
  • Ollama: An increasingly popular tool that enables users to download, run, and manage various open-source LLMs (like Google's Gemma, Meta's Llama series, Mistral models, etc.) directly on their local machine, promoting privacy and offline capabilities.
  • FAISS (Facebook AI Similarity Search): A highly optimized library for performing efficient similarity searches on large datasets of vectors. In our RAG context, it's crucial to quickly find relevant text passages based on the user's query embedding.
  • Hugging Face Datasets & Transformers: The de facto standard libraries in the NLP ecosystem. Datasets provide easy access to a vast collection of datasets, while sentence-transformers (built on Transformers) offer convenient ways to generate high-quality text embeddings.

Our objective is to create a web-based chat application where users can pose questions. The application will then:

  1. Convert the question into a numerical vector (embedding).
  2. Search a pre-indexed vector store (built from a dataset using FAISS) to find text passages with similar embeddings (i.e., relevant context).
  3. Provide this retrieved context, along with the original question, to a locally running LLM (via Ollama).
  4. Display the LLM's generated answer, which is now grounded in the retrieved information, back to the user in the chat interface built in Reflex.

Initial code structure and setup

Note: You can find the complete code at this GitHub repository.

Folder/File Layout

Layout for our RAG application.

rag_app/  # Root folder for this project
│
├── .env                 
├── requirements.txt     
├── rxconfig.py          
│
└── rag_gemma_reflex/    
    ├── __init__.py      
    ├── rag_logic.py     
    ├── state.py         
    └── rag_gemma_reflex.py

Requirements

Make sure to create a virtual environment using uv or via basic pip

    reflex             
    langchain           
    langchain-community 
    langchain-huggingface 
    datasets            
    faiss-cpu           
    sentence-transformers
    ollama              
    python-dotenv       
    langchain-ollama

For an easier installation, download the requirements.txt file from here. And then in your virtual environment run any one of the commands:

pip install -r requirements.txt

## if you're using uv
uv pip install -r requirements.txt

Each library is performing their roles:

  • reflex builds the interactive frontend
  • langchain orchestrates the RAG flow
  • datasets provides the knowledge source
  • sentence-transformers and faiss-cpu handle the retrieval mechanism
  • ollama runs the local LLM
  • python-dotenv helps manage configuration

Diving into the code

Downloading the local AI model

Ollama provides a lot of AI models to download and get started with. For this demo we’ll be using gemma3:4b-it-qat this is a 4 billion parameter quantized model which takes 3x less ram than their non-quantized counterparts. You’re free to change it to a bigger or smaller model for this demo.

understanding rag - gemma

Dataset we’re using

For this demo we’re using the dataset, neural-bridge/rag-dataset-12000 from Huggingface. There are many datasets available, you can download and change the name in the settings.

Reflex Configuration (rxconfig.py):

This file contains basic settings for the Reflex application. For this project, it’s minimal, primarily just naming the app:

import reflex as rx

# Basic configuration defining the application's name
config = rx.Config(
    app_name="rag_app",
)

Implementing the RAG Core Logic

This script is the engine of our application and is responsible for setting up and executing the entire RAG pipeline. You can find the code for this file here.

Key Steps:

DEFAULT_OLLAMA_MODEL = "gemma3:4b-it-qat"
DATASET_NAME = "neural-bridge/rag-dataset-12000"
DATASET_SUBSET_SIZE = 100
EMBEDDING_MODEL_NAME = "all-MiniLM-L6-v2"
OLLAMA_MODEL = os.getenv("OLLAMA_MODEL", DEFAULT_OLLAMA_MODEL)
FAISS_INDEX_PATH = "faiss_index_neural_bridge"
  1. Configuration: Defined constants for the dataset name (neural-bridge/rag-dataset-12000), embedding model (all-MiniLM-L6-v2), the default Ollama model (gemma3:4b-it-qat), and the FAISS index path. Added logic to read the Ollama model name from the environment (OLLAMA_MODEL) for flexibility.
def load_and_split_data():
    """
    Loads the neural-bridge/rag-dataset-12000 dataset and converts
    contexts into LangChain Documents.
    """
    print(f"Loading dataset '{DATASET_NAME}'...")
    try:
        if DATASET_SUBSET_SIZE:
            print(f"Loading only the first {DATASET_SUBSET_SIZE} entries.")
            dataset = load_dataset(DATASET_NAME, split=f"train[:{DATASET_SUBSET_SIZE}]")
        else:
            print("Loading the full dataset...")
            dataset = load_dataset(DATASET_NAME, split="train")

        documents = [
            Document(
                page_content=row["context"],
                metadata={"question": row["question"], "answer": row["answer"]},
            )
            for row in dataset
            if row.get("context")
        ]

        print(f"Loaded {len(documents)} documents.")
        return documents

    except Exception as e:
        print(f"Error loading dataset '{DATASET_NAME}': {e}")
        print(traceback.format_exc())
        return []
  1. Data Loading (load_and_split_data):
  • Loads the specified dataset from Hugging Face (datasets.load_dataset).
  • Handles loading a subset ( DATASET_SUBSET_SIZE) for faster testing.
  • Converts each row’s ’context’ into a LangChain Document object, storing the ’question’ and ’answer’ in the metadata. (We initially tried rag-datasets/rag-mini-wikipedia but switched back due to loading complexities)
def get_embeddings_model():
    """Initializes and returns the HuggingFace embedding model."""
    print(f"Loading embedding model '{EMBEDDING_MODEL_NAME}'...")
    model_kwargs = {"device": "cpu"}
    encode_kwargs = {"normalize_embeddings": False}
    embeddings = HuggingFaceEmbeddings(
        model_name=EMBEDDING_MODEL_NAME,
        model_kwargs=model_kwargs,
        encode_kwargs=encode_kwargs,
    )
    print("Embedding model loaded.")
    return embeddings
  1. Embeddings (get_embeddings_model): Initializes a sentence transformer model (all-MiniLM-L6-v2) using langchain_huggingface.HuggingFaceEmbeddings to convert text documents into numerical.
def create_or_load_vector_store(documents, embeddings):
    """Creates a FAISS vector store from documents or loads it if it exists."""
    if os.path.exists(FAISS_INDEX_PATH) and os.listdir(FAISS_INDEX_PATH):
        print(f"Loading existing FAISS index from '{FAISS_INDEX_PATH}'...")
        try:
            vector_store = FAISS.load_local(
                FAISS_INDEX_PATH,
                embeddings,
                allow_dangerous_deserialization=True
            )
            print("FAISS index loaded.")
        except Exception as e:
            print(f"Error loading FAISS index: {e}")
            print("Attempting to rebuild the index...")
            vector_store = None
    else:
        vector_store = None

    if vector_store is None:
        if not documents:
            print("Error: No documents loaded to create FAISS index.")
            return None
        print("Creating new FAISS index...")
        vector_store = FAISS.from_documents(documents, embeddings)
        print("FAISS index created.")
        print(f"Saving FAISS index to '{FAISS_INDEX_PATH}'...")
        try:
            vector_store.save_local(FAISS_INDEX_PATH)
            print("FAISS index saved.")
        except Exception as e:
            print(f"Error saving FAISS index: {e}")

    return vector_store
  1. Vector Store (create_or_load_vector_store):
  • Uses FAISS ( langchain_community.vectorstores.FAISS) to create a vector index from the document embeddings.
  • Crucially, it checks if an index already exists at FAISS_INDEX_PATH and loads it to avoid re- processing on every run. If not found, it creates and saves a new index.
def get_ollama_llm():
    """Initializes and returns the Ollama LLM using the new package."""
    global OLLAMA_MODEL
    current_ollama_model = os.getenv("OLLAMA_MODEL", DEFAULT_OLLAMA_MODEL)

    if OLLAMA_MODEL != current_ollama_model:
        print(f"Ollama model changed to '{current_ollama_model}'.")
        OLLAMA_MODEL = current_ollama_model
        global _rag_chain
        _rag_chain = None

    print(f"Initializing Ollama LLM with model '{OLLAMA_MODEL}'...")

    try:
        ollama_client.show(OLLAMA_MODEL)
        print(f"Confirmed Ollama model '{OLLAMA_MODEL}' is available locally.")
    except ollama_client.ResponseError as e:
        if "model not found" in str(e).lower():
            print(f"Error: Ollama model '{OLLAMA_MODEL}' not found locally.")
            print(f"Please pull it first using: ollama pull {OLLAMA_MODEL}")
            return None
        else:
            print(f"An error occurred while checking the Ollama model: {e}")
            return None
    except Exception as e:
        print(f"An unexpected error occurred while checking Ollama model: {e}")
        return None

    ollama_base_url = os.getenv("OLLAMA_HOST")
    if ollama_base_url:
        print(f"Using Ollama host: {ollama_base_url}")
        llm = Ollama(model=OLLAMA_MODEL, base_url=ollama_base_url)
    else:
        print("Using default Ollama host (http://localhost:11434).")
        llm = Ollama(model=OLLAMA_MODEL)

    print("Ollama LLM initialized.")
    return llm
  1. LLM Initialization (get_ollama_llm):
  • Connects to the locally running Ollama service.
  • Uses the langchain_ollama.OllamaLLM class (after updating from the deprecated langchain_community version).
  • Specifies the desired model (OLLAMA_MODEL, defaulting to gemma3:4b-it-qat).
  • Includes error handling to check if the specified model has been pulled in Ollama (ollama pull <model_name>).
def get_rag_chain():
    """Returns the initialized RAG chain, setting it up if necessary."""
    if _rag_chain is None:
        setup_rag_chain()
    if _rag_chain is None:
        print("Warning: RAG chain is not available.")
    return _rag_chain
  1. Chain Setup (setup_rag_chain, get_rag_chain):
  • Retriever: Creates a retriever from the FAISS vector store (vector_store.as_retriever) to find the top k documents most similar to a user’s query.

  • Prompt Template: Defines a ChatPromptTemplate instructing the LLM how to answer the question using the provided context. We updated the template variable from {question} to {input} to match what the create_retrieval_chain expects.

  • Stuff Documents Chain: Uses create_stuff_documents_chain to take the retrieved documents and the user input, format them into the prompt, and pass them to the LLM.

  • Retrieval Chain: Uses create_retrieval_chain to tie the retriever and the stuff documents chain together into the final RAG chain.

  • The get_rag_chain function acts as a lazy initializer, setting up the chain only on the first request or if the configuration (like the Ollama model) changes.

Reflex App (State and UI)

Managing UI State (state.py)

This file defines the Reflex State class, which holds the data for the UI and the logic to handle user interactions. You can find the code here.

import reflex as rx
from . import rag_logic
import traceback

class QA(rx.Base):
    """A question and answer pair."""
    question: str
    answer: str
    is_loading: bool = False

class State(rx.State):
    """Manages the application state for the RAG chat interface."""
    question: str = ""
    chat_history: list[QA] = []
    is_loading: bool = False

    async def handle_submit(self):
        """Handles the user submitting a question."""
        if not self.question.strip():
            return

        user_question = self.question
        self.chat_history.append(QA(question=user_question, answer="", is_loading=True))
        self.question = ""
        yield

        try:
            rag_chain = rag_logic.get_rag_chain()
            if rag_chain is None:
                raise Exception("RAG chain could not be initialized. Check logs.")

            response = await rag_chain.ainvoke({"input": user_question})
            answer = response.get("answer", "Sorry, I couldn't find an answer.")
            self.chat_history[-1].answer = answer
            self.chat_history[-1].is_loading = False

        except Exception as e:
            print(f"Error processing question: {e}")
            print(traceback.format_exc())
            self.chat_history[-1].answer = f"An error occurred: {e}. Check the console logs."
            self.chat_history[-1].is_loading = False

        finally:
            if self.chat_history:
                self.chat_history[-1].is_loading = False
  • question`: Stores the current text in the input field.
  • chat_history: A list of QA objects (a simple rx.Base model holding question, answer, and loading status) to display the conversation.
  • is_loading: A boolean for global loading states (though we primarily used the per-message loading).
  • handle_submit: An asynchronous event handler triggered when the user submits the form.

UI in Reflex

For the UI we’re using Reflex components to build the visual chat interface. You can find the Reflex styling guide to know more about how to style your own Reflex components. And the code here. Here’s the code for the `rag_gemma_reflex.py

import reflex as rx
from .state import State, QA

# --- UI Styles ---
colors = {
    "background": "#0F0F10",
    "text_primary": "#E3E3E3",
    "text_secondary": "#BDC1C6",
    "input_bg": "#1F1F21",
    "input_border": "#3C4043",
    "button_bg": "#8AB4F8",
    "button_text": "#202124",
    "button_hover_bg": "#AECBFA",
    "user_bubble_bg": "#3C4043",
    "bot_bubble_bg": "#1E1F21",
    "bubble_border": "#5F6368",
    "loading_text": "#9AA0A6",
    "heading_gradient_start": "#8AB4F8",
    "heading_gradient_end": "#C3A0F8",
}

base_style = {
    "background_color": colors["background"],
    "color": colors["text_primary"],
    "font_family": "'Roboto', sans-serif",
    "font_weight": "200",
    "height": "100vh",
    "width": "100%",
}

input_style = {
    "background_color": colors["input_bg"],
    "border": f"1px solid {colors['input_border']}",
    "color": colors["text_primary"],
    "border_radius": "24px",
    "padding": "12px 18px",
    "width": "100%",
    "font_weight": "400",
    "_placeholder": {
        "color": colors["text_secondary"],
        "font_weight": "300",
    },
    ":focus": {
        "border_color": colors["button_bg"],
        "box_shadow": f"0 0 0 1px {colors['button_bg']}",
    },
}

button_style = {
    "background_color": colors["button_bg"],
    "color": colors["button_text"],
    "border": "none",
    "border_radius": "24px",
    "padding": "12px 20px",
    "cursor": "pointer",
    "font_weight": "500",
    "font_family": "'Roboto', sans-serif",
    "transition": "background-color 0.2s ease",
    ":hover": {
        "background_color": colors["button_hover_bg"],
    },
}

chat_box_style = {
    "padding": "1em 0",
    "flex_grow": 1,
    "overflow_y": "auto",
    "display": "flex",
    "flex_direction": "column-reverse",
    "width": "100%",
    "&::-webkit-scrollbar": {
        "width": "8px",
    },
    "&::-webkit-scrollbar-track": {
        "background": colors["input_bg"],
        "border_radius": "4px",
    },
    "&::-webkit-scrollbar-thumb": {
        "background": colors["bubble_border"],
        "border_radius": "4px",
    },
    "&::-webkit-scrollbar-thumb:hover": {
        "background": colors["text_secondary"],
    },
}

qa_style = {
    "margin_bottom": "1em",
    "padding": "12px 18px",
    "border_radius": "18px",
    "word_wrap": "break-word",
    "max_width": "85%",
    "box_shadow": "0 1px 3px 0 rgba(0, 0, 0, 0.15)",
    "line_height": "1.6",
    "font_weight": "400",
    "code": {
        "background_color": "rgba(255, 255, 255, 0.1)",
        "padding": "0.2em 0.4em",
        "font_size": "85%",
        "border_radius": "4px",
        "font_family": "monospace",
    },
    "a": {
        "color": colors["button_bg"],
        "text_decoration": "underline",
        ":hover": {
            "color": colors["button_hover_bg"],
        },
    },
    "p": {
        "margin": "0",
    },
}

question_style = {
    **qa_style,
    "background_color": colors["user_bubble_bg"],
    "color": colors["text_primary"],
    "align_self": "flex-end",
    "border_bottom_right_radius": "4px",
}

answer_style = {
    **qa_style,
    "background_color": colors["bot_bubble_bg"],
    "color": colors["text_primary"],
    "align_self": "flex-start",
    "border_bottom_left_radius": "4px",
}

loading_style = {
    "color": colors["loading_text"],
    "font_style": "italic",
    "font_weight": "300",
}

# --- UI Components ---
def message_bubble(qa: QA):
    """Displays a single question and its answer."""
    return rx.vstack(
        rx.box(qa.question, style=question_style),
        rx.cond(
            qa.is_loading,
            rx.box("Thinking...", style={**answer_style, **loading_style}),
            rx.markdown(qa.answer, style=answer_style),
        ),
        align_items="stretch",
        width="100%",
        spacing="1",
    )

# --- Main Page ---
def index() -> rx.Component:
    """The main chat interface page."""
    heading_style = {
        "size": "7",
        "margin_bottom": "0.25em",
        "font_weight": "400",
        "background_image": f"linear-gradient(to right, {colors['heading_gradient_start']}, {colors['heading_gradient_end']})",
        "background_clip": "text",
        "-webkit-background-clip": "text",
        "color": "transparent",
        "width": "fit-content",
    }

    return rx.container(
        rx.vstack(
            rx.box(
                rx.heading("RAG Chat with Gemma", **heading_style),
                rx.text(
                    "Ask a question based on the loaded context.",
                    color=colors["text_secondary"],
                    font_weight="300",
                ),
                padding_bottom="0.5em",
                width="100%",
                text_align="center",
            ),
            rx.box(
                rx.foreach(State.chat_history, message_bubble),
                style=chat_box_style,
            ),
            rx.form(
                rx.hstack(
                    rx.input(
                        name="question",
                        placeholder="Ask your question...",
                        value=State.question,
                        on_change=State.set_question,
                        style=input_style,
                        flex_grow=1,
                        height="50px",
                    ),
                    rx.button(
                        "Ask",
                        type="submit",
                        style=button_style,
                        is_loading=State.is_loading,
                        height="50px",
                    ),
                    width="100%",
                    align_items="center",
                ),
                on_submit=State.handle_submit,
                width="100%",
            ),
            align_items="center",
            width="100%",
            height="100%",
            padding_x="1em",
            padding_y="1em",
            spacing="4",
        ),
        max_width="900px",
        height="100vh",
        padding=0,
        margin="auto",
    )

# --- App Setup ---
stylesheets = [
    "https://fonts.googleapis.com/css2?family=Roboto:wght@200;300;400;500&display=swap",
]

app = rx.App(style=base_style, stylesheets=stylesheets)
app.add_page(index, title="Reflex Chat")

Building the Application

You need to make sure that the correct structure is followed. And then from the root directory of the project run these two commands. Your app will be up and running.

  1. reflex init
  2. reflx run

Note: Please ensure that you have Ollama up and running before you run the application. A succussful message would look like this. Some of the messages might appear after you drop in a Hi in the chat interface.

understanding rag - reflex run understanding rag - rag chat with gemma

Chat Interface

Sending a message to our RAG Application:

understanding rag - chat with gemma 2

And then validating the claim from the dataset.

understanding rag - berry export

For this answer, our RAG-app answers and provides the correct context; the dataset screenshot is from Huggingface. So, we can say that RAG is working good enough.

How to make this app more accurate/production ready?

For this application to perform better and more accurately, there are a couple of steps that can be followed.

  • Using a bigger model from Ollama like Gemma 27B, Llama 3.3 70B, Qwen 2.5 70B or DeepSeek V3 for better accuracy.
  • Use a dedicated vector database like Qdrant, Pinecone, Milvus, etc. to store and index results.
  • Creating a personalized and dedicated dataset instead of using internet corpus for specific use cases. Remember, RAG is as good as your data you provide. For best results the data should be well cleaned, and primed for using with AI.
  • Visual overhauls: You can improve the chat interface by following reflex’s guides to add more features, extra states, database to store chats, etc.

Conclusion

Through this iterative process, we successfully constructed a functional, locally-hosted RAG chat application. This project demonstrates the power of combining Reflex for rapid UI development, LangChain for sophisticated LLM workflow orchestration, FAISS for efficient vector search, and Ollama for the privacy and control offered by local LLM inference.

References

Ready to get started?

Scale your integration strategy and deliver the integrations your customers need in record time.

Ready to get started?
Trusted by
Nmbrs
Benefex
Principal Group
Invoice2go by BILL
Trengo
MessageMedia
Lever
Ponto | Isabel Group
Apideck Blog

Insights, guides, and updates from Apideck

Discover company news, API insights, and expert blog posts. Explore practical integration guides and tech articles to make the most of Apideck's platform.

A Primer on the Model Context Protocol (MCP)
AIIndustry insights

A Primer on the Model Context Protocol (MCP)

In this article, we dive deep into what MCP actually is, how it works behind the scenes, and why it’s being called the “USB-C for AI models.” You’ll explore how it simplifies AI integrations, the roles of Hosts, Clients, and Servers, and the security risks developers need to keep in mind.

Saurabh Rai

Saurabh Rai

7 min read
Understanding RAG: Retrieval Augmented Generation Essentials for AI Projects
AIIndustry insights

Understanding RAG: Retrieval Augmented Generation Essentials for AI Projects

Large Language Models (LLMs) are powerful, but they don't inherently know your specific company procedures, project details, or internal knowledge base. You can bridge this gap and make AI reliably answer questions using your private data by using retrieval-augmented generation (RAG). This is a three-article series. In this article, we will go through the basics of RAG, its importance, and an overview of how to build an RAG pipeline and the tools involved.

Saurabh Rai

Saurabh Rai

9 min read
HiBob API Integration Guide
HRISGuides & Tutorials

HiBob API Integration Guide

This guide breaks down how to integrate with the HiBob API, a common request for developers building HRIS integrations. It covers real challenges you’ll face—like authentication setup, handling custom fields, and dealing with rate limits—and walks you through working API examples using Postman. You'll learn how to fetch employee data, manage time-off requests, and structure reliable API calls.

Rexford Ayeh Nyarko

Rexford Ayeh Nyarko

11 min read