Your business documentation is a goldmine of institutional knowledge—but if your team is still digging through PDFs, Notion pages, and Confluence, or SharePoint folders to find answers, you're hemorrhaging productivity. A Retrieval-Augmented Generation (RAG) chatbot can change that. change that in under eight hours. Unlike fine-tuning a model on your data, which requires expensive GPUs and weeks of iteration, RAG lets you plug a retrieval layer on top of an existing LLM. The result? A chatbot that cites verifiable sources from your own documents and stays up to date without re-training. In this guide, you'll build one using LangChain for orchestration, Qdrant as your vector database, and a free LLM like Mistral 7B via Hugging Face or Ollama. By the end of the day, you'll have a working prototype that answers questions about your policies, product specs, or onboarding, or compliance guidelines. No hype, no fluff—just a repeatable pipeline that turns static docs into a conversational assistant your team actually wants to use.
Why RAG Beats Fine-Tuning for Business Documentation
Before you write a single line of code, it's worth understanding why RAG is the right pattern here. Fine-tuning adapts a model's weights to a specific dataset, but it's overkill and often counterproductive for document Q&A. If your document set changes—like an updated expense policy—supersedes an older one, a fine-tuned model will still have the old weights in its parameters. RAG sidesteps this entirely: you keep your source documents in a vector store, and every query retrieves the most relevant chunks in real time. When a document changes, you simply re-index that file. The LLM itself stays untouched, so you never have to worry about model drift or catastrophic forgetting.
For business documentation, retrieval accuracy matters more than fluency. A RAG pipeline lets you ground every answer in a specific chunk of text, and you can display the source citation back to the user. This auditability is critical for regulated industries like finance, healthcare, or legal. Moreover, the cost is dramatically lower. Running a small, quantized model like Mistral 7B on your own hardware or via a free API tier means you can handle thousands of queries without burning through tokens. You also avoid vendor lock-in: swap the LLM or the vector store later without rewriting your entire stack.
Setting Up Your Environment: LangChain, Qdrant, and Your LLM
We'll use three core components. LangChain handles the chain provides a unified interface for prompt templates, retrievers, and LLM wrappers, so you don't have to glue together disparate APIs manually. Qdrant is an open-source vector database that runs locally or in Docker; it handles high-dimensional embeddings with built-in filtering, so you can scope queries to specific document collections later. For the LLM, we'll use Mistral 7B Instruct via Ollama (free, local) or the Hugging Face Inference API (also free for limited usage). If you prefer a cloud option, Google's Gemini free tier works well too.
- Install Docker if you don't have it, then pull the Qdrant image:
docker pull qdrant/qdrant - Launch Qdrant on port 6333:
docker run -p 6333:6333 qdrant/qdrant - Create a Python virtual environment and install dependencies:
pip install langchain langchain-community qdrant-client sentence-transformers ollama pypdf - Install Ollama from ollama.ai and pull Mistral:
ollama pull mistral
That's it. You now have a local vector database, a local LLM, and the LangChain ecosystem ready to wire them together. No cloud credits required. If you prefer a hosted Qdrant, their free tier offers 1 GB of storage, which is plenty for a few hundred pages of documentation.
Loading and Chunking Your Documents
The quality of your RAG pipeline depends almost entirely on how you chunk your documents. Too large, and you overwhelm the LLM's context window and lose precision. Too small, and you miss surrounding context. For business docs like policy PDFs or Notion exports from Confluence, aim for chunks of 500–1000 characters with a 100-character overlap. LangChain's RecursiveCharacterTextSplitter is the default choice because it respects paragraph and sentence boundaries.
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
loader = PyPDFLoader("your_policy_document.pdf")
documents = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=100)
chunks = splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks")
If your documents are in Markdown or HTML (common with Notion/Confluence exports), use UnstructuredMarkdownLoader or UnstructuredHTMLLoader instead. The key insight: add a metadata field like "source": filename to every chunk. This will let your chatbot cite exactly which document it pulled an answer from. For multi-page policies, also include a page number. LangChain preserves metadata through the splitter automatically if the original loader extracted it.
Creating Embeddings and Storing Them in Qdrant
With your chunks ready, you need to convert each one into a vector embedding. We'll use sentence-transformers/all-MiniLM-L6-v2—a lightweight model that produces 384-dimensional embeddings. It's fast, runs entirely offline, and offers solid retrieval quality for general business text. If you're handling specialized legal or medical terminology, consider BAAI/bge-large-en-v1.5 for a small accuracy bump at the cost of slower inference.
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Qdrant
from qdrant_client import QdrantClient
client = QdrantClient(host="localhost", port=6333)
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vector_store = Qdrant.from_documents(
documents=chunks,
embedding=embeddings,
url="http://localhost:6333",
collection_name="business_docs",
force_recreate=True # set to False on subsequent runs
)
print("Indexing complete.")
.from_documents() call handles everything: it creates the Qdrant collection if it doesn't exist, generates embeddings for each chunk, and upserts them. The parameter force_recreate=True is useful during development—you wipe the collection every time you re-run. In production, set it to False and use add_documents() for incremental updates. You can verify your data in Qdrant's web UI at http://localhost:6333/dashboard.
Building the Retrieval Chain with LangChain
Now comes the orchestration layer. You'll build a chain that takes a user question, retrieves the top-k relevant chunks from your Qdrant vector store, stuffs them into a prompt template, and sends the whole thing to the LLM. LangChain's RetrievalQA chain is the simplest path, but for more control you can use the lower-level LLMChain with a custom prompt.
from langchain.chains import RetrievalQA
from langchain_community.llms import Ollama
from langchain.prompts import PromptTemplate
llm = Ollama(model="mistral", temperature=0.1)
retriever = vector_store.as_retriever(search_kwargs={"k": 4})
prompt_template = """You are a helpful assistant for our company documentation.
Use the following context to answer the user's question.
Always cite the source document name in your answer.
If you don't know, say you don't know.
Context: {context}
Question: {question}
Answer:"""
PROMPT = PromptTemplate(template=prompt_template, input_variables=["context", "question"])
qa_chain = RetrievalQA.from_chain_type(
llm=llm=llm,
chain_type="stuff",
retriever=retriever,
chain_type_kwargs={"prompt": PROMPT}
)
The temperature=0.1 keeps the model factual—essential for business documentation. The k=4 parameter retrieves the four most relevant chunks; you can tune this based on your chunk size and the complexity of questions. The prompt explicitly asks for source citations, which you can format as a markdown reference at the end of each answer. test this immediately by calling qa_qa_chain.invoke("What is our remote work policy?").
Connecting the LLM and Running Your First Query
With the chain assembled, let's run a complete test. Open a Python script or a Jupyter notebook and execute the chain. If you encounter a blank or hallucinated answer, first check that the retriever actually pulls relevant chunks—print them with retriever.get_relevant_documents("your question"). If the chunks look wrong, your embedding model might not capture domain-specific terminology, or your chunk size is too large. If the chunks are correct but the LLM gives a poor answer, adjust the prompt template to be more directive.
question = "What is the approval process for travel expenses over $500?"
result = qa_chain.invoke({"query": question})
print(result["result"])
A well-built pipeline should return something like: “According to the Travel Policy 2024 document (page 5), expenses over $500 require pre-approval from your manager and a second approval from the finance director. Submit via the expense portal with supporting receipts.” This answer is grounded, cited, and immediately actionable. If you see a reference to an external document name that exists in your metadata, you know the retrieval is working. From here, you can wrap this in a simple Streamlit or Gradio UI in about 30 minutes, turning your script into a chat interface your team can access via a local URL.
Running and Testing Your Chatbot
For a quick web UI, use Streamlit. Create a app.py file that imports your chain, renders a text input, and displays the answer along with the source documents. Streamlit's session state lets you maintain conversation history if you want a multi-turn experience. Here's a minimal implementation:
import streamlit as st
st.title("DocuBot - Your Documentation Assistant")
question = st.text_input("Ask a question about our policies:")
if question:
with st.spinner("Searching..."):
result = qa_chain.invoke({"query": question})
st.markdown(result["result"])
Run it with streamlit run app.py. You now have a functioning RAG chatbot in under a day. To make it production-ready, add authentication (env vars or a simple password), log queries for auditing, and configure Qdrant's persistence volume so your embeddings survive container restarts. For larger document sets, consider batch indexing and a background worker. But for a day-one prototype that actually solves a business problem, this is already it.
The barrier to building a useful AI tool has never been lower. You just vectorized your docs, wired a local LLM, and gave your team a chatbot that doesn't hallucinate wild answers. Try indexing your employee handbook or product FAQ next—you'll see search traffic drop as people start asking the bot instead of digging through folders. If you hit friction, the LangChain and Qdrant communities are active and helpful. Now go ship your bot. Your documentation is finally working for you.
What chunk size should I use for legal or technical documentation?
For dense, jargon-heavy documents, reduce your chunk size to 400–600 characters with a 50-character overlap. Smaller chunks improve retrieval precision for domain-specific terms. Test with a few sample queries: if the retriever pulls irrelevant chunks, your chunk size is likely too large. You can also experiment with a sentence-transformers model fine-tuned on your domain, such as the BAAI/bge-small-en-v1.5 for faster performance on legal text.
Related from our network
- ChatGPT Side Hustle Ideas: 5 Ways to Monetize AI in 2026 (calcvortex)
- ChatGPT Side Hustle Ideas: 5 Tested Methods for 2026 (partpickerauto)
- How to Build a Bullet Journal System for Remote Work Success (bulletjournals)
Related from our network
- ChatGPT Side Hustle Ideas: 5 Ways to Monetize AI in 2026 (calcvortex)
- ChatGPT Side Hustle Ideas: 5 Tested Methods for 2026 (partpickerauto)
- How to Build a Bullet Journal System for Remote Work Success (bulletjournals)


