If you've been relying on cloud-based AI services for document Q&A, you're paying a premium in both latency and privacy. Building a local retrieval-augmented generation (RAG) system on your own hardware eliminates API costs, keeps sensitive data in-house, and gives you full control over the pipeline. In this tutorial, you'll learn how to pair Ollama for local LLM inference with Qdrant as your vector database to create a self-hosted RAG stack that handles document ingestion, embedding, semantic search, and conversational retrieval. We'll walk through every step—from installing Ollama and pulling a quantised model to chunking PDFs, generating embeddings with nomic-embed-text, and wiring up a chat interface that queries your documents in real time. By the end, you'll have a production-ready local RAG system that runs entirely on your hardware, with no external API calls and no data leaving your machine. Let's build it.
Why Build a Local RAG System?
Cloud-based RAG solutions from providers like OpenAI, Pinecone, or Google Vertex AI charge per token and per vector operation. For a team processing 10,000 documents per month, those costs can easily exceed $500—and that's before you factor in egress fees and latency from round trips to remote servers. A local RAG system eliminates all of that. With Ollama running a 7B-parameter model like llama3.1:8b on a single consumer GPU (e.g., an RTX 3060 with 12 GB VRAM), you can achieve inference speeds of 30–50 tokens per second while keeping every document, every embedding, and every query response on your own network.
Beyond cost, privacy is the primary driver. Healthcare records, legal contracts, internal financial data—none of it should touch a third-party API if you can avoid it. A local stack means your ingestion pipeline, vector store, and LLM all reside on machines you control. Qdrant, for instance, can run as a single binary or inside Docker, and it supports on-disk storage with optional memory mapping, so even a modest server with 16 GB RAM and an SSD can handle millions of vectors. The trade-off is that you're responsible for maintenance and scaling, but for most small-to-medium knowledge bases (10,000–100,000 documents), a single machine is more than sufficient.
Finally, local RAG gives you freedom to experiment. You can swap embedding models, change chunking strategies, or switch from cosine similarity to dot-product scoring without worrying about API deprecations or rate limits. This agility is critical when you're iterating on retrieval quality—something that's notoriously hard to optimise in a black-box cloud setup.
Setting Up Ollama for Local LLM Inference
Ollama is the easiest way to run open-source LLMs on your own hardware. It wraps model downloading, quantisation, and inference into a single CLI and REST API. Start by installing Ollama on your target machine—Linux, macOS, or Windows (via WSL2). The official installer is a one-liner: curl -fsSL https://ollama.com/install.sh | sh. Once installed, pull a model that balances quality with resource constraints. For most RAG workloads, llama3.1:8b (8 billion parameters, Q4_K_M quantisation) is the sweet spot: it uses about 6 GB of VRAM and delivers coherent, context-aware answers. If you're on a CPU-only machine, try mistral:7b-q4_0, which runs comfortably on 8 GB of system RAM at 5–10 tokens per second.
After pulling your model, test the Ollama REST API. By default, it listens on http://localhost:11434. A quick curl http://localhost:11434/api/generate -d '{"model": "llama3.1:8b", "prompt": "Hello"}' should return a streaming JSON response. You'll also need an embedding model for the RAG pipeline. Ollama hosts nomic-embed-text (137 million parameters, 768-dimensional vectors), which is purpose-built for retrieval tasks and outperforms OpenAI's text-embedding-ada-002 on the MTEB benchmark by 2–3 points while running entirely locally. Pull it with ollama pull nomic-embed-text.
One practical tip: set the OLLAMA_NUM_PARALLEL environment variable to 2 or 4 if you plan to handle concurrent embedding requests. The default of 1 serialises all calls, which will bottleneck your ingestion pipeline. Also, consider running Ollama as a systemd service so it starts automatically on boot—critical if your RAG system is meant to be always available.
Installing and Configuring Qdrant as Your Vector Database
Qdrant is a high-performance vector database written in Rust, designed for production similarity search. It supports multiple indexing algorithms (HNSW, payload filtering, quantisation) and can run as a single-node instance for local setups. The fastest way to get started is with Docker: docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant. This exposes the REST API on port 6333 and the gRPC interface on 6334. For a persistent setup, mount a volume: -v $(pwd)/qdrant_storage:/qdrant/storage. If Docker isn't an option, download the binary from GitHub releases and run ./qdrant directly—it's a single executable with no dependencies.
Once Qdrant is running, create a collection to hold your document vectors. The key parameters are vectors_config (size and distance metric) and optimizers_config. For nomic-embed-text, set size: 768 and distance: Cosine. Cosine similarity is the standard for text embeddings because it normalises vector length, making it robust to varying document lengths. Here's a sample curl command to create the collection:
curl -X PUT http://localhost:6333/collections/documents \
-H 'Content-Type: application/json' \
-d '{
"vectors": {
"size": 768,
"distance": "Cosine"
},
"optimizers_config": {
"default_segment_number": 2,
"memmap_threshold_kb": 20000
}
}'For local setups, enable memory-mapped storage (memmap_threshold_kb) to reduce RAM usage—Qdrant will map vectors directly from disk instead of loading them entirely into memory. This is a game-changer for machines with limited RAM. Also, set default_segment_number to 2 or 4 to keep indexing overhead low during ingestion. You can tune these later as your collection grows.
Building the Document Ingestion Pipeline
The ingestion pipeline is where raw documents become searchable vectors. Start by choosing a chunking strategy. Fixed-size chunking with 512 tokens and a 128-token overlap is a solid baseline—it captures enough context while keeping each chunk focused. Use a library like langchain or llama_index to handle splitting, or write your own with tiktoken for token-aware chunking. For PDFs, PyMuPDF (fitz) is fast and reliable; for HTML or Markdown, beautifulsoup4 or markdown-it-py work well. Here's a minimal Python snippet that reads a PDF, chunks it, and generates embeddings:
import fitz
from ollama import Client
ollama = Client(host='http://localhost:11434')
doc = fitz.open("report.pdf")
chunks = []
for page in doc:
text = page.get_text()
# Simple chunk: split into 512-char windows with 128 overlap
for i in range(0, len(text), 384):
chunk = text[i:i+512]
if len(chunk) > 50:
chunks.append(chunk)
embeddings = []
for chunk in chunks:
resp = ollama.embeddings(model='nomic-embed-text', prompt=chunk)
embeddings.append(resp['embedding'])Once you have embeddings, upsert them into Qdrant. Each point needs a unique ID, the vector, and a payload (typically the chunk text and metadata like source file and page number). Batch your upserts—Qdrant handles up to 1,000 points per request efficiently. A batch size of 100–200 is safe for local setups. Here's the upsert call:
points = [
{
"id": i,
"vector": emb,
"payload": {"text": chunks[i], "source": "report.pdf", "page": 1}
}
for i, emb in enumerate(embeddings)
]
curl -X PUT http://localhost:6333/collections/documents/points \
-H 'Content-Type: application/json' \
-d '{"points": points}'Pro tip: store the chunk text in the payload so you can retrieve it directly during search without a separate lookup. Also, add a timestamp field to your payload if you plan to implement time-based filtering later—Qdrant supports payload filtering natively, which is invaluable for document versioning.
Implementing Semantic Search and Retrieval
With your documents ingested, the retrieval step converts a user query into an embedding and searches Qdrant for the most similar chunks. The query embedding uses the same nomic-embed-text model—consistency between ingestion and retrieval is critical for accurate cosine similarity. Send the query to Ollama's embeddings endpoint, then call Qdrant's /collections/documents/points/search with the vector and a limit parameter (typically 3–5 chunks). Here's the full retrieval flow in Python:
query = "What were the Q3 revenue figures?"
query_emb = ollama.embeddings(model='nomic-embed-text', prompt=query)['embedding']
search_result = requests.post(
"http://localhost:6333/collections/documents/points/search",
json={
"vector": query_emb,
"limit": 5,
"with_payload": True
}
).json()
context_chunks = [hit['payload']['text'] for hit in search_result['result']]One common pitfall is retrieving chunks that are semantically similar but contextually irrelevant—for example, matching “revenue” in a footnote rather than the main financial table. To improve precision, add a score_threshold parameter (e.g., 0.75) to filter out low-similarity results. You can also experiment with mmr (maximum marginal relevance) reranking, which Qdrant supports via the mmr parameter in the search request. MMR diversifies results by penalising chunks that are too similar to each other, ensuring you get coverage across different parts of the document.
For performance, Qdrant's HNSW index is already optimised for approximate nearest neighbour search. On a collection of 50,000 vectors, a search with limit: 5 typically returns in under 10 milliseconds on an SSD-backed machine. If you need exact search (no approximation), set exact: true in the search request, but be aware that this degrades to O(n) and will be slower for large collections.
Creating the Chat Interface
The final piece is a chat interface that takes a user question, retrieves relevant chunks, and feeds them as context to the LLM. This is the “generation” part of RAG. Build a simple function that concatenates the retrieved chunks into a prompt template, then calls Ollama's generate endpoint. A proven template looks like this:
def rag_chat(query):
# Retrieve
context = retrieve_chunks(query) # returns list of text strings
# Build prompt
prompt = f"""Use the following context to answer the question. If the context doesn't contain the answer, say "I don't have enough information."
Context:
{chr(10).join(context)}
Question: {query}
Answer:"""
# Generate
response = ollama.generate(model='llama3.1:8b', prompt=prompt)
return response['response']For a more polished interface, wrap this in a Gradio or Streamlit app. Gradio is particularly easy: a few lines of Python give you a text input box, a submit button, and a streaming output area. Here's a minimal Gradio app that connects to your RAG pipeline:
import gradio as grdef respond(message, history):
return rag_chat(message)
gr.ChatInterface
Related from our network
- [Draft] Comparison — theconnectedhaven (theconnectedhaven)
- [Draft] Comparison — theconnectedhaven (theconnectedhaven)
- [Draft] Comparison — theconnectedhaven (theconnectedhaven)



