Imagine having a language model that speaks your industry’s jargon, follows your internal formatting rules, and maintains your brand voice—all while running on a desktop GPU, not a $100,000 cloud cluster. Fine-tuning open source models with Low-Rank Adaptation (LoRA) makes this possible for businesses of any size. LoRA inserts trainable matrices into a frozen base model, slashing VRAM requirements so that a 7-billion-parameter model fits on a consumer-grade RTX 3090 or 4090. The result is a specialized assistant that excels at tasks like summarizing technical documentation, generating compliance reports, or answering product-specific support queries—without sacrificing data privacy. In this guide, you’ll walk through every step: preparing a clean dataset, selecting a base model, configuring LoRA hyperparameters, running training on a local machine, evaluating performance, and deploying the finished model via Ollama for real-world inference. Each section delivers concrete commands and configuration options so you can replicate the process with your own data today.
1. Preparing Your Dataset: The Foundation of Successful Fine-Tuning
Your fine-tuning model is only as good as the data it learns from. Start by collecting 500–5,000 high-quality examples that represent the exact tasks your business needs. For instruction-following models like Llama 3 or Mistral, structure your data in the conversational format the base model expects: a user prompt followed by a response, optionally with a system prompt. Use JSON or JSONL files with keys like instruction, input, and output, or the Alpaca-style single-turn format. Clean the dataset by removing duplicates, correcting typos, and ensuring consistent tone. Below is a checklist for dataset readiness:
- Task specificity: Each example should mirror a real business query (e.g., “Summarize this contract clause” or “Generate a product description for SKU-4321”).
- Balanced length: Include a mix of short and long and short responses to prevent the model from becoming verbose or terse.
- No PII leakage: Scrub personal data, internal IDs, and confidential information before training.
- Train/test split: Reserve 10–20% of examples for evaluation. Keep them in a separate JSONL file.
Once your dataset is ready, upload it to your working directory. Tools like datasets from Hugging Face can load local files directly. If you lack data, consider using a synthetic generation pipeline with a strong model (e.g., GPT-4) to create seed examples, then manually review and edit them. Quantity matters less than quality—a few hundred perfectly curated examples often outperform thousands of noisy ones.
2. Choosing the Right Base Model and Hardware Setup
Not every open source model is ideal for fine-tuning on consumer GPUs. For most business applications, a 7-billion parameter model strikes the best balance between capability and VRAM footprint. Popular choices include Llama 3 8B, Mistral 7B, and Phi-3-mini. All three support LoRA fine-tuning out of the box via the PEFT library. On the hardware side, you need a GPU with at least 12 GB of VRAM for 7B models using 4-bit quantization (QLoRA). A 24 GB RTX 4090 provides comfortable headroom and allows for larger batch sizes. If you only have 8 GB, consider a 3B or 1.5B model like Phi-3-small or TinyLlama.
Below is a quick VRAM reference table for typical LoRA fine-tuning setups (all figures assume a batch size of 1 with gradient accumulation):
- 1.5B model (e.g., TinyLlama) with 4-bit QLoRA: ~5 GB VRAM
- 3B model (e.g., Phi-3-mini) with 4-bit QLoRA: ~7 GB VRAM
- 7B model (e.g., Mistral) with 4-bit QLoRA: ~12 GB VRAM
- 7B model with 8-bit LoRA (no quantization): ~16 GB VRAM
- 13B model with 4-bit QLoRA: ~20 GB VRAM (pushing limits on 24 GB)
For CPU-offloading or multi-GPU setups, frameworks like accelerate and DeepSpeed can help, but they increase complexity. Start with a single consumer GPU and a 7B model to keep the workflow straightforward. Once you confirm the base model works for your domain, you can scale up.
3. Setting Up Your Training Environment and Tools
Install the core Python libraries inside a clean environment. Use pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 (or your CUDA version). Then install the fine-tuning ecosystem: transformers, datasets, peft, bitsandbytes, accelerate, and trl (Transformer Reinforcement Learning, which includes the SFTTrainer class). If you plan to quantize the base model to 4-bit, bitsandbytes is mandatory. For a full script, you can also install wandb for experiment tracking. Here is a minimal setup block:
pip install torch transformers datasets peft bitsandbytes accelerate trl wandbAfter installation, log in to Hugging Face if you plan to use gated models (e.g., Llama 3 requires acceptance of terms). Use huggingface-cli login and paste an access token. Then, test that your GPU is recognized by running torch.cuda.is_available(). With the environment ready, download your selected base model using AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B", load_in_4bit=True)—this caches the weights locally. Now you’re ready to load your dataset and begin training configuration.
4. Configuring LoRA Hyperparameters for Optimal Performance
LoRA reduces trainable parameters by inserting rank-decomposition matrices into attention layers. The two key hyperparameters are r (rank) (rank) and lora_alpha. A rank of 8 or 16 works well for most business fine-tuning; higher ranks preserve more information but increase VRAM use. lora_alpha controls the scaling—typically set to 16 or 32. Target modules should include all linear layers in the attention blocks. For Llama 3, that’s q_proj, k_proj, v_proj, o_proj. Include gate_proj, up_proj, down_proj for the feed-forward network if you want more adaptation capacity. Below is a sample PEFT configuration:
from peft import LoraConfig
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k","v_proj","k_proj","o_proj","gate_proj","up_proj","down_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)Other critical settings: learning_rate starts at 2e-4 for LoRA (use a scheduler like cosine with warmup). per_device_train_batch_size of 1 to 4, combined with gradient_accumulation_steps of 4 to 8, gives an effective batch size of 4–32. Train for 3–5 epochs on datasets under 5,000 examples; monitor validation loss to avoid overfitting. Use max_seq_length of 512 or 1024 tokens depending on your data. If you hit OOM errors, reduce r or enable 4-bit quantization.
5. Running the Fine-Tuning Process: Tips and Monitoring
With the configuration in place, start training using the SFTTrainer from TRL. The trainer handles formatting, packing, and gradient updates. A typical command inside a Python script or notebook looks like this:
from trl import SFTTrainer
trainer = SFTTrainer(
model=model,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
args=training_args,
tokenizer=tokenizer,
peft_config=lora_config,
max_seq_length=512,
packing=True
)
trainer.train()Monitor loss every few steps. Expect the training loss to drop from ~2.0 to below 1.0 after a few hundred steps. If loss plateaus too early, increase r or reduce learning rate. If loss spikes, reduce batch size or check for corrupted data. Use wandb integration to log gradients and learning rate curves. Training a 7B model with LoRA on a 24 GB GPU typically takes 1–4–8 hours for 1,000 examples over 3 epochs. During training, periodically save checkpoints: trainer.save_model("checkpoint-step-500"). This lets you roll back if later steps degrade quality.
Once training finishes, merge the LoRA weights into the base model only if deploying with Ollama (Ollama expects a complete model file). Use model = model.merge_and_unload() and save the full model with model.save_pretrained("final-model"). Alternatively, keep the adapter separate and load it at inference time using PeftModel.from_pretrained(base_model, adapter_path) for Hugging Face pipelines.
6. Evaluating Your Fine-Tuned Model
Quantitative metrics like perplexity are useful but not sufficient for business use cases. Start by calculating perplexity on your held-out test set—a drop of at least 10–20% from the base model indicates learning. However, the real test is qualitative: ask the model to perform actual business tasks. Create a small evaluation set of 20–50 prompts that mirror production queries. Compare the fine-tuned model’s outputs against the base model’s outputs and a human-written gold standard. Score them on accuracy, tone, and adherence to formatting. Use a rubric like the one below:
- Accuracy: Does the output correctly answer the question or complete the task without hallucination?
- Format following: Does it match your required structure (e.g., bullet points, JSON, email style)?
- Brevity vs. detail: Is the response appropriately concise for the use case?
If the fine-tuned model performs worse than expected, revisit your dataset. Common issues include insufficient diversity, mismatched tokenizer padding, or overfitting on short examples. You can also use an automated evaluator like an LLM-as-a-judge (e.g., GPT-4) to rate outputs on a scale of 1–5, but always spot-check manually. Once you achieve consistent improvements across your test set, the model is ready for deployment.
7. Deploying with Ollama for Production Use
Ollama provides an easy way to run language models locally with an OpenAI-compatible API. To deploy your fine-tuned model, first convert the merged PyTorch checkpoint to GGUF format, which Ollama uses for efficient CPU/GPU inference. Use the llama.cpp project’s conversion script (e.g., convert.py) to generate a GGUF file from your final-model directory. Alternatively, use the ollama create command with a Modelfile. Write a Modelfile that points to your local weights:
FROM ./final-model
TEMPLATE "{{ .Prompt }}" # Customize if needed
PARAMETER temperature 0.7
PARAMETER top_p 0.9Run ollama create my-business-model -f Modelfile to build the model, then ollama run my-business-model to test it interactively. For API usage is straightforward: send POST requests to http://localhost:11434/api/generate with a JSON body containing your prompt. This allows integration into chatbots, internal tools, or automation pipelines. Ollama also supports multiple GPUs and works equally on Linux, macOS, and Windows. For production, consider setting OLLAMA_NUM_PARALLEL=4 to handle


