Fine-Tuning
Fine-Tuning Methods
Update all model parameters on task-specific data. Most effective but computationally expensive requiring GPU clusters. Best quality for target domain with sufficient data (10K+ examples). Risk of catastrophic forgetting on original tasks.
Freeze base model and train small low-rank matrices injected into attention layers. 0.1-1% of parameters to train. Reduces memory 3x and training time 2-3x. Composable - swap LoRA adapters for different tasks. PEFT method of choice for most use cases.
LoRA with 4-bit quantized base model for memory efficiency. Fine-tune 65B models on single GPU. Minimal quality loss vs LoRA. Enables large model fine-tuning on consumer hardware. Combines NormalFloat4 quantization with LoRA.
Prepend trainable continuous vectors (prefixes) to each layer while keeping model frozen. Task-specific prefixes guide model behavior. 0.01% parameters trainable. Effective for text generation and conditional tasks. Soft prompts vs hard discrete prompts.
Insert small trainable modules (adapters) between frozen model layers. Bottleneck architecture compresses then expands representations. 1-4% parameters trainable. Multiple adapters composable for multi-task learning. Houlsby and Pfeiffer adapter variants.
Fine-tune on diverse instruction-following datasets to improve zero-shot task generalization. Enables chat-style interactions. Examples: FLAN, T0, Dolly, Alpaca datasets. Bridges base models and aligned assistants. Multi-task mixture training.
Fine-Tuning vs RAG Decision Matrix
| Scenario | Best Approach | Why | Considerations |
|---|---|---|---|
| New domain knowledge (medical, legal) | Fine-Tuning | Deeply embed domain patterns, terminology, reasoning | Requires quality dataset |
| Frequently changing information | RAG | Easy updates without retraining | Retrieval latency |
| Proprietary/sensitive data | Fine-Tuning | Data stays internal, no external calls | Higher upfront cost |
| Behavioral change (tone, format) | Fine-Tuning | Consistent style adherence | Hard to update behavior |
| Factual question answering | RAG | Source attribution, easy fact updates | Context window limits |
| Both needed | Hybrid (Fine-Tune + RAG) | Tuned model with retrieval augmentation | Complexity overhead |
Training Infrastructure & Tools
Software libraries for fine-tuning LLMs. Hugging Face Transformers (most popular, model hub integration). PyTorch Lightning (training loops abstraction). DeepSpeed (Microsoft, distributed training). Accelerate (device abstraction). PEFT library for parameter-efficient methods. TRL (Transformer Reinforcement Learning).
Curate and format training data for fine-tuning. Quality over quantity - 100 high-quality > 10K low-quality. Data formatting (instruction-input-output, conversational). Deduplication and cleaning. Train/val/test splits. Data augmentation and synthetic generation. JSONL format standard.
Learning rate (critical - use warmup and decay), batch size (gradient accumulation for large batches), epochs (1-3 typical), LoRA rank and alpha, weight decay, gradient clipping. Use validation loss for early stopping. Tools: Weights & Biases, Ray Tune, Optuna.
Scale training across multiple GPUs/nodes. Data parallelism (split batches), model parallelism (split model layers), pipeline parallelism (pipeline stages). FSDP (Fully Sharded Data Parallel), DeepSpeed ZeRO stages, Megatron-LM. Gradient accumulation for effective large batches.
Fine-Tuning Best Practices
Data Quality
- Diverse examples covering target distribution
- High-quality human-written or curated data
- Remove duplicates and low-quality samples
- Balance dataset across categories
- Include edge cases and error handling
Training Strategy
- Start with small learning rate (1e-5 to 1e-4)
- Use warmup and cosine decay schedule
- Monitor train/val loss curves (watch overfitting)
- Gradient accumulation for larger effective batch
- Save checkpoints regularly
Evaluation
- Task-specific metrics (BLEU, ROUGE, accuracy)
- Human evaluation for quality assessment
- Test on held-out data (not seen during training)
- Compare against base model and baselines
- Check for catastrophic forgetting
