Why Fine-tune?
Fine-tuning adapts a pre-trained model to your specific task or domain by training it on your data. This improves performance, reduces costs, and gives you more control.
| Aspect | Pre-trained Model | Fine-tuned Model |
|---|---|---|
| Performance on your task | General, may be inconsistent | Specialized, more accurate |
| Prompt length | Long (needs examples) | Short (behavior learned) |
| Cost per call | Higher (more tokens) | Lower (shorter prompts) |
| Response consistency | Variable | More reliable |
🎓 Fine-tuning Methods
Full Fine-tuning
Update all parameters
- Best performance
- Requires lots of GPU memory
- Slow and expensive
- Risk of catastrophic forgetting
LoRA (Low-Rank Adaptation)
Add small trainable matrices
- 90% less memory
- Fast training
- Easy to swap adapters
- Nearly same performance
QLoRA
LoRA + quantization
- Even less memory (4-bit)
- Fine-tune on consumer GPUs
- Minimal quality loss
- Best for resource constraints
💻 Full Fine-tuning Example
Using Hugging Face Transformers
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer
from datasets import load_dataset
import torch
# Load model and tokenizer
model_name = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
# Prepare dataset
dataset = load_dataset("your-dataset")
def preprocess(examples):
# Format: "Instruction: ... Response: ..."
texts = []
for instruction, response in zip(examples["instruction"], examples["response"]):
text = f"Instruction: {instruction}\nResponse: {response}"
texts.append(text)
return tokenizer(texts, truncation=True, padding="max_length", max_length=512)
tokenized_dataset = dataset.map(preprocess, batched=True)
# Training arguments
training_args = TrainingArguments(
output_dir="./fine-tuned-model",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-5,
weight_decay=0.01,
logging_steps=10,
save_steps=100,
evaluation_strategy="steps",
eval_steps=100,
fp16=True # Mixed precision for speed
)
# Train
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset["train"],
eval_dataset=tokenized_dataset["validation"]
)
trainer.train()
# Save
model.save_pretrained("./my-fine-tuned-model")
tokenizer.save_pretrained("./my-fine-tuned-model")
⚡ LoRA Fine-tuning (Recommended)
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import LoraConfig, get_peft_model, TaskType
import torch
# Load base model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
torch_dtype=torch.float16,
device_map="auto"
)
# Configure LoRA
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=8, # Rank (4, 8, 16, 32)
lora_alpha=32, # Scaling factor
lora_dropout=0.1, # Dropout for regularization
target_modules=[ # Which layers to adapt
"q_proj",
"k_proj",
"v_proj",
"o_proj"
]
)
# Apply LoRA
model = get_peft_model(model, lora_config)
# Check trainable parameters
model.print_trainable_parameters()
# Output: trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06%
# Train with Trainer (same as before)
# ...
LoRA Benefits Explained
Instead of updating all 7B parameters:
- Add small matrices to attention layers
- Only train ~4M parameters (0.06%!)
- 90% less GPU memory needed
- Can fine-tune on single RTX 4090
- Easy to distribute (LoRA weights are tiny: ~20MB vs 14GB)
🚀 QLoRA - Even More Efficient
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch
# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True
)
# Load model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config,
device_map="auto"
)
# Prepare for training
model = prepare_model_for_kbit_training(model)
# Apply LoRA (same as before)
lora_config = LoraConfig(
r=8,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.1,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
# Train with ~5GB GPU memory!
| Method | GPU Memory (7B model) | Training Speed | Quality |
|---|---|---|---|
| Full Fine-tuning | ~80GB (needs A100) | Slow | 100% |
| LoRA | ~24GB (needs RTX 3090) | Fast | ~99% |
| QLoRA | ~5GB (works on RTX 3060) | Fast | ~97% |
📊 Preparing Training Data
Format Your Data
# instruction_dataset.jsonl
{"instruction": "Translate to French:", "input": "Hello, how are you?", "output": "Bonjour, comment allez-vous?"}
{"instruction": "Translate to French:", "input": "Good morning", "output": "Bonjour"}
{"instruction": "Summarize this text:", "input": "Long text...", "output": "Summary..."}
# Load with datasets
from datasets import load_dataset
dataset = load_dataset("json", data_files="instruction_dataset.jsonl")
# Format for training
def format_instruction(example):
if example["input"]:
text = f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"
else:
text = f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"
return {"text": text}
dataset = dataset.map(format_instruction)
Data Quality Tips
- Quantity: 100-1000 examples for task-specific fine-tuning
- Diversity: Cover various aspects of your task
- Quality: High-quality examples > large quantity
- Format: Consistent formatting helps model learn faster
🎯 OpenAI Fine-tuning API
import openai
import os
os.environ["OPENAI_API_KEY"] = "your-key-here"
# Prepare data (JSONL format)
# {"messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
# Upload training file
with open("training_data.jsonl", "rb") as f:
training_file = openai.File.create(file=f, purpose="fine-tune")
# Create fine-tune job
fine_tune = openai.FineTuningJob.create(
training_file=training_file.id,
model="gpt-3.5-turbo",
hyperparameters={
"n_epochs": 3
}
)
print(f"Fine-tune job created: {fine_tune.id}")
# Check status
status = openai.FineTuningJob.retrieve(fine_tune.id)
print(status.status) # "running", "succeeded", or "failed"
# Once complete, use your fine-tuned model
response = openai.ChatCompletion.create(
model=fine_tune.fine_tuned_model, # e.g., "ft:gpt-3.5-turbo:org:model:id"
messages=[{"role": "user", "content": "Your prompt"}]
)
print(response.choices[0].message.content)
Pricing (GPT-3.5-turbo)
- Training: $0.008 per 1K tokens
- Usage: $0.012 per 1K tokens (vs $0.0015 for base model)
- Cost-benefit: Higher per-call cost, but shorter prompts = net savings
⚙️ Best Practices
1. Start Small
- Try prompt engineering first
- Then few-shot learning
- Fine-tune only if needed
2. Choose the Right Method
- OpenAI API: Easiest, no infrastructure needed
- LoRA: Best balance of efficiency and performance
- QLoRA: For consumer GPUs
- Full fine-tuning: Only if you need absolute best performance
3. Monitor Training
# Use Weights & Biases for tracking
import wandb
wandb.init(project="my-fine-tune")
training_args = TrainingArguments(
# ...
report_to="wandb"
)
4. Evaluate Properly
- Hold out test set for evaluation
- Use task-specific metrics
- Compare with base model
- Test edge cases
🎯 Key Takeaways
- Fine-tuning customizes models for your specific task
- LoRA is the best method for most use cases (efficient + effective)
- QLoRA enables fine-tuning on consumer hardware
- Data quality matters more than quantity
- Start simple - try prompting before fine-tuning