Fine-tuning Large Language Models

Why Fine-tune?

Fine-tuning adapts a pre-trained model to your specific task or domain by training it on your data. This improves performance, reduces costs, and gives you more control.

Aspect	Pre-trained Model	Fine-tuned Model
Performance on your task	General, may be inconsistent	Specialized, more accurate
Prompt length	Long (needs examples)	Short (behavior learned)
Cost per call	Higher (more tokens)	Lower (shorter prompts)
Response consistency	Variable	More reliable

🎓 Fine-tuning Methods

Full Fine-tuning

Update all parameters

Best performance
Requires lots of GPU memory
Slow and expensive
Risk of catastrophic forgetting

LoRA (Low-Rank Adaptation)

Add small trainable matrices

90% less memory
Fast training
Easy to swap adapters
Nearly same performance

QLoRA

LoRA + quantization

Even less memory (4-bit)
Fine-tune on consumer GPUs
Minimal quality loss
Best for resource constraints

💻 Full Fine-tuning Example

Using Hugging Face Transformers

from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer
from datasets import load_dataset
import torch

# Load model and tokenizer
model_name = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Prepare dataset
dataset = load_dataset("your-dataset")

def preprocess(examples):
    # Format: "Instruction: ... Response: ..."
    texts = []
    for instruction, response in zip(examples["instruction"], examples["response"]):
        text = f"Instruction: {instruction}\nResponse: {response}"
        texts.append(text)
    
    return tokenizer(texts, truncation=True, padding="max_length", max_length=512)

tokenized_dataset = dataset.map(preprocess, batched=True)

# Training arguments
training_args = TrainingArguments(
    output_dir="./fine-tuned-model",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-5,
    weight_decay=0.01,
    logging_steps=10,
    save_steps=100,
    evaluation_strategy="steps",
    eval_steps=100,
    fp16=True  # Mixed precision for speed
)

# Train
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"]
)

trainer.train()

# Save
model.save_pretrained("./my-fine-tuned-model")
tokenizer.save_pretrained("./my-fine-tuned-model")

⚡ LoRA Fine-tuning (Recommended)

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import LoraConfig, get_peft_model, TaskType
import torch

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    torch_dtype=torch.float16,
    device_map="auto"
)

# Configure LoRA
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=8,                    # Rank (4, 8, 16, 32)
    lora_alpha=32,          # Scaling factor
    lora_dropout=0.1,       # Dropout for regularization
    target_modules=[        # Which layers to adapt
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj"
    ]
)

# Apply LoRA
model = get_peft_model(model, lora_config)

# Check trainable parameters
model.print_trainable_parameters()
# Output: trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06%

# Train with Trainer (same as before)
# ...

LoRA Benefits Explained

Instead of updating all 7B parameters:

Add small matrices to attention layers
Only train ~4M parameters (0.06%!)
90% less GPU memory needed
Can fine-tune on single RTX 4090
Easy to distribute (LoRA weights are tiny: ~20MB vs 14GB)

🚀 QLoRA - Even More Efficient

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True
)

# Load model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto"
)

# Prepare for training
model = prepare_model_for_kbit_training(model)

# Apply LoRA (same as before)
lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

# Train with ~5GB GPU memory!

Method	GPU Memory (7B model)	Training Speed	Quality
Full Fine-tuning	~80GB (needs A100)	Slow	100%
LoRA	~24GB (needs RTX 3090)	Fast	~99%
QLoRA	~5GB (works on RTX 3060)	Fast	~97%

📊 Preparing Training Data

Format Your Data

# instruction_dataset.jsonl
{"instruction": "Translate to French:", "input": "Hello, how are you?", "output": "Bonjour, comment allez-vous?"}
{"instruction": "Translate to French:", "input": "Good morning", "output": "Bonjour"}
{"instruction": "Summarize this text:", "input": "Long text...", "output": "Summary..."}

# Load with datasets
from datasets import load_dataset

dataset = load_dataset("json", data_files="instruction_dataset.jsonl")

# Format for training
def format_instruction(example):
    if example["input"]:
        text = f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"
    else:
        text = f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"
    return {"text": text}

dataset = dataset.map(format_instruction)

Data Quality Tips

Quantity: 100-1000 examples for task-specific fine-tuning
Diversity: Cover various aspects of your task
Quality: High-quality examples > large quantity
Format: Consistent formatting helps model learn faster

🎯 OpenAI Fine-tuning API

import openai
import os

os.environ["OPENAI_API_KEY"] = "your-key-here"

# Prepare data (JSONL format)
# {"messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

# Upload training file
with open("training_data.jsonl", "rb") as f:
    training_file = openai.File.create(file=f, purpose="fine-tune")

# Create fine-tune job
fine_tune = openai.FineTuningJob.create(
    training_file=training_file.id,
    model="gpt-3.5-turbo",
    hyperparameters={
        "n_epochs": 3
    }
)

print(f"Fine-tune job created: {fine_tune.id}")

# Check status
status = openai.FineTuningJob.retrieve(fine_tune.id)
print(status.status)  # "running", "succeeded", or "failed"

# Once complete, use your fine-tuned model
response = openai.ChatCompletion.create(
    model=fine_tune.fine_tuned_model,  # e.g., "ft:gpt-3.5-turbo:org:model:id"
    messages=[{"role": "user", "content": "Your prompt"}]
)

print(response.choices[0].message.content)

Pricing (GPT-3.5-turbo)

Training: $0.008 per 1K tokens
Usage: $0.012 per 1K tokens (vs $0.0015 for base model)
Cost-benefit: Higher per-call cost, but shorter prompts = net savings

⚙️ Best Practices

1. Start Small

Try prompt engineering first
Then few-shot learning
Fine-tune only if needed

2. Choose the Right Method

OpenAI API: Easiest, no infrastructure needed
LoRA: Best balance of efficiency and performance
QLoRA: For consumer GPUs
Full fine-tuning: Only if you need absolute best performance

3. Monitor Training

# Use Weights & Biases for tracking
import wandb

wandb.init(project="my-fine-tune")

training_args = TrainingArguments(
    # ...
    report_to="wandb"
)

4. Evaluate Properly

Hold out test set for evaluation
Use task-specific metrics
Compare with base model
Test edge cases

🎯 Key Takeaways

Fine-tuning customizes models for your specific task
LoRA is the best method for most use cases (efficient + effective)
QLoRA enables fine-tuning on consumer hardware
Data quality matters more than quantity
Start simple - try prompting before fine-tuning