🎯 Fine-tuning Large Language Models

Customize LLMs for your specific use case

Why Fine-tune?

Fine-tuning adapts a pre-trained model to your specific task or domain by training it on your data. This improves performance, reduces costs, and gives you more control.

Aspect Pre-trained Model Fine-tuned Model
Performance on your task General, may be inconsistent Specialized, more accurate
Prompt length Long (needs examples) Short (behavior learned)
Cost per call Higher (more tokens) Lower (shorter prompts)
Response consistency Variable More reliable

🎓 Fine-tuning Methods

Full Fine-tuning

Update all parameters

  • Best performance
  • Requires lots of GPU memory
  • Slow and expensive
  • Risk of catastrophic forgetting

LoRA (Low-Rank Adaptation)

Add small trainable matrices

  • 90% less memory
  • Fast training
  • Easy to swap adapters
  • Nearly same performance

QLoRA

LoRA + quantization

  • Even less memory (4-bit)
  • Fine-tune on consumer GPUs
  • Minimal quality loss
  • Best for resource constraints

💻 Full Fine-tuning Example

Using Hugging Face Transformers

from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer
from datasets import load_dataset
import torch

# Load model and tokenizer
model_name = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Prepare dataset
dataset = load_dataset("your-dataset")

def preprocess(examples):
    # Format: "Instruction: ... Response: ..."
    texts = []
    for instruction, response in zip(examples["instruction"], examples["response"]):
        text = f"Instruction: {instruction}\nResponse: {response}"
        texts.append(text)
    
    return tokenizer(texts, truncation=True, padding="max_length", max_length=512)

tokenized_dataset = dataset.map(preprocess, batched=True)

# Training arguments
training_args = TrainingArguments(
    output_dir="./fine-tuned-model",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-5,
    weight_decay=0.01,
    logging_steps=10,
    save_steps=100,
    evaluation_strategy="steps",
    eval_steps=100,
    fp16=True  # Mixed precision for speed
)

# Train
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"]
)

trainer.train()

# Save
model.save_pretrained("./my-fine-tuned-model")
tokenizer.save_pretrained("./my-fine-tuned-model")

⚡ LoRA Fine-tuning (Recommended)

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import LoraConfig, get_peft_model, TaskType
import torch

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    torch_dtype=torch.float16,
    device_map="auto"
)

# Configure LoRA
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=8,                    # Rank (4, 8, 16, 32)
    lora_alpha=32,          # Scaling factor
    lora_dropout=0.1,       # Dropout for regularization
    target_modules=[        # Which layers to adapt
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj"
    ]
)

# Apply LoRA
model = get_peft_model(model, lora_config)

# Check trainable parameters
model.print_trainable_parameters()
# Output: trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06%

# Train with Trainer (same as before)
# ...

LoRA Benefits Explained

Instead of updating all 7B parameters:

  • Add small matrices to attention layers
  • Only train ~4M parameters (0.06%!)
  • 90% less GPU memory needed
  • Can fine-tune on single RTX 4090
  • Easy to distribute (LoRA weights are tiny: ~20MB vs 14GB)

🚀 QLoRA - Even More Efficient

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True
)

# Load model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto"
)

# Prepare for training
model = prepare_model_for_kbit_training(model)

# Apply LoRA (same as before)
lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

# Train with ~5GB GPU memory!
Method GPU Memory (7B model) Training Speed Quality
Full Fine-tuning ~80GB (needs A100) Slow 100%
LoRA ~24GB (needs RTX 3090) Fast ~99%
QLoRA ~5GB (works on RTX 3060) Fast ~97%

📊 Preparing Training Data

Format Your Data

# instruction_dataset.jsonl
{"instruction": "Translate to French:", "input": "Hello, how are you?", "output": "Bonjour, comment allez-vous?"}
{"instruction": "Translate to French:", "input": "Good morning", "output": "Bonjour"}
{"instruction": "Summarize this text:", "input": "Long text...", "output": "Summary..."}

# Load with datasets
from datasets import load_dataset

dataset = load_dataset("json", data_files="instruction_dataset.jsonl")

# Format for training
def format_instruction(example):
    if example["input"]:
        text = f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"
    else:
        text = f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"
    return {"text": text}

dataset = dataset.map(format_instruction)

Data Quality Tips

🎯 OpenAI Fine-tuning API

import openai
import os

os.environ["OPENAI_API_KEY"] = "your-key-here"

# Prepare data (JSONL format)
# {"messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

# Upload training file
with open("training_data.jsonl", "rb") as f:
    training_file = openai.File.create(file=f, purpose="fine-tune")

# Create fine-tune job
fine_tune = openai.FineTuningJob.create(
    training_file=training_file.id,
    model="gpt-3.5-turbo",
    hyperparameters={
        "n_epochs": 3
    }
)

print(f"Fine-tune job created: {fine_tune.id}")

# Check status
status = openai.FineTuningJob.retrieve(fine_tune.id)
print(status.status)  # "running", "succeeded", or "failed"

# Once complete, use your fine-tuned model
response = openai.ChatCompletion.create(
    model=fine_tune.fine_tuned_model,  # e.g., "ft:gpt-3.5-turbo:org:model:id"
    messages=[{"role": "user", "content": "Your prompt"}]
)

print(response.choices[0].message.content)

Pricing (GPT-3.5-turbo)

⚙️ Best Practices

1. Start Small

  • Try prompt engineering first
  • Then few-shot learning
  • Fine-tune only if needed

2. Choose the Right Method

  • OpenAI API: Easiest, no infrastructure needed
  • LoRA: Best balance of efficiency and performance
  • QLoRA: For consumer GPUs
  • Full fine-tuning: Only if you need absolute best performance

3. Monitor Training

# Use Weights & Biases for tracking
import wandb

wandb.init(project="my-fine-tune")

training_args = TrainingArguments(
    # ...
    report_to="wandb"
)

4. Evaluate Properly

  • Hold out test set for evaluation
  • Use task-specific metrics
  • Compare with base model
  • Test edge cases

🎯 Key Takeaways