The Transformer Revolution
Introduced in the 2017 paper "Attention Is All You Need," Transformers have become the foundation of modern AI. They power GPT-4, BERT, Claude, and virtually every state-of-the-art language model.
🎯 Why Transformers Changed Everything
- Process entire sequences in parallel (not sequential like RNNs)
- Capture long-range dependencies effectively
- Scale to billions of parameters
- Enable transfer learning across tasks
🧠 The Core Idea: Attention
What is Attention?
Attention allows the model to focus on different parts of the input when processing each element. Like how you pay more attention to certain words when reading a sentence.
Example: "The animal didn't cross the street because it was too tired."
→ "it" refers to "animal" (not "street") - attention helps the model figure this out!
The Attention Formula
# Self-Attention in one equation
Attention(Q, K, V) = softmax(Q @ K.T / √d_k) @ V
Where:
- Q (Query): "What am I looking for?"
- K (Key): "What do I contain?"
- V (Value): "What do I actually output?"
- d_k: dimension of keys (for scaling)
# Step by step:
1. Compute attention scores: Q @ K.T
2. Scale down: divide by √d_k
3. Convert to probabilities: softmax
4. Weight the values: multiply by V
🏗️ Transformer Architecture
A Transformer consists of two main parts:
Encoder (Understanding)
- Reads and processes input
- Creates rich representations
- Used in BERT, T5
- Example: Understanding a sentence
Decoder (Generation)
- Generates output sequentially
- Uses encoder representations
- Used in GPT, ChatGPT
- Example: Writing a response
Full Architecture Diagram
Input: "Hello World"
↓
[Token Embedding]
↓
[+ Positional Encoding] ← Adds position information
↓
┌─────────────────────────────┐
│ ENCODER LAYER 1 │
│ ┌──────────────────────┐ │
│ │ Multi-Head │ │ ← Attention to all words
│ │ Self-Attention │ │
│ └──────────────────────┘ │
│ ↓ │
│ ┌──────────────────────┐ │
│ │ Feed Forward │ │ ← Process each position
│ │ Neural Network │ │
│ └──────────────────────┘ │
└─────────────────────────────┘
↓
┌─────────────────────────────┐
│ ENCODER LAYER 2-N │ ← Repeat 6-96 times
└─────────────────────────────┘
↓
Rich Contextual Representations
🔍 Deep Dive: Self-Attention
Step-by-Step Example
Let's process: "The cat sat on the mat"
import torch
import torch.nn as nn
# 1. Input Embeddings
words = ["The", "cat", "sat", "on", "the", "mat"]
embedding_dim = 512
# Each word becomes a 512-dimensional vector
embeddings = torch.randn(6, 512) # [seq_len, d_model]
# 2. Create Q, K, V matrices
d_k = 64 # Dimension of queries/keys
W_q = nn.Linear(512, 64) # Query projection
W_k = nn.Linear(512, 64) # Key projection
W_v = nn.Linear(512, 64) # Value projection
Q = W_q(embeddings) # [6, 64]
K = W_k(embeddings) # [6, 64]
V = W_v(embeddings) # [6, 64]
# 3. Compute attention scores
scores = Q @ K.transpose(-2, -1) # [6, 6]
# scores[i][j] = how much word i should attend to word j
scores = scores / torch.sqrt(torch.tensor(d_k)) # Scale
# 4. Apply softmax (convert to probabilities)
attention_weights = torch.softmax(scores, dim=-1) # [6, 6]
# Example attention weights for "cat":
# cat → The: 0.05
# cat → cat: 0.40 ← pays most attention to itself
# cat → sat: 0.25 ← "cat" is related to "sat"
# cat → on: 0.10
# cat → the: 0.05
# cat → mat: 0.15 ← "cat" on the "mat"
# 5. Weighted sum of values
output = attention_weights @ V # [6, 64]
print(f"Input shape: {embeddings.shape}")
print(f"Output shape: {output.shape}")
print(f"Attention weights shape: {attention_weights.shape}")
Visualizing Attention
import matplotlib.pyplot as plt
import seaborn as sns
# Visualize attention matrix
plt.figure(figsize=(8, 6))
sns.heatmap(attention_weights.detach().numpy(),
xticklabels=words,
yticklabels=words,
cmap='YlOrRd',
annot=True,
fmt='.2f')
plt.title('Self-Attention Weights')
plt.xlabel('Keys')
plt.ylabel('Queries')
plt.show()
# Darker cells = stronger attention
🎯 Multi-Head Attention
Why Multiple Heads?
Instead of one attention mechanism, use multiple "heads" that learn different relationships:
- Head 1: Might focus on syntactic relationships (subject-verb)
- Head 2: Might capture semantic meaning
- Head 3: Might track co-references (pronouns)
- Head 4-8: Other patterns...
class MultiHeadAttention(nn.Module):
def __init__(self, d_model=512, num_heads=8):
super().__init__()
self.num_heads = num_heads
self.d_k = d_model // num_heads # 512 / 8 = 64
# Linear layers for Q, K, V
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)
def split_heads(self, x, batch_size):
"""Split into multiple heads"""
# x: [batch, seq_len, d_model]
x = x.view(batch_size, -1, self.num_heads, self.d_k)
# [batch, seq_len, num_heads, d_k]
return x.transpose(1, 2) # [batch, num_heads, seq_len, d_k]
def forward(self, x, mask=None):
batch_size = x.size(0)
# 1. Linear projections
Q = self.W_q(x) # [batch, seq_len, d_model]
K = self.W_k(x)
V = self.W_v(x)
# 2. Split into multiple heads
Q = self.split_heads(Q, batch_size)
K = self.split_heads(K, batch_size)
V = self.split_heads(V, batch_size)
# 3. Scaled dot-product attention (for each head)
scores = Q @ K.transpose(-2, -1) / torch.sqrt(torch.tensor(self.d_k))
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
attention_weights = torch.softmax(scores, dim=-1)
attention_output = attention_weights @ V
# 4. Concatenate heads
attention_output = attention_output.transpose(1, 2).contiguous()
attention_output = attention_output.view(batch_size, -1, self.num_heads * self.d_k)
# 5. Final linear layer
output = self.W_o(attention_output)
return output, attention_weights
# Usage
mha = MultiHeadAttention(d_model=512, num_heads=8)
x = torch.randn(32, 50, 512) # [batch, seq_len, d_model]
output, weights = mha(x)
print(f"Output shape: {output.shape}") # [32, 50, 512]
📍 Positional Encoding
The Problem
Transformers process all tokens in parallel → they don't know word order!
"Cat chases dog" vs "Dog chases cat" would look the same without position info.
The Solution: Add Position Information
import numpy as np
def positional_encoding(seq_len, d_model):
"""
Create positional encodings using sine/cosine functions
"""
position = np.arange(seq_len)[:, np.newaxis] # [seq_len, 1]
div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))
pe = np.zeros((seq_len, d_model))
pe[:, 0::2] = np.sin(position * div_term) # Even dimensions
pe[:, 1::2] = np.cos(position * div_term) # Odd dimensions
return torch.FloatTensor(pe)
# Add to embeddings
seq_len, d_model = 50, 512
embeddings = torch.randn(1, seq_len, d_model)
pos_encoding = positional_encoding(seq_len, d_model)
# Combine
input_with_position = embeddings + pos_encoding
print(f"Positional encoding shape: {pos_encoding.shape}")
Visualize Positional Encodings
pe = positional_encoding(100, 512)
plt.figure(figsize=(12, 6))
plt.pcolormesh(pe.numpy(), cmap='RdBu')
plt.xlabel('Embedding Dimension')
plt.ylabel('Position in Sequence')
plt.colorbar()
plt.title('Positional Encoding Pattern')
plt.show()
# Creates unique "fingerprint" for each position
🔄 Complete Transformer Block
class TransformerBlock(nn.Module):
def __init__(self, d_model=512, num_heads=8, d_ff=2048, dropout=0.1):
super().__init__()
# Multi-head attention
self.attention = MultiHeadAttention(d_model, num_heads)
self.norm1 = nn.LayerNorm(d_model)
self.dropout1 = nn.Dropout(dropout)
# Feed-forward network
self.ffn = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.ReLU(),
nn.Dropout(dropout),
nn.Linear(d_ff, d_model)
)
self.norm2 = nn.LayerNorm(d_model)
self.dropout2 = nn.Dropout(dropout)
def forward(self, x, mask=None):
# 1. Multi-head attention + residual connection
attention_output, _ = self.attention(x, mask)
x = x + self.dropout1(attention_output) # Residual
x = self.norm1(x) # Layer norm
# 2. Feed-forward + residual connection
ffn_output = self.ffn(x)
x = x + self.dropout2(ffn_output) # Residual
x = self.norm2(x) # Layer norm
return x
# Full Transformer
class Transformer(nn.Module):
def __init__(self, vocab_size, d_model=512, num_layers=6, num_heads=8):
super().__init__()
self.embedding = nn.Embedding(vocab_size, d_model)
self.blocks = nn.ModuleList([
TransformerBlock(d_model, num_heads)
for _ in range(num_layers)
])
self.output_layer = nn.Linear(d_model, vocab_size)
def forward(self, x):
# Embed tokens
x = self.embedding(x)
# Add positional encoding
x = x + positional_encoding(x.size(1), x.size(2))
# Process through transformer blocks
for block in self.blocks:
x = block(x)
# Output projection
logits = self.output_layer(x)
return logits
# Create model
model = Transformer(vocab_size=50000, d_model=512, num_layers=6)
print(f"Total parameters: {sum(p.numel() for p in model.parameters()):,}")
🎭 Encoder vs Decoder vs Encoder-Decoder
Encoder-Only
Examples: BERT, RoBERTa
Use Cases:
- Text classification
- Named entity recognition
- Question answering
- Sentiment analysis
Key Feature: Bidirectional context
Decoder-Only
Examples: GPT-3, GPT-4, LLaMA
Use Cases:
- Text generation
- Story writing
- Code completion
- Chat
Key Feature: Auto-regressive generation
Encoder-Decoder
Examples: T5, BART
Use Cases:
- Translation
- Summarization
- Paraphrasing
- Question generation
Key Feature: Input → Output transformation
⚡ Why Transformers Are So Powerful
1. Parallelization
Unlike RNNs, all tokens process simultaneously
- Much faster training
- Can use modern GPU hardware efficiently
- Scales to longer sequences
2. Long-Range Dependencies
Directly connect any two tokens
- No vanishing gradients
- Understands context across entire document
- Better than RNNs at long sequences
3. Interpretability
Attention weights show what model focuses on
- Visualize decision-making
- Debug model behavior
- Understand predictions
4. Transfer Learning
Pre-train once, fine-tune for many tasks
- Save computation
- Better performance
- Work with less data
📊 Transformer Variants
GPT (Generative Pre-trained Transformer)
Decoder-only, autoregressive
Size: 175B parameters (GPT-3)
BERT (Bidirectional Encoder)
Encoder-only, masked language modeling
Size: 110M - 340M parameters
T5 (Text-to-Text Transfer)
Encoder-decoder, unified framework
Size: 220M - 11B parameters
CLIP (Contrastive Language-Image)
Dual encoders for text + images
Use: Image-text understanding
🎯 Key Takeaways
- Attention is the core mechanism - allows model to focus on relevant parts
- Multi-head attention learns different types of relationships simultaneously
- Positional encoding gives the model sense of word order
- Parallelization makes Transformers much faster than RNNs
- Scalability - works well from millions to trillions of parameters
- Powers virtually all modern LLMs (GPT-4, Claude, Gemini)